You have accessed and collected all the relevant data. With the data at hand, you are now ready to start thinking about the data analysis and interpretation. However, before diving into the analysis, it is crucial to obtain an overview of the collected data and prepare it by cleaning and pre-processing.
Pre-processing data is essential for an accurate and insightful data analysis. How well you clean and transform the data has a huge impact on the quality of the result, since incorrect and inconsistent data leads to false conclusions. The steps and techniques for data cleaning will vary from dataset to dataset. However, this section will provide an overview of the most important steps.
Pre-processing data is a very technical step that requires a profound knowledge of statistics and relevant software. While it isn’t the role of policymakers to perform this step, it’s important to understand what needs to be done and why. At the end of this section, you should feel confident to join technical meetings with data analysts, make meaningful and realistic data requests and promote structural change that facilitates the pre-processing of data.
Before you start cleaning and pre-processing data, get to know the dataset. By examining the data, you can identify its structure, format and the types of variables present. It helps you identify data quality issues and determine appropriate strategies for handling various aspects of the data. This ensures that subsequent pre-processing steps are conducted effectively, and that the resulting data is suitable for the desired analysis or modelling tasks.
Once you’re familiarized with the dataset, you can begin the process of pre-processing the data. This typically involves the following five steps:
Data cleaning is the process of identifying and correcting or removing errors, inconsistencies and inaccuracies in the dataset:
Duplicate records or observations can occur when data is collected from multiple sources and scraped or obtained from clients or different departments. Data cleaning involves identifying and removing these duplicates to avoid bias in the analysis or modelling process.
Returning to our example of exploring the factors that influence life expectancy, you discovered the presence of duplicate records. To address this issue, you now remove duplicates and double-check you haven’t missed any:
The next step in data cleaning focuses on addressing structural errors. These errors can occur due to issues during measurement, data transfer or general carelessness in data management. For instance, the variable date might have entries that put the month first and others that put the day first. Structural errors also include things like typos or inconsistent capitalization in categorical features. To identify and fix these errors, data analysts typically use so-called summary tables of all entries in a column that show you useful information like the number of missing values or the mean value of column. Visualizations can also help to spot any inconsistencies or irregularities (see more on how to “visualise data”).
Under the first section “How to get started”, we already spotted structural errors with the spelling of Afghanistan. Being aware of these errors, you can now conduct steps to clean (correction of variables in case of typos and inconsistent capitalizations) the dataset and ensure the accuracy of the analysis:
Missing values are a common issue in datasets and can arise due to various reasons such as data entry errors, equipment malfunctions or survey non-responses. Before deciding on how to handle this issue, you should explore what and why data is missing.
To get started, ask the following questions:
Common ways to deal with missing values:
In our example dataset, the employee of the Ministry of Health discovered that some rows contain missing values. To prevent potential biases in our data that could affect our subsequent analysis, we make the decision to remove those rows entirely.
Find a step-by-step guide for handling missing values in Python here.
Outliers are data points that are significantly higher or lower than the nearby data points and the general pattern of the dataset or graph. These extreme values can disrupt the overall trend or distribution of the data, and they may indicate potential errors, anomalies or unique observations. It's important to identify and handle outliers appropriately during data analysis to ensure accurate and reliable results.
Common methods to detect outliers include Z-Score, Standard Deviation Method, Boxplot Method and Isolation Forest. The choice of method depends on factors such as data distribution, dimensionality, presence of noise and the specific requirements of the analysis. It's often a good idea to experiment with different methods and evaluate their performance on the specific dataset at hand.
You can treat outliers by either deleting them when you’re sure they’re incorrect and insignificant or by replacing them, for instance, through the mean value or a predictive model (see above on how to handle missing values).
Outliers are sometimes extreme values that are valid and genuine observations. In such cases, it can make sense to keep the outlier and understand the reasons behind its value to uncover hidden knowledge that has the potential to improve your analysis. However, handling outliers is quite complex and requires a data expert.
Data transformation is the process of changing the format, structure or values of data to make it more suitable for analysis, modeling or other data processing tasks. It usually involves applying various operations or functions to the data to achieve the desired transformation. Below are two standard types of data transformation.
Standardizing units involves converting observations to a consistent unit of measure, whereas normalizing observations describes the process of scaling measurements to a common reference or range, typically to a scale between 0 and 1 or -1 and 1. This process enables the comparability of variables. By bringing all measurements to the same scale, analysts can accurately understand relationships and patterns within the data. Furthermore, it can significantly impact the performance of statistical models and machine learning algorithms. Some algorithms, such as regression models or neural networks, bring better results when the scale and magnitude of input variables is standardized.
An analysis may require a new variable. This can be the case if, for example, average values or new calculations from the existing variables are of interest for the analysis.
Or in some cases, the relationship between a variable and the outcome variable may not follow the pattern that the model you’d like to apply assumes. For instance, the model assumes a linear relationship between two variables, but the observed relationships are curved or U-shaped. By squaring, taking the logarithm or root of a variable, you can adjust to the non-linear relationship, so a linear regression model can be more precise. Note that this changes the interpretation of the variable’s effect in a model. For more information on interpreting data analyses, see the section on how to “analyse and interpret data”.
Data reduction refers to the process of reducing the size or complexity of data while preserving its essential information. Data reduction is commonly used when dealing with datasets that are large, unwieldy or contain redundant or irrelevant information. The objective is to simplify the data without losing critical insights or compromising the accuracy of analyses. The reduction in data volume directly translates to lower storage requirements, which in turn leads to cost savings in hardware, infrastructure and maintenance. Additionally, statistical models for data analysis perform better if irrelevant variables, which don’t help to explain a certain effect, are omitted.
Data reduction techniques can include:
This process is a part of data analysis that involves selecting a subset of relevant features (another word for variables) from a larger set of available features. The objective is to reduce complexity or size of the dataset and to enhance interpretability and performance of your model by focusing on the most informative and discriminative features. There are different statistical techniques for how to determine which features carry the most informative value.
Data sampling is a statistical technique for selecting a subset of data from a larger population or dataset. By carefully choosing a representative sample, data scientists or researchers can reduce the overall size of the data while still capturing meaningful information about the entire population. This subset can then be used to draw accurate conclusions, make inferences or perform analyses that are representative of the larger dataset, saving computational resources and time.
There are two main types of sampling methods that can be utilized:
Data subsetting involves extracting specific portions of a dataset based on certain criteria or conditions. This process allows us to focus on particular rows or columns of the data or isolate observations that possess specific characteristics. Subsetting is commonly performed during the data exploration phase to examine the relationship between variables, especially to identify potential interactions. It also enables us to extract subsets of data for separate analyses or to perform specific operations on them. Essentially, data subsetting helps us narrow down our data to relevant portions that are of interest for further investigation or analysis.
During the exploratory data analysis phase, you discovered a variable named "Status" that categorizes countries as either "Developed" or "Developing". However, for your analysis, you’re only concerned with developing countries. Therefore, you decide to create a subset of your data that includes information exclusively on developing countries. This subset will allow you to focus specifically on the relevant observations that meet your criteria of interest.
Data integration is the process of combining and consolidating data from multiple sources into a unified and consistent view. It involves bringing together data from different systems, databases, files or formats, and transforming and harmonizing it to create a comprehensive and reliable dataset.
The first step in data integration is to harmonize the different dataset. The goal is to ensure that data collected from various systems or sources can be combined, compared and analysed effectively, eliminating discrepancies and inconsistencies. Harmonizing the data involves transforming, cleaning and reformatting it so that it adheres to a common set of standards and definitions. From the previous sections, you should already know what cleaning and transforming data entails. So, you only need to decide what standards all data sources should be aligned to. For instance, you might have in every dataset a column with the variable “Date”. You need to decide which format the variable “Date” should take (e.g., DD-MM-YYYY, MM-DD-YYYY or MM-DD-YY) and then transform all “Date” columns in every dataset to this format.
After harmonizing your datasets, you might want to merge them, combining multiple datasets into a single dataset based on a common attribute or key. For instance, the country name could be an attribute by which you want to merge. This means that two datasets with different information have a column with country names. Now, you can merge the information on, say, “Afghanistan” from both datasets into one larger dataset. The process of merging data is commonly used to consolidate information from different sources to create a more comprehensive and unified dataset. This can facilitate analyses and reporting.
Once the data has been cleaned and pre-processed, a decision must be made on how and where best to store the data.
When you’re dealing with large amounts of data from various sources, it makes sense to develop a database. Databases provide a structured way to store, retrieve and adjust information. Creating a database involves several steps, and the specific process may vary depending on the database management system (DBMS) you’re using.
There are different options to store and access the data. Data can be stored on personal devices (such as hard drives, disk drives or USB drives), cloud storage and Network Attached Storage (NAS). Typically, the institution you’re working for has clear guidelines on which storage option to use, and personal devices will probably be the exception.
Here are some practical considerations when deciding between cloud storage and NAS:
Cloud storage involves storing data on remote servers managed by a third-party provider. The data is accessed over the internet, and the provider is responsible for the storage infrastructure and maintenance. The main advantage is a convenient data access from anywhere with an internet connection and scalability without additional hardware investments. The main disadvantages include dependence on stable internet connectivity, data privacy concerns from relying on third-party providers, loss of data sovereignty and expensive service contracts.
NAS involves storing data on a dedicated storage device that is connected to a local network. It provides shared storage accessible by multiple devices within the network. Typically, public institutions have such networks already set up. The main advantage is the complete ownership of the data infrastructure and the centralized data management. However, typical drawbacks include limited accessibility outside the local network as well as security, maintenance and administration costs. Also, scaling of NAS storage capacity can be more complex and may require investing in additional hardware.
Data Cleaning:
Data Transformation:
Data Reduction:
Data Integration:
Data Storage:
Before moving on, ensure that all pre-processing steps are properly documented. This will help others to understand the reasoning behind each change and ensure reproducibility. With that, you can move to the section “Analyse and Interpret Data”.