Analysing data is complex and there are many considerations to make. This section will provide you with an overview of the main considerations and common mistakes. You will also have a better understanding of what to look out for when interpreting data, so the conclusions you draw will be correct and robust.
At the end of this section, you should be able to understand what data analysis is appropriate for your set of analytical questions. You’ll also be able to make considerations about the accuracy and interpretability of your desired analytical outcome. When interpreting data, you’ll now be aware of common mistakes and limitations.
Policymakers are usually not doing the data analysis themselves. Rather, they need to make meaningful and realistic requests to their technical teams. Therefore, this section only provides an overview of the main considerations and common mistakes when analysing and interpreting data.
Before starting any type of analysis, you should be very clear about the set of questions that you’d like to answer. You start with the problem you defined (see “Define your Problem Statement”) and make a list of everything you need to know to have a better understanding of your problem and solution. A trick that might help you to come up with a list of questions is to think about the counterfactual of the problem that you’re trying to solve and consider marginalized groups that are often left out of the discussion.
Imagine you work in the Ministry of Health, and you’ve been asked to revise a policy regarding healthcare expenditures. To make a more informed decision about how healthcare expenditures should be designed, you focus on the different diseases, lifestyles, and socioeconomic circumstances and how they affect life expectancies. You come up with a long list of questions that your data analysis should answer. Here are a few:
Descriptive data analysis is a statistical method used to summarize and describe the main characteristics of a dataset. It helps in understanding the key features, trends and patterns within the data without making any inferences or generalizations about a larger population. Oftentimes a descriptive analysis is completely sufficient to make relevant conclusions.
Let's consider the example of using life expectancy data. Suppose we have a dataset that includes information about the life expectancy of individuals over a certain period of time. The dataset might include variables such as year recorded, life expectancy, diseases, alcohol consumption, etc.
By performing these types of analyses on the life expectancy data, you gain valuable insights into the average life expectancy, its distribution, trends over time and potential associations with other variables.
Statistical learning refers to a set of approaches to automatically learn patterns and relationships within data to make predictions or uncover insights. Prediction and inference are two fundamental concepts in statistical analysis and have different goals and methods. Understanding the difference between the two will help you to set up the appropriate analysis for your set of questions.
For example, in the case of life expectancy data, you might want to infer whether there’s a significant relationship between factors like education level and life expectancy at the population level. By using statistical techniques such as hypothesis testing or confidence intervals, you can draw conclusions about the relationship between these variables.
The goal of inference is to produce a better understanding of underlying patterns and relationships within the data and make broader statements about the population based on the sample data.
For example, you could use a predictive model like regression analysis to develop a model that predicts life expectancy based on these factors. The model would be trained using historical data from different communities where we know both the input variables (age, gender, education, healthcare access) and the corresponding life expectancy values. Once the model is trained, you can use it to predict the life expectancy of new communities given their characteristics and forecast how the life expectancy will change in the future.
The goal of prediction is to provide accurate estimates or forecasts for future observations. It focuses on making specific predictions for individual cases within the dataset, rather than drawing general conclusions about the population.
The trade-off between model complexity and interpretability refers to the fact that as a statistical or machine learning model becomes more complex and sophisticated, it tends to be less interpretable or understandable by humans.
Suppose you want to develop a model to predict life expectancy based on factors such as age, gender, education level, income and healthcare access. To illustrate the trade-offs between different models, let’s consider two options, which are very far apart on the spectrum of complexity: A simple linear regression and a very complex neural network.
Linear regression: Linear regression is a simple and interpretable model. It assumes a linear relationship between the input variables and target variable (life expectancy). The model estimates the contribution and impact of each variable on life expectancy. It’s relatively easy to interpret these relationships and understand how changes in the input variables affect the predicted life expectancy.
Neural network: Neural networks are highly complex and non-linear. They consist of multiple layers of interconnected nodes (neurons) and can capture intricate relationships and patterns within the data. These models are capable of learning complex representations and interactions between variables, which can potentially improve the accuracy of predictions. However, as the complexity increases, it becomes harder to interpret how the model arrives at its predictions. The relationship between the input variables and the output (life expectancy) is often obscured within the numerous layers and weights of the neural network, making it challenging to understand the underlying factors contributing to the prediction.
Linear regression: In the case of linear regression, the model provides interpretable coefficients for each input variable. For example, if the coefficient for education level is positive, it suggests that higher education is associated with increased life expectancy. This interpretability allows you to draw meaningful conclusions and make informed decisions based on the model's outputs.
Neural networks: On the other hand, neural networks often lack interpretability. The multiple layers and complex interactions make it difficult to explain why the model arrived at a particular prediction. The weights assigned to different variables, or the internal representations learned by the neural network, may not have a direct and intuitive interpretation. Consequently, it becomes challenging to gain insights into the factors that drive the predictions, limiting our ability to interpret and trust the model's outputs.
Qualitative data analysis refers to the process of examining and interpreting non-numerical or non-quantifiable data to gain insights, identify patterns and generate meaningful interpretations. Qualitative data can include various types of information, such as interview transcripts, survey responses, field notes, observation records, open-ended questionnaire responses, audio or video recordings and textual data from documents or literature. Unlike quantitative data that can be analysed using statistical methods, qualitative data analysis involves a more interpretive and subjective approach.
Imagine you’ve done some interviews with health experts on factors that influence life expectancy. Here are some general steps of qualitative data analysis:
Step 1: Data Coding
Step 2: Develop a Coding System
Step 3: Axial Coding
Step 4: Thematic Analysis
These steps are very generic and should give you a rough idea of how you could approach a qualitative data analysis. The steps for data collection and interpretation are left out as they’re covered in other parts of the navigator.
Natural Language Processing
The progress in artificial intelligence allows for the development of algorithms and models that enable computers to understand, interpret and generate human language in a way that is both meaningful and useful. Natural language processing is the keyword to find further information.
In today's data-driven world, policymakers rely heavily on accurate and meaningful data analysis to make informed decisions that shape the course of society. However, misinterpreting data can have severe consequences, undermining the very essence of evidence-based policymaking. Erroneous conclusions drawn from misinterpreted data can yield policies that are ineffective, inefficient or even counterproductive. Such missteps squander resources, hinder progress and fail to address the needs of the communities that policymakers serve.
One of the gravest dangers of misinterpreting data lies in the perpetuation of biases and the reinforcement of existing inequalities. Data can inadvertently reflect societal biases due to various factors, such as biased data collection methods or skewed sample selection. Recognizing and challenging these biases are paramount to ensuring fair and equitable policies serve all segments of society.
Correlation: Correlation refers to a statistical relationship between two variables. With life expectancy, we might examine the correlation between factors such as education level and life expectancy. For example, we might find a positive correlation between higher education levels and longer life expectancy. This means that, on average, individuals with higher education tend to live longer and vice versa. However, correlation alone doesn’t imply causation. It indicates that there’s a relationship between the variables, but it doesn’t explain the cause-and-effect relationship between them.
Causation: Causation, on the other hand, suggests a cause-and-effect relationship between variables. In the case of life expectancy, identifying causation would involve determining whether a specific factor directly causes changes in life expectancy. For example, we might investigate whether smoking directly causes a decrease in life expectancy. Establishing causation requires rigorous scientific studies, such as randomized controlled trials or longitudinal studies that can demonstrate a direct causal relationship by controlling for other confounding factors. In general, causation is much more difficult to prove than correlation and should, therefore, only be claimed if you’re certain of it.
Statistical Significance: Statistical significance is a measure that helps determine whether an observed result is likely to be due to a real effect or is simply due to chance. In the context of life expectancy data, statistical significance is used to assess whether a relationship or difference between two groups (e.g., smokers vs. non-smokers) is likely to be meaningful or whether it could have occurred by random chance.
Confidence Intervals: Confidence intervals provide a range of values within which a population parameter, such as the mean or median, is likely to fall. It provides a measure of the uncertainty associated with an estimate. In the context of life expectancy data, a confidence interval can be used to estimate the range within which the true difference in life expectancy between two groups lies.
For example, the data analysts might calculate a 95% confidence interval for the difference in life expectancy between smokers and non-smokers. Let's say they find that the confidence interval is two to five years. This means that they’re 95% confident that the true difference in life expectancy between smokers and non-smokers falls within this range.
The confidence interval provides a measure of the precision of the estimate. A narrower confidence interval indicates more precise estimates, whereas a wider interval indicates greater uncertainty or variability in the data.
Overgeneralization refers to drawing broad conclusions based on limited or insufficient data. As policymakers, you must be very careful about which group or cohort the data is referring to and how it links to the general population.
For instance, someone might observe a correlation between exercise habits and longer life expectancy in a study and then overgeneralize by assuming that all individuals who exercise regularly will have extended life spans. However, this overlooks other crucial factors such as genetics, lifestyle choices and access to healthcare that also influence life expectancy.
Confirmation bias occurs when individuals seek or interpret data in a way that confirms their pre-existing beliefs or expectations while disregarding contradictory evidence. Therefore, it’s essential to question your own intentions related to the results of data analysis. In general, data analysis should follow sound and objective statistical methods with an open-end result. However, often enough statistics are used to solely confirm the position of the one who produced them.
For example, if someone holds the belief that genetics is the sole determinant of life expectancy, they may search for and interpret data that supports this notion while disregarding evidence that highlights the influence of other factors like behavior or environmental factors.
Neglecting context involves interpreting data without considering the broader context or the complexities surrounding the topic. The oversimplification and incomplete understanding of the factors influencing your target variable, such as life expectancy, can lead to the wrong conclusion.
For instance, let's say a study finds a correlation between income levels and life expectancy. Neglecting context would involve solely attributing differences in life expectancy to income while disregarding the potential influence of factors like access to healthcare, education, lifestyle choices or environmental factors that often interact with income to shape life expectancy outcomes.
Sampling bias occurs when the sample used in a study or analysis is not representative of the target population, leading to inaccurate conclusions. Sampling bias can occur if a study sample disproportionately includes individuals from certain demographics or geographic areas.
For example, if a study on life expectancy only includes participants from a specific age group or a particular socioeconomic background, the findings may not be applicable to the entire population. The conclusions drawn from such a sample would be limited in their generalizability and could result in biased interpretations as mentioned above.
Selection bias arises when the selection of participants or data points is not random or representative. This typically happens when certain individuals or groups are systematically excluded or included in the study based on specific criteria.
For instance, if your data analysis on life expectancy only includes individuals who voluntarily participate or only includes those who have access to healthcare services, the findings may not accurately represent the entire population. This bias can lead to misleading interpretations of life expectancy patterns and factors.
Confounding variables are variables that are not the main focus of analysis but can influence the relationship between the variables being studied.
For example, if your data analysts find a correlation between higher life expectancy and the consumption of a particular food item, it may be tempting to conclude that the food item directly causes increased life expectancy. However, there might be confounding variables at play. People who consume the food item may also have higher incomes, better access to healthcare or engage in other health-conscious behaviors that contribute to longer life expectancy.
To address confounding variables, data analysts employ various techniques such as statistical adjustments, stratification or regression analysis to isolate the effects of the variables of interest. However, it can be challenging to fully eliminate the impact of all confounding variables, and their presence may still introduce bias and affect the interpretation of data.
Incomplete data refers to missing or unavailable information, which can hinder the accurate interpretation of data. In the context of life expectancy, incomplete data can arise due to several reasons, including variations in data collection methods across different regions or countries, underreporting of deaths, errors in recording birth or death dates or gaps in historical data. Incomplete data can lead to biased conclusions and inaccurate assessments. Often, this deeply affects already marginalized groups, as they encounter even greater barriers to meaningful participation in society, at times being completely excluded from the data collection efforts.
For instance, if data from people in rural areas are missing, it may result in an underestimation or overestimation of life expectancy for those groups, which might lead to inadequate policy decisions in these areas. In addition, if data collection methods change over time, it becomes challenging to compare life expectancy trends accurately.
To mitigate the impact of incomplete data, data analysts may use statistical techniques like imputation to estimate missing values or employ data validation methods to ensure the reliability of the available data. However, these methods may introduce their own limitations and biases (see article on “Pre-Process Data”).
Comparison is an effective method to interpret data and understand patterns within a dataset. When it comes to life expectancy data, the ability to compare and contrast across different populations, regions or time periods can provide valuable information about health disparities, socio-economic factors and the impact of various interventions or policies. Here are some ideas on how to compare data based on the example of life expectancy:
As mentioned in the introduction to this article, analysing data is typically a task for data analysts and not policymakers. However, even if you’re missing the technical knowledge to conduct the actual analysis, you should still be able to understand the main considerations when analysing data, so you can request the right types of analysis.
The following questions might help you to ensure your data analysis is done correctly:
After reading this article, you should be aware of the many pitfalls when interpreting data. As a policymaker, it is crucial to mitigate the risk of misinterpreting your data analysis as much as possible to ensure no wrong conclusions will be drawn.
The following questions might help you to produce accurate and objective interpretations:
After successfully working through the validation, analysis, and interpretation of your data, learn how to visualize your data for effective communication.