How to Handle Missing Data in Your Research (Without Ruining Your Results)

Missing data is one of those problems that every quantitative researcher encounters and too few handle well. Whether you are running a survey, analyzing secondary data, or conducting a longitudinal study, some participants will skip questions, drop out, or provide unusable responses. The question is never whether you will have missing data. The question is whether your approach to handling it introduces bias, reduces power, or quietly undermines the conclusions you draw.

The stakes are real. A poorly chosen missing data strategy can shrink your sample size by half, bias your estimates in unpredictable directions, and lead reviewers to question the validity of your entire analysis. A well-chosen strategy preserves statistical power, reduces bias, and demonstrates methodological rigor. This guide covers the types of missingness, the common approaches (good and bad), and the practical steps for handling missing data responsibly.

Why Missing Data Matters More Than You Think

Consider a straightforward example. You survey 500 healthcare workers about burnout and job satisfaction. Of those 500 respondents, 120 left the burnout scale incomplete. If you simply delete those 120 cases and analyze the remaining 380, you have lost nearly a quarter of your data. But the bigger problem is not the lost sample size. The bigger problem is why those 120 people did not complete the burnout items. If the most burned-out workers were the ones who skipped the burnout questions (perhaps because they were too exhausted or disengaged to finish the survey), then your remaining sample systematically underrepresents the very group you most need to understand. Your burnout estimates will be biased downward, and your conclusions will be wrong.

This is why understanding the mechanism behind your missing data is the essential first step. Before choosing a handling strategy, you need to understand the different types of missingness and what they imply for your analysis. Resources like the statistical decision tools at Stats for Scholars can help you think through how different data conditions affect your choice of analytical method.

The Three Types of Missingness

Statisticians classify missing data into three categories based on the relationship between the missingness and the data values. These classifications, formalized by Donald Rubin in the 1970s, determine which handling methods will produce valid results.

Missing Completely at Random (MCAR)

Data is Missing Completely at Random when the probability of a value being missing is unrelated to both the observed and unobserved data. In other words, the missingness is purely random, like data lost because a server crashed, a page of a paper survey got stuck together, or a lab instrument malfunctioned on random occasions.

MCAR is the most optimistic scenario because the incomplete cases are a random subset of the full sample. Deleting incomplete cases under MCAR does not introduce bias (though it does reduce power). However, true MCAR is rare in practice. Most missingness has some systematic component.

Plain-language example: A research assistant accidentally spills coffee on 15 paper surveys, making them unreadable. The spill has nothing to do with how those participants responded. That is MCAR.

You can partially test for MCAR using Little's MCAR test, which examines whether the pattern of missingness is related to observed values. A nonsignificant result is consistent with MCAR, though it does not definitively prove it.

Missing at Random (MAR)

Data is Missing at Random when the probability of missingness is related to observed data but not to the missing values themselves, after controlling for observed variables. Despite the confusing name, MAR data is not randomly missing. The missingness is systematic but can be predicted from other variables you have measured.

Plain-language example: In your burnout survey, younger workers are more likely to skip the burnout items than older workers, regardless of their actual burnout levels. Age predicts missingness, but once you account for age, the remaining missingness is random. Since you measured age, you can adjust for this pattern.

MAR is the assumption that underlies most modern missing data methods. It is also the most common real-world scenario, though it is not directly testable because you cannot observe the missing values to confirm they are unrelated to missingness after controlling for observed variables.

Missing Not at Random (MNAR)

Data is Missing Not at Random when the probability of missingness depends on the unobserved values themselves, even after controlling for all observed variables. This is the most problematic scenario because the missing data mechanism is directly tied to the thing you cannot see.

Plain-language example: Participants with the highest burnout levels skip the burnout questions specifically because they are too burned out to complete them. No amount of controlling for observed variables (age, gender, department) fully explains the pattern because the missingness depends on the burnout values themselves, which are the very values that are missing.

MNAR is the hardest to handle and often requires specialized models (pattern mixture models, selection models) or sensitivity analyses to explore how different assumptions about the missing values would change your conclusions.

Common Approaches That Often Make Things Worse

Listwise Deletion (Complete Case Analysis)

Listwise deletion removes any case with missing data on any variable in the analysis. If a participant is missing one value out of twenty variables, the entire case is dropped.

Problems with listwise deletion:

Dramatically reduces sample size, especially when missingness is spread across multiple variables
Wastes the observed data from incomplete cases
Produces biased estimates unless data is MCAR (which is rare)
Reduces statistical power, making it harder to detect real effects

Listwise deletion remains the default in most statistical software, which means researchers who do not actively choose a better method end up using it by accident. This is one of the most common methodological weaknesses reviewers identify.

Pairwise Deletion

Pairwise deletion uses all available data for each specific analysis, so a correlation between variables A and B uses all cases with both A and B, even if those cases are missing data on variable C. This preserves more data than listwise deletion but creates a new problem: each analysis is based on a different subset of cases. This can produce correlation matrices that are mathematically impossible (not positive definite), which causes downstream problems for multivariate analyses.

Mean Substitution

Mean substitution replaces missing values with the variable's sample mean. While it preserves the sample size and does not change the variable mean, it artificially reduces variability (the variance is underestimated because you have replaced genuinely variable data points with a constant). It also distorts relationships between variables by pulling correlations toward zero. Mean substitution is widely regarded as one of the worst approaches to missing data and should be avoided.

Better Approaches to Missing Data

Multiple Imputation

Multiple imputation (MI) is the gold standard for handling missing data under the MAR assumption. Rather than replacing each missing value with a single guess, MI creates multiple complete versions of the dataset (typically 5 to 50), each with slightly different plausible values for the missing data points. The analysis is then run on each imputed dataset, and the results are pooled using rules developed by Rubin.

The process involves three steps:

Imputation: Generate m complete datasets where missing values are replaced with values drawn from a predictive distribution based on observed data. The imputation model should include all variables in your analysis model plus any auxiliary variables that predict missingness or the missing values.
Analysis: Run your planned statistical analysis (regression, t-test, ANOVA, etc.) separately on each of the m imputed datasets.
Pooling: Combine the m sets of results using Rubin's rules, which account for both within-imputation variability (the usual sampling error) and between-imputation variability (the uncertainty introduced by imputation). The pooled estimates, standard errors, and confidence intervals properly reflect the additional uncertainty due to missing data.

Multiple imputation produces unbiased estimates under MAR, maintains appropriate standard errors (unlike single imputation methods), and preserves statistical power. It works with virtually any analysis type and is available in all major statistical packages.

When planning a study, anticipating attrition and missing data is part of responsible sample size planning. The Sample Size Calculator can help you determine how many participants to recruit, and you should inflate that number by your expected rate of missingness to ensure you retain adequate power after accounting for incomplete data.

Maximum Likelihood Estimation

Maximum likelihood (ML) estimation handles missing data by estimating model parameters that maximize the likelihood of the observed data, using all available information without deleting cases or imputing values. Full Information Maximum Likelihood (FIML) is the most common variant for structural equation modeling and related techniques.

FIML estimates a separate likelihood function for each pattern of observed data and combines them into an overall likelihood. Like multiple imputation, FIML produces unbiased estimates under MAR and uses all available data. It tends to be computationally simpler than MI (no need to create and combine multiple datasets) and produces a single set of results.

The main limitation is that FIML is built into specific modeling frameworks (primarily structural equation modeling software like Mplus, lavaan in R, and AMOS) and may not be available for all analysis types. For standard regression, ANOVA, or nonparametric tests, multiple imputation is typically more flexible.

Expectation-Maximization (EM) Algorithm

The EM algorithm iteratively estimates means, variances, and covariances by alternating between estimating expected values of missing data (E-step) and re-estimating parameters using the completed data (M-step). EM produces good point estimates but underestimates standard errors because it treats imputed values as known. For this reason, EM is best used as a diagnostic tool or preliminary step rather than a final analysis method.

Practical Steps for Handling Missing Data

Step 1: Assess the Extent of Missingness

Before choosing a strategy, quantify the problem:

What percentage of cases have at least one missing value?
What percentage of values is missing for each variable?
Are certain variables or items much more prone to missingness than others?

As a rough guideline, less than 5% missing data on a variable is usually manageable with most methods. Between 5% and 20% requires careful attention to the method chosen. Above 20% raises serious concerns about the validity of any results, regardless of method.

Step 2: Examine the Pattern of Missingness

Determine whether missingness is monotone (once a variable is missing, all subsequent variables are also missing, as in dropout) or arbitrary (no particular pattern). Create a missing data pattern matrix to visualize which combinations of variables tend to be missing together.

Step 3: Test for Mechanisms

Run Little's MCAR test to evaluate whether data is consistent with MCAR. If it is significant, MCAR is unlikely. Compare cases with and without missing data on key demographic and outcome variables using t-tests or chi-square tests. If the groups differ systematically, the data is at least MAR and possibly MNAR.

You cannot definitively prove MAR versus MNAR from observed data alone. If you have strong theoretical reasons to believe that missingness depends on unobserved values, consider MNAR-appropriate methods or at minimum conduct sensitivity analyses.

Step 4: Choose and Implement a Method

For most research scenarios:

If data is plausibly MCAR and missingness is less than 5%, listwise deletion is defensible (though MI is still better)
If data is plausibly MAR (the most common scenario), use multiple imputation or FIML
If data may be MNAR, use MI or FIML as your primary approach and conduct sensitivity analyses to assess how results change under different MNAR assumptions

When choosing between MI and FIML, consider your software, your analysis type, and your comfort level. Both produce comparable results under MAR. MI is more flexible across analysis types; FIML is more elegant within structural equation modeling.

Step 5: Report Transparently

Your methods section should include the amount and pattern of missing data, the assumed mechanism with justification, results of diagnostic tests, the handling method chosen and why, implementation details (number of imputations, software used), and any sensitivity analyses. Reviewers increasingly expect missing data to be addressed explicitly.

Software-Specific Guidance

SPSS

SPSS handles missing data through multiple avenues. The base package defaults to listwise deletion, which means you must actively choose a better option. The Missing Values add-on module provides Little's MCAR test, EM estimation, and pattern analysis. Multiple imputation is available through Analyze > Multiple Imputation (requiring the Missing Values module). Specify the number of imputations (20 is a reasonable default), include all analysis variables and relevant auxiliary variables, and SPSS creates a stacked dataset with an imputation number variable. Run your analysis on the pooled data using Analyze > Multiple Imputation > Analyze Patterns. For detailed guidance on navigating SPSS statistical procedures, the software guides at Stats for Scholars walk through common analyses step by step.

R

R offers the most comprehensive missing data tools. The mice package (Multivariate Imputation by Chained Equations) is the most widely used for multiple imputation:

library(mice)
imp <- mice(data, m = 20, method = "pmm", seed = 12345)
fit <- with(imp, lm(outcome ~ predictor1 + predictor2))
pooled <- pool(fit)
summary(pooled)

The naniar and visdat packages provide excellent missing data visualization. For FIML in structural equation modeling, the lavaan package supports it natively with the missing = "fiml" argument. The Amelia package offers an alternative MI approach based on a multivariate normal model.

Excel

Excel lacks built-in tools for sophisticated missing data handling. You can identify missing values with COUNTBLANK functions and calculate missingness percentages, but for actual imputation, export your data to SPSS or R. Do not use Excel's fill-down or average functions as makeshift imputation; these are forms of single imputation that will distort your results.

The Qualitative Contrast

It is worth noting that missing data as discussed here is primarily a quantitative concern rooted in the structure of numerical datasets and statistical estimation. Qualitative research faces different data completeness challenges. An interview participant might decline to answer certain questions or a focus group might not cover all planned topics, but these situations are handled through methodological transparency and reflexive analysis rather than statistical techniques. Researchers working with qualitative or mixed methods data will find different frameworks for thinking about data completeness at Qualitative Researchers, which addresses the unique rigor criteria of non-numerical inquiry.

Common Mistakes to Avoid

Ignoring the problem entirely. The worst approach is to not mention missing data at all. Even if your dataset is remarkably complete, say so explicitly. Reviewers will look for it.

Using mean substitution because it is easy. Mean substitution is the statistical equivalent of covering a pothole with a piece of cardboard. It might look fine on the surface, but it weakens the foundation. It attenuates correlations, underestimates variances, and produces artificially narrow confidence intervals.

Running too few imputations. The old advice of 5 imputations is outdated. Current recommendations suggest the number of imputations should be at least equal to the percentage of incomplete cases. If 30% of cases have missing data, use at least 30 imputations. With modern computing, there is little cost to running 50 or more.

Excluding auxiliary variables from the imputation model. Variables that predict missingness or that correlate with the incomplete variables should be included in the imputation model even if they are not in your analysis model. Omitting them makes the MAR assumption less plausible and reduces the quality of imputations.

Assuming MCAR without testing. Do not assume your missing data is random just because you cannot think of a reason it would be systematic. Run Little's MCAR test and compare complete versus incomplete cases on key variables. The results may surprise you.

Reporting imputed results without the original results. Consider including complete-case results as a sensitivity analysis alongside your MI or FIML results. If results are similar across methods, this strengthens confidence; if they differ, discuss what the discrepancy suggests.

To ensure your measures are sound before worrying about missing data patterns, run a reliability analysis on your scales. The Reliability Calculator can help you assess internal consistency and identify problematic items that may be contributing to nonresponse.

Planning Ahead: Prevention Is Better Than Cure

The best missing data strategy is prevention. While you cannot eliminate it entirely, you can minimize it by keeping surveys short, pilot testing instruments to identify confusing items, planning for attrition in longitudinal studies by oversampling at baseline, building in retention strategies like reminders and incentives, and collecting auxiliary variables useful for imputation. If you anticipate 15% attrition, recruit 15% more than your power analysis suggests you need.

Conclusion

Missing data does not have to ruin your research, but handling it carelessly very well might. The field has moved beyond listwise deletion and mean substitution. Modern approaches like multiple imputation and FIML are accessible in all major software packages, well-documented, and expected by reviewers and dissertation committees alike.

Start by understanding why your data is missing. Choose a method appropriate to the mechanism. Report everything transparently. And whenever possible, design your study to minimize the problem before it starts. Your results, and your reviewers, will thank you.