EDA Cheat Sheet
Published:
đ đ EDA Cheat Sheet
Why we EDA
Sometimes the consumer of your analysis wonât understand why you need the time for EDA and will want results NOW! Here are some reasons you can give to convince them itâs a good use of time for everyone involved.
Reasons for the analyst
- Identify patterns and develop hypotheses.
- Test technical assumptions.
- Inform model selection and feature engineering.
- Build an intuition for the data.
Reasons for consumer of analysis
- Ensures delivery of technically-sound results.
- Ensures right question is being asked.
- Tests business assumptions.
- Provides context necessary for maximum applicability and value of results.
- Leads to insights that would otherwise not be found.
Things to keep in mind
- Youâre never done with EDA. With every analytical result, you want to return to EDA, make sure the result makes sense, test other questions that come up because of it.
- Stay open-minded. Youâre supposed to be challenging your assumptions and those of the stakeholder who youâre performing the analysis for.
- Repeat EDA for every new problem. Just because youâve done EDA on a dataset before doesnât mean you shouldnât do it again for the next problem. You need to look at the data through the lense of the problem at hand and you will likely have different areas of investigation.
The plan
Exploratory data analysis consists of the following major tasks, which we present linearly here because each task doesnât make much sense to do without the ones prior to it. However, in reality, you are going to constantly jump around from step to step. You may want to do all the steps for a subset of the variables first or you might jump back because you learned something and need to have another look.
- Form hypotheses/develop investigation themes to explore
- Wrangle data
- Assess quality of data
- Profile data
- Explore each individual variable in the dataset
- Assess the relationship between each variable and the target
- Assess interactions between variables
- Explore data across many dimensions
Throughout the entire analysis you want to:
- Capture a list of hypotheses and questions that come up for further exploration.
- Record things to watch out for/ be aware of in future analyses.
- Show intermediate results to colleagues to get a fresh perspective, feedback, domain knowledge. Donât do EDA in a bubble! Get feedback throughout especially from people removed from the problem and/or with relevant domain knowledge.
- Position visuals and results together. EDA relies on your natural pattern recognition abilities so maximize what youâll find by putting visualizations and results in close proximity.
Wrangling
Basic things to do
- Make your data tidy.
- Each variable forms a column
- Each observation forms a row
- Each type of observational unit forms a table
- Transform data: sometimes you will need to transform your data to be able to extract information from it. This step will usually occur after some of the other steps of EDA unless domain knowledge can inform these choices beforehand.
- Log: when data is highly skewed (versus normally distributed like a bell curve), sometimes it has a log-normal distribution and taking the log of each data point will normalize it.
- Binning of continuous variables: Binning continuous variables and then analyzing the groups of observations created can allow for easier pattern identification. Especially with non-linear relationships.
- Simplifying of categories: you really donât want more than 8-10 categories within a single data field. Try to aggregate to higher-level categories when it makes sense.
Helpful packages
Data quality assessment and profiling
Before trying to understand what information is in the data, make sure you understand what the data represents and whatâs missing.
Basic things to do
- Categorical: count, count distinct, assess unique values
- Numerical: count, min, max
- Spot-check random samples and samples that you are familiar with
- Slice and dice
Questions to consider
- What data isnât there?
- Are there systematic reasons for missing data?
- Are there fields that are always missing at the same time ?
- Is there information in what data is missing?
- Is the data that is there right?
- Are there frequent values that are default values?
- Are there fields that represent the same information?
- What timestamp should you use?
- Are there numerical values reported as strings?
- Are there special values?
- Is the data being generated the way you think?
- Are there variables that are numerical but really should be categorical?
- Is data consistent across different operating systems, device type, platforms, countries?
- Are there any direct relationships between fields (e.g. a value of x always implies a specific value of y)?
- What are the units of measurement? Are they consistent?
- Is data consistent across the population and time? (time series)
- Are there obvious changes in reported data around the time of important events that affect data generation (e.g. version release)? (panel data)
Helpful packages
Example backlog
- Assess the prevalence of missing data across all data fields, assess whether its missing is random or systematic, and identify patterns when such data is missing
- Identify any default values that imply missing data for a given field
- Determine sampling strategy for quality assessment and initial EDA
- For datetime data types, ensure consistent formatting and granularity of data, and perform sanity checks on all dates present in the data.
- In cases where multiple fields capture the same or similar information, understand the relationships between them and assess the most effective field to use
- Assess data type of each field
- For discrete value types, ensure data formats are consistent
- For discrete value types, assess number of distinct values and percent unique and do sanity check on types of answers
- For continuous data types, assess descriptive statistics and perform sanity check on values
- Understand relationships between timestamps and assess which to use in analysis
- Slice data by device type, operating system, software version and ensure consistency in data across slices
- For device or app data, identify version release dates and assess data for any changes in format or value around those dates
Exploration
After quality assessment and profiling, exploratory data analysis can be divided into 4 main types of tasks:
- Exploration of each individual variable
- Assessment of the relationship between each variable and the target variable
- Assessment of the interaction between variables
- Exploration of data cross many dimensions
1. Exploring each individual variable
Basics to do
Quantify:
- Location: mean, median, mode, interquartile mean
- Spread: standard deviation, variance, range, interquartile range
- Shape: skewness, kurtosis
For time series, plot summary statistics over time.
For panel data:
- Plot cross-sectional summary statistics over time
- Plot time-series statistics across the population
Questions to consider
- What does each field in the data look like?
- Is the distribution skewed? Bimodal?
- Are there outliers? Are they feasible?
- Are there discontinuities?
- Are the typical assumptions seen in modeling valid?
- Gaussian
- Identically and independently distributed
- Have one mode
- Can be negative
- Generating processes are stationary and isoptropic (time series)
- Independence between subjects (panel data)
2. Exploring the relationship between each variable and the target
How does each field interact with the target?
Assess each relationshipâs:
- Linearity
- Direction
- Rough size
- Strength
Methods:
- Bivariate visualizations
- Calculate correlation
3. Assessing interactions between variables
How do the variables interact with each other?
Basic things to do
- Bivariate visualizations for all combinations
- Correlation matrices
- Compare summary statistics of variable x for different categories of y
4. Exploring data across many dimensions
Are there patterns across many of the variables?
Basic things to do
- Categorical:
- Parallel coordinates
- Continuous
- Principal component analysis
- Clustering
Helpful packages
ipywidgets
: making function variables interactive for visualizations and calculationsmpld3
: interactive visualizations
Example backlog
- Generate list of questions and hypotheses to be considered during EDA
- Create univariate plots for all fields
- Create bivariate plots for each combination of fields to assess correlation and other relationships
- Plot summary statistics over time for time series data
- Plot distribution of x for different categories y
- Plot mean/median/min/max/count/distinct count of x over time for different categories of y
- Capture list of hypotheses and questions that come up during EDA
- Record things to watch out for/ be aware of in future analyses
- Distill and present findings
Visualization guide
Here are the types of visualizations, and the python packages we find most useful for data exploration.
Univariate
- Categorical:
- Continuous:
Bivariate
- Categorical x categorical
- Categorical x continuous
- Box plots of continuous for each category
- Violin plots of continuous distribution for each category
- Overlaid histograms (if 3 or less categories)
- Continuous x continuous
Multivariate
Timeseries
- Line plots
- Any bivariate plot with time or time period as the x-axis.
Panel data
- Heat map with rows denoting observations and columns denoting time periods.
- Multiple line plots
- Strip plot where time is on the x-axis, each entity has a constant y-value and a point is plotted every time an event is observed for that entity.
Geospatial
- Choropleths: regions colored according to their data value.
Helpful packages
matplotlib
: basic plottingseaborn
: prettier versions of somematplotlib
figuresmpld3
: interactive plottingfolium
: geospatial plotting