What is EDA?
EDA is known as Exploratory Data Analysis. The overall goal of EDA (in my opinion) is to get a “feel” for the data. What is the data trying to tell us? By understanding the data, it will help us determine what answers we can get from the data. These “answers” can lead us to other questions.
My strategy is comprised of two parts – First, Exploring the data while cleaning. This is the most important part of my EDA process. I say this because you don’t want to be halfway through a project and realize you missed something at the beginning of the process. Plus, we cannot analysis data until it has been cleaned, free of errors, and any questions addressed.
Exploring the Data while cleaning
- Create an Excel spreadsheet or Word document to document your EDA process. You should save it in the directory of your project
or assignment. This will be helpful, if you need to recreate your results in the future and a reminder as to what you did.
- Document any changes made to the data
- Document the main points from the steps below.
- What is the shape?
- Number of observations?
- Format? Long or Wide?
- Is the data large enough in size?
- Create a table describing the data set.
- Variable (column name)
- Data Type (Date, Continuous, Discrete, Character, etc.)
- Description of the variable
- Read any meta data or separate files on the data set.
- Are there any missing values (blanks, NA, periods, etc.)?
- Ask yourself, why are they missing?
- Did participants just fail to answer or is there a bigger problem?
- How are you going to address the missing information?
- Impute it
- Remove it
- Leave it as is and try to work around it
- Do you need to consult with anyone else before deciding?
- Look at any date variable.
- Verify the date format is correct.
- Look at any variables (columns) that could be converted to factors.
- Are the values consistent?
- Use ndistinct(). I’ve ran into situations where gender field contained M, F, Male, and Female.
- Based on step 3b – make any data type changes, if necessary
Next, Exploring the data with visualization and tools
- Summarize the data using functions such as summary() to get basic statistical information.
- Does the results make sense?
- Look at mean and median – Is the data skewed?
- Outliers? Why do we have outliers?
-
Create contingency table to further explore the data
- Plot the data to determine the distribution or shape.
- Referring to step 3, create a plot based on the data types being explored.
- One variable
- Histogram
- Boxplot
- Two (multiple) variables
- Scatter plot
- Bare Chart
- Heat Map
- Line Chart
- Categorical variables
- Bar Chart
- One variable
- What do you see? Trends? Relationships?
- Referring to step 3, create a plot based on the data types being explored.
- Use correlation to confirm or disaffirm any relationships.
- Using cor(), ggpairs(), and/ or ggcorr().
From the outline above, after all of step one, I think the most important thing is plotting and looking the distribution. Visually, it’s easier to see patterns and trends in the data, which can help answer questions.
As you are completing the EDA, you are trying to look for relationships and ask why those relationships exist in the first place.
Honestly, I think that is the only reason to complete EDA.
Thank again for reading my blog post. I am working on my narrative and trying to find my voice.