Meeting the Data Visualization Criteria for Certification
The most common reason for failing any of our certification practical exams (Data Analyst or Scientist, Associate or Professional) is not meeting the Data Visualization criteria. So let's take a dive in and look at what we are looking for and make sure you don't have the same problems.
Lets start by reviewing the criteria. All of our certifications require that you show the same three things:
- Has created at least two different visualizations of single variables (e.g. histogram, bar chart, single boxplot)
- Has created at least one visualization including two or more variables (e.g. scatterplot, filled barchart, multiple boxplots)
- Has used visualizations that support the findings being presented
Our grading team are looking to see that you know how to create different types of visualizations and that you can pick the right data and situation for them.
Graphics for Single Variables
The first thing we are looking for is two graphics that tell us about single variables. You can pick any graphics you want - histogram, box plot, bar chart, density plot, dot plot,... The only requirement is that it just includes one variable.
Lets take a look at some examples.
I am going to use the Toyota used car data that is included in our Professional level sample case studies. You can see the columns that are included in the data below.
library(tidyverse) cars <- read_csv("https://s3.amazonaws.com/talent-assets.datacamp.com/toyota.csv")
Graphics that would meet the single variable criteria include a box plot that shows counts of each category:
A histogram of the distribution of a numeric variable:
I could create a boxplot, but just for one variable
Or an area chart, like this one that shows the counts for the number of cars manufactured in each year as an area:
Whatever you pick the important thing to remember is that you are only using one single variable. If you add another variable to show differences between groups, that won't count.
Remember, just one variable for this criteria. And we want to see two different types of graphics, so they can't both be bar charts.
Graphics for Multiple Variables
The second criteria is the most fun. For this one we are looking to see that you can create graphics that include multiple variables - two, three, four, ten! As many as you want, but I would suggest you keep to two or three so we can also interpret your graphic!
This is where you can get out your filled/stacked bar charts, your scatter plots, your heatmaps, your panelled/facetted graphics. Let's take a look at a few using the Toyota data.
Maybe I want to talk about the relationship between price and the model of car...
Or the relationship between price and milage:
Now is the time to use bar charts that include additional variables, like this one that shows that cars with larger engines are less likely to be petrol.
A slightly less obvious example would be this next one. For this bar chart I first of all calculated the average price for each model and then displayed those two variable in a bar chart. This might look like a plot of a single variable, but is in fact showing two. So it counts for this second criteria.
The last thing you need to do is to make sure that the graphics you include support the findings you are writing about or talking about.
If you are telling us about how price decreases as mileage increases, we want to see a graph that shows that - like the scatter plot I created above.
The easiest way to do this is to think about the types of graphics that you could include that relate to the questions we have given you. We might have asked you about how model and fuel type are related, but you could first tell us about the models, then the fuel types and then the combination. Along with a short description for each, you will have met all of the criteria for visualizations!