Duplicate of Data Validation in Certification

Beta

How to pass the Data Validation Criteria in Certification

We previously looked in depth at the Data Visualization criteria for certification, now it is the turn of Data Validation.

Why validate data?

The best way to understand why this is so important is to think about an example.

In our Data Analyst Associate Sample, we ask you to help out the marketing team with determining which types of coffee shop they should market their new product in. The data description asks that you remove any rows where the number of reviews is missing. Let's suppose we did not do that. We might go on to tell the marketing team that they should work with a particular type of shop. Months later, that approach is not be working and the company is losing money. After lots of looking back and analyzing why, it turns out it was because you did not remove missing values.

You might think this is an extreme, invented story to scare you, but these are the potential consequences of failing to validate data. While you are studying the impact is low, you don't quite get the answer you expected. To a business, the impact could be huge. Decisions are made based on inaccurate information, just from not validating and cleaning your data.

What are we looking for in Certification

Now you know why it is one of the most important parts of any data project, but what do we want to see in your practical exam? Directly from the rubric we want to see that you have:

validated all variables against provided criteria and where necessary has performed cleaning tasks to result in analysis-ready data

The most common mistake that we see is not telling us about every column.

What sort of validation and cleaning will you need to do?

We have included a couple of problems that you need to fix in every project data set. We could make it really easy for you and tell you exactly where to find them. But we want this to be as real as possible. We will give you the information on how the column should be structured, and you need to make sure that it meets this criteria. Just like you will have to in your data jobs.

This could include:

Replacing values with another value
Removing rows that meet (or do not meet) some criteria
Correcting mistakes in the data (e.g. spellings)
Converting data to a different type (convert characters to numbers)

For those of you working on the associate level, the instructions we give will usually be more specific - for example, we will probably tell you exactly what to do with a missing value. But if you are working on the professional level, we may expect you to use your own judegement from time to time.

The important thing, is that you tell us about every column in the data so we know that you found the problems and fixed them - don't worry, if you read the information we give and look at each column, you will find them. And if you are ever unsure what to do, tell us your thinking and the decision you made.

Making sure you pass - a tip from me!

Since launching certification at DataCamp I have graded hundreds of submissions. This is one of the hardest criteria to grade because we sometimes have to look over the report several times to find the information.

My tip is to make it really easy for whoever will grade your work. Create a list with one point for each column. That way the grader will be absolutely certain you have looked at every single column, and won't be able to fail you. Not only are you making the grading easier, but it is also easier for you to see what you have done and be certain you have checked every column.

So looking at the example for the Data Analyst Associate Certification here is my solution...

Example Solution

The original data is 200 rows and 9 columns. After validation, there were 198 rows remaining. The following describes what I did to each column:

Region: There were 10 unique regions, as expected
Place name: There were 185 unique place names, suggesting that some names are duplicated, this should be confirmed with the team providing the data
Place type: There are only 4 values for each place type, Coffee Shop, Cafe, Espresson Bar and Others. This matches what is expected
Rating: Values range from 3.9 to 5.0, so all are within the range expected
Reviews: I removed rows where the Review value was missing. This was 2 rows, leaving 198 rows of data
Price: There are 3 price categories, as expected
Delivery option: There are 2 delivery options - True/False, as expected
Dine-in Option:I converted missing values to False, there were originally no false values
Takeaway option: I converted missing values to False, there were also originally no false values