Vivendo is a fast food chain in Brazil with over 200 outlets. As with many fast food establishments, customers make claims against the company. For example, they blame Vivendo for suspected food poisoning.
The legal team, who processes these claims, is currently split across four locations. The new head of the legal department wants to see if there are differences in the time it takes to close claims across the locations.
The legal team has given you a data set where each row is a claim made against the company. They would like you to answer the following questions:
- How does the number of claims differ across locations?
- What is the distribution of time to close claims?
- How does the average time to close claims differ by location?
The dataset contains one row for each claim.
The dataset needs to be validated based on the description below:
|Claim ID||Character, the unique identifier of the claim.|
|Time to Close||Numeric, number of days it took for the claim to be closed.|
|Claim Amount||Numeric, initial claim value in the currency of Brazil. For example, “R$50,000.00” should be converted into 50000.|
|Amount Paid||Numeric, total amount paid after the claim closed in the currency of Brazil.|
|Location||Character, location of the claim, one of “RECIFE”, “SAO LUIS”, “FORTALEZA”, or “NATAL”.|
|Individuals on Claim||Numeric, number of individuals on this claim.|
|Linked Cases||Binary, whether this claim is believed to be linked with other cases, either TRUE or FALSE.|
|Cause||Character, the cause of the food poisoning injuries, one of ‘vegetable’, ‘meat’, or ‘unknown’. Replace any empty rows with ‘unknown’.|
Describe the validation tasks you performed and what you found. Have you made any changes to the data to enable further analysis? Remember to describe what you did for every column in the data.
The origianl data is 98 rows and 8 columns. To begin my data validation I went through each column to find any missing or incorrect values based on the data description. I found 1 negative value in the Time to Close column and removed that row, leaving 97 rows of data. I also changed the Claim Amount column to reflect the style of the Amount Paid column by removing the "R$", "," , and ".00". I then replaced all null values with 'unknown' in the Cause column, as well as removed the 's' from 'vegetables'.
Looking at the remaining columns:
- All of the Claim ID's have 25 charachters
- The Time to Close days are all within range
- The Claim Amount and Amount Paid are both as expected
- There were 4 unique Locations as expected
- There were 6 instances of Individuals on Claim having a value of 0, suggesting a data entry error, this should be confirmed with the team providing the data
- There are 2 options for Linked Cases - true/false, as expected
- There are 3 options for Cause - 'vegetable’, ‘meat’, or ‘unknown’.
Data Discovery and Visualization
Describe what you found in the analysis and how the visualizations answer the customer questions in the project brief. In your description you should:
- Include at least two different data visualizations to demonstrate the characteristics of single variables
- Include at least one data visualization to demonstrate the relationship between two or more variables
- Describe how your analysis has answered the business questions in the project brief
How does the number of claims differ across locations?
Of the 4 locations, Sao Luis has the most claims at 29 and Natal has the least at 21. There is not a large discrepency between the amount of claims per location so this will only marginally effect the Time to Close per location.
What is the distribution of time to close claims?
We can see the the majority of claims are clsoed before 1500 days while there are some outliers taking more than 2000 days.
How does the average time to close claims differ by location?
We will need to combine the previous two pieces of information to see how the Time to Close varies by location. So far it is hard to conclude if there is a difference in the time it takes to close claims across the locations.
We can see Natal is the most consistent with the smallest range for time to close claims. Fortaleza and Reccife are comparable with similar interquartile ranges. Sao Luis however has the largest interquartile range and longest wiskers. We can see that many of the outliers from the previous graph belong to Sao Luis, who also has a higher median than the rest.
Based on this information, the new head of the legal department should look into the Sao Luis location to find out why their time to close is ranging anywhere from 29 to 3591 days. They should look into why Sao Luis time to close is so inconsistent and possibly try to model their location after Natal, the most consistent location. Further analysis should be done to see if the number of individuals on a claim or the claim amount effects the time to close. If Sao Luis is working with the most individuals on claims or with large claim amounts, this could be why they take longer.