Exploratory Data Analysis on KHOJ using Python
In this notebook, you are going to do data analysis on an interesting dataset named KHOJ - Know Your High Court Judges. This dataset contains information about the judges of all high courts in India from 1993 to 2021.
💾 The Data
The data you will see here contains 1708 rows and 44 columns. But you don't need all columns of this dataset. In the table below, you will see the description of those columns which you will use in this analysis.
|Name of the Judge||Containing the name of the Judge|
|Gender||Gender of the Judge|
|Date of Birth||The date in which the Judges were born|
|Date of Appointment||The date on which the person was elevated as a Judge of any High Court (appointment as an Additional Judge is also considered here)|
|Date of Retirement||The date on which the person demits office as a Judge of High Court or of the Supreme Court (if elevated to it)|
|If appointed Chief Justice in any High Court||Categorical column specifying if a judge is appointed as Chief Justice or not.|
|If appointed to the Supreme Court||Categorical column specifying if a judge is appointed to the Supreme Court or not.|
|Foreign Degree in Law||If the judge has a Foreign Degree in Law or not.|
|Post-Graduate in Law||If the judge has a PG Degree in Law or not.|
|Post-Graduate in another subject||If the judge has a PG Degree in another subject or not.|
|Graduation Specialization||The particular subject is chosen by the Judge during his Graduation.|
|file Title||Name of the Court.|
If you want to know about all columns of this data, you should check this link. If you want to analyze the data of a specific court, you can check this website as this data contains all informations about the Judges of all courts.
What to do🤔?
You know about the data. Now you have to ask yourself - what you want to know from this data? Below are some questions which I want to know from this data.
- What is the average age of Judges when they are appointed as a Judge of any High Court?
- What is the average retirement age of the High Court Judge ?
- What is the average duration of working as a Judge?
- What is the Ratio of the Male and Female Judges?
- What is the Education Qualification of Judges? This also has four subparts.
- How many of them have done Post Graduation in Law?
- How many of them have a Foreign degree in Law?
- Which subject they chose in their Graduation Specialization?
- How many of them have done Post Graduation in another subject other than Law?
- What is the Judge's designation? It also has two subparts.
- How many judges per state had been promoted as a chief justice in any High Court?
- How many judges per state have been promoted as a judge in the Supreme Court?
It's also possible that the question/questions you think is/are not in the list. You can add them. Now you know about the data and you also know that what you want to know. Now you can finally go to the data analysis part.
🧹 Analyzing and Cleaning the Data
Importing the necessary libraries
As we are not doing any data visualization tasks in Python here, we are not going to import any data visualization library. For our task, it is sufficient to import Pandas and Numpy. Let's import those libraries.
import pandas as pd import numpy as np
Quick Look on data
Now, let's see what our data looks like. As the file is in .csv format, we import this data in Pandas using the read_csv() method.
judge_data = pd.read_csv("khoj-1.8.csv") judge_data.head()
By first look, we can notice that,
- There are three date columns available - Date of Birth, Date of Appointment and Date of Retirement. But suprisingly, Pandas detects them as object, which is the style of Pandas library, telling you that these columns are categorical columns.
- Some columns containing the value Not Available and Not Applicable. Here Not Available denotes the null value in the column and Not Applicable means that column is not applicable for that specific judge.
date_cols = ['Date of Birth', 'Date of Appointment', 'Date of Retirement'] judge_data[date_cols].info()
Now, you don't need the whole date of these three date columns. As you are only interested in the age of the judge, it is sufficient for us to take only the year from the date. But before doing that, you have to replace the Not Available value with np.nan. Otherwise, while converting those columns to pd.datetime format, it will throw an error.
for cols in date_cols: # Replacing "Not Available" value with np.nan judge_data[cols] = judge_data[cols].replace("Not Available", np.nan) # Converting the date columns to pd.datetime format judge_data[cols] = pd.to_datetime(judge_data[cols]) # extracting year from date judge_data[cols] = pd.DatetimeIndex(judge_data[cols]).year
After this operation, these date column contains only the year. So, it is no longer needed to call those columns as date columns. It's time for replace their names.
# making a list containing the new name of the date columns date_rename = ['Year of Birth', 'Year of Appointment', 'Year of Retirement'] # zip the date_cols and date_rename cols and making a dictionary with it rename_dict = dict(zip(date_cols, date_rename)) # finally renaming the columns with that dictionary judge_data.rename(columns=rename_dict, inplace=True)
Let's see some summary statistics of these year columns.
You don't get any valuable information from the above summary, isn't it? What happens if we convert those columns to object and again see their summary stats?
# Selecting the date columns date_columns = judge_data.select_dtypes(exclude='object') for col in date_columns.columns: # converting the column to object date_columns[col] = date_columns[col].astype('object') # See the summary statistics date_columns.describe(include='O')
It seems that most of the Judges were born in 1956, appointed in 2016 and retired in 2018. That's all we got from these three columns.
Now, it's the time for removing the unnecessary columns. For this, we have to see the summary stats of the categorical columns.