Analyzing Crime in LA
🌇🚔 Background
Los Angeles, California 😎. The City of Angels. Tinseltown. The Entertainment Capital of the World! Known for its warm weather, palm trees, sprawling coastline, and Hollywood, along with producing some of the most iconic films and songs!
However, as with any highely populated city, it isn't always glamarous and there can be a large volume of crime. That's where you can help!
You have been asked to support the Los Angeles Police Department (LAPD) by analyzing their crime data to identify patterns in criminal behavior. They plan to use your insights to allocate resources effectively to tackle various crimes in different areas.
You are free to use any methodologies that you like in order to produce your insights.
The Data
They have provided you with a single dataset to use. A summary and preview is provided below.
The data is publicly available here.
👮♀️ crimes.csv
Column | Description |
---|---|
'DR_NO' | Division of Records Number: Official file number made up of a 2 digit year, area ID, and 5 digits. |
'Date Rptd' | Date reported - MM/DD/YYYY. |
'DATE OCC' | Date of occurence - MM/DD/YYYY. |
'TIME OCC' | In 24 hour military time. |
'AREA' | The LAPD has 21 Community Police Stations referred to as Geographic Areas within the department. These Geographic Areas are sequentially numbered from 1-21. |
'AREA NAME' | The 21 Geographic Areas or Patrol Divisions are also given a name designation that references a landmark or the surrounding community that it is responsible for. For example 77th Street Division is located at the intersection of South Broadway and 77th Street, serving neighborhoods in South Los Angeles. |
'Rpt Dist No' | A four-digit code that represents a sub-area within a Geographic Area. All crime records reference the "RD" that it occurred in for statistical comparisons. Find LAPD Reporting Districts on the LA City GeoHub at http://geohub.lacity.org/datasets/c4f83909b81d4786aa8ba8a74ab |
'Crm Cd' | Crime code for the offence committed. |
'Crm Cd Desc' | Definition of the crime. |
'Vict Age' | Victim Age (years) |
'Vict Sex' | Victim's sex: F : Female, M : Male, X : Unknown. |
'Vict Descent' | Victim's descent:
|
'Premis Cd' | Code for the type of structure, vehicle, or location where the crime took place. |
'Premis Desc' | Definition of the 'Premis Cd' . |
'Weapon Used Cd' | The type of weapon used in the crime. |
'Weapon Desc' | Description of the weapon used (if applicable). |
'Status Desc' | Crime status. |
'Crm Cd 1' | Indicates the crime committed. Crime Code 1 is the primary and most serious one. Crime Code 2, 3, and 4 are respectively less serious offenses. Lower crime class numbers are more serious. |
'Crm Cd 2' | May contain a code for an additional crime, less serious than Crime Code 1. |
'Crm Cd 3' | May contain a code for an additional crime, less serious than Crime Code 1. |
'Crm Cd 4' | May contain a code for an additional crime, less serious than Crime Code 1. |
'LOCATION' | Street address of the crime. |
'Cross Street' | Cross Street of rounded Address |
'LAT' | Latitude of the crime location. |
'LON' | Longtitude of the crime location. |
1. Data Preprocessing:
Start by importing the dataset and checking for missing values and data quality issues. Convert date and time columns into appropriate data types for analysis. Explore the dataset's basic statistics, such as crime distribution by type, area, and time.
import pandas as pd
import matplotlib.pyplot as plt
crimes = pd.read_csv("data/crimes.csv")
crime_data = crimes.copy()
crime_data.info()
# Check for missing values
missing_values = crime_data.isnull().sum()
print("Missing Values:\n", missing_values)
# Convert date columns to datetime data type
crime_data['Date Rptd'] = pd.to_datetime(crime_data['Date Rptd'])
crime_data['DATE OCC'] = pd.to_datetime(crime_data['DATE OCC'])
# Feature Engineering
crime_data['Year'] = crime_data['DATE OCC'].dt.year
crime_data['Month'] = crime_data['DATE OCC'].dt.month
crime_data['DayOfWeek'] = crime_data['DATE OCC'].dt.dayofweek # 0 = Monday, 6 = Sunday
crime_data
# Create a function to convert military time to standard time
def convert_military_to_standard(military_time):
if pd.notna(military_time): # Check for non-null values
military_time = str(military_time).zfill(4) # Ensure it's four digits long
hours = military_time[:2]
minutes = military_time[2:]
return f"{hours}:{minutes}"
return None
columns_to_fill = ['Vict Sex', 'Vict Descent', 'Premis Cd', 'Premis Desc', 'Weapon Used Cd', 'Weapon Desc', 'Crm Cd 1', 'Crm Cd 2', 'Crm Cd 3', 'Crm Cd 4', 'Cross Street']
for column in columns_to_fill:
if column in crime_data.columns:
crime_data[column].fillna(0 if column.startswith('Crm Cd') else 'UNKNOWN', inplace=True)
crime_data['LOCATION'] = crime_data['LOCATION'].str.replace('\n', ' ').str.replace('\r', ' ')
crime_data['LOCATION'].fillna('Unknown', inplace=True)
# Loop through DataFrame columns to efficiently change data types
for col in crime_data:
# Convert integer columns to int32
if crime_data[col].dtype == 'int':
crime_data[col] = crime_data[col].astype('int16')
# Convert float columns to float16
elif crime_data[col].dtype == 'float':
crime_data[col] =crime_data[col].astype('float16')
# Convert float columns to float16
elif crime_data[col].dtype == 'datetime64[ns]':
crime_data[col] =crime_data[col].astype('datetime64[ns]')
# Convert remaining columns to standard categories
else:
crime_data[col] = crime_data[col].astype('category')
# Apply the conversion function to the 'TIME OCC' column
crime_data['TIME OCC'] = crime_data['TIME OCC'].apply(convert_military_to_standard)
# Convert 'TIME OCC' to a real-time format for each row
crime_data['TIME OCC'] = pd.to_datetime(crime_data['TIME OCC'], format='%H:%M:%S', errors='coerce').dt.strftime('%H:%M:%S')
# Extract hour and minute from 'TIME OCC'
crime_data['HourOfDay'] = crime_data['TIME OCC'].dt.hour
crime_data['MinuteOfHour'] = crime_data['TIME OCC'].dt.minute
# Loop through DataFrame columns to efficiently change data types
for col in crime_data:
# Convert integer columns to int32
if crime_data[col].dtype == 'int':
crime_data[col] = crime_data[col].astype('int16')
# Display the first few rows of the preprocessed data
crime_data.info()
crime_data
# Convert 'DATE OCC' and 'TIME OCC' to datetime
crime_data['TIME OCC'] = pd.to_datetime(crime_data['TIME OCC'], format='%H:%M:%S', errors='coerce').dt.strftime('%H:%M') # Updated format
crime_data['Date Rptd'] = pd.to_datetime(crime_data['Date Rptd'], format='%d/%m/%Y', errors='coerce').dt.date
crime_data['DATE OCC'] = pd.to_datetime(crime_data['DATE OCC'], format='%d/%m/%Y', errors='coerce').dt.date
crime_data
''' # Basic statistics of the dataset
basic_stats = crime_data.describe()
print("Basic Statistics:\n", basic_stats)
# Crime distribution by type
crime_type_distribution = crime_data['Crm Cd Desc'].value_counts()
print("Crime Distribution by Type:\n", crime_type_distribution)
# Crime distribution by area
crime_area_distribution = crime_data['AREA NAME'].value_counts()
print("Crime Distribution by Area:\n", crime_area_distribution)
# Convert 'DATE OCC' to datetime and extract year and month for time analysis
crime_data['DATE OCC'] = pd.to_datetime(crime_data['DATE OCC'])
crime_data['Year'] = crime_data['DATE OCC'].dt.year
crime_data['Month'] = crime_data['DATE OCC'].dt.month
# Crime distribution by year
crime_year_distribution = crime_data['Year'].value_counts().sort_index()
print("Crime Distribution by Year:\n", crime_year_distribution)
# Crime distribution by month
crime_month_distribution = crime_data['Month'].value_counts().sort_index()
print("Crime Distribution by Month:\n", crime_month_distribution)
# Visualize crime distribution by year
plt.figure(figsize=(10, 6))
plt.bar(crime_year_distribution.index, crime_year_distribution.values)
plt.xlabel("Year")
plt.ylabel("Number of Crimes")
plt.title("Crime Distribution by Year")
plt.show()
# Visualize crime distribution by month
plt.figure(figsize=(10, 6))
plt.plot(crime_month_distribution.index, crime_month_distribution.values, marker='o')
plt.xlabel("Month")
plt.ylabel("Number of Crimes")
plt.title("Crime Distribution by Month")
plt.xticks(range(1, 13), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.show()
'''
2. Exploratory Data Analysis (EDA):
Visualize the distribution of crimes on a map using latitude and longitude data. Examine the frequency of crimes by area, crime type, and victim characteristics (age, sex, descent). Create histograms, bar charts, and heatmaps to gain insights from the data.
import seaborn as sns
import folium
'''
# Create a Folium map
crime_map = folium.Map(location=[34.0522, -118.2437], zoom_start=10)
# Plot crime locations on the map
for index, row in crime_data.iterrows():
folium.CircleMarker(
location=[row['LAT'], row['LON']],
radius=3,
color='red',
fill=True,
fill_color='red'
).add_to(crime_map)
# Save the map to an HTML file
crime_map.save('crime_map.html')
# Crime distribution by area (Bar chart)
crime_area_distribution = crime_data['AREA NAME'].value_counts()
plt.figure(figsize=(10, 6))
sns.barplot(x=crime_area_distribution.index, y=crime_area_distribution.values, palette='viridis')
plt.xlabel("Area")
plt.ylabel("Number of Crimes")
plt.title("Crime Distribution by Area")
plt.xticks(rotation=90)
plt.show()
# Crime distribution by type (Bar chart)
crime_type_distribution = crime_data['Crm Cd Desc'].value_counts()[:10] # Display the top 10 crime types
plt.figure(figsize=(10, 6))
sns.barplot(x=crime_type_distribution.values, y=crime_type_distribution.index, palette='viridis')
plt.xlabel("Number of Crimes")
plt.ylabel("Crime Type")
plt.title("Top 10 Crime Types")
plt.show()
# Crime distribution by victim age (Histogram)
plt.figure(figsize=(10, 6))
sns.histplot(crime_data['Vict Age'], kde=True, color='purple')
plt.xlabel("Victim Age")
plt.ylabel("Number of Crimes")
plt.title("Crime Distribution by Victim Age")
plt.show()
# Crime distribution by victim sex (Bar chart)
crime_sex_distribution = crime_data['Vict Sex'].value_counts()
plt.figure(figsize=(6, 4))
sns.barplot(x=crime_sex_distribution.index, y=crime_sex_distribution.values, palette='viridis')
plt.xlabel("Victim Sex")
plt.ylabel("Number of Crimes")
plt.title("Crime Distribution by Victim Sex")
plt.show()
# Crime distribution by victim descent (Bar chart)
crime_descent_distribution = crime_data['Vict Descent'].value_counts()
plt.figure(figsize=(10, 6))
sns.barplot(x=crime_descent_distribution.index, y=crime_descent_distribution.values, palette='viridis')
plt.xlabel("Victim Descent")
plt.ylabel("Number of Crimes")
plt.title("Crime Distribution by Victim Descent")
plt.xticks(rotation=90)
plt.show()
'''
3. Data Interpretation:
Summarize your findings from EDA. For instance, you might discover that certain areas have a higher crime rate or that specific types of crimes are more prevalent. Observe any trends or patterns in the data, such as if certain crimes occur more frequently at specific times.
'''# Summarize overall statistics
crime_stats = crime_data.describe()
# Crime Distribution by Area
crime_area_distribution = crime_data['AREA NAME'].value_counts()
highest_crime_area = crime_area_distribution.idxmax()
# Most Prevalent Crime Types
crime_type_distribution = crime_data['Crm Cd Desc'].value_counts()
top_crime_types = crime_type_distribution.head(5)
# Time Trends
crime_data['DATE OCC'] = pd.to_datetime(crime_data['DATE OCC'])
crime_data['Year'] = crime_data['DATE OCC'].dt.year
yearly_crime_trends = crime_data['Year'].value_counts().sort_index()
# Seasonal Patterns
monthly_crime_trends = crime_data['DATE OCC'].dt.month.value_counts().sort_index()
# Time of Day Analysis
hourly_crime_distribution = crime_data['TIME OCC'].value_counts().sort_index()
# Print and visualize findings
print("Overall Statistics:")
print(crime_stats)
print("\nArea with Highest Crime Rate:", highest_crime_area)
print("\nTop 5 Crime Types:")
print(top_crime_types)
print("\nYearly Crime Trends:")
print(yearly_crime_trends)
print("\nMonthly Crime Trends:")
print(monthly_crime_trends)
print("\nHourly Crime Distribution:")
print(hourly_crime_distribution)
# Plot visualizations
plt.figure(figsize=(12, 6))
sns.barplot(x=hourly_crime_distribution.index, y=hourly_crime_distribution.values, palette='viridis')
plt.xlabel("Hour of the Day")
plt.ylabel("Number of Crimes")
plt.title("Hourly Crime Distribution")
plt.show()
'''
- Machine Learning Models:
You can build machine learning models to predict criminal activities. For example: Time Series Analysis: Predict when crimes may occur based on historical data. Classification: Predict the type of crime or the area where a crime is likely to happen. Utilize features from the dataset such as time, location, victim characteristics, and premises code for modeling.
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.ensemble import RandomForestClassifier
# Classification: Predict Crime Type
# Extract hour and minute from 'TIME OCC'
time_series_data['Hour'] = time_series_data.index.hour
time_series_data['Minute'] = time_series_data.index.minute
# Time Series Analysis: Predict When Crimes May Occur
# Filter data for time series analysis
time_series_data = crime_data[['DATE OCC', 'Crm Cd']].copy()
time_series_data['DATE OCC'] = pd.to_datetime(time_series_data['DATE OCC']) # Convert 'DATE OCC' column to datetime
time_series_data.set_index('DATE OCC', inplace=True)
time_series_data = time_series_data.resample('D').count() # Resample to daily frequency
# Create a new Exponential Smoothing model with a different use_boxcox parameter
model = ExponentialSmoothing(time_series_data, seasonal='add', seasonal_periods=7, use_boxcox=False)
model_fit = model.fit(optimized=True)
forecast = model_fit.forecast(steps=30) # Forecast for the next 30 days
# Print the forecast
print("Time Series Analysis - Next 30 Days Forecast:")
print(forecast)
# Select features and target variable for classification
features = ['Year', 'Month', 'DayOfWeek', 'HourOfDay', 'MinuteOfHour', 'LAT', 'LON', 'Vict Age', 'Vict Sex', 'Vict Descent', 'Premis Cd']
target = 'Crm Cd Desc'
# Encode categorical variables
crime_data = pd.get_dummies(crime_data, columns=['Vict Sex', 'Vict Descent', 'Premis Cd'])
X = crime_data.drop(target, axis=1)
y = crime_data[target]
# Adjust train-test split parameters
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
# Check if there's enough data in the training set
if len(X_train) == 0:
raise ValueError("Adjust train-test split parameters to ensure a non-empty training set.")
# Classification Model: Random Forest
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)
y_pred = rf_classifier.predict(X_test)
# Model Evaluation
classification_accuracy = accuracy_score(y_test, y_pred)
classification_report_str = classification_report(y_test, y_pred)
# Print results
print("\nClassification Model - Accuracy:")
print(f"Accuracy: {classification_accuracy:.2f}")
print("\nClassification Report:")
print(classification_report_str)
- Recommendations and Insights:
Offer recommendations for public safety or law enforcement based on your analysis. For example, you could suggest increased police presence in areas with higher crime rates. Provide insights into the demographics of crime victims, which may inform prevention strategies. 6. Data Visualization:
Visualize your model's predictions to make them more understandable and actionable. Consider creating interactive dashboards or maps to make your results accessible. 7. Report and Documentation:
Prepare a formal report summarizing your analysis, methodologies, findings, and recommendations. Include visualizations, charts, and code snippets as necessary. 8. Further Analysis:
Depending on your interests, you can explore more advanced techniques, such as natural language processing (NLP) to analyze crime descriptions for additional insights.