Competition - Analyzing Crimes in LA
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Analyzing Crime in LA

    🌇🚔 Background

    Los Angeles, California 😎. The City of Angels. Tinseltown. The Entertainment Capital of the World! Known for its warm weather, palm trees, sprawling coastline, and Hollywood, along with producing some of the most iconic films and songs!

    However, as with any highely populated city, it isn't always glamarous and there can be a large volume of crime. That's where you can help!

    You have been asked to support the Los Angeles Police Department (LAPD) by analyzing their crime data to identify patterns in criminal behavior. They plan to use your insights to allocate resources effectively to tackle various crimes in different areas.

    You are free to use any methodologies that you like in order to produce your insights.

    The Data

    They have provided you with a single dataset to use. A summary and preview is provided below.

    The data is publicly available here.

    👮‍♀️ crimes.csv

    ColumnDescription
    'DR_NO'Division of Records Number: Official file number made up of a 2 digit year, area ID, and 5 digits.
    'Date Rptd'Date reported - MM/DD/YYYY.
    'DATE OCC'Date of occurence - MM/DD/YYYY.
    'TIME OCC'In 24 hour military time.
    'AREA'The LAPD has 21 Community Police Stations referred to as Geographic Areas within the department. These Geographic Areas are sequentially numbered from 1-21.
    'AREA NAME'The 21 Geographic Areas or Patrol Divisions are also given a name designation that references a landmark or the surrounding community that it is responsible for. For example 77th Street Division is located at the intersection of South Broadway and 77th Street, serving neighborhoods in South Los Angeles.
    'Rpt Dist No'A four-digit code that represents a sub-area within a Geographic Area. All crime records reference the "RD" that it occurred in for statistical comparisons. Find LAPD Reporting Districts on the LA City GeoHub at http://geohub.lacity.org/datasets/c4f83909b81d4786aa8ba8a74ab
    'Crm Cd'Crime code for the offence committed.
    'Crm Cd Desc'Definition of the crime.
    'Vict Age'Victim Age (years)
    'Vict Sex'Victim's sex: F: Female, M: Male, X: Unknown.
    'Vict Descent'Victim's descent:
    • A - Other Asian
    • B - Black
    • C - Chinese
    • D - Cambodian
    • F - Filipino
    • G - Guamanian
    • H - Hispanic/Latin/Mexican
    • I - American Indian/Alaskan Native
    • J - Japanese
    • K - Korean
    • L - Laotian
    • O - Other
    • P - Pacific Islander
    • S - Samoan
    • U - Hawaiian
    • V - Vietnamese
    • W - White
    • X - Unknown
    • Z - Asian Indian
    'Premis Cd'Code for the type of structure, vehicle, or location where the crime took place.
    'Premis Desc'Definition of the 'Premis Cd'.
    'Weapon Used Cd'The type of weapon used in the crime.
    'Weapon Desc'Description of the weapon used (if applicable).
    'Status Desc'Crime status.
    'Crm Cd 1'Indicates the crime committed. Crime Code 1 is the primary and most serious one. Crime Code 2, 3, and 4 are respectively less serious offenses. Lower crime class numbers are more serious.
    'Crm Cd 2'May contain a code for an additional crime, less serious than Crime Code 1.
    'Crm Cd 3'May contain a code for an additional crime, less serious than Crime Code 1.
    'Crm Cd 4'May contain a code for an additional crime, less serious than Crime Code 1.
    'LOCATION'Street address of the crime.
    'Cross Street'Cross Street of rounded Address
    'LAT'Latitude of the crime location.
    'LON'Longtitude of the crime location.

    1. Data Preprocessing:

    Start by importing the dataset and checking for missing values and data quality issues. Convert date and time columns into appropriate data types for analysis. Explore the dataset's basic statistics, such as crime distribution by type, area, and time.

    import pandas as pd
    import matplotlib.pyplot as plt
    
    crimes = pd.read_csv("data/crimes.csv")
    crime_data = crimes.copy()
    crime_data.info()
    # Check for missing values
    missing_values = crime_data.isnull().sum()
    print("Missing Values:\n", missing_values)
    
    
    # Convert date columns to datetime data type
    crime_data['Date Rptd'] = pd.to_datetime(crime_data['Date Rptd'])
    crime_data['DATE OCC'] = pd.to_datetime(crime_data['DATE OCC'])
    
    # Feature Engineering
    crime_data['Year'] = crime_data['DATE OCC'].dt.year
    crime_data['Month'] = crime_data['DATE OCC'].dt.month
    crime_data['DayOfWeek'] = crime_data['DATE OCC'].dt.dayofweek  # 0 = Monday, 6 = Sunday
    
    
    crime_data
    
    
    # Create a function to convert military time to standard time
    def convert_military_to_standard(military_time):
        if pd.notna(military_time):  # Check for non-null values
            military_time = str(military_time).zfill(4)  # Ensure it's four digits long
            hours = military_time[:2]
            minutes = military_time[2:]
            return f"{hours}:{minutes}"
        return None
    
    columns_to_fill = ['Vict Sex', 'Vict Descent', 'Premis Cd', 'Premis Desc', 'Weapon Used Cd', 'Weapon Desc', 'Crm Cd 1', 'Crm Cd 2', 'Crm Cd 3', 'Crm Cd 4', 'Cross Street']
    for column in columns_to_fill:
        if column in crime_data.columns:
            crime_data[column].fillna(0 if column.startswith('Crm Cd') else 'UNKNOWN', inplace=True)
            
    crime_data['LOCATION'] = crime_data['LOCATION'].str.replace('\n', ' ').str.replace('\r', ' ')
    crime_data['LOCATION'].fillna('Unknown', inplace=True)
    
    # Loop through DataFrame columns to efficiently change data types
    for col in crime_data:
        # Convert integer columns to int32
        if crime_data[col].dtype == 'int':
            crime_data[col] = crime_data[col].astype('int16')
        # Convert float columns to float16
        elif crime_data[col].dtype == 'float':
            crime_data[col] =crime_data[col].astype('float16')
            # Convert float columns to float16
        elif crime_data[col].dtype == 'datetime64[ns]':
            crime_data[col] =crime_data[col].astype('datetime64[ns]')
            # Convert remaining columns to standard categories
        else:
            crime_data[col] = crime_data[col].astype('category')
            
    # Apply the conversion function to the 'TIME OCC' column
    crime_data['TIME OCC'] = crime_data['TIME OCC'].apply(convert_military_to_standard)
    # Convert 'TIME OCC' to a real-time format for each row
    crime_data['TIME OCC'] = pd.to_datetime(crime_data['TIME OCC'], format='%H:%M:%S', errors='coerce').dt.strftime('%H:%M:%S')
    
    # Extract hour and minute from 'TIME OCC'
    crime_data['HourOfDay'] = crime_data['TIME OCC'].dt.hour
    crime_data['MinuteOfHour'] = crime_data['TIME OCC'].dt.minute
    
    # Loop through DataFrame columns to efficiently change data types
    for col in crime_data:
        # Convert integer columns to int32
        if crime_data[col].dtype == 'int':
            crime_data[col] = crime_data[col].astype('int16')
    
    # Display the first few rows of the preprocessed data
    crime_data.info()
    crime_data
    # Convert 'DATE OCC' and 'TIME OCC' to datetime
    crime_data['TIME OCC'] = pd.to_datetime(crime_data['TIME OCC'], format='%H:%M:%S', errors='coerce').dt.strftime('%H:%M')  # Updated format
    crime_data['Date Rptd'] = pd.to_datetime(crime_data['Date Rptd'], format='%d/%m/%Y', errors='coerce').dt.date
    crime_data['DATE OCC'] = pd.to_datetime(crime_data['DATE OCC'], format='%d/%m/%Y', errors='coerce').dt.date
    crime_data
    ''' # Basic statistics of the dataset
    basic_stats = crime_data.describe()
    print("Basic Statistics:\n", basic_stats)
    
    # Crime distribution by type
    crime_type_distribution = crime_data['Crm Cd Desc'].value_counts()
    print("Crime Distribution by Type:\n", crime_type_distribution)
    
    # Crime distribution by area
    crime_area_distribution = crime_data['AREA NAME'].value_counts()
    print("Crime Distribution by Area:\n", crime_area_distribution)
    
    # Convert 'DATE OCC' to datetime and extract year and month for time analysis
    crime_data['DATE OCC'] = pd.to_datetime(crime_data['DATE OCC'])
    crime_data['Year'] = crime_data['DATE OCC'].dt.year
    crime_data['Month'] = crime_data['DATE OCC'].dt.month
    
    # Crime distribution by year
    crime_year_distribution = crime_data['Year'].value_counts().sort_index()
    print("Crime Distribution by Year:\n", crime_year_distribution)
    
    # Crime distribution by month
    crime_month_distribution = crime_data['Month'].value_counts().sort_index()
    print("Crime Distribution by Month:\n", crime_month_distribution)
    
    # Visualize crime distribution by year
    plt.figure(figsize=(10, 6))
    plt.bar(crime_year_distribution.index, crime_year_distribution.values)
    plt.xlabel("Year")
    plt.ylabel("Number of Crimes")
    plt.title("Crime Distribution by Year")
    plt.show()
    
    # Visualize crime distribution by month
    plt.figure(figsize=(10, 6))
    plt.plot(crime_month_distribution.index, crime_month_distribution.values, marker='o')
    plt.xlabel("Month")
    plt.ylabel("Number of Crimes")
    plt.title("Crime Distribution by Month")
    plt.xticks(range(1, 13), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
    plt.show()
    
    '''

    2. Exploratory Data Analysis (EDA):

    Visualize the distribution of crimes on a map using latitude and longitude data. Examine the frequency of crimes by area, crime type, and victim characteristics (age, sex, descent). Create histograms, bar charts, and heatmaps to gain insights from the data.

    import seaborn as sns
    import folium
    '''
    # Create a Folium map
    crime_map = folium.Map(location=[34.0522, -118.2437], zoom_start=10)
    
    # Plot crime locations on the map
    for index, row in crime_data.iterrows():
        folium.CircleMarker(
            location=[row['LAT'], row['LON']],
            radius=3,
            color='red',
            fill=True,
            fill_color='red'
        ).add_to(crime_map)
    
    # Save the map to an HTML file
    crime_map.save('crime_map.html')
    
    # Crime distribution by area (Bar chart)
    crime_area_distribution = crime_data['AREA NAME'].value_counts()
    plt.figure(figsize=(10, 6))
    sns.barplot(x=crime_area_distribution.index, y=crime_area_distribution.values, palette='viridis')
    plt.xlabel("Area")
    plt.ylabel("Number of Crimes")
    plt.title("Crime Distribution by Area")
    plt.xticks(rotation=90)
    plt.show()
    
    # Crime distribution by type (Bar chart)
    crime_type_distribution = crime_data['Crm Cd Desc'].value_counts()[:10]  # Display the top 10 crime types
    plt.figure(figsize=(10, 6))
    sns.barplot(x=crime_type_distribution.values, y=crime_type_distribution.index, palette='viridis')
    plt.xlabel("Number of Crimes")
    plt.ylabel("Crime Type")
    plt.title("Top 10 Crime Types")
    plt.show()
    
    # Crime distribution by victim age (Histogram)
    plt.figure(figsize=(10, 6))
    sns.histplot(crime_data['Vict Age'], kde=True, color='purple')
    plt.xlabel("Victim Age")
    plt.ylabel("Number of Crimes")
    plt.title("Crime Distribution by Victim Age")
    plt.show()
    
    # Crime distribution by victim sex (Bar chart)
    crime_sex_distribution = crime_data['Vict Sex'].value_counts()
    plt.figure(figsize=(6, 4))
    sns.barplot(x=crime_sex_distribution.index, y=crime_sex_distribution.values, palette='viridis')
    plt.xlabel("Victim Sex")
    plt.ylabel("Number of Crimes")
    plt.title("Crime Distribution by Victim Sex")
    plt.show()
    
    # Crime distribution by victim descent (Bar chart)
    crime_descent_distribution = crime_data['Vict Descent'].value_counts()
    plt.figure(figsize=(10, 6))
    sns.barplot(x=crime_descent_distribution.index, y=crime_descent_distribution.values, palette='viridis')
    plt.xlabel("Victim Descent")
    plt.ylabel("Number of Crimes")
    plt.title("Crime Distribution by Victim Descent")
    plt.xticks(rotation=90)
    plt.show()
    
    '''

    3. Data Interpretation:

    Summarize your findings from EDA. For instance, you might discover that certain areas have a higher crime rate or that specific types of crimes are more prevalent. Observe any trends or patterns in the data, such as if certain crimes occur more frequently at specific times.

    '''# Summarize overall statistics
    crime_stats = crime_data.describe()
    
    # Crime Distribution by Area
    crime_area_distribution = crime_data['AREA NAME'].value_counts()
    highest_crime_area = crime_area_distribution.idxmax()
    
    # Most Prevalent Crime Types
    crime_type_distribution = crime_data['Crm Cd Desc'].value_counts()
    top_crime_types = crime_type_distribution.head(5)
    
    # Time Trends
    crime_data['DATE OCC'] = pd.to_datetime(crime_data['DATE OCC'])
    crime_data['Year'] = crime_data['DATE OCC'].dt.year
    yearly_crime_trends = crime_data['Year'].value_counts().sort_index()
    
    # Seasonal Patterns
    monthly_crime_trends = crime_data['DATE OCC'].dt.month.value_counts().sort_index()
    
    # Time of Day Analysis
    
    hourly_crime_distribution = crime_data['TIME OCC'].value_counts().sort_index()
    
    # Print and visualize findings
    print("Overall Statistics:")
    print(crime_stats)
    print("\nArea with Highest Crime Rate:", highest_crime_area)
    print("\nTop 5 Crime Types:")
    print(top_crime_types)
    print("\nYearly Crime Trends:")
    print(yearly_crime_trends)
    print("\nMonthly Crime Trends:")
    print(monthly_crime_trends)
    print("\nHourly Crime Distribution:")
    print(hourly_crime_distribution)
    
    # Plot visualizations
    plt.figure(figsize=(12, 6))
    sns.barplot(x=hourly_crime_distribution.index, y=hourly_crime_distribution.values, palette='viridis')
    plt.xlabel("Hour of the Day")
    plt.ylabel("Number of Crimes")
    plt.title("Hourly Crime Distribution")
    plt.show()
    
    '''
    1. Machine Learning Models:

    You can build machine learning models to predict criminal activities. For example: Time Series Analysis: Predict when crimes may occur based on historical data. Classification: Predict the type of crime or the area where a crime is likely to happen. Utilize features from the dataset such as time, location, victim characteristics, and premises code for modeling.

    from statsmodels.tsa.holtwinters import ExponentialSmoothing
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score, classification_report
    from sklearn.ensemble import RandomForestClassifier
    
    # Classification: Predict Crime Type
    
    # Extract hour and minute from 'TIME OCC'
    time_series_data['Hour'] = time_series_data.index.hour
    time_series_data['Minute'] = time_series_data.index.minute
    
    # Time Series Analysis: Predict When Crimes May Occur
    # Filter data for time series analysis
    time_series_data = crime_data[['DATE OCC', 'Crm Cd']].copy()
    time_series_data['DATE OCC'] = pd.to_datetime(time_series_data['DATE OCC'])  # Convert 'DATE OCC' column to datetime
    time_series_data.set_index('DATE OCC', inplace=True)
    time_series_data = time_series_data.resample('D').count()  # Resample to daily frequency
    
    
    # Create a new Exponential Smoothing model with a different use_boxcox parameter
    model = ExponentialSmoothing(time_series_data, seasonal='add', seasonal_periods=7, use_boxcox=False)
    model_fit = model.fit(optimized=True)
    forecast = model_fit.forecast(steps=30)  # Forecast for the next 30 days
    
    # Print the forecast
    print("Time Series Analysis - Next 30 Days Forecast:")
    print(forecast)
    Run cancelled
    
    
    # Select features and target variable for classification
    features = ['Year', 'Month', 'DayOfWeek', 'HourOfDay', 'MinuteOfHour', 'LAT', 'LON', 'Vict Age', 'Vict Sex', 'Vict Descent', 'Premis Cd']
    target = 'Crm Cd Desc'
    
    # Encode categorical variables
    crime_data = pd.get_dummies(crime_data, columns=['Vict Sex', 'Vict Descent', 'Premis Cd'])
    
    X = crime_data.drop(target, axis=1)
    y = crime_data[target]
    
    # Adjust train-test split parameters
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
    
    # Check if there's enough data in the training set
    if len(X_train) == 0:
        raise ValueError("Adjust train-test split parameters to ensure a non-empty training set.")
    
    # Classification Model: Random Forest
    rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
    rf_classifier.fit(X_train, y_train)
    y_pred = rf_classifier.predict(X_test)
    
    # Model Evaluation
    classification_accuracy = accuracy_score(y_test, y_pred)
    classification_report_str = classification_report(y_test, y_pred)
    
    # Print results
    print("\nClassification Model - Accuracy:")
    print(f"Accuracy: {classification_accuracy:.2f}")
    print("\nClassification Report:")
    print(classification_report_str)
    1. Recommendations and Insights:

    Offer recommendations for public safety or law enforcement based on your analysis. For example, you could suggest increased police presence in areas with higher crime rates. Provide insights into the demographics of crime victims, which may inform prevention strategies. 6. Data Visualization:

    Visualize your model's predictions to make them more understandable and actionable. Consider creating interactive dashboards or maps to make your results accessible. 7. Report and Documentation:

    Prepare a formal report summarizing your analysis, methodologies, findings, and recommendations. Include visualizations, charts, and code snippets as necessary. 8. Further Analysis:

    Depending on your interests, you can explore more advanced techniques, such as natural language processing (NLP) to analyze crime descriptions for additional insights.