Workspace
Ilya Selivanov/

Competition - The best plants for bees

0
Beta
Spinner

Which plants are better for bees: native or non-native?

📖 Background

You work for the local government environment agency and have taken on a project about creating pollinator bee-friendly spaces. You can use both native and non-native plants to create these spaces and therefore need to ensure that you use the correct plants to optimize the environment for these bees.

The team has collected data on native and non-native plants and their effects on pollinator bees. Your task will be to analyze this data and provide recommendations on which plants create an optimized environment for pollinator bees.

💪 Challenge

Provide your agency with a report that covers the following:

  • Which plants are preferred by native vs non-native bee species?
  • A visualization of the distribution of bee and plant species across one of the samples.
  • Select the top three plant species you would recommend to the agency to support native bees.

Preprocessing

In this step, we will preprocess the data so it can be used in our future code

Imports

We will need to use these modules in the project

import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import pandas as pd
import numpy as np

Loading the data

data = pd.read_csv("data/plants_and_bees.csv")
display(data)

Checking for missing values

prep = data.copy()
prep.isna().sum()

There are too many missing values for specialized_on and status. We have to delete status, but not specialized_on, as we will need it in the first challenge.

prep = prep.drop('status', axis=1)

For specialized_on, we can replace all missing values with 'None', as the bees didn't prefer anything.

prep['specialized_on'] = prep['specialized_on'].fillna(value='None')
data['specialized_on'] = data['specialized_on'].fillna(value='None')

For the other columns, we can remove the rows which have missing values, as it is best, because there isn't too much to delete

prep = prep.dropna(axis=0)
prep.isna().sum()

Turning categorical columns into numerical ones

For columns with only two unique values, we will turn them into binary ones

to_binary = []
less_10 = []

for col in data.columns:
    if len(set(data[col])) == 2:
        print(f'The `{col}` column has 2 values which are: {set(data[col])}')
        
        prep[col] = data[col].astype('category').cat.codes
        
        to_binary.append(col)
        
    elif len(set(data[col])) < 11:
        less_10.append(f'The `{col}` column has under 10 values which are: {set(data[col])}\n\n')

print()

print('Before:')
print(data[to_binary].head())

print()

print('After:')
print(prep[to_binary].head())
print(*less_10)

Looking at the columns above, we can see that they can be categorically encoded (except for date, as it is can be converted to a datetime object).

cols = ['site', 'nesting']

for i in cols:
    prep[i] = data[i].astype('category').cat.codes

shuffle_rows = prep.sample(frac=1).head().index

print('Before:')
print(data[cols].tail())

print()

print('After:')
print(prep[cols].tail())

Checking types of columns

We are going to check whether the columns' dtypes are correct, because sometimes numerical columns are interpreted as non-numerical ones.