Buzzing Discoveries : Exploring the Plant Preference of Bees
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Buzzing Discoveries: Exploring the Plant Preference of Bees

    This notebook was constructed as part of a DataCamp Community Competition. The chosen tool for the analysis is Python, visualisations were compiled using Plotly, providing interactive graphs (including hover-over info and zoom function for more details if desired). Code is hidden for increased readablity in the publication, but can be accessed in the notebook. As a beginning beekeeper myself, I was very excited to be working with this dataset!

    📖 Background

    We have taken on a project about creating pollinator bee-friendly spaces for a local government environment agency. Bee-friendly spaces can be created using both native and non-native plants and therefore we need to ensure that the correct plants are used to optimize the environment for these bees.

    The team has collected data on native and non-native plants and their effects on pollinator bees. Our task is to analyze this data and provide insights and recommendations on which plants create an optimized environment for pollinator bees.

    🔎 Analysis Objectives

    Exploratory Analysis

    We will perform an exploratory analysis of the available variables, providing some general insights and context for the study and the research answers listed below.

    Research questions

    We will provide the enviroment agency with answers to the following questions :

    • Which plants are preferred by native vs non-native bee species?
    • Select the top three plant species you would recommend to the agency to support native bees/non-native bees?
    • A visualization of the distribution of bee and plant species across one of the samples.

    💾 Data Validation

    Original Data

    The description of the data is shown below, accompanied by the original Dataframe.

    You have assembled information on the plants and bees research in a file called plants_and_bees.csv. Each row represents a sample that was taken from a patch of land where the plant species were being studied.

    ColumnDescription
    sample_idThe ID number of the sample taken.
    species_numThe number of different bee species in the sample.
    dateDate the sample was taken (format : MM-DD-YY)
    seasonSeason during sample collection ("early.season" or "late.season").
    siteName of collection site.
    native_or_nonWhether the sample was from a native or non-native plant.
    samplingThe sampling method.
    plant_speciesThe name of the plant species the sample was taken from. None indicates the sample was taken from the air.
    timeThe time the sample was taken.
    bee_speciesThe bee species in the sample.
    sexThe gender of the bee species.
    specialized_onThe plant genus the bee species preferred.
    parasiticWhether or not the bee is parasitic (0:no, 1:yes).
    nestingThe bees nesting method.
    statusThe status of the bee species.
    nonnative_beeWhether the bee species is native or not (0:no, 1:yes).

    Data is courtesy of Dryad - Source (data has been modified)

    Data Exploration

    We will import the raw data and appropriate packages for analysis.

    Hidden code
    Hidden code
    raw_data.select_dtypes(include=['int']).describe()
    Hidden output
    raw_data.select_dtypes(include='float').describe()
    Hidden output
    raw_data.select_dtypes(include='object').describe()
    Hidden output

    Findings & Actions needed

    Following summaries for each column discuss :

    • Whether the values match the description given in the provided table,
    • If there were missing values and which value they were represented by,
    • If there were other inconsistensies which need need fixing,
    • Which actions are needed to make the values match the description provided.

    sample_id

    • Duplicates were present, but will not be dropped : it is possible for samples to contain multiple identical observations (entries are not uniquely identified by the sample_id).

    species_num

    • This column does not have any missing values and does not need alterations.

    date

    • The column can be used in its current form and will not be altered.

    season

    • This column does not have any missing values,
    • The original notation style will be altered, removing '.season' from the entries.

    site

    • This column does not have any missing values and does not need any alterations.

    native_or_non

    • The name of the column is changed to 'native_plant' for clarity during analysis,
    • There are no missing values,
    • Original notation for the findings - values (0:no, 1:yes) - is changed to Yes and No.

    sampling

    • This column does not have any missing values and does not need alterations.

    plant_species

    • This column does not have any missing values and does not need alterations.

    time

    • This column has the wrong datatype and notation, both will be adjusted.

    bee_species

    • This column does not have any missing values and does not need alterations.

    sex

    • This column does not have any missing values and not does need alterations.

    specialized_on

    • This column contains 1243 missing values, meaning about 99.44% of the values are missing.
    • This variable doesn't play a crucial role in answering the research question provided, and will not be taken into account during the analysis.

    parasitic

    • This column contains 63 missing values, which will be replaced by 'Not Specified' for clarity during analysis.

    nesting

    • This column contains 54 missing values, which will be replaced by 'Not Specified' for clarity during analysis.

    status

    • This column contains 1235 missing values, meaning about 98.8% of the values are missing.
    • This variable doesn't play a crucial role in answering the research question provided, and will not be taken into account during the analysis.

    nonnative_bee

    • The name of the column is changed to 'native_bee' for clarity during analysis,
    • There are 61 missing values, these are replaced by 'Not Specified',
    • Original notation for the findings - values (0:no, 1:yes) - is changed to Yes and No.

    Result of Data Cleaning

    All actions described above will be executed to ensure analysis-ready data. In addition, several variables will be converted to categorical, as they have limited description options and this will save some computational space.

    #selecting data to be analyzed - discarding columns 'specialized_on' and 'status'
    data = raw_data.drop(['specialized_on', 'status' ], axis=1)
    
    #rename columns 'native_or_non' and 'nonnative_bee'
    data = data.rename(columns={'native_or_non':'native_plant','nonnative_bee':'native_bee'})
    #replace nulls with description where advised
    columns_to_fill = ['parasitic', 'nesting', 'native_bee']
    
    for column in columns_to_fill:
        data[column].fillna("Not Specified", inplace=True)
    #values for 'parasitic' : replace "0/1" with "No/Yes"
    data['parasitic'] = data['parasitic'].replace(0, "No")
    data['parasitic'] = data['parasitic'].replace(1, "Yes")
    
    #values for 'nonnative_bee' : replace "0/1" with "No/Yes"
    data['native_bee'] = data['native_bee'].replace(0, "non-native")
    data['native_bee'] = data['native_bee'].replace(1, "native")
    #change season notation for season - remove .season by splitting at the dot and keeping only the first part
    data['season'] = data['season'].apply(lambda notation: notation.split('.')[0])
    
    #convert time
    data['time'] = data['time'].apply(lambda x: datetime.strptime(str(x), "%H%M").strftime("%H:%M"))
    Hidden code