Bruno Eduardo Buendía Pérez/

Price Analysis


Price Analysis

In this exercise, we will conduct an analysis of a database encompassing the prices of two assets. Our objective is to extract valuable and insightful information from this dataset.

# We import the CSV file
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression

data= pd.read_csv("Data.csv", index_col= 0)

#Observing the initial records of the table

We will change columns names

# Getting the columns names
#Changing columns names

columns_names={'fecha':"Date", 'hora': 'Hour', 'pml1':'PML1', 'fecha.1':'Date_2', 'hora.1':'Hour_2', 'pml2':'PML2'}
data.rename(columns=columns_names, inplace=True)

We will conduct observations to identify anomalies or defects in the data.

# We utilize the 'describe' function to obtain key statistical metrics.


In the last table, it is observed that the maximum values of the hour column are 25, so I will discard the values with hour greater than 24.

data_not_25hrs = data.drop(data[(data['Hour'] > 24) | (data['Hour_2'] > 24)].index)

# It is observed that some values in the price table have negative values, I am unsure if this is a normal behavior, but I have decided to discard those data points since they are price values.

data_clean = data_not_25hrs.drop(data[(data['PML1'] < 0) | (data['PML2'] < 0)].index)


We can observe that, on average, PML1 has a higher price than PML2. However, PML2 exhibits the maximum value (38,160). It would be intriguing to investigate the cause of such a high value or determine if it is indicative of a data error.

Additionally, PML2 demonstrates a higher coefficient of variation (1.58) compared to PML1 (1.03), as it possesses a greater standard deviation and a smaller mean.

We will conduct an analysis to gain further insights into the distribution of asset prices.

# Plot a boxplot of "PML1" and "PML2" price
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 5)) 

ax1.set_title("Boxplot for PML1")

ax2.set_title("Boxplot for PML2")

# Set common labels and title
fig.suptitle("Price distribution", fontsize=15)

# Show the Figure

As observed in the preceding graph, there are several values for PML1 and PML2 that can be considered outliers. We will proceed to count how many there are and determine the percentage they represent.

# Calculate the interquartile range (IQR)
Q1_pml1 = data_clean['PML1'].quantile(0.25)
Q3_pml1 = data_clean['PML1'].quantile(0.75)
IQR_pml1 = Q3_pml1 - Q1_pml1

Q1_pml2 = data_clean['PML2'].quantile(0.25)
Q3_pml2 = data_clean['PML2'].quantile(0.75)
IQR_pml2 = Q3_pml2 - Q1_pml2

# Define thresholds to consider outliers
lower_threshold_pml1 = Q1_pml1 - 1.5 * IQR_pml1
upper_threshold_pml1 = Q3_pml1 + 1.5 * IQR_pml1

lower_threshold_pml2 = Q1_pml2 - 1.5 * IQR_pml2
upper_threshold_pml2 = Q3_pml2 + 1.5 * IQR_pml2

# Identify outliers
outliers_PML1 = data_clean[(data_clean['PML1'] < lower_threshold_pml1) | (data_clean['PML1'] > upper_threshold_pml1)]
outliers_PML2 = data_clean[(data_clean['PML2'] < lower_threshold_pml2) | (data_clean['PML2'] > upper_threshold_pml2)]

# Count outliers and calculate percentage
count_outliers_PML1 = len(outliers_PML1)
count_outliers_PML2 = len(outliers_PML2)

total_values = len(data_clean)
percentage_outliers_PML1 = (count_outliers_PML1 / total_values) * 100
percentage_outliers_PML2 = (count_outliers_PML2 / total_values) * 100
print(f"For PML1 the interquartile range is: {IQR_pml1}")
print(f"Number of outliers in PML1: {count_outliers_PML1}")
print(f"Percentage of outliers in PML1: {percentage_outliers_PML1:.2f}%")

print(f"For PML2 the interquartile range is: {IQR_pml2}")
print(f"Number of outliers in PML2: {count_outliers_PML2}")
print(f"Percentage of outliers in PML2: {percentage_outliers_PML2:.2f}%")

  • AI Chat
  • Code