Data Analysis Project - Thi Thu Ngan Nguyen

Beta

Data Analysis Project: EC:6062

Student Name: Thi Thu Ngan Nguyen

Student ID: 23213175

install.packages('mosaic')
install.packages('expss')
install.packages('lmtest')
install.packages('writexl')
install.packages('stargazer')

library(tidyverse)
library(magrittr)
library(stargazer)
library(dplyr)
library(writexl)
library(lmtest)
library (car)
library(readxl)
library(expss)
library(maditr)
library(broom) 
library(mosaic)
library(RCurl)
url_robust <- "https://raw.githubusercontent.com/IsidoreBeautrelet/economictheoryblog/master/robust_summary.R"
eval(parse(text = getURL(url_robust, ssl.verifypeer = FALSE)),
   envir=.GlobalEnv)

list.files()

data <- read_excel("Firm-profits_Data.xlsx")
data = apply_labels(data, 
                    ID = "Unique Identifier",
                    log_profits = "Firms profits in 2024 in Natural Logs",
                    log_training = "Total investments in Training between 2022 and 2023 in Logs",
                    log_equipment = "Total investments in Capital Equipment between 2022 and 2023 in Logs",
                    Enterprise_Group = "Does the firm belong to an Enterprise Group, 1 = Yes",
                    Firm_Age = "Year of the firm since registration to 2024",
                    Export_yes_no = "The firm is an exporting firm, 1= Yes",
                    Small_Firm = "The firm has fewer than 50 employees",
                    Industrial_sector = "Industrial Sector codes",
                    innovation_yes = "The firm introduced new products or services to market during 2022 ans 2023",
                    Employees_log = "Total number of employees in logs in 2024",
                    R_D_yes = "The firm invested in RD during 2022 and 2023, 1 = yes"
)

Hidden output

summary(data)

While our data generally looks okay, there are some notable outliers that could affect the analysis:

The distribution of profits is positively skewed, with a mean greater than the median.The maximum profit is approximately 19.481 units, which is substantially higher than the mean and median profits. This suggests the presence of outliers or extremely profitable firms in the dataset.
The mean training investment is approximately 0.239, suggesting modest investments in training on average. The maximum investment in training is approximately 13.256 units, which is notably higher than the mean and median investments. This could indicate the presence of extreme values or outliers in training investments.
Similarly, the maximum investment in capital equipment is approximately 26.873 units, much higher than the mean and median investments. This suggests the presence of outliers in equipment investments.
The maximum number of employees is 1 billion, which seems exceptionally high and may indicate an error in data entry or a significant outlier.

par(mfrow=c(2,2))

# Boxplot for log_profits
boxplot(data$log_profits, main="Boxplot of Log Profits", ylab="Log Profits", col="lightblue")

# Boxplot for log_training
boxplot(data$log_training, main="Boxplot of Log Training Investments", ylab="Log Training", col="lightgreen")

# Boxplot for log_equipment
boxplot(data$log_equipment, main="Boxplot of Log Equipment Investments", ylab="Log Equipment", col="lightcoral")

# Boxplot for Employees_log
boxplot(data$Employees_log, main="Boxplot of Log Employees", ylab="Log Employees", col="lightyellow")

par(mfrow=c(1,1))

summary(data$Employees_log)
boxplot(data$Employees_log, main = "Boxplot of Employees_log")
hist(data$Employees_log, main = "Histogram of Employees_log", xlab = "Employees_log", ylab = "Frequency")
qqnorm(data$Employees_log)
qqline(data$Employees_log)

The outlier representing a billion employees appears highly unrealistic and likely to be a data entry error. After careful consideration of the dataset, I decided to remove this outlier from the analysis, given the implausibility of this figure and the potential for data quality issues.

I kept other outliers, like extremely high profits or investment numbers, because they're not as obviously wrong. These outliers might represent highly successful companies or big investments, which are valuable insights for our analysis.

# Remove the outlier observation for number of employees
cleaned_data <- data %>%
  filter(Employees_log < 1e9)

summary(cleaned_data$Employees_log)

ggplot(cleaned_data, aes(x = log_training, y = log_profits)) +
  geom_point() +
  labs(x = "Log Training Investments", y = "Log Profits", title = "Scatterplot: Profits vs. Training Investments")

ggplot(cleaned_data, aes(x = log_equipment, y = log_profits)) +
  geom_point() +
  labs(x = "Log Equipment Investments", y = "Log Profits", title = "Scatterplot: Profits vs. Equipment Investments")

The scatterplots provide insights into the relationship between companies' investments in training and equipment and their profits, all represented in logarithmic scale.

Scatterplot: Profits vs. Training Investments

Despite the presence of many zero values in training investments, we observe a generally positive trend between training investments and profits.
Firms with higher training investments tend to have higher profits, indicating a potential positive relationship between the two variables.
However, there are clusters of points at zero on the x-axis (training investments), suggesting a substantial number of firms that did not invest in training.

Scatterplot: Profits vs. Equipment Investments
- The trend in the scatterplot for equipment investments appears clearer and sharper compared to the scatterplot for training investments.
- There is a noticeable positive relationship between equipment investments and profits, with firms that have higher equipment investments generally achieving higher profits.
- The scatterplot indicates a stronger and more consistent association between equipment investments and profits compared to training investments.
- Similar to training investments, there are clusters of points at zero on the x-axis (equipment investments), suggesting a substantial number of firms that did not invest in equipment.

# Point 4
# Simple regression analysis: Profits vs. Training Investments
model_training <- lm(log_profits ~ log_training, data = cleaned_data)
summary(model_training)

# Simple regression analysis: Profits vs. Equipment Investments
model_equipment <- lm(log_profits ~ log_equipment, data = cleaned_data)
summary(model_equipment)

Therre is a relationships between profits and investments in training and equipment. Both models suggest that investments in training and equipment are positively associated with profits, with equipment investments showing a slightly stronger association. However, it's important to note that the percentage of variability in profits explained by these models is relatively low, indicating that other factors not included in the models may also influence profitability.

#Point 5: Y= Profits and X1 = training and X2 = Equipment.
model_multiple <- lm(log_profits ~ log_training + log_equipment, data = cleaned_data)
summary(model_multiple)

The multiple regression analysis demonstrates that both training and equipment investments are positively associated with profits. However, the coefficient for training investments (β1) is larger than that for equipment investments (β2), suggesting that training investments may have a relatively stronger impact on profits compared to equipment investments.

Additionally, the model explains approximately 18.42% of the variability in profits, indicating that other factors not included in the model may also influence firm profitability.

#Point 6: Do point 5 with all other (suitable) control variables
model_full <- lm(log_profits ~ log_training + log_equipment + Enterprise_Group + Firm_Age + Export_yes_no + Small_Firm + Industrial_sector + innovation_yes + Employees_log + R_D_yes, data = cleaned_data)
summary(model_full)

‌
‌
‌

Data Analysis Project - Thi Thu Ngan Nguyen

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Data Analysis Project: EC:6062

Student Name: Thi Thu Ngan Nguyen

Student ID: 23213175

Data Analysis Project: EC:6062