Business & Data Research
Posts
Linear Regression Model - Inferential Statistics

Linear Regression Model - Inferential Statistics

Mahesh Gurumoorthi
September 21, 2024

Linear regression is a regression model that uses a straight line to describe the relationship between one or more variables. It finds the line of best fit through your data by searching for the value of the regression coefficient(s) that minimizes the total error of the model.

There are two main types of linear regression:

Simple Linear Regression uses only one independent variable
Multi Linear Regression uses two or more independent variables

Step 1: Install the packages and move these packages into global environment where we can use it for data modelling purpose.

install.packages("ggplot2")
install.packages("dplyr")
install.packages("broom")
install.packages("ggpubr")

library(ggplot2)
library(dplyr)
library(broom)
library(ggpubr)

Step 2 : Load the required dataset and explore the loaded dataset to know the structure of the data:

petrol <- read.csv("/Users/Library/CloudStorage/OneDrive-Microsoft365/Sample Datasets Kaggle/petrol_consumption.csv")
head(petrol)
cor(petrol)
summary(petrol)

Step 3 : Important : Make sure the data meets the OLS (Ordinary Least Square) assumptions which are as follows :

Linearity
Homoscedasticity
Zero Means Error
Endogeneity
Autocorrelation of errors
Multicollinearity

In this case, we will check the normality of the loaded dataset and make sure we are taking one independent and one dependent variable which is called single linear regression or simple linear regression model

hist(petrol$Petrol_Consumption)

Step 4 : Linearity of the dataset between petrol consumption and paved highways

plot(Petrol_Consumption ~ Paved_Highways, data = petrol)

Step 5 : Homascadecity:
This means that the prediction error will not change significantly over the range of the prediction of the model, so we can test with this assumption later after fitting the linear model.

cor(petrol$Petrol_Consumption, petrol$Paved_Highways)
[1] 0.01904194

Usage of correlation function between the two independent variables which are paved highways and petrol consumption. By running the function cor() we will get to know how far or close these two variables are, if these variables are very close then we can take these two variables into linear model technique. In this case, we got 0.019 which is 2% variance and 98% is correlated with these 2 variables, hence considering this into linear model technique.

Step 6: Perform the linear regression line by using the above dataset:

petrol_consumption_highway <- lm(Petrol_Consumption ~ Paved_Highways, data = petrol)
summary(petrol_consumption_highway)

Step 7 : Plot the linear regression line

Residuals are the differences between the observed values and the predicted values in a regression analysis. They are calculated as:

Residual=Observed value−Predicted value

Residuals help assess how well a regression model fits the data. If the residuals are small and randomly distributed, it indicates a good fit

Step 8 : Plotting the graph of the insurance data and highlighting the 2 variables relationship, this shows paved highways vs petrol consumption.

petrol_highway_final <- ggplot(petrol_consumption_highway, aes(x = Petrol_Consumption,
                                                               y = Paved_Highways )) + 
  geom_point()
print(petrol_highway_final)

Key takeaway : From the above final chart, we could infer that petrol consumption vs paved highways also seeing some outliers which we can ignore for now. Overall, with this model we can infer the petrol consumption when the vehicle is used in the highways.