- Business & Data Research
- Posts
- Linear Regression Model using R Programming
Linear Regression Model using R Programming
Linear regression is a regression model that uses a straight line to describe the relationship between one or more variables. It finds the line of best fit through your data by searching for the value of the regression coefficient(s) that minimizes the total error of the model.
There are two main types of linear regression:
Simple Linear Regression uses only one independent variable
Multi Linear Regression uses two or more independent variables
Step 1: Install the packages and move these packages into global environment where we can use it for data modelling purpose.
install.packages("ggplot2")
install.packages("dplyr")
install.packages("broom")
install.packages("ggpubr")
library(ggplot2)
library(dplyr)
library(broom)
library(ggpubr)
Step 2 : Load the required dataset and explore the loaded dataset to know the structure of the data:
insurance_data <- read.csv("/Users/maheshg/Dropbox/Sample Datasets Kaggle/insurance.csv")
head(insurance_data)
summary(insurance_data)
Step 3 : Important : Make sure the data meets the OLS (Ordinary Least Square) assumptions which are as follows :
Linearity
Homoscedasticity
Zero Means Error
Endogeneity
Autocorrelation of errors
Multicollinearity
In this case, we will check the normality of the loaded dataset and make sure we are taking one independent and one dependent variable which is called single linear regression or simple linear regression model
hist(insurance_data$bmi)
Step 4 : Linearity of the dataset between BMI variable and Age variable:
plot(bmi ~ age, data = insurance_data)
Step 5 : Homascadecity:
This means that the prediction error will not change significantly over the range of the prediction of the model, so we can test with this assumption later after fitting the linear model.
Usage of correlation function between the two independent variables which are bmi and age. By running the function cor() we will get to know how far or close these two variables are, if these variables are very close then we can take these two variables into linear model technique. In this case, we got 0.109 which is 10% variance and 90% is correlated with these 2 variables, hence considering this into linear model technique.
cor(insurance_data$bmi, insurance_data$age)
[1] 0.1092719
Step 6: Perform the linear regression line by using the above dataset:
insurance_age_bmi <- lm(bmi ~ age, data = insurance_data)
summary(insurance_age_bmi)
Step 7 : Plot the linear regression line
par(mfrow = c(2,2))
plot(insurance_age_bmi)
Residuals are the differences between the observed values and the predicted values in a regression analysis. They are calculated as:
Residual=Observed value−Predicted value
Residuals help assess how well a regression model fits the data. If the residuals are small and randomly distributed, it indicates a good fit
Step 8 : Plotting the graph of the insurance data and highlighting the 2 variables relationship, this shows bmi vs age.
insurance_graph <- ggplot(insurance_data, aes (x = bmi, y = age)) +
geom_point()
print(insurance_graph)