Linear Regression Model using R Programming

Linear regression is a regression model that uses a straight line to describe the relationship between one or more variables. It finds the line of best fit through your data by searching for the value of the regression coefficient(s) that minimizes the total error of the model.

There are two main types of linear regression:

  • Simple Linear Regression uses only one independent variable

  • Multi Linear Regression uses two or more independent variables

Step 1: Install the packages and move these packages into global environment where we can use it for data modelling purpose.

install.packages("ggplot2")
install.packages("dplyr")
install.packages("broom")
install.packages("ggpubr")

library(ggplot2)
library(dplyr)
library(broom)
library(ggpubr)

Step 2 : Load the required dataset and explore the loaded dataset to know the structure of the data:

insurance_data <- read.csv("/Users/maheshg/Dropbox/Sample Datasets Kaggle/insurance.csv")
head(insurance_data)
summary(insurance_data)

Step 3 : Important : Make sure the data meets the OLS (Ordinary Least Square) assumptions which are as follows :

  • Linearity

  • Homoscedasticity

  • Zero Means Error

  • Endogeneity

  • Autocorrelation of errors

  • Multicollinearity

In this case, we will check the normality of the loaded dataset and make sure we are taking one independent and one dependent variable which is called single linear regression or simple linear regression model

hist(insurance_data$bmi)

Step 4 : Linearity of the dataset between BMI variable and Age variable:

plot(bmi ~ age, data = insurance_data)

Step 5 : Homascadecity:
This means that the prediction error will not change significantly over the range of the prediction of the model, so we can test with this assumption later after fitting the linear model.

Usage of correlation function between the two independent variables which are bmi and age. By running the function cor() we will get to know how far or close these two variables are, if these variables are very close then we can take these two variables into linear model technique. In this case, we got 0.109 which is 10% variance and 90% is correlated with these 2 variables, hence considering this into linear model technique.

cor(insurance_data$bmi, insurance_data$age)
[1] 0.1092719

Step 6: Perform the linear regression line by using the above dataset:

insurance_age_bmi <- lm(bmi ~ age, data = insurance_data)
summary(insurance_age_bmi)

Step 7 : Plot the linear regression line


par(mfrow = c(2,2))
plot(insurance_age_bmi)

Residuals are the differences between the observed values and the predicted values in a regression analysis. They are calculated as:

Residual=Observed value−Predicted value

Residuals help assess how well a regression model fits the data. If the residuals are small and randomly distributed, it indicates a good fit

Step 8 : Plotting the graph of the insurance data and highlighting the 2 variables relationship, this shows bmi vs age.


insurance_graph <- ggplot(insurance_data, aes (x = bmi, y = age)) + 
  geom_point()
print(insurance_graph)