Business & Data Research
Posts
Predictive Modelling - Random Forest and Linear Regression (Comparison)

Predictive Modelling - Random Forest and Linear Regression (Comparison)

Random Forest and Linear Regression Model using Python

Mahesh Gurumoorthi
September 17, 2025

Random Forest Model :

A Random Forest is a powerful and versatile machine learning model used for both classification and regression tasks. It’s part of the ensemble learning family, meaning it combines multiple models to produce better results than any single model alone.

Linear Regression Model :

A Linear Regression model is one of the simplest and most widely used techniques in statistics and machine learning for predicting a continuous outcome based on one or more input features.

About the dataset :

This dataset contains publicly available data from the Inside Airbnb project for Los Angeles, California. The data was scraped on June 17, 2025, and provides a detailed snapshot of Airbnb listings in the city at that time.

Inside Airbnb is an independent, non-commercial project that provides data and advocacy about Airbnb's impact on communities around the world.

This dataset includes :

listings.csv: A detailed file containing ~79 columns for over 45,000 active/inactive listings. This file is ideal for deep analysis, including descriptions, amenities, pricing, availability, and host information.

This rich dataset is suitable for a wide range of data science tasks, from simple exploratory data analysis to complex predictive modeling, geospatial analysis, and multimodal projects combining text, numerical, and image data.

Step 1 : Importing Required Libraries and packages

import numpy as np import pandas as pd import matplotlib as plt import seaborn as sns

Step 2 : Reading the dataset using Pandas:

airbnb_data = pd.read_csv('/Users/Sample Datasets Kaggle/listings.csv')

airbnb_data.head()

Step 3 : Validating the dataset and ensure whether it has any NA values

airbnb_data.isna().sum()

Step 4 : Describing the dataset of airbnb

airbnb_data.describe()

Step 5 : Checking the shape of the dataset

airbnb_data.shape
(45421, 79)

Step 6: Generate the sample values from the dataset instead of taking the entire population

airbnb_data = airbnb_data.sample(10000, random_state=42)

Step 7: Importing the random forest library to train the model

from sklearn.ensemble import RandomForestRegressor

# Train a Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

Step 8 : Predict on the test set

# Predict on the test set
y_pred_rf = rf_model.predict(X_test)

Step 9: Evaluate the model of the dataset

# Evaluate the model
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

Random Forest Test MSE: 159157.78
Random Forest Test R^2: 0.58

Step 10: Plotting the random forest model

plt.figure(figsize=(8, 6))
sns.scatterplot(x=y_test, y=y_pred_rf, alpha=0.5)
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price (Random Forest)')
plt.title('Actual vs Predicted Airbnb Prices (Random Forest)')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')  # Diagonal line
plt.show()

Step 11: Repeat the steps for linear model (from step 1 to 10)

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Example: Predicting 'price' using a simple regression model

# 1. Data Cleaning: Remove rows with missing price and select numeric features
airbnb_data_clean = airbnb_data.copy()
airbnb_data_clean = airbnb_data_clean[airbnb_data_clean['price'].notna()]

# Convert price to numeric (remove $ and ,)
airbnb_data_clean['price'] = airbnb_data_clean['price'].replace('[\$,]', '', regex=True).astype(float)

# Select features for prediction (you can expand this list)
features = ['accommodates', 'bedrooms', 'bathrooms', 'beds', 'minimum_nights', 'maximum_nights', 'number_of_reviews']
airbnb_data_clean = airbnb_data_clean.dropna(subset=features)

X = airbnb_data_clean[features]
y = airbnb_data_clean['price']

# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Train a regression model
model = LinearRegression()
model.fit(X_train, y_train)

# 4. Evaluate the model
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Test MSE: {mse:.2f}")
print(f"Test R^2: {r2:.2f}")

plt.figure(figsize=(8, 6))
sns.scatterplot(x=y_test, y=y_pred, alpha=0.5)
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Actual vs Predicted Airbnb Prices')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')  # Diagonal line
plt.show()

Conclusion: Based on the above models, both random forest and linear regression, understood that this model is underfit by seeing the R2 value, which is close to 0.40 and 0.58. To improve this model, we need to look for other options, such as XG boost or increase the sample size of the dataset or increase the features in the dataset.