Descriptive Statistics + Decision Tree Algorithm - Used Car Predictions

Prediction Model - Decision Tree using Python

About the dataset :

One widely used dataset for predictive modelling is the Extensive Used Car Price dataset on Kaggle. It includes:

  • Car Name: Brand and model

  • Year of Manufacture: Used to calculate age

  • Selling Price: Target variable for prediction

  • Present Price: Original price when new

  • Kilometres Driven: Indicator of wear and tear

  • Fuel Type: Petrol, Diesel, or CNG

  • Transmission Type: Manual or Automatic

  • Seller Type: Dealer or Individual

  • Number of Owners: Prior ownership count

This dataset is ideal for regression tasks and supports both linear and tree-based models.

Step 1 : Importing Required Libraries and packages

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns

Step 2 : Reading the dataset using Pandas and explore the data analysis

used_car_df = pd.read_csv('/Users/Sample Datasets Kaggle/cardekho_data.csv')
used_car_df.describe
used_car_df.head()
used_car_df.isna()
used_car_df.isnull().sum()
used_car_df.isna().sum()
used_car_df.size
2709
used_car_df.ndim
2
used_car_df.shape
(301, 9)
row_count = len(used_car_df)
print(row_count)
301
print (f"There are {row_count} used cars in our dataset")
There are 301 used cars in our dataset
used_car_df.info()
used_car_df.dtypes

Step 3 : Exploring Data Analysis in detail :

used_car_df['Owner'].value_counts()
used_car_df[used_car_df.duplicated(keep=False)]
used_car_df = used_car_df.drop_duplicates()
used_car_df = used_car_df.reset_index(drop=True)
used_car_df.iloc[10:20]

Step 4: Plot the Descriptive Statistics of used car models

plt.figure(figsize=(12,6))
top_used_cars = used_car_df['Car_Name'].value_counts().nlargest(15)
labels = top_used_cars.index
plt.pie(top_used_cars,labels=labels)
plt.title("Top Used Cars",fontsize = 20)
plt.show()

Step 5: Perform the decision tree model, import the required libraries

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder
used_car_df.columns

Index(['Car_Name', 'Year', 'Selling_Price', 'Present_Price', 'Kms_Driven',
       'Fuel_Type', 'Seller_Type', 'Transmission', 'Owner'],
      dtype='object')
labels = LabelEncoder()
categorical_col = ["Fuel_Type", "Seller_Type", "Transmission"]

Step 6: Describe the used cars details

# The method should be used with parentheses to display the summary statistics
used_car_df.describe()
for col in categorical_col:
    used_car_df[col] = labels.fit_transform(used_car_df[col])

Step 7: Split the training and test data

X = used_car_df.drop(['Selling_Price','Car_Name'],axis = 1)
Y = used_car_df["Selling_Price"]
X
Y
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.25,random_state=29)

Step 8: Fit the decision tree model

tree_model = DecisionTreeRegressor()
tree_model.fit(X_train,Y_train)
y_pred = tree_model.predict(X_test)

from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score
print("Squared Error",mean_squared_error(Y_test,y_pred))
print("Absolute Error",mean_absolute_error(Y_test,y_pred))
print("R2 square",r2_score(Y_test,y_pred))

Squared Error 1.1092279999999999
Absolute Error 0.6467999999999999
R2 square 0.9315864294744105

Conclusion : (Strong Supporting Model via Decision Making)

  • The R² score is high, indicating that the model captures most of the underlying data patterns.

  • This level of accuracy is typically sufficient for decision-making in business contexts like pricing, forecasting, or risk scoring.