- Business & Data Research
- Posts
- Descriptive Statistics + Decision Tree Algorithm - Used Car Predictions
Descriptive Statistics + Decision Tree Algorithm - Used Car Predictions
Prediction Model - Decision Tree using Python

About the dataset :
One widely used dataset for predictive modelling is the Extensive Used Car Price dataset on Kaggle. It includes:
Car Name: Brand and model
Year of Manufacture: Used to calculate age
Selling Price: Target variable for prediction
Present Price: Original price when new
Kilometres Driven: Indicator of wear and tear
Fuel Type: Petrol, Diesel, or CNG
Transmission Type: Manual or Automatic
Seller Type: Dealer or Individual
Number of Owners: Prior ownership count
This dataset is ideal for regression tasks and supports both linear and tree-based models.
Step 1 : Importing Required Libraries and packages
Step 2 : Reading the dataset using Pandas and explore the data analysis
used_car_df.describe

used_car_df.head()

used_car_df.isna()

used_car_df.isnull().sum()

used_car_df.isna().sum()

used_car_df.size
2709
used_car_df.ndim
2
used_car_df.shape
(301, 9)
row_count = len(used_car_df)
print(row_count)
301
print (f"There are {row_count} used cars in our dataset")
There are 301 used cars in our dataset
used_car_df.info()

used_car_df.dtypes

Step 3 : Exploring Data Analysis in detail :
used_car_df['Owner'].value_counts()

used_car_df[used_car_df.duplicated(keep=False)]

used_car_df = used_car_df.drop_duplicates()
used_car_df = used_car_df.reset_index(drop=True)
used_car_df.iloc[10:20]

Step 4: Plot the Descriptive Statistics of used car models
plt.figure(figsize=(12,6))
top_used_cars = used_car_df['Car_Name'].value_counts().nlargest(15)
labels = top_used_cars.index
plt.pie(top_used_cars,labels=labels)
plt.title("Top Used Cars",fontsize = 20)
plt.show()

Step 5: Perform the decision tree model, import the required libraries
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder
used_car_df.columns
Index(['Car_Name', 'Year', 'Selling_Price', 'Present_Price', 'Kms_Driven',
'Fuel_Type', 'Seller_Type', 'Transmission', 'Owner'],
dtype='object')
labels = LabelEncoder()
categorical_col = ["Fuel_Type", "Seller_Type", "Transmission"]
Step 6: Describe the used cars details
# The method should be used with parentheses to display the summary statistics
used_car_df.describe()

for col in categorical_col:
used_car_df[col] = labels.fit_transform(used_car_df[col])
Step 7: Split the training and test data
X = used_car_df.drop(['Selling_Price','Car_Name'],axis = 1)
Y = used_car_df["Selling_Price"]
X

Y

X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.25,random_state=29)
Step 8: Fit the decision tree model
tree_model = DecisionTreeRegressor()
tree_model.fit(X_train,Y_train)
y_pred = tree_model.predict(X_test)
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score
print("Squared Error",mean_squared_error(Y_test,y_pred))
print("Absolute Error",mean_absolute_error(Y_test,y_pred))
print("R2 square",r2_score(Y_test,y_pred))
Squared Error 1.1092279999999999
Absolute Error 0.6467999999999999
R2 square 0.9315864294744105
Conclusion : (Strong Supporting Model via Decision Making)
The R² score is high, indicating that the model captures most of the underlying data patterns.
This level of accuracy is typically sufficient for decision-making in business contexts like pricing, forecasting, or risk scoring.