Business & Data Research
Posts
Content-based recommendation system using the fruit dataset, GDP per capita

Content-based recommendation system using the fruit dataset, GDP per capita

Recommendation Engine using sklearn

Mahesh Gurumoorthi
September 28, 2025

Recommendation systems play an integral role in shaping our digital interactions. Whether it's online retailers suggesting products aligned with our browsing habits or platforms like Netflix and Spotify curating movies and music tailored to our tastes, these engines harness behavioural data to surface content that resonates with individual preferences. By doing so, they drive user engagement, boost conversions, and deliver a more personalized experience.

While various recommendation approaches exist, this tutorial centers on content-based recommendation systems. Unlike collaborative filtering, which depends on user activity and ratings, content-based methods focus on the attributes of the items themselves. For instance, if you’re exploring resources on “machine learning,” a content-based recommender would suggest other materials—such as articles or books—that delve into similar concepts, methodologies, or case studies, independent of other users’ interactions.

Step 1 : Importing Required Libraries and packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns

Step 2: Import sklearn packages and TfID Vectorisation and linear kernel for data preprocessing

Applying TF-IDF vectorization to conver the textual data into a matrix where each description is represented numerically. The TF-IDF transformation emphasizes words unique to each description, which similarity calculation

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

Step 3: Reading the dataset using Pandas:

fruit_dataset = pd.read_csv('/Users/Sample Datasets Kaggle/frtot.csv')

Step 4 : Describing the dataset using pandas library:

fruit_dataset.describe()

Step 5: Reviewing the head of the EV dataset

fruit_dataset.head(10)

Step 6: Perform the exploratory data analysis

fruit_dataset.size
33890

fruit_dataset.shape
(6778, 5)

fruit_dataset.ndim
2

Step 7: Calculate the cosine similarities

Use the cosine similarity to calculate the similarity scores. Cosine similarity measures the similarity between two commodities based on the angle between their vector representation.

cosine_similarity = linear_kernel(tfidfruit_data_matrix,tfidfruit_data_matrix)

Step 8: Store the results in disctionary

results = {}

for idx, row in fruit_dataset.iterrows():
    # Get indices of top 100 similar items (excluding itself)
    similar_indices = cosine_similarity[idx].argsort()[-101:-1]
    similar_items = [(cosine_similarity[idx][i], fruit_dataset['id'][i]) for i in similar_indices]
    results[row['id']] = similar_items

Step 9: Create the item description

def item(id):
    return fruit_dataset.loc[fruit_dataset['id'] == id]['Commodity'].tolist()[0].split('-')[0]

Step 10: Create the recommendation function

def recommendation(item_id,num):
    print("Recommendation " + str(num) + "products similar to " + item(item_id) + "...")
    print("-----")
    recs = results[item_id][:num]
    for rec in recs:
        print("Recommended: " + item(rec[1]) + " (score:" + str(rec[0]) + ")")

Step 11: Generate the result

recommendation(item_id=20,num=40)

Conclusion: "Recommended: Avocados..." and "Peaches and nectarines..." are items that the system considers similar to the reference item ("Fruit") based on their content attributes — likely metadata such as category, nutritional profile, consumption trends, or textual descriptions.
The score: 1.0 indicates a perfect similarity (or maximum relevance) according to the system’s similarity metric. This could be cosine similarity, Jaccard index, or another measure, depending on the implementation.