Business & Data Research
Posts
Natural Language Processing using real and fake datasets

Natural Language Processing using real and fake datasets

NLP, Real Fake Dataset

Mahesh Gurumoorthi
June 06, 2026

About the dataset :

With the rapid increase in digital information consumption, the spread of misinformation and "fake news" has become a significant societal challenge. Machine learning and Natural Language Processing (NLP) offer powerful tools to automate the detection of such deceptive texts.

This dataset was compiled to provide a large-scale, highly balanced, and deduplicated benchmark for binary fake news classification. It merges four prominent data sources in the field, filtered to ensure high text quality and no exact duplicates..

Step 1 : Importing Required Libraries and packages

import numpy as np import matplotlib.pyplot as plt import pandas as pd import seaborn as sns import os import sys

Step 2 : Reading the dataset using Pandas:

dataset = pd.read_csv('/Sample Datasets Kaggle/news.csv')

dataset.head()

dataset.size
91514

dataset.shape
(45757, 2)

Step 4 : Importing Additional libraries to preprocess the data

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

Step 5 : Splitting the training and testing dataset

# split dataset into training and test sets
X = dataset['text']
y = dataset['label']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

Step 6 : Converting Raw data into TD - IDF Features


vectorizer = TfidfVectorizer(stop_words='english', max_df=0.8, ngram_range=(1, 2))
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

Step 7 : Train logisic regression classifier

model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train_tfidf, y_train)

Step 8 : Evaluate the model

y_pred = model.predict(X_test_tfidf) print("Accuracy:", accuracy_score(y_test, y_pred)) print(classification_report(y_test, y_pred))

Step 9 : Display the confusion matrix

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

Step 10: Inspect Top Features

feature_names = vectorizer.get_feature_names_out()
top_positive = sorted(zip(model.coef_[0], feature_names), reverse=True)[:20]
top_negative = sorted(zip(model.coef_[0], feature_names))[:20]
print("Top positive features:", [feature for _, feature in top_positive])
print("Top negative features:", [feature for _, feature in top_negative])

Top positive features: ['said', 'reuters', 'washington reuters', 'washington', 'wednesday', 'president donald', 'tuesday', 'thursday', 'monday', 'republican', 'friday', 'reuters president', 'minister', 'year', 'presidential', 'said statement', 'spokesman', 'told', 'statement', 'democratic']
Top negative features: ['video', 'just', 'hillary', 'featured image', 'featured', 'gop', 'com', 'watch', 'image', 'america', 'like', 'obama', 'fact', 'isis', 'president trump', 'getty', 'read', 'president obama', 'getty images', 'american']