- Business & Data Research
- Posts
- Natural Language Processing using real and fake datasets
Natural Language Processing using real and fake datasets
NLP, Real Fake Dataset

About the dataset :
With the rapid increase in digital information consumption, the spread of misinformation and "fake news" has become a significant societal challenge. Machine learning and Natural Language Processing (NLP) offer powerful tools to automate the detection of such deceptive texts.
This dataset was compiled to provide a large-scale, highly balanced, and deduplicated benchmark for binary fake news classification. It merges four prominent data sources in the field, filtered to ensure high text quality and no exact duplicates..
Step 1 : Importing Required Libraries and packages
Step 2 : Reading the dataset using Pandas:
dataset = pd.read_csv('/Sample Datasets Kaggle/news.csv')dataset.head()
dataset.size
91514dataset.shape
(45757, 2)
Step 4 : Importing Additional libraries to preprocess the data
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionStep 5 : Splitting the training and testing dataset
# split dataset into training and test sets
X = dataset['text']
y = dataset['label']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)Step 6 : Converting Raw data into TD - IDF Features
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.8, ngram_range=(1, 2))
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)Step 7 : Train logisic regression classifier
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train_tfidf, y_train)Step 8 : Evaluate the model

Step 9 : Display the confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
Step 10: Inspect Top Features
feature_names = vectorizer.get_feature_names_out()
top_positive = sorted(zip(model.coef_[0], feature_names), reverse=True)[:20]
top_negative = sorted(zip(model.coef_[0], feature_names))[:20]
print("Top positive features:", [feature for _, feature in top_positive])
print("Top negative features:", [feature for _, feature in top_negative])
Top positive features: ['said', 'reuters', 'washington reuters', 'washington', 'wednesday', 'president donald', 'tuesday', 'thursday', 'monday', 'republican', 'friday', 'reuters president', 'minister', 'year', 'presidential', 'said statement', 'spokesman', 'told', 'statement', 'democratic']
Top negative features: ['video', 'just', 'hillary', 'featured image', 'featured', 'gop', 'com', 'watch', 'image', 'america', 'like', 'obama', 'fact', 'isis', 'president trump', 'getty', 'read', 'president obama', 'getty images', 'american']