7.3. Text Classification Application: Fake News detection#

  • Author: Johannes Maucher and Marcel Heisler

  • Last update: 13.12.2022

In this notebook conventional Machine Learning algorithms are applied to learn a discriminator-model for distinguishing fake- and non-fake news.

What you will learn:

  • Access text from .csv file

  • Preprocess text for classification

  • Calculate BoW matrix

  • Apply conventional machine learning algorithms for fake news detection

  • Evaluation of classifiers

7.3.1. Access Data#

In this notebook a fake-news corpus from Kaggle is applied for training and testing Machine Learning algorithms. Download the train file and save it in a directory. Load it into a pandas dataframe.

Then inspect the data.

import pandas as pd
import numpy as np
from sklearn.naive_bayes import MultinomialNB
data = pd.read_csv('../Data/fake-news/train.csv',index_col=0)
print(data.shape)
data.head()
(20800, 4)
title author text label
id
0 House Dem Aide: We Didn’t Even See Comey’s Let... Darrell Lucus House Dem Aide: We Didn’t Even See Comey’s Let... 1
1 FLYNN: Hillary Clinton, Big Woman on Campus - ... Daniel J. Flynn Ever get the feeling your life circles the rou... 0
2 Why the Truth Might Get You Fired Consortiumnews.com Why the Truth Might Get You Fired October 29, ... 1
3 15 Civilians Killed In Single US Airstrike Hav... Jessica Purkiss Videos 15 Civilians Killed In Single US Airstr... 1
4 Iranian woman jailed for fictional unpublished... Howard Portnoy Print \nAn Iranian woman has been sentenced to... 1

Above we see that we have 20800 documents (rows in the dataframe) and for each document we (should) have a title, the author the text and a label.

Below we can see that the dataset seems to be pretty balanced with about 10400 documents belonging to each category.

bins, counts = np.unique(data.label, return_counts=True)
print(bins, counts)
[0 1] [10387 10413]

If we check for missing values we can find them in each of the columns except for the label.

data.isnull().sum(axis=0)
title      558
author    1957
text        39
label        0
dtype: int64

7.3.2. Data Preparation#

In the following code cells, we rearrange the input texts. First we replace missing values with spaces (' '). Then we concatenate title, author and text to a single column of text, called total. So far the operations can be applied to train and test data at once. Now we need to split the data and process each set independently.

from sklearn.model_selection import train_test_split
data = data.fillna(' ')
data['total'] = data['title'] + ' ' + data['author'] + ' ' + data['text']
data = data[['total', 'label']]
data.head()
total label
id
0 House Dem Aide: We Didn’t Even See Comey’s Let... 1
1 FLYNN: Hillary Clinton, Big Woman on Campus - ... 0
2 Why the Truth Might Get You Fired Consortiumne... 1
3 15 Civilians Killed In Single US Airstrike Hav... 1
4 Iranian woman jailed for fictional unpublished... 1

7.3.3. Preprocessing#

The input texts in column total shall be preprocessed as follows:

  • stopwords shall be removed

  • all characters, which are neither alpha-numeric nor whitespaces, shall be removed

  • all characters shall be represented in lower-case.

  • for all words, the lemma (base-form) shall be applied

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
stop_words = stopwords.words('english')
lemmatizer = WordNetLemmatizer()
for index in data.index:
    sentence = data.loc[index,'total']
    # Cleaning the sentence with regex
    sentence = re.sub(r'[^\w\s]', '', sentence)
    # Tokenization
    words = nltk.word_tokenize(sentence)
    # Stopwords removal
    words = [lemmatizer.lemmatize(w).lower() for w in words if not w in stop_words]
    filter_sentence = " ".join(words)
    data.loc[index, 'total'] = filter_sentence

First 5 cleaned texts in the training-dataframe:

data.head()
total label
id
0 house dem aide we didnt even see comeys letter... 1
1 flynn hillary clinton big woman campus breitba... 0
2 why truth might get you fired consortiumnewsco... 1
3 15 civilians killed in single us airstrike hav... 1
4 iranian woman jailed fictional unpublished sto... 1

Clean data in the test-dataframe in the same way as done for the training-dataframe above:

First 5 cleaned texts in the test-dataframe:

7.3.4. Determine Bag-of-Word Matrix for Training- and Test-Data#

In the code-cells below two different types of Bag-of-Word matrices are calculated. The first type contains the term-frequencies, i.e. the entry in row \(i\), column \(j\) is the frequency of word \(j\) in document \(i\). In the second type, the matrix-entries are not the term-frequencies, but the tf-idf-values.

We also need to split the data now into train and test. Then for a each type of matrix (term-frequency or tf-idf) a separate one must be calculated for training and testing. Since we always pretend, that only training-data is known in advance, the matrix-structure, i.e. the columns (= words) depends only on the training-data. This matrix structure is calculated in the row:

count_vectorizer.fit(X_train)

and

tfidf.fit(freq_term_matrix_train),

respectively. An important parameter of the CountVectorizer-class is min_df. The value, which is assigned to this parameter is the minimum frequency of a word, such that it is regarded in the BoW-matrix. Words, which appear less often are disregarded.

The training data is then mapped to this structure by

count_vectorizer.transform(X_train)

and

tfidf.transform(X_train),

respectively.

For the test-data, however, no new matrix-structure is calculated. Instead the test-data is transformed to the structure of the matrix, defined by the training data.

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
train_data, test_data, train_labels, test_labels = train_test_split(data['total'], data['label'], train_size=0.8, random_state=1) 
print('Training data size: {}'.format(len(train_data)))
print('Test data size: {}'.format(len(test_data)))
train_data = train_data.values
test_data = test_data.values
train_labels = train_labels.values
test_labels = test_labels.values
Training data size: 16640
Test data size: 4160

Train BoW-models and transform training-data to BoW-matrix:

count_vectorizer = CountVectorizer(min_df=4)
count_vectorizer.fit(train_data)
freq_term_matrix_train = count_vectorizer.transform(train_data)
tfidf = TfidfTransformer(norm = "l2")
tfidf.fit(freq_term_matrix_train)
tf_idf_matrix_train = tfidf.transform(freq_term_matrix_train)
print(freq_term_matrix_train.toarray().shape)
print(tf_idf_matrix_train.toarray().shape)
(16640, 49086)
(16640, 49086)

Transform test-data to BoW-matrix:

freq_term_matrix_test = count_vectorizer.transform(test_data)
tf_idf_matrix_test = tfidf.transform(freq_term_matrix_test)

7.3.5. Train a Naive Bayes calssifier#

We train on term-frequencies first, make predictions on the test set with this classifier and calculate the accuracy. Then we compare to the usage of tf-idf-values.

nb_tf_classifier = MultinomialNB()
nb_tf_classifier.fit(freq_term_matrix_train, train_labels)
nb_tf_classifier.get_params()
{'alpha': 1.0, 'class_prior': None, 'fit_prior': True, 'force_alpha': 'warn'}
tf_predictions = nb_tf_classifier.predict(freq_term_matrix_test)
accuracy_score(test_labels, tf_predictions)
0.9276442307692307
nb_tfidf_classifier = MultinomialNB()
nb_tfidf_classifier.fit(tf_idf_matrix_train, train_labels)
tfidf_predictions = nb_tfidf_classifier.predict(tf_idf_matrix_test)
accuracy_score(test_labels, tfidf_predictions)
0.9055288461538461

7.3.6. Further evaluation#

print(classification_report(test_labels, tfidf_predictions))
              precision    recall  f1-score   support

           0       0.85      0.99      0.91      2038
           1       0.99      0.83      0.90      2122

    accuracy                           0.91      4160
   macro avg       0.92      0.91      0.91      4160
weighted avg       0.92      0.91      0.91      4160
fig, ax = plt.subplots(figsize=(5, 5))
disp = ConfusionMatrixDisplay.from_estimator(nb_tfidf_classifier, tf_idf_matrix_test, test_labels, display_labels=['reliable', 'fake'], xticks_rotation='vertical', ax=ax)
#disp = ConfusionMatrixDisplay.from_estimator(nb_classifier, tf_idf_matrix_test, test_labels, normalize='true', display_labels=['reliable', 'fake'], xticks_rotation='vertical', ax=ax)
../_images/0116cd2c13d72b9ca9394601568c204dc718086208c83e9f155cade044ad4c7a.png

7.3.7. Train a linear classifier#

Below a Logistic Regression model is trained. This is just a linear classifier with a sigmoid- or softmax- activation-function.

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(tf_idf_matrix_train, train_labels)

logreg_predictions = logreg.predict(tf_idf_matrix_test)

print(classification_report(test_labels, logreg_predictions))

fig, ax = plt.subplots(figsize=(5, 5))
disp = ConfusionMatrixDisplay.from_estimator(logreg, tf_idf_matrix_test, test_labels, display_labels=['reliable', 'fake'], xticks_rotation='vertical', ax=ax)
              precision    recall  f1-score   support

           0       0.97      0.96      0.97      2038
           1       0.96      0.97      0.97      2122

    accuracy                           0.97      4160
   macro avg       0.97      0.97      0.97      4160
weighted avg       0.97      0.97      0.97      4160
../_images/b7886817345b0486d840e8f0ee93d867647c50502e7f21371d55ac9a9d78e2e2.png

Note: there is an acutal test set provided by the kaggle challenge. However it seems that the distribution of test-data is significantly different from the distribution of training-data. Since similar preprocessing an classifiers resultet in much worse evaluation scores (about 0.65 Accuracy). This is why we only used the official train data and splittet it to get meaningful results.

Results of a logistic regression model on the actual test.csv:

          precision    recall  f1-score   support

       0       0.59      0.65      0.62      2339
       1       0.69      0.63      0.66      2861

accuracy                           0.64      5200
macro avg       0.64      0.64      0.64      5200
weighted avg       0.64      0.64      0.64      5200