7.4. Text Classification Application: Fake News detection#
Author: Johannes Maucher
Last update: 24.11.2020
In this notebook conventional Machine Learning algorithms are applied to learn a discriminator-model for distinguishing fake- and non-fake news.
What you will learn:
Access text from .csv file
Preprocess text for classification
Calculate BoW matrix
Apply conventional machine learning algorithms for fake news detection
Evaluation of classifiers
7.4.1. Access Data#
In this notebook a fake-news corpus from Kaggle is applied for training and testing Machine Learning algorithms. Download the 3 files and save it in a directory. The path of this directory shall be assigned to the variable path
in the following code-cell:
import pandas as pd
pfad="/Users/johannes/DataSets/fake-news/"
train = pd.read_csv(pfad+'train.csv',index_col=0)
test = pd.read_csv(pfad+'test.csv',index_col=0)
test_labels=pd.read_csv(pfad+'submit.csv',index_col=0)
Data in dataframe train
is applied for training. The dataframe test
contains the texts for testing the model and the dataframe test_labels
contains the true labels of the test-texts.
print("Number of texts in train-dataframe: \t",train.shape[0])
print("Number of columns in train-dataframe: \t",train.shape[1])
train.head()
Number of texts in train-dataframe: 20800
Number of columns in train-dataframe: 4
title | author | text | label | |
---|---|---|---|---|
id | ||||
0 | House Dem Aide: We Didn’t Even See Comey’s Let... | Darrell Lucus | House Dem Aide: We Didn’t Even See Comey’s Let... | 1 |
1 | FLYNN: Hillary Clinton, Big Woman on Campus - ... | Daniel J. Flynn | Ever get the feeling your life circles the rou... | 0 |
2 | Why the Truth Might Get You Fired | Consortiumnews.com | Why the Truth Might Get You Fired October 29, ... | 1 |
3 | 15 Civilians Killed In Single US Airstrike Hav... | Jessica Purkiss | Videos 15 Civilians Killed In Single US Airstr... | 1 |
4 | Iranian woman jailed for fictional unpublished... | Howard Portnoy | Print \nAn Iranian woman has been sentenced to... | 1 |
Append the test-dataframe with the labels, which are contained in dataframe test_labels
.
test["label"]=test_labels["label"]
print("Number of texts in test-dataframe: \t",test.shape[0])
print("Number of columns in test-dataframe: \t",test.shape[1])
test.head()
Number of texts in test-dataframe: 5200
Number of columns in test-dataframe: 4
title | author | text | label | |
---|---|---|---|---|
id | ||||
20800 | Specter of Trump Loosens Tongues, if Not Purse... | David Streitfeld | PALO ALTO, Calif. — After years of scorning... | 0 |
20801 | Russian warships ready to strike terrorists ne... | NaN | Russian warships ready to strike terrorists ne... | 1 |
20802 | #NoDAPL: Native American Leaders Vow to Stay A... | Common Dreams | Videos #NoDAPL: Native American Leaders Vow to... | 0 |
20803 | Tim Tebow Will Attempt Another Comeback, This ... | Daniel Victor | If at first you don’t succeed, try a different... | 1 |
20804 | Keiser Report: Meme Wars (E995) | Truth Broadcast Network | 42 mins ago 1 Views 0 Comments 0 Likes 'For th... | 1 |
7.4.2. Data Selection#
In the following code cells, first the number of missing-data fields is determined. Then the information in columns author
, title
and text
are concatenated to a single string, which is saved in the column total
. After this process, only columns total
and label
are required, all other columns can be removed in the train
- and the test
-dataframe.
train.isnull().sum(axis=0)
title 558
author 1957
text 39
label 0
dtype: int64
test.isnull().sum(axis=0)
title 122
author 503
text 7
label 0
dtype: int64
train = train.fillna(' ')
train['total'] = train['title'] + ' ' + train['author'] + ' ' + train['text']
train = train[['total', 'label']]
train.head()
total | label | |
---|---|---|
id | ||
0 | House Dem Aide: We Didn’t Even See Comey’s Let... | 1 |
1 | FLYNN: Hillary Clinton, Big Woman on Campus - ... | 0 |
2 | Why the Truth Might Get You Fired Consortiumne... | 1 |
3 | 15 Civilians Killed In Single US Airstrike Hav... | 1 |
4 | Iranian woman jailed for fictional unpublished... | 1 |
test = test.fillna(' ')
test['total'] = test['title'] + ' ' + test['author'] + ' ' + test['text']
test = test[['total', 'label']]
7.4.3. Preprocessing#
The input texts in column total
shall be preprocessed as follows:
stopwords shall be removed
all characters, which are neither alpha-numeric nor whitespaces, shall be removed
all characters shall be represented in lower-case.
for all words, the lemma (base-form) shall be applied
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
stop_words = stopwords.words('english')
lemmatizer = WordNetLemmatizer()
for index in train.index:
#filter_sentence = ''
sentence = train.loc[index,'total']
# Cleaning the sentence with regex
sentence = re.sub(r'[^\w\s]', '', sentence)
# Tokenization
words = nltk.word_tokenize(sentence)
# Stopwords removal
words = [lemmatizer.lemmatize(w).lower() for w in words if not w in stop_words]
filter_sentence = " ".join(words)
train.loc[index, 'total'] = filter_sentence
First 5 cleaned texts in the training-dataframe:
train.head()
total | label | |
---|---|---|
id | ||
0 | house dem aide we didnt even see comeys letter... | 1 |
1 | flynn hillary clinton big woman campus breitba... | 0 |
2 | why truth might get you fired consortiumnewsco... | 1 |
3 | 15 civilians killed in single us airstrike hav... | 1 |
4 | iranian woman jailed fictional unpublished sto... | 1 |
Clean data in the test-dataframe in the same way as done for the training-dataframe above:
lemmatizer = WordNetLemmatizer()
for index in test.index:
#filter_sentence = ''
sentence = test.loc[index,'total']
# Cleaning the sentence with regex
sentence = re.sub(r'[^\w\s]', '', sentence)
# Tokenization
words = nltk.word_tokenize(sentence)
# Stopwords removal
words = [lemmatizer.lemmatize(w).lower() for w in words if not w in stop_words]
filter_sentence = " ".join(words)
test.loc[index, 'total'] = filter_sentence
First 5 cleaned texts in the test-dataframe:
test.head()
total | label | |
---|---|---|
id | ||
20800 | specter trump loosens tongues not purse string... | 0 |
20801 | russian warship ready strike terrorist near al... | 1 |
20802 | nodapl native american leaders vow stay all wi... | 0 |
20803 | tim tebow will attempt another comeback this t... | 1 |
20804 | keiser report meme wars e995 truth broadcast n... | 1 |
7.4.4. Determine Bag-of-Word Matrix for Training- and Test-Data#
In the code-cells below two different types of Bag-of-Word matrices are calculated. The first type contains the term-frequencies, i.e. the entry in row \(i\), column \(j\) is the frequency of word \(j\) in document \(i\). In the second type, the matrix-entries are not the term-frequencies, but the tf-idf-values.
Note that for a given typ (term-frequency or tf-idf) a separate matrix must be calculated for training and testing. Since we always pretend, that only training-data is known in advance, the matrix-structure, i.e. the columns (= words) depends only on the training-data. This matrix structure is calculated in the row:
count_vectorizer.fit(X_train)
and
tfidf.fit(freq_term_matrix_train),
respectively. An important parameter of the CountVectorizer
-class is min_df
. The value, which is assigned to this parameter is the minimum frequency of a word, such that it is regarded in the BoW-matrix. Words, which appear less often are disregarded.
The training data is then mapped to this structure by
count_vectorizer.transform(X_train)
and
tfidf.transform(X_train),
respectively.
For the test-data, however, no new matrix-structure is calculated. Instead the test-data is transformed to the structure of the matrix, defined by the training data.
X_train = train['total'].values
y_train = train['label'].values
X_test = test['total'].values
y_test = test['label'].values
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
Train BoW-models and transform training-data to BoW-matrix:
count_vectorizer = CountVectorizer(min_df=4)
count_vectorizer.fit(X_train)
freq_term_matrix_train = count_vectorizer.transform(X_train)
tfidf = TfidfTransformer(norm = "l2")
tfidf.fit(freq_term_matrix_train)
tf_idf_matrix_train = tfidf.transform(freq_term_matrix_train)
freq_term_matrix_train.toarray().shape
(20800, 55055)
tf_idf_matrix_train.toarray().shape
(20800, 55055)
Transform test-data to BoW-matrix:
freq_term_matrix_test = count_vectorizer.transform(X_test)
tf_idf_matrix_test = tfidf.transform(freq_term_matrix_test)
7.4.5. Train a linear classifier#
Below a Logistic Regression model is trained. This is just a linear classifier with a sigmoid- or softmax- activity-function.
X_train=tf_idf_matrix_train
X_test=tf_idf_matrix_test
#X_train=freq_term_matrix_train
#X_test=freq_term_matrix_test
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
7.4.6. Evaluate trained model#
First, the trained model is applied to predict the class of the training-samples:
y_pred_train = logreg.predict(X_train)
y_pred_train
array([1, 1, 1, ..., 0, 1, 1])
from sklearn.metrics import confusion_matrix, classification_report
confusion_matrix(y_train,y_pred_train)
array([[10200, 187],
[ 148, 10265]])
The model’s prediction are compared with the true classes of the training-samples. The classification-report contains the common metrics for evaluating classifiers:
print(classification_report(y_train,y_pred_train))
precision recall f1-score support
0 0.99 0.98 0.98 10387
1 0.98 0.99 0.98 10413
accuracy 0.98 20800
macro avg 0.98 0.98 0.98 20800
weighted avg 0.98 0.98 0.98 20800
The output of the classification report shows, that the model is well fitted to the training data, since it predicts training data with an accuracy of 98%.
However, accuracy on the training-data, provides no information on the model’s capability to classify new data. Therefore, below the model’s prediction on the test-dataset is calculated:
y_pred_test = logreg.predict(X_test)
confusion_matrix(y_test,y_pred_test)
array([[1524, 815],
[1061, 1800]])
print(classification_report(y_test,y_pred_test))
precision recall f1-score support
0 0.59 0.65 0.62 2339
1 0.69 0.63 0.66 2861
accuracy 0.64 5200
macro avg 0.64 0.64 0.64 5200
weighted avg 0.64 0.64 0.64 5200
The model’s accuracy on the test-data is weak. The model is overfitted on the training-data. It seems that the distribution of test-data is significantly different from the distribution of training-data. This hypothesis can be verified by ignoring the data from test.csv
and instead split data from train.csv
into a train- and a test-partition. In this modified experiment performance on test-data is much better, because the texts within train.csv
origin from the same distributions.