{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Text Classification Application: Fake News detection\n",
    "* Author: Johannes Maucher\n",
    "* Last update: 24.11.2020\n",
    "\n",
    "In this notebook conventional Machine Learning algorithms are applied to learn a discriminator-model for distinguishing fake- and non-fake news.\n",
    "\n",
    "What you will learn:\n",
    "* Access text from .csv file\n",
    "* Preprocess text for classification\n",
    "* Calculate BoW matrix\n",
    "* Apply conventional machine learning algorithms for fake news detection\n",
    "* Evaluation of classifiers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Access Data\n",
    "In this notebook a [fake-news corpus from Kaggle](https://www.kaggle.com/c/fake-news/data) is applied for training and testing Machine Learning algorithms. Download the 3 files and save it in a directory. The path of this directory shall be assigned to the variable `path`in the following code-cell: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "pfad=\"/Users/johannes/DataSets/fake-news/\"\n",
    "train = pd.read_csv(pfad+'train.csv',index_col=0)\n",
    "test = pd.read_csv(pfad+'test.csv',index_col=0)\n",
    "test_labels=pd.read_csv(pfad+'submit.csv',index_col=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Data in dataframe `train` is applied for training. The dataframe `test`contains the texts for testing the model and the dataframe `test_labels` contains the true labels of the test-texts. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of texts in train-dataframe: \t 20800\n",
      "Number of columns in train-dataframe: \t 4\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>author</th>\n",
       "      <th>text</th>\n",
       "      <th>label</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>House Dem Aide: We Didn’t Even See Comey’s Let...</td>\n",
       "      <td>Darrell Lucus</td>\n",
       "      <td>House Dem Aide: We Didn’t Even See Comey’s Let...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>FLYNN: Hillary Clinton, Big Woman on Campus - ...</td>\n",
       "      <td>Daniel J. Flynn</td>\n",
       "      <td>Ever get the feeling your life circles the rou...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Why the Truth Might Get You Fired</td>\n",
       "      <td>Consortiumnews.com</td>\n",
       "      <td>Why the Truth Might Get You Fired October 29, ...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>15 Civilians Killed In Single US Airstrike Hav...</td>\n",
       "      <td>Jessica Purkiss</td>\n",
       "      <td>Videos 15 Civilians Killed In Single US Airstr...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Iranian woman jailed for fictional unpublished...</td>\n",
       "      <td>Howard Portnoy</td>\n",
       "      <td>Print \\nAn Iranian woman has been sentenced to...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                title              author  \\\n",
       "id                                                                          \n",
       "0   House Dem Aide: We Didn’t Even See Comey’s Let...       Darrell Lucus   \n",
       "1   FLYNN: Hillary Clinton, Big Woman on Campus - ...     Daniel J. Flynn   \n",
       "2                   Why the Truth Might Get You Fired  Consortiumnews.com   \n",
       "3   15 Civilians Killed In Single US Airstrike Hav...     Jessica Purkiss   \n",
       "4   Iranian woman jailed for fictional unpublished...      Howard Portnoy   \n",
       "\n",
       "                                                 text  label  \n",
       "id                                                            \n",
       "0   House Dem Aide: We Didn’t Even See Comey’s Let...      1  \n",
       "1   Ever get the feeling your life circles the rou...      0  \n",
       "2   Why the Truth Might Get You Fired October 29, ...      1  \n",
       "3   Videos 15 Civilians Killed In Single US Airstr...      1  \n",
       "4   Print \\nAn Iranian woman has been sentenced to...      1  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(\"Number of texts in train-dataframe: \\t\",train.shape[0])\n",
    "print(\"Number of columns in train-dataframe: \\t\",train.shape[1])\n",
    "train.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Append the test-dataframe with the labels, which are contained in dataframe `test_labels`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "test[\"label\"]=test_labels[\"label\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of texts in test-dataframe: \t 5200\n",
      "Number of columns in test-dataframe: \t 4\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>author</th>\n",
       "      <th>text</th>\n",
       "      <th>label</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>20800</th>\n",
       "      <td>Specter of Trump Loosens Tongues, if Not Purse...</td>\n",
       "      <td>David Streitfeld</td>\n",
       "      <td>PALO ALTO, Calif.  —   After years of scorning...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20801</th>\n",
       "      <td>Russian warships ready to strike terrorists ne...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Russian warships ready to strike terrorists ne...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20802</th>\n",
       "      <td>#NoDAPL: Native American Leaders Vow to Stay A...</td>\n",
       "      <td>Common Dreams</td>\n",
       "      <td>Videos #NoDAPL: Native American Leaders Vow to...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20803</th>\n",
       "      <td>Tim Tebow Will Attempt Another Comeback, This ...</td>\n",
       "      <td>Daniel Victor</td>\n",
       "      <td>If at first you don’t succeed, try a different...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20804</th>\n",
       "      <td>Keiser Report: Meme Wars (E995)</td>\n",
       "      <td>Truth Broadcast Network</td>\n",
       "      <td>42 mins ago 1 Views 0 Comments 0 Likes 'For th...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                   title  \\\n",
       "id                                                         \n",
       "20800  Specter of Trump Loosens Tongues, if Not Purse...   \n",
       "20801  Russian warships ready to strike terrorists ne...   \n",
       "20802  #NoDAPL: Native American Leaders Vow to Stay A...   \n",
       "20803  Tim Tebow Will Attempt Another Comeback, This ...   \n",
       "20804                    Keiser Report: Meme Wars (E995)   \n",
       "\n",
       "                        author  \\\n",
       "id                               \n",
       "20800         David Streitfeld   \n",
       "20801                      NaN   \n",
       "20802            Common Dreams   \n",
       "20803            Daniel Victor   \n",
       "20804  Truth Broadcast Network   \n",
       "\n",
       "                                                    text  label  \n",
       "id                                                               \n",
       "20800  PALO ALTO, Calif.  —   After years of scorning...      0  \n",
       "20801  Russian warships ready to strike terrorists ne...      1  \n",
       "20802  Videos #NoDAPL: Native American Leaders Vow to...      0  \n",
       "20803  If at first you don’t succeed, try a different...      1  \n",
       "20804  42 mins ago 1 Views 0 Comments 0 Likes 'For th...      1  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(\"Number of texts in test-dataframe: \\t\",test.shape[0])\n",
    "print(\"Number of columns in test-dataframe: \\t\",test.shape[1])\n",
    "test.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Selection\n",
    "\n",
    "In the following code cells, first the number of missing-data fields is determined. Then the information in columns `author`, `title` and `text` are concatenated to a single string, which is saved in the column `total`. After this process, only columns `total` and `label` are required, all other columns can be removed in the `train`- and the `test`-dataframe. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "title      558\n",
       "author    1957\n",
       "text        39\n",
       "label        0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train.isnull().sum(axis=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "title     122\n",
       "author    503\n",
       "text        7\n",
       "label       0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test.isnull().sum(axis=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "train = train.fillna(' ')\n",
    "train['total'] = train['title'] + ' ' + train['author'] + ' ' + train['text']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "train = train[['total', 'label']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>total</th>\n",
       "      <th>label</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>House Dem Aide: We Didn’t Even See Comey’s Let...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>FLYNN: Hillary Clinton, Big Woman on Campus - ...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Why the Truth Might Get You Fired Consortiumne...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>15 Civilians Killed In Single US Airstrike Hav...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Iranian woman jailed for fictional unpublished...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                total  label\n",
       "id                                                          \n",
       "0   House Dem Aide: We Didn’t Even See Comey’s Let...      1\n",
       "1   FLYNN: Hillary Clinton, Big Woman on Campus - ...      0\n",
       "2   Why the Truth Might Get You Fired Consortiumne...      1\n",
       "3   15 Civilians Killed In Single US Airstrike Hav...      1\n",
       "4   Iranian woman jailed for fictional unpublished...      1"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "test = test.fillna(' ')\n",
    "test['total'] = test['title'] + ' ' + test['author'] + ' ' + test['text']\n",
    "test = test[['total', 'label']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preprocessing\n",
    "The input texts in column `total` shall be preprocessed as follows:\n",
    "* stopwords shall be removed\n",
    "* all characters, which are neither alpha-numeric nor whitespaces, shall be removed\n",
    "* all characters shall be represented in lower-case.\n",
    "* for all words, the lemma (base-form) shall be applied"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nltk\n",
    "from nltk.corpus import stopwords\n",
    "from nltk.stem import WordNetLemmatizer\n",
    "import re"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "stop_words = stopwords.words('english')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "lemmatizer = WordNetLemmatizer()\n",
    "for index in train.index:\n",
    "    #filter_sentence = ''\n",
    "    sentence = train.loc[index,'total']\n",
    "    # Cleaning the sentence with regex\n",
    "    sentence = re.sub(r'[^\\w\\s]', '', sentence)\n",
    "    # Tokenization\n",
    "    words = nltk.word_tokenize(sentence)\n",
    "    # Stopwords removal\n",
    "    words = [lemmatizer.lemmatize(w).lower() for w in words if not w in stop_words]\n",
    "    filter_sentence = \" \".join(words)\n",
    "    train.loc[index, 'total'] = filter_sentence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First 5 cleaned texts in the training-dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>total</th>\n",
       "      <th>label</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>house dem aide we didnt even see comeys letter...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>flynn hillary clinton big woman campus breitba...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>why truth might get you fired consortiumnewsco...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>15 civilians killed in single us airstrike hav...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>iranian woman jailed fictional unpublished sto...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                total  label\n",
       "id                                                          \n",
       "0   house dem aide we didnt even see comeys letter...      1\n",
       "1   flynn hillary clinton big woman campus breitba...      0\n",
       "2   why truth might get you fired consortiumnewsco...      1\n",
       "3   15 civilians killed in single us airstrike hav...      1\n",
       "4   iranian woman jailed fictional unpublished sto...      1"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Clean data in the test-dataframe in the same way as done for the training-dataframe above:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "lemmatizer = WordNetLemmatizer()\n",
    "for index in test.index:\n",
    "    #filter_sentence = ''\n",
    "    sentence = test.loc[index,'total']\n",
    "    # Cleaning the sentence with regex\n",
    "    sentence = re.sub(r'[^\\w\\s]', '', sentence)\n",
    "    # Tokenization\n",
    "    words = nltk.word_tokenize(sentence)\n",
    "    # Stopwords removal\n",
    "    words = [lemmatizer.lemmatize(w).lower() for w in words if not w in stop_words]\n",
    "    filter_sentence = \" \".join(words)\n",
    "    test.loc[index, 'total'] = filter_sentence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First 5 cleaned texts in the test-dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>total</th>\n",
       "      <th>label</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>20800</th>\n",
       "      <td>specter trump loosens tongues not purse string...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20801</th>\n",
       "      <td>russian warship ready strike terrorist near al...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20802</th>\n",
       "      <td>nodapl native american leaders vow stay all wi...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20803</th>\n",
       "      <td>tim tebow will attempt another comeback this t...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20804</th>\n",
       "      <td>keiser report meme wars e995 truth broadcast n...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                   total  label\n",
       "id                                                             \n",
       "20800  specter trump loosens tongues not purse string...      0\n",
       "20801  russian warship ready strike terrorist near al...      1\n",
       "20802  nodapl native american leaders vow stay all wi...      0\n",
       "20803  tim tebow will attempt another comeback this t...      1\n",
       "20804  keiser report meme wars e995 truth broadcast n...      1"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Determine Bag-of-Word Matrix for Training- and Test-Data\n",
    "In the code-cells below two different types of Bag-of-Word matrices are calculated. The first type contains the **term-frequencies**, i.e. the entry in row $i$, column $j$ is the frequency of word $j$ in document $i$. In the second type, the matrix-entries are not the term-frequencies, but the tf-idf-values. \n",
    "\n",
    "Note that for a given typ (term-frequency or tf-idf) a separate matrix must be calculated for training and testing. Since we always pretend, that only training-data is known in advance, the matrix-structure, i.e. the columns (= words) depends only on the training-data. This matrix structure is calculated in the row:\n",
    "\n",
    "```\n",
    "count_vectorizer.fit(X_train)\n",
    "```\n",
    "and\n",
    "```\n",
    "tfidf.fit(freq_term_matrix_train),\n",
    "```\n",
    "respectively. An important parameter of the `CountVectorizer`-class is `min_df`. The value, which is assigned to this parameter is the minimum frequency of a word, such that it is regarded in the BoW-matrix. Words, which appear less often are disregarded.\n",
    "\n",
    "The training data is then mapped to this structure by \n",
    "```\n",
    "count_vectorizer.transform(X_train)\n",
    "```\n",
    "and\n",
    "```\n",
    "tfidf.transform(X_train),\n",
    "```\n",
    "respectively.\n",
    "\n",
    "For the test-data, however, no new matrix-structure is calculated. Instead the test-data is transformed to the structure of the matrix, defined by the training data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train = train['total'].values\n",
    "y_train = train['label'].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_test = test['total'].values\n",
    "y_test = test['label'].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import TfidfTransformer\n",
    "from sklearn.feature_extraction.text import CountVectorizer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Train BoW-models and transform training-data to BoW-matrix:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "count_vectorizer = CountVectorizer(min_df=4)\n",
    "count_vectorizer.fit(X_train)\n",
    "freq_term_matrix_train = count_vectorizer.transform(X_train)\n",
    "tfidf = TfidfTransformer(norm = \"l2\")\n",
    "tfidf.fit(freq_term_matrix_train)\n",
    "tf_idf_matrix_train = tfidf.transform(freq_term_matrix_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(20800, 55055)"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "freq_term_matrix_train.toarray().shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(20800, 55055)"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tf_idf_matrix_train.toarray().shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Transform test-data to BoW-matrix:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "freq_term_matrix_test = count_vectorizer.transform(X_test)\n",
    "tf_idf_matrix_test = tfidf.transform(freq_term_matrix_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train a linear classifier\n",
    "\n",
    "Below a [Logistic Regression model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression) is trained. This is just a linear classifier with a sigmoid- or softmax- activity-function. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train=tf_idf_matrix_train\n",
    "X_test=tf_idf_matrix_test\n",
    "#X_train=freq_term_matrix_train\n",
    "#X_test=freq_term_matrix_test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
       "                   intercept_scaling=1, l1_ratio=None, max_iter=100,\n",
       "                   multi_class='auto', n_jobs=None, penalty='l2',\n",
       "                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,\n",
       "                   warm_start=False)"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "logreg = LogisticRegression()\n",
    "logreg.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluate trained model\n",
    "First, the trained model is applied to predict the class of the training-samples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred_train = logreg.predict(X_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 1, 1, ..., 0, 1, 1])"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_pred_train"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import confusion_matrix, classification_report"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[10200,   187],\n",
       "       [  148, 10265]])"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "confusion_matrix(y_train,y_pred_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The model's prediction are compared with the true classes of the training-samples. The classification-report contains the common metrics for evaluating classifiers:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "           0       0.99      0.98      0.98     10387\n",
      "           1       0.98      0.99      0.98     10413\n",
      "\n",
      "    accuracy                           0.98     20800\n",
      "   macro avg       0.98      0.98      0.98     20800\n",
      "weighted avg       0.98      0.98      0.98     20800\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(classification_report(y_train,y_pred_train))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output of the classification report shows, that the model is well fitted to the training data, since it predicts training data with an accuracy of 98%.\n",
    "\n",
    "However, accuracy on the training-data, provides no information on the model's capability to classify new data. Therefore, below the model's prediction on the test-dataset is calculated:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred_test = logreg.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1524,  815],\n",
       "       [1061, 1800]])"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "confusion_matrix(y_test,y_pred_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "           0       0.59      0.65      0.62      2339\n",
      "           1       0.69      0.63      0.66      2861\n",
      "\n",
      "    accuracy                           0.64      5200\n",
      "   macro avg       0.64      0.64      0.64      5200\n",
      "weighted avg       0.64      0.64      0.64      5200\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(classification_report(y_test,y_pred_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The model's accuracy on the test-data is weak. The model is overfitted on the training-data. It seems that the distribution of test-data is significantly different from the distribution of training-data. This hypothesis can be verified by ignoring the data from `test.csv` and instead split data from `train.csv` into a train- and a test-partition. In this modified experiment performance on test-data is much better, because the texts within `train.csv` origin from the same distributions. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3 (default, Jul 25 2020, 13:03:44) \n[GCC 8.3.0]"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "vscode": {
   "interpreter": {
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}