{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Text Classification Application: Fake News detection\n", "* Author: Johannes Maucher\n", "* Last update: 24.11.2020\n", "\n", "In this notebook conventional Machine Learning algorithms are applied to learn a discriminator-model for distinguishing fake- and non-fake news.\n", "\n", "What you will learn:\n", "* Access text from .csv file\n", "* Preprocess text for classification\n", "* Calculate BoW matrix\n", "* Apply conventional machine learning algorithms for fake news detection\n", "* Evaluation of classifiers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Access Data\n", "In this notebook a [fake-news corpus from Kaggle](https://www.kaggle.com/c/fake-news/data) is applied for training and testing Machine Learning algorithms. Download the 3 files and save it in a directory. The path of this directory shall be assigned to the variable `path`in the following code-cell: " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "pfad=\"/Users/johannes/DataSets/fake-news/\"\n", "train = pd.read_csv(pfad+'train.csv',index_col=0)\n", "test = pd.read_csv(pfad+'test.csv',index_col=0)\n", "test_labels=pd.read_csv(pfad+'submit.csv',index_col=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data in dataframe `train` is applied for training. The dataframe `test`contains the texts for testing the model and the dataframe `test_labels` contains the true labels of the test-texts. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of texts in train-dataframe: \t 20800\n", "Number of columns in train-dataframe: \t 4\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleauthortextlabel
id
0House Dem Aide: We Didn’t Even See Comey’s Let...Darrell LucusHouse Dem Aide: We Didn’t Even See Comey’s Let...1
1FLYNN: Hillary Clinton, Big Woman on Campus - ...Daniel J. FlynnEver get the feeling your life circles the rou...0
2Why the Truth Might Get You FiredConsortiumnews.comWhy the Truth Might Get You Fired October 29, ...1
315 Civilians Killed In Single US Airstrike Hav...Jessica PurkissVideos 15 Civilians Killed In Single US Airstr...1
4Iranian woman jailed for fictional unpublished...Howard PortnoyPrint \\nAn Iranian woman has been sentenced to...1
\n", "
" ], "text/plain": [ " title author \\\n", "id \n", "0 House Dem Aide: We Didn’t Even See Comey’s Let... Darrell Lucus \n", "1 FLYNN: Hillary Clinton, Big Woman on Campus - ... Daniel J. Flynn \n", "2 Why the Truth Might Get You Fired Consortiumnews.com \n", "3 15 Civilians Killed In Single US Airstrike Hav... Jessica Purkiss \n", "4 Iranian woman jailed for fictional unpublished... Howard Portnoy \n", "\n", " text label \n", "id \n", "0 House Dem Aide: We Didn’t Even See Comey’s Let... 1 \n", "1 Ever get the feeling your life circles the rou... 0 \n", "2 Why the Truth Might Get You Fired October 29, ... 1 \n", "3 Videos 15 Civilians Killed In Single US Airstr... 1 \n", "4 Print \\nAn Iranian woman has been sentenced to... 1 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(\"Number of texts in train-dataframe: \\t\",train.shape[0])\n", "print(\"Number of columns in train-dataframe: \\t\",train.shape[1])\n", "train.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Append the test-dataframe with the labels, which are contained in dataframe `test_labels`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "test[\"label\"]=test_labels[\"label\"]" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of texts in test-dataframe: \t 5200\n", "Number of columns in test-dataframe: \t 4\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleauthortextlabel
id
20800Specter of Trump Loosens Tongues, if Not Purse...David StreitfeldPALO ALTO, Calif. — After years of scorning...0
20801Russian warships ready to strike terrorists ne...NaNRussian warships ready to strike terrorists ne...1
20802#NoDAPL: Native American Leaders Vow to Stay A...Common DreamsVideos #NoDAPL: Native American Leaders Vow to...0
20803Tim Tebow Will Attempt Another Comeback, This ...Daniel VictorIf at first you don’t succeed, try a different...1
20804Keiser Report: Meme Wars (E995)Truth Broadcast Network42 mins ago 1 Views 0 Comments 0 Likes 'For th...1
\n", "
" ], "text/plain": [ " title \\\n", "id \n", "20800 Specter of Trump Loosens Tongues, if Not Purse... \n", "20801 Russian warships ready to strike terrorists ne... \n", "20802 #NoDAPL: Native American Leaders Vow to Stay A... \n", "20803 Tim Tebow Will Attempt Another Comeback, This ... \n", "20804 Keiser Report: Meme Wars (E995) \n", "\n", " author \\\n", "id \n", "20800 David Streitfeld \n", "20801 NaN \n", "20802 Common Dreams \n", "20803 Daniel Victor \n", "20804 Truth Broadcast Network \n", "\n", " text label \n", "id \n", "20800 PALO ALTO, Calif. — After years of scorning... 0 \n", "20801 Russian warships ready to strike terrorists ne... 1 \n", "20802 Videos #NoDAPL: Native American Leaders Vow to... 0 \n", "20803 If at first you don’t succeed, try a different... 1 \n", "20804 42 mins ago 1 Views 0 Comments 0 Likes 'For th... 1 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(\"Number of texts in test-dataframe: \\t\",test.shape[0])\n", "print(\"Number of columns in test-dataframe: \\t\",test.shape[1])\n", "test.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Selection\n", "\n", "In the following code cells, first the number of missing-data fields is determined. Then the information in columns `author`, `title` and `text` are concatenated to a single string, which is saved in the column `total`. After this process, only columns `total` and `label` are required, all other columns can be removed in the `train`- and the `test`-dataframe. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "title 558\n", "author 1957\n", "text 39\n", "label 0\n", "dtype: int64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.isnull().sum(axis=0)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "title 122\n", "author 503\n", "text 7\n", "label 0\n", "dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test.isnull().sum(axis=0)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "train = train.fillna(' ')\n", "train['total'] = train['title'] + ' ' + train['author'] + ' ' + train['text']" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "train = train[['total', 'label']]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
totallabel
id
0House Dem Aide: We Didn’t Even See Comey’s Let...1
1FLYNN: Hillary Clinton, Big Woman on Campus - ...0
2Why the Truth Might Get You Fired Consortiumne...1
315 Civilians Killed In Single US Airstrike Hav...1
4Iranian woman jailed for fictional unpublished...1
\n", "
" ], "text/plain": [ " total label\n", "id \n", "0 House Dem Aide: We Didn’t Even See Comey’s Let... 1\n", "1 FLYNN: Hillary Clinton, Big Woman on Campus - ... 0\n", "2 Why the Truth Might Get You Fired Consortiumne... 1\n", "3 15 Civilians Killed In Single US Airstrike Hav... 1\n", "4 Iranian woman jailed for fictional unpublished... 1" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.head()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "test = test.fillna(' ')\n", "test['total'] = test['title'] + ' ' + test['author'] + ' ' + test['text']\n", "test = test[['total', 'label']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preprocessing\n", "The input texts in column `total` shall be preprocessed as follows:\n", "* stopwords shall be removed\n", "* all characters, which are neither alpha-numeric nor whitespaces, shall be removed\n", "* all characters shall be represented in lower-case.\n", "* for all words, the lemma (base-form) shall be applied" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "import nltk\n", "from nltk.corpus import stopwords\n", "from nltk.stem import WordNetLemmatizer\n", "import re" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "stop_words = stopwords.words('english')" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "lemmatizer = WordNetLemmatizer()\n", "for index in train.index:\n", " #filter_sentence = ''\n", " sentence = train.loc[index,'total']\n", " # Cleaning the sentence with regex\n", " sentence = re.sub(r'[^\\w\\s]', '', sentence)\n", " # Tokenization\n", " words = nltk.word_tokenize(sentence)\n", " # Stopwords removal\n", " words = [lemmatizer.lemmatize(w).lower() for w in words if not w in stop_words]\n", " filter_sentence = \" \".join(words)\n", " train.loc[index, 'total'] = filter_sentence" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First 5 cleaned texts in the training-dataframe:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
totallabel
id
0house dem aide we didnt even see comeys letter...1
1flynn hillary clinton big woman campus breitba...0
2why truth might get you fired consortiumnewsco...1
315 civilians killed in single us airstrike hav...1
4iranian woman jailed fictional unpublished sto...1
\n", "
" ], "text/plain": [ " total label\n", "id \n", "0 house dem aide we didnt even see comeys letter... 1\n", "1 flynn hillary clinton big woman campus breitba... 0\n", "2 why truth might get you fired consortiumnewsco... 1\n", "3 15 civilians killed in single us airstrike hav... 1\n", "4 iranian woman jailed fictional unpublished sto... 1" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Clean data in the test-dataframe in the same way as done for the training-dataframe above:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "lemmatizer = WordNetLemmatizer()\n", "for index in test.index:\n", " #filter_sentence = ''\n", " sentence = test.loc[index,'total']\n", " # Cleaning the sentence with regex\n", " sentence = re.sub(r'[^\\w\\s]', '', sentence)\n", " # Tokenization\n", " words = nltk.word_tokenize(sentence)\n", " # Stopwords removal\n", " words = [lemmatizer.lemmatize(w).lower() for w in words if not w in stop_words]\n", " filter_sentence = \" \".join(words)\n", " test.loc[index, 'total'] = filter_sentence" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First 5 cleaned texts in the test-dataframe:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
totallabel
id
20800specter trump loosens tongues not purse string...0
20801russian warship ready strike terrorist near al...1
20802nodapl native american leaders vow stay all wi...0
20803tim tebow will attempt another comeback this t...1
20804keiser report meme wars e995 truth broadcast n...1
\n", "
" ], "text/plain": [ " total label\n", "id \n", "20800 specter trump loosens tongues not purse string... 0\n", "20801 russian warship ready strike terrorist near al... 1\n", "20802 nodapl native american leaders vow stay all wi... 0\n", "20803 tim tebow will attempt another comeback this t... 1\n", "20804 keiser report meme wars e995 truth broadcast n... 1" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Determine Bag-of-Word Matrix for Training- and Test-Data\n", "In the code-cells below two different types of Bag-of-Word matrices are calculated. The first type contains the **term-frequencies**, i.e. the entry in row $i$, column $j$ is the frequency of word $j$ in document $i$. In the second type, the matrix-entries are not the term-frequencies, but the tf-idf-values. \n", "\n", "Note that for a given typ (term-frequency or tf-idf) a separate matrix must be calculated for training and testing. Since we always pretend, that only training-data is known in advance, the matrix-structure, i.e. the columns (= words) depends only on the training-data. This matrix structure is calculated in the row:\n", "\n", "```\n", "count_vectorizer.fit(X_train)\n", "```\n", "and\n", "```\n", "tfidf.fit(freq_term_matrix_train),\n", "```\n", "respectively. An important parameter of the `CountVectorizer`-class is `min_df`. The value, which is assigned to this parameter is the minimum frequency of a word, such that it is regarded in the BoW-matrix. Words, which appear less often are disregarded.\n", "\n", "The training data is then mapped to this structure by \n", "```\n", "count_vectorizer.transform(X_train)\n", "```\n", "and\n", "```\n", "tfidf.transform(X_train),\n", "```\n", "respectively.\n", "\n", "For the test-data, however, no new matrix-structure is calculated. Instead the test-data is transformed to the structure of the matrix, defined by the training data." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "X_train = train['total'].values\n", "y_train = train['label'].values" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "X_test = test['total'].values\n", "y_test = test['label'].values" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import TfidfTransformer\n", "from sklearn.feature_extraction.text import CountVectorizer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Train BoW-models and transform training-data to BoW-matrix:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "count_vectorizer = CountVectorizer(min_df=4)\n", "count_vectorizer.fit(X_train)\n", "freq_term_matrix_train = count_vectorizer.transform(X_train)\n", "tfidf = TfidfTransformer(norm = \"l2\")\n", "tfidf.fit(freq_term_matrix_train)\n", "tf_idf_matrix_train = tfidf.transform(freq_term_matrix_train)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(20800, 55055)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "freq_term_matrix_train.toarray().shape" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(20800, 55055)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tf_idf_matrix_train.toarray().shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Transform test-data to BoW-matrix:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "freq_term_matrix_test = count_vectorizer.transform(X_test)\n", "tf_idf_matrix_test = tfidf.transform(freq_term_matrix_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train a linear classifier\n", "Below a [Logistic Regression model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression) is trained. This is just a linear classifier with a sigmoid- or softmax- activity-function. " ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "X_train=tf_idf_matrix_train\n", "X_test=tf_idf_matrix_test\n", "#X_train=freq_term_matrix_train\n", "#X_test=freq_term_matrix_test" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LogisticRegression()" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.linear_model import LogisticRegression\n", "logreg = LogisticRegression()\n", "logreg.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluate trained model\n", "First, the trained model is applied to predict the class of the training-samples:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "y_pred_train = logreg.predict(X_train)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1, 1, 1, ..., 0, 1, 1])" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_pred_train" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import classification_report" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model's prediction are compared with the true classes of the training-samples. The classification-report contains the common metrics for evaluating classifiers:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 0.99 0.98 0.98 10387\n", " 1 0.98 0.99 0.98 10413\n", "\n", " accuracy 0.98 20800\n", " macro avg 0.98 0.98 0.98 20800\n", "weighted avg 0.98 0.98 0.98 20800\n", "\n" ] } ], "source": [ "print(classification_report(y_train,y_pred_train))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The output of the classification report shows, that the model is well fitted to the training data, since it predicts training data with an accuracy of 98%.\n", "\n", "However, accuracy on the training-data, provides no information on the model's capability to classify new data. Therefore, below the model's prediction on the test-dataset is calculated:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "y_pred_test = logreg.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 0.59 0.65 0.62 2339\n", " 1 0.69 0.63 0.66 2861\n", "\n", " accuracy 0.64 5200\n", " macro avg 0.64 0.64 0.64 5200\n", "weighted avg 0.64 0.64 0.64 5200\n", "\n" ] } ], "source": [ "print(classification_report(y_test,y_pred_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model's accuracy on the test-data is weak. The model is overfitted on the training-data. It seems that the distribution of test-data is significantly different from the distribution of training-data. \n", "\n", "The main drawback in this experiment is possibly the application of the BoW-model to represent texts. BoW disregards word-order and semantic relations between words. The application of word-embeddings and neural networks like CNNs and LSTMs may perform much better." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
totallabel
id
0house dem aide we didnt even see comeys letter...1
1flynn hillary clinton big woman campus breitba...0
2why truth might get you fired consortiumnewsco...1
315 civilians killed in single us airstrike hav...1
4iranian woman jailed fictional unpublished sto...1
\n", "
" ], "text/plain": [ " total label\n", "id \n", "0 house dem aide we didnt even see comeys letter... 1\n", "1 flynn hillary clinton big woman campus breitba... 0\n", "2 why truth might get you fired consortiumnewsco... 1\n", "3 15 civilians killed in single us airstrike hav... 1\n", "4 iranian woman jailed fictional unpublished sto... 1" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.head()" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "from tensorflow.keras.preprocessing import text" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "MAX_NB_WORDS=5000" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "tokenizer=text.Tokenizer(num_words=MAX_NB_WORDS)\n", "tokenizer.fit_on_texts(train[\"total\"])" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "trainSeq=tokenizer.texts_to_sequences(train[\"total\"])" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "testSeq=tokenizer.texts_to_sequences(test[\"total\"])" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5000" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenizer.num_words" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "textlenghtsTrain=[len(t) for t in trainSeq]" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "textlenghtsTest=[len(t) for t in testSeq]" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "from matplotlib import pyplot as plt" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.hist(textlenghtsTrain,bins=20)\n", "plt.title(\"Distribution of text lengths in words\")\n", "plt.xlabel(\"number of words per document\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "textlenghtsTrain.sort(reverse=True)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[11420, 9852, 9250, 8712, 8365, 8189, 7403, 6643, 6321, 6295]" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "textlenghtsTrain[:10]" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "MAX_SEQUENCE_LENGTH=800\n", "EMBEDDING_DIM=100" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from tensorflow.keras.preprocessing.sequence import pad_sequences\n", "from tensorflow.keras.utils import to_categorical" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "X_train = pad_sequences(trainSeq, maxlen=MAX_SEQUENCE_LENGTH)\n", "X_test = pad_sequences(testSeq, maxlen=MAX_SEQUENCE_LENGTH)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "y_train = to_categorical(np.asarray(train[\"label\"]))\n", "y_test = to_categorical(np.asarray(test[\"label\"]))" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "from tensorflow.keras.layers import Embedding, Dense, Input, Flatten, Conv1D, MaxPooling1D, Dropout, Concatenate, GlobalMaxPool1D\n", "from tensorflow.keras.models import Model" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "embedding_layer = Embedding(MAX_NB_WORDS,\n", " EMBEDDING_DIM,\n", " #weights=[embedding_matrix],\n", " input_length=MAX_SEQUENCE_LENGTH,\n", " trainable=True)" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')\n", "embedded_sequences = embedding_layer(sequence_input)\n", "l_cov1= Conv1D(32, 5, activation='relu')(embedded_sequences)\n", "l_pool1 = MaxPooling1D(2)(l_cov1)\n", "l_cov2 = Conv1D(64, 3, activation='relu')(l_pool1)\n", "l_pool2 = MaxPooling1D(5)(l_cov2)\n", "l_flat = Flatten()(l_pool2)\n", "l_dense = Dense(64, activation='relu')(l_flat)\n", "preds = Dense(2, activation='softmax')(l_dense)\n", "model = Model(sequence_input, preds)" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model: \"model\"\n", "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "input_1 (InputLayer) [(None, 800)] 0 \n", "_________________________________________________________________\n", "embedding (Embedding) (None, 800, 100) 500000 \n", "_________________________________________________________________\n", "conv1d (Conv1D) (None, 796, 32) 16032 \n", "_________________________________________________________________\n", "max_pooling1d (MaxPooling1D) (None, 398, 32) 0 \n", "_________________________________________________________________\n", "conv1d_1 (Conv1D) (None, 396, 64) 6208 \n", "_________________________________________________________________\n", "max_pooling1d_1 (MaxPooling1 (None, 79, 64) 0 \n", "_________________________________________________________________\n", "flatten (Flatten) (None, 5056) 0 \n", "_________________________________________________________________\n", "dense (Dense) (None, 64) 323648 \n", "_________________________________________________________________\n", "dense_1 (Dense) (None, 2) 130 \n", "=================================================================\n", "Total params: 846,018\n", "Trainable params: 846,018\n", "Non-trainable params: 0\n", "_________________________________________________________________\n" ] } ], "source": [ "model.compile(loss='categorical_crossentropy',\n", " optimizer='rmsprop',\n", " metrics=['categorical_accuracy'])\n", "model.summary()" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/6\n", "163/163 [==============================] - 27s 154ms/step - loss: 0.2283 - categorical_accuracy: 0.8963 - val_loss: 2.1035 - val_categorical_accuracy: 0.6313\n", "Epoch 2/6\n", "163/163 [==============================] - 25s 154ms/step - loss: 0.0552 - categorical_accuracy: 0.9823 - val_loss: 2.7110 - val_categorical_accuracy: 0.6369\n", "Epoch 3/6\n", "163/163 [==============================] - 25s 153ms/step - loss: 0.0316 - categorical_accuracy: 0.9903 - val_loss: 3.6822 - val_categorical_accuracy: 0.6377\n", "Epoch 4/6\n", "163/163 [==============================] - 25s 152ms/step - loss: 0.0180 - categorical_accuracy: 0.9951 - val_loss: 5.2826 - val_categorical_accuracy: 0.6352\n", "Epoch 5/6\n", "163/163 [==============================] - 25s 156ms/step - loss: 0.0084 - categorical_accuracy: 0.9975 - val_loss: 5.7444 - val_categorical_accuracy: 0.6310\n", "Epoch 6/6\n", "163/163 [==============================] - 26s 158ms/step - loss: 0.0054 - categorical_accuracy: 0.9983 - val_loss: 7.5507 - val_categorical_accuracy: 0.6423\n" ] } ], "source": [ "history=model.fit(X_train, y_train, validation_data=(X_test, y_test),epochs=6, verbose=True, batch_size=128)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6 (default, Sep 26 2022, 11:37:49) \n[Clang 14.0.0 (clang-1400.0.29.202)]" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "vscode": { "interpreter": { "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" } } }, "nbformat": 4, "nbformat_minor": 4 }