{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Implementation of Topic Extraction and Document Clustering\n", "\n", "- Author: Johannes Maucher\n", "- Last update: 14.12.2021\n", "\n", "This notebook demonstrates how [gensim](http://radimrehurek.com/gensim/) can be applied for *Latent Semantic Indexing (LSI)*. In LSI a set of abstract topics (features), which are latent in a set of simple texts, is calculated. Then the documents are described and visualised with respect to these abstract features. The notebook is an adoption of the corresponding [gensim LSI tutorial](http://radimrehurek.com/gensim/tut2.html). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Collect and filter text documents\n", "A list of very small documents is defined. From the corresponding BoW (Bag of Words) representation all stopwords and all words, which appear only once are removed. The resulting cleaned BoW models of all documents are printed below. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "#!pip install --upgrade gensim" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['human', 'interface', 'computer']\n", "['survey', 'user', 'computer', 'system', 'response', 'time']\n", "['eps', 'user', 'interface', 'system']\n", "['system', 'human', 'system', 'eps']\n", "['user', 'response', 'time']\n", "['trees']\n", "['graph', 'trees']\n", "['graph', 'minors', 'trees']\n", "['graph', 'minors', 'survey']\n" ] } ], "source": [ "from gensim import corpora, models, similarities\n", "\n", "documents = [\"Human machine interface for lab abc computer applications\",\n", " \"A survey of user opinion of computer system response time\",\n", " \"The EPS user interface management system\",\n", " \"System and human system engineering testing of EPS\",\n", " \"Relation of user perceived response time to error measurement\",\n", " \"The generation of random binary unordered trees\",\n", " \"The intersection graph of paths in trees\",\n", " \"Graph minors IV Widths of trees and well quasi ordering\",\n", " \"Graph minors A survey\"]\n", "# remove common words and tokenize\n", "stoplist = set('for a of the and to in'.split())\n", "texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]\n", "# remove words that appear only once\n", "all_tokens=[]\n", "for t in texts:\n", " for w in t:\n", " all_tokens.append(w)\n", "tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)\n", "texts = [[word for word in text if word not in tokens_once]\n", " for text in texts]\n", "for t in texts:\n", " print(t)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dictionaries and Corpora\n", "The words of the cleaned documents constitute a dictionary, which is persistently saved in the file *deerwester.dict*. The dictionary-method *token2id* displays the dictionary indes of each word." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}\n" ] } ], "source": [ "dictionary = corpora.Dictionary(texts)\n", "dictionary.save('../Data/deerwester.dict') # store the dictionary, for future reference\n", "print(dictionary.token2id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, a corpus is generated, which is a very efficient representation of the cleaned documents. In the corpus each word is represented by it's index in the dictionary. The corpus is persistently saved to file *deerwester.mm*." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "corpus = [dictionary.doc2bow(text) for text in texts]\n", "corpora.MmCorpus.serialize('../Data/deerwester.mm', corpus) # store to disk, for later use" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[(0, 1), (1, 1), (2, 1)]\n", "[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]\n", "[(2, 1), (5, 1), (7, 1), (8, 1)]\n", "[(1, 1), (5, 2), (8, 1)]\n", "[(3, 1), (6, 1), (7, 1)]\n", "[(9, 1)]\n", "[(9, 1), (10, 1)]\n", "[(9, 1), (10, 1), (11, 1)]\n", "[(4, 1), (10, 1), (11, 1)]\n" ] } ], "source": [ "for c in corpus:\n", " print(c)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following code snippet demonstrates how a dictionary and a corpus can be loaded into the python program." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[(0, 1.0), (1, 1.0), (2, 1.0)]\n", "[(0, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (7, 1.0)]\n", "[(2, 1.0), (5, 1.0), (7, 1.0), (8, 1.0)]\n", "[(1, 1.0), (5, 2.0), (8, 1.0)]\n", "[(3, 1.0), (6, 1.0), (7, 1.0)]\n", "[(9, 1.0)]\n", "[(9, 1.0), (10, 1.0)]\n", "[(9, 1.0), (10, 1.0), (11, 1.0)]\n", "[(4, 1.0), (10, 1.0), (11, 1.0)]\n" ] } ], "source": [ "dictionary = corpora.Dictionary.load('../Data/deerwester.dict')\n", "corpus = corpora.MmCorpus('../Data/deerwester.mm')\n", "for c in corpus:\n", " print(c)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TF-IDF Model of the corpus\n", "A tf-idf model is generated from the cleaned documents of the corpus and all corpus documents are represented by the vector of tf-idf values of their words." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TF-IDF Model without document-vector normalisation" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[(0, 2.1699250014423126), (1, 2.1699250014423126), (2, 2.1699250014423126)]\n", "[(0, 2.1699250014423126), (3, 2.1699250014423126), (4, 2.1699250014423126), (5, 1.5849625007211563), (6, 2.1699250014423126), (7, 1.5849625007211563)]\n", "[(2, 2.1699250014423126), (5, 1.5849625007211563), (7, 1.5849625007211563), (8, 2.1699250014423126)]\n", "[(1, 2.1699250014423126), (5, 3.1699250014423126), (8, 2.1699250014423126)]\n", "[(3, 2.1699250014423126), (6, 2.1699250014423126), (7, 1.5849625007211563)]\n", "[(9, 1.5849625007211563)]\n", "[(9, 1.5849625007211563), (10, 1.5849625007211563)]\n", "[(9, 1.5849625007211563), (10, 1.5849625007211563), (11, 2.1699250014423126)]\n", "[(4, 2.1699250014423126), (10, 1.5849625007211563), (11, 2.1699250014423126)]\n" ] } ], "source": [ "tfidf = models.TfidfModel(corpus,normalize=False) # generate a transformation object and fit it to the corpus documents\n", "corpus_tfidf = tfidf[corpus] # apply the transformation to all corpus documents\n", "for doc in corpus_tfidf:\n", " print(doc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Transform a new document to tf-idf vector. The new document in this example consists of the words \n", "* *computer (index 0)*, \n", "* *human (index 1)* \n", "* 2 times the word *system (index 5)*:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[(0, 2.1699250014423126), (1, 2.1699250014423126), (5, 3.1699250014423126)]\n" ] } ], "source": [ "newDoc=[(0,1),(1,1),(5,2)]\n", "newTFIDF=tfidf[newDoc]\n", "print(newTFIDF)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Verify that log2 is applied in the tf-idf calculation:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2.169925001442312" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "np.log2(9/2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TF-IDF Model with document-vector normalisation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In general it is better to normalise the document vectors, such that each vector has a length of $1$. By applying document normalisation the obtained vectors are *independent* of document length." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]\n", "[(0, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.44424552527467476), (5, 0.3244870206138555), (6, 0.44424552527467476), (7, 0.3244870206138555)]\n", "[(2, 0.5710059809418182), (5, 0.4170757362022777), (7, 0.4170757362022777), (8, 0.5710059809418182)]\n", "[(1, 0.49182558987264147), (5, 0.7184811607083769), (8, 0.49182558987264147)]\n", "[(3, 0.6282580468670046), (6, 0.6282580468670046), (7, 0.45889394536615247)]\n", "[(9, 1.0)]\n", "[(9, 0.7071067811865475), (10, 0.7071067811865475)]\n", "[(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)]\n", "[(4, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]\n" ] } ], "source": [ "tfidf = models.TfidfModel(corpus,normalize=True) # generate a transformation object and fit it to the corpus documents\n", "corpus_tfidf = tfidf[corpus] # apply the transformation to all corpus documents\n", "for doc in corpus_tfidf:\n", " print(doc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## LSI Model of the corpus\n", "A Latent Semantic Indexing (LSI) model is generated from the given documents. The number of topics that shall be extracted is selected to be two in this example:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0,\n", " '0.703*\"trees\" + 0.538*\"graph\" + 0.402*\"minors\" + 0.187*\"survey\" + 0.061*\"system\" + 0.060*\"time\" + 0.060*\"response\" + 0.058*\"user\" + 0.049*\"computer\" + 0.035*\"interface\"'),\n", " (1,\n", " '0.460*\"system\" + 0.373*\"user\" + 0.332*\"eps\" + 0.328*\"interface\" + 0.320*\"time\" + 0.320*\"response\" + 0.293*\"computer\" + 0.280*\"human\" + 0.171*\"survey\" + -0.161*\"trees\"')]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation\n", "corpus_lsi = lsi[corpus_tfidf]\n", "lsi.print_topics(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As shown below, each document is described in the new 2-dimensional space. The dimensions represent the two extracted topics." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Document 0: \t [(0, 0.06600783396090117), (1, 0.5200703306361857)]\n", "Document 1: \t [(0, 0.19667592859142272), (1, 0.7609563167700056)]\n", "Document 2: \t [(0, 0.08992639972446137), (1, 0.7241860626752515)]\n", "Document 3: \t [(0, 0.07585847652177857), (1, 0.6320551586003436)]\n", "Document 4: \t [(0, 0.10150299184979987), (1, 0.5737308483002962)]\n", "Document 5: \t [(0, 0.7032108939378321), (1, -0.16115180214025504)]\n", "Document 6: \t [(0, 0.8774787673119843), (1, -0.1675890686465903)]\n", "Document 7: \t [(0, 0.9098624686818588), (1, -0.1408655362871861)]\n", "Document 8: \t [(0, 0.6165825350569285), (1, 0.05392907566389654)]\n" ] } ], "source": [ "x=[]\n", "y=[]\n", "i=0\n", "for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly\n", " print(\"Document %2d: \\t\"%i,doc)\n", " x.append(doc[0][1])\n", " y.append(doc[1][1])\n", " i+=1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The documents can be plotted in the new 2-dimensional space. In this space the documents are clearly partitioned into 2 clusters, each representing one of the 2 topics." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "from matplotlib import pyplot as plt\n", "plt.figure(figsize=(12,10))\n", "plt.plot(x,y,'or')\n", "plt.title('documents in the new space')\n", "plt.xlabel('topic 1')\n", "plt.ylabel('topic 2')\n", "#plt.xlim([0,1.1])\n", "#plt.ylim([-0.9,0.3])\n", "s=0.02\n", "for i in range(len(x)):\n", " plt.text(x[i]+s,y[i]+s,\"doc \"+str(i))\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "LSI models can be saved to and loaded from files: " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "lsi.save('../Data/model.lsi') # same for tfidf, lda, ...\n", "lsi = models.LsiModel.load('../Data/model.lsi')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.9" }, "nav_menu": {}, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": "block", "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }