{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "TextBlob Stemming and Lemmatization\n", "=====================================\n", "\n", "- Author: Johannes Maucher\n", "- Last update: 2021-10-26\n", "\n", "In this notebook the application of stemming and lemmatization shall be demonstrated. For this we apply the correponding modules provided by the NLP package [**TextBlob**](https://textblob.readthedocs.io/en/dev/). Since stemming and lemmatization both require the segmentation of texts into lists of words, segmentation and other preprocessing-functions of TextBlob are also shown. In notebook [Regular Expressions](../01access/05RegularExpressions.ipynb) it has already been demonstrated how to implement segmentation in Python without additional packages. If you like to go directly to [Word Normalisation click here](#word_normalisation). \n", "\n", "[**TextBlob**](https://textblob.readthedocs.io/en/dev/) is a Python library for Natural Language Processing. It provides a simple API for, e.g.\n", "* Noun phrase extraction\n", "* Part-of-speech tagging\n", "* Sentiment analysis\n", "* Classification\n", "* Language translation and detection powered by Google Translate\n", "* Tokenization (splitting text into words and sentences)\n", "* Word and phrase frequencies\n", "* Parsing\n", "* n-grams\n", "* Word inflection (pluralization and singularization) and lemmatization\n", "* Spelling correction\n", "* WordNet integration" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#!pip install textblob" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.15.3\n" ] } ], "source": [ "import textblob\n", "print(textblob.__version__)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2017-10-24T13:39:31.517000Z", "start_time": "2017-10-24T13:39:30.314000Z" } }, "outputs": [], "source": [ "from textblob import TextBlob" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TextBlob objects\n", "TextBlob objects are like Python strings, which have enhanced with typical NLP processing methods. They are generared as follows:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2017-10-24T13:39:31.517000Z", "start_time": "2017-10-24T13:39:31.517000Z" } }, "outputs": [], "source": [ "myBlob1=TextBlob(\"\"\"TextBlob is a Python (2 and 3) library for processing textual data. \n", "It provides a simple API for diving into common natural language processing (NLP) tasks \n", "such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, \n", "translation, and more. New York is a nice city.\"\"\")\n", "myBlob2=TextBlob(u\"Das Haus steht im Tal auf der anderen Seite des Berges. Es gehört Dr. med. Brinkmann und wird von der Immo GmbH aus verwaltet. Im Haus befindet sich u.a. eine Physio-Praxis.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Language Detection" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2017-10-24T13:39:31.971000Z", "start_time": "2017-10-24T13:39:31.532000Z" } }, "outputs": [], "source": [ "#print(\"Blob1 text language is \",myBlob1.detect_language())\n", "#print(\"Blob2 text language is \",myBlob2.detect_language())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sentences and Words\n", "The TextBlob class also integrates methods for tokenisation. The corresponding segmentation of the given text into sentences and words can be obtained as follows:\n", "\n", "**English text:**" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2017-10-24T13:39:32.049000Z", "start_time": "2017-10-24T13:39:31.971000Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--------------------\n", "TextBlob is a Python (2 and 3) library for processing textual data.\n", "--------------------\n", "It provides a simple API for diving into common natural language processing (NLP) tasks \n", "such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, \n", "translation, and more.\n", "--------------------\n", "New York is a nice city.\n" ] } ], "source": [ "for s in myBlob1.sentences:\n", " print(\"-\"*20)\n", " print(s)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2017-10-24T13:39:32.071000Z", "start_time": "2017-10-24T13:39:32.065000Z" } }, "outputs": [ { "data": { "text/plain": [ "WordList(['New', 'York', 'is', 'a', 'nice', 'city'])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myBlob1.sentences[2].words" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As can be seen in the example above multi-word expressions, like *New York* are not detected." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**German text:**" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--------------------\n", "Das Haus steht im Tal auf der anderen Seite des Berges.\n", "--------------------\n", "Es gehört Dr. med.\n", "--------------------\n", "Brinkmann und wird von der Immo GmbH aus verwaltet.\n", "--------------------\n", "Im Haus befindet sich u.a.\n", "--------------------\n", "eine Physio-Praxis.\n" ] } ], "source": [ "for s in myBlob2.sentences:\n", " print(\"-\"*20)\n", " print(s)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As can be seen in the example above, segmentation works fine for English texts, but fails for German texts, in particular if the text contains abbreviations, which are not used in English. However, the [**TextBlob-DE package**](https://pypi.python.org/pypi/textblob-de) provides German language support for *TextBlob*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part-of-Speech Tagging" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Meaning of POS Tags according to Penn Treebank II tagset](https://gist.github.com/nlothian/9240750)." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2017-10-24T13:39:40.447000Z", "start_time": "2017-10-24T13:39:40.060000Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TextBlob NNP\n", "is VBZ\n", "a DT\n", "Python NNP\n", "2 CD\n", "and CC\n", "3 CD\n", "library NN\n", "for IN\n", "processing VBG\n", "textual JJ\n", "data NNS\n" ] } ], "source": [ "for word,pos in myBlob1.sentences[0].tags:\n", " print(word, pos)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Doesn't work for German:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2017-10-24T13:39:40.469000Z", "start_time": "2017-10-24T13:39:40.447000Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Das NNP\n", "Haus NNP\n", "steht VBD\n", "im JJ\n", "Tal NNP\n", "auf NN\n", "der NN\n", "anderen NNS\n", "Seite NNP\n", "des NNS\n", "Berges NNP\n" ] } ], "source": [ "for word,pos in myBlob2.sentences[0].tags:\n", " print(word, pos)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Word Normalization\n", "In NLP the term *word normalization* comprises methods to map different word forms to a unique form. Word normalization reduces complexity and often improves the NLP task. E.g. in document-classification it usually does not matter in which temporal form a verb is written. The mapping of all temporal forms to the base form reduces the number of words in the vocabulary and likely increases the accuracy of the classifier.\n", "\n", "### Singularization\n", "One form of word normalization is to map all words in plural into the corresponding singular form. With *textblob* this can be realized as follows:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The The\n", "cars car\n", "in in\n", "the the\n", "streets street\n", "around around\n", "Main Main\n", "Square Square\n", "have have\n", "been been\n", "observed observed\n", "by by\n", "policemen policeman\n" ] } ], "source": [ "myBlob6=TextBlob(\"The cars in the streets around Main Square have been observed by policemen\")\n", "for word in myBlob6.words:\n", " print(word, word.singularize())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lemmatization\n", "Lemmatization maps all distinct forms of a word to the baseform. E.g. the wordforms `went, gone, going` are all mapped to the baseform `go`:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "go\n", "go\n", "go\n" ] } ], "source": [ "print(TextBlob(\"went\").words[0].lemmatize(\"v\")) # mode \"v\" for verb\n", "print(TextBlob(\"gone\").words[0].lemmatize(\"v\"))\n", "print(TextBlob(\"going\").words[0].lemmatize(\"v\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lemmatization of adverbs and adjectives:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "late\n", "pretty\n", "bad\n" ] } ], "source": [ "print(TextBlob(\"later\").words[0].lemmatize(\"a\")) # mode \"a\" for adverb/adjective\n", "print(TextBlob(\"prettier\").words[0].lemmatize(\"a\"))\n", "print(TextBlob(\"worse\").words[0].lemmatize(\"a\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lemmatizsation of nouns: " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "woman\n" ] } ], "source": [ "print(TextBlob(\"women\").words[0].lemmatize(\"n\")) # mode \"n\" for noun" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2017-10-24T13:39:40.469000Z", "start_time": "2017-10-24T13:39:40.469000Z" } }, "outputs": [], "source": [ "myBlob3=TextBlob(\"The engineer went into the garden and found the cats lying beneath the trees\")" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2017-10-24T13:39:43.763000Z", "start_time": "2017-10-24T13:39:40.469000Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The The\n", "engineer engineer\n", "went go\n", "into into\n", "the the\n", "garden garden\n", "and and\n", "found find\n", "the the\n", "cats cat\n", "lying lie\n", "beneath beneath\n", "the the\n", "trees tree\n" ] } ], "source": [ "for word in myBlob3.words:\n", " print(word, word.lemmatize(\"v\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Stemming\n", "The drawback of lemmatization is its complexity. Some languages have a quite regular structure. This means that the different word-forms can be derived from the baseform by a well defined set of rules. In this case the inverse application of these rules can be applied for determining the baseform. However, languages such as German, have many irregular cases. Lemmatization then requires a *dictionary*, which lists all different word-forms an their corresponding baseform.\n", "\n", "Stemming is simple and less complex alternative to lemmatization. Stemmers map each word to their stem. In contrast to lemmatization the result of stemming need not be a lexical entry (valid word). Stemmers, e.g. the *Porter Stemmer* apply heuristics for word-suffixes and strip-off found suffixes from the word. E.g. in the word `engineer` a stemmer finds `er` as a frequnet suffix. It strips-off this suffix and outputs the found stem `engin`. " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The the\n", "engineer engin\n", "went went\n", "into into\n", "the the\n", "garden garden\n", "and and\n", "found found\n", "the the\n", "cats cat\n", "lying lie\n", "beneath beneath\n", "the the\n", "trees tree\n" ] } ], "source": [ "for word in myBlob3.words:\n", " print(word, word.stem())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Word correction\n", "Word correction can also be considered as type of normalization. It maps misspelled words to likely correct form. " ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "TextBlob(\"sentence is not true\")" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "w=TextBlob(\"sentense is not tru\")\n", "wcorr=w.correct()\n", "wcorr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TextBlob-DE for German Language" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "#!pip install textblob-de" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2017-10-24T13:40:47.713000Z", "start_time": "2017-10-24T13:40:47.344000Z" } }, "outputs": [], "source": [ "import textblob_de" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2017-10-24T13:41:56.077000Z", "start_time": "2017-10-24T13:41:56.074000Z" } }, "outputs": [], "source": [ "myBlob4=textblob_de.TextBlobDE(u\"Das Haus steht im Tal auf der anderen Seite des Berges. Es gehört Dr. med. Brinkmann und wird von der Immo GmbH verwaltet. Im Haus befindet sich u.a. eine Physio-Praxis.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sentences and Words" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2017-10-24T13:42:04.246000Z", "start_time": "2017-10-24T13:42:04.246000Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--------------------\n", "Das Haus steht im Tal auf der anderen Seite des Berges.\n", "--------------------\n", "Es gehört Dr. med. Brinkmann und wird von der Immo GmbH verwaltet.\n", "--------------------\n", "Im Haus befindet sich u.a. eine Physio-Praxis.\n" ] } ], "source": [ "for s in myBlob4.sentences:\n", " print(\"-\"*20)\n", " print(s)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "WordList(['Es', 'gehört', 'Dr', 'med', 'Brinkmann', 'und', 'wird', 'von', 'der', 'Immo', 'GmbH', 'verwaltet'])" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myBlob4.sentences[1].words" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "WordList(['Dr. med. Brinkmann', 'Immo GmbH'])" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myBlob4.sentences[1].noun_phrases" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part-of-Speech Tagging" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "myBlob5=textblob_de.TextBlobDE(\"Er ist mit seinen Katzen über drei Tische gesprungen\")" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "ExecuteTime": { "end_time": "2017-10-24T13:42:34.301000Z", "start_time": "2017-10-24T13:42:33.978000Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Er PRP\n", "ist VBN\n", "mit IN\n", "seinen PRP$\n", "Katzen NNS\n", "über IN\n", "drei CD\n", "Tische JJ\n", "gesprungen VBN\n" ] } ], "source": [ "for word,tag in myBlob5.tags:\n", " print(word, tag)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lemmatization" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2017-10-24T13:43:57.683000Z", "start_time": "2017-10-24T13:43:57.602000Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "er\n", "sein\n", "mit\n", "seinen\n", "Katze\n", "über\n", "drei\n", "Tisch\n", "gesprungen\n" ] } ], "source": [ "for word in myBlob5.words.lemmatize():\n", " print(word)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.0" }, "nav_menu": {}, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": "block", "toc_window_display": false }, "toc_position": { "height": "664px", "left": "0px", "right": "1350.67px", "top": "125.333px", "width": "251px" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }