{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Regular expressions in Python \n", "\n", "- Author: Johannes Maucher\n", "- Last update: 2020-09-09\n", "\n", "This notebook demonstrates the application of regular expressions in Python. " ] }, { "cell_type": "markdown", "metadata": { "jp-MarkdownHeadingCollapsed": true, "slideshow": { "slide_type": "slide" } }, "source": [ "## Regular Expressions Operators and Characterclass Symbols\n", "|Operator|Behavior|\n", "|--- |--- |\n", "|.|Wildcard, matches any character|\n", "|^abc|Matches some pattern abc at the start of a string|\n", "|abc$|Matches some pattern abc at the end of a string|\n", "|[abc]|Matches one of a set of characters|\n", "|[A-Z0-9]|Matches one of a range of characters|\n", "|*|Zero or more of previous item, e.g. a*, [a-z]* (also known as Kleene Closure)|\n", "|+|One or more of previous item, e.g. a+, [a-z]+|\n", "|?|Zero or one of the previous item (i.e. optional), e.g. a?, [a-z]?|\n", "|{n}|Exactly n repeats where n is a non-negative integer|\n", "|{n,}|At least n repeats|\n", "|{,n}|No more than n repeats|\n", "|{m,n}|At least m and no more than n repeats|\n", "|a([aA])+|Parentheses that indicate the scope of the operators|\n", "\n", "In addition to the operators listed above the `|`- operator is also frequently used. It acts as a disjunction, for example\n", "`|ed|ing|s|` matches to all character sequences with either `ed`, `ing` or `s`. \n", "\n", "For frequently applied character classes, the following shortcut-symbols are defined:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "|Symbol|Function|\n", "|--- |--- |\n", "|\\d|Any decimal digit (equivalent to [0-9])|\n", "|\\D|Any non-digit character (equivalent to [^0-9])|\n", "|\\s|Any whitespace character (equivalent to [ \\t\\n\\r\\f\\v])|\n", "|\\S|Any non-whitespace character (equivalent to [^ \\t\\n\\r\\f\\v])|\n", "|\\w|Any alphanumeric character (equivalent to [a-zA-Z0-9_])|\n", "|\\W|Any non-alphanumeric character (equivalent to [^a-zA-Z0-9_])|\n", "|\\b|Boundary between word and non-word|\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Definition of patterns\n", "Regular expressions are patterns of character sequences. In Python such patterns must be defined as **raw strings**. A raw string is a character sequence, which is not interpreted. A string is defined as raw-string, by the prefix `r`. For example\n", "\n", "`mypattern = r\"[0-9]+\\s\"`\n", "\n", "is a raw string pattern, which matches to all sequences which consist of one ore more decimal numbers, followed by a white-space character." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Matching and Searching\n", "The two main application categories of regular expressions are matching and searching: \n", "\n", "* In matching applications the pattern represents a syntactic rule and an arbitrary character sequence (text) is parsed, if it is consistent to this syntax. For example in an web-interface, where users can enter their date of birth a regular expression can be applied to check if the user entered the date in an acceptable format.\n", "* In search applications the pattern defines a type of character-sequence, which is searched for in an arbitrary long text. \n", "\n", "The most important methods of the Python regular expression package `re` are:\n", "\n", "* `re.findall(pattern,text)`: Searches in the string-variable `text`- for all matches with `pattern`. All found non-overlapping matches are returned as a list of strings.\n", "* `re.split(pattern,text)`: Searches in the string-variable `text`- for all matches with `pattern`. At all found patterns the text is split. A list of all splits is returned.\n", "* `re.search(pattern,text)`: Searches in the string-variable `text`- for all matches with `pattern`. The first found match is returned as a *match-object*. The return value is `None`, if no matches are found.\n", "* `re.match(pattern,text)`: Checks if the first characters of the string-variable `text` match to the pattern. If this is the case a *match-object* is returned, otherwise `None`.\n", "* `re.sub(pattern,replacement,text)`: Searches for all matches of pattern in text and replaces this matches by the string `replacement`.\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2017-11-12T20:30:40.519000Z", "start_time": "2017-11-12T21:30:15.769000+01:00" }, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import nltk\n", "import re" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2017-11-12T20:51:00.504000Z", "start_time": "2017-11-12T21:51:00.485000+01:00" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "dummytext1=\"His email-address is foo.bar@bar-foo.com but he doesn't check his emails frequently\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Assume, that we have texts, that contain email-addresses, like the dummy-text above. The task is to find all email addresses in the text. For this we can define a pattern, which matches to syntactically correct email-addresses and pass this pattern to the `findall(pattern,text)`-method of the `re`-package:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2017-11-12T20:51:02.141000Z", "start_time": "2017-11-12T21:51:02.132000+01:00" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['foo.bar@bar-foo.com']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mailpattern = r'[\\w_\\.-]+@[\\w_\\.-]+\\.\\w+'\n", "re.findall(mailpattern, dummytext1)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2017-11-12T20:31:11.422000Z", "start_time": "2017-11-12T21:31:11.422000+01:00" }, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "dummytext2 = \"This is just a dummy test, which is applied for demonstrating regular expressions in Python. The current date is 2017-10-17.\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Find first word, which begins with character *d*:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2017-11-12T20:32:03.121000Z", "start_time": "2017-11-12T21:32:03.121000+01:00" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " dummy \n" ] } ], "source": [ "pattern1 = r\"\\s[d]\\S*\\s\"\n", "search_result = re.search(pattern1, dummytext2)\n", "if search_result:\n", " print(search_result.group())\n", "else:\n", " print(\"Pattern not in Text\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Find all words, which begin with character *d*:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2017-11-12T20:32:29.034000Z", "start_time": "2017-11-12T21:32:29.018000+01:00" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[' dummy ', ' demonstrating ', ' date ']\n" ] } ], "source": [ "search_result = re.findall(pattern1, dummytext2)\n", "if search_result:\n", " print(search_result)\n", "else:\n", " print(\"Pattern not in Text\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Find all words, which begin with character *d*. Return only the words, not the whitespaces around them:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2017-11-12T20:33:01.773000Z", "start_time": "2017-11-12T21:33:01.773000+01:00" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['dummy', 'demonstrating', 'date']\n" ] } ], "source": [ "pattern2 = r\"\\s([d]\\S*)\\s\"\n", "search_result = re.findall(pattern2, dummytext2)\n", "if search_result:\n", " print(search_result)\n", "else:\n", " print(\"Pattern not in Text\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Same result as above:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2017-11-12T20:33:09.923000Z", "start_time": "2017-11-12T21:33:09.908000+01:00" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['dummy', 'demonstrating', 'date']\n" ] } ], "source": [ "pattern3 = r\"\\b[d]\\S*\\b\"\n", "search_result = re.findall(pattern3, dummytext2)\n", "if search_result:\n", " print(search_result)\n", "else:\n", " print(\"Pattern not in Text\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Replace substrings:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2017-11-12T20:33:20.317000Z", "start_time": "2017-11-12T21:33:20.301000+01:00" }, "scrolled": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'Dear Mrs. Keane, we like to invit you to our summer school'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "templateText=\"Dear Mrs. , we like to invit you to our \"\n", "name=\"Keane\"\n", "event=\"summer school\"\n", "re.sub(r\"\", event, re.sub(r\"\", name, templateText))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Segmentation into words using regular expressions" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2017-11-12T20:36:40.239000Z", "start_time": "2017-11-12T21:36:40.217000+01:00" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Anzahl der unterschiedlichen Tokens: 40\n", "\n", "1\n", "Allianz-Arena\n", "Bayern\n", "Er\n", "Es\n", "FC\n", "Ich\n", "Köln\n", "Oder\n", "Spiel\n", "Thomas\n", "an\n", "bereit\n", "bereits\n", "bin\n", "das\n", "den\n", "der\n", "fußballspielen\n", "gefragt\n", "gegen\n", "habe\n", "ich\n", "ihr\n", "in\n", "jederzeit\n", "könnten\n", "lieber\n", "mal\n", "meint\n", "meinte\n", "mir\n", "schau\n", "schön\n", "was\n", "wenn\n", "wieder\n", "wir\n", "wäre\n" ] } ], "source": [ "import nltk\n", "import re\n", "\n", "text = \"\"\"Es wäre schön, wenn wir mal wieder fußballspielen könnten.\n", "Oder was meint ihr? Ich bin jederzeit bereit. Thomas habe ich bereits gefragt.\n", "Er meinte: \"Ich schau mir lieber das Spiel der Bayern gegen den 1.FC Köln\n", "in der Allianz-Arena an.\" \"\"\"\n", "\n", "#cleaned_tokens=re.findall(r\"[\\wäüöÄÜÖß]+\",text) # Einfache Lösung die funktioniert\n", "cleaned_tokens = re.split(r\"[\\s.,;:()!?\\\"]+\", text)\n", "#cleaned_tokens = text.split()\n", "#cleaned_tokens = re.findall(r\"[^\\s.,;:()!?\\\"]+\", text) #liefert gleiches Ergebnis wie Split, jedoch ohne den leeren String\n", "#cleaned_tokens = re.findall(r\"\\w+[.]\\w+|[^\\s.,;:()!?\\\"]+\", text) #Damit wird auch Punkt innerhalb eines Wortes erlaubt\n", "\n", "ts = {token.strip(\"\\\".,?:\") for token in cleaned_tokens}\n", "print(f\"Anzahl der unterschiedlichen Tokens: {len(ts)}\")\n", "for t in sorted(ts):\n", " print(t)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Segmentation into words using NLTK" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Es\n", "wäre\n", "schön\n", "wenn\n", "wir\n", "mal\n", "wieder\n", "fußballspielen\n", "könnten\n", "Oder\n", "was\n", "meint\n", "ihr\n", "Ich\n", "bin\n", "jederzeit\n", "bereit\n", "Thomas\n", "habe\n", "ich\n", "bereits\n", "gefragt\n", "Er\n", "meinte\n", "Ich\n", "schau\n", "mir\n", "lieber\n", "das\n", "Spiel\n", "der\n", "Bayern\n", "gegen\n", "den\n", "1\n", "FC\n", "Köln\n", "in\n", "der\n", "Allianz\n", "Arena\n", "an\n", "Anzahl der Tokens: 42\n" ] } ], "source": [ "cleaned_tokens = nltk.regexp_tokenize(text, r\"[\\wäöüÄÖÜß]+\")\n", "for token in cleaned_tokens:\n", " print(token)\n", "print(f\"Anzahl der Tokens: {len(cleaned_tokens)}\")" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2017-11-12T20:36:53.898000Z", "start_time": "2017-11-12T21:36:53.866000+01:00" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Es\n", "wäre\n", "schön\n", ",\n", "wenn\n", "wir\n", "mal\n", "wieder\n", "fußballspielen\n", "könnten\n", ".\n", "Oder\n", "was\n", "meint\n", "ihr\n", "?\n", "Ich\n", "bin\n", "jederzeit\n", "bereit\n", ".\n", "Thomas\n", "habe\n", "ich\n", "bereits\n", "gefragt\n", ".\n", "Er\n", "meinte\n", ":\n", "``\n", "Ich\n", "schau\n", "mir\n", "lieber\n", "das\n", "Spiel\n", "der\n", "Bayern\n", "gegen\n", "den\n", "1.FC\n", "Köln\n", "in\n", "der\n", "Allianz-Arena\n", "an\n", ".\n", "''\n", "Anzahl der Tokens: 49\n" ] } ], "source": [ "ntokens = nltk.word_tokenize(text)\n", "for a in ntokens:\n", " print(a)\n", "print(f\"Anzahl der Tokens: {len(ntokens)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Segmentation into sentences using regular expressions" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2017-11-12T20:37:00.281000Z", "start_time": "2017-11-12T21:37:00.266000+01:00" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "----------\n", "Es wäre schön, wenn wir mal wieder fußballspielen könnten\n", "----------\n", "Oder was meint ihr\n", "----------\n", "Ich bin jederzeit bereit\n", "----------\n", "Thomas habe ich bereits gefragt\n", "----------\n", "Er meinte\n", "----------\n", "\"Ich schau mir lieber das Spiel der Bayern gegen den 1.FC Köln\n", "in der Allianz-Arena an.\" \n", "----------\n", "Anzahl der Tokens: 6\n" ] } ], "source": [ "sentences = re.split(r\"[?!.:]+\\s\",text)\n", "for sentence in sentences:\n", " print('-'*10)\n", " print(sentence)\n", "\n", "print('-'*10)\n", "print(f\"Anzahl der Tokens: {len(sentences)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Segmentation into sentences using NLTK" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2017-11-12T20:37:23.338000Z", "start_time": "2017-11-12T21:37:23.102000+01:00" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "----------\n", "Dr. med. R. Steiner meint es wäre schön, wenn wir mal wieder fußballspielen könnten.\n", "----------\n", "Oder was meint ihr?\n", "----------\n", "Ich bin jederzeit bereit.\n", "----------\n", "Thomas usw. habe ich bereits gefragt.\n", "----------\n", "Er meinte: \"Ich schau mir lieber das Spiel der Bayern gegen den 1.FC Köln \n", "in der Allianz-Arena an.\"\n", "----------\n", "Es schauen 7 Mio. zu, inkl. mir selbst.\n", "----------\n", "U.a. soll auch Messi etc. dabei sein.\"\n", "----------\n", "Anzahl der Tokens: 7\n" ] } ], "source": [ "text2 = \"\"\"Dr. med. R. Steiner meint es wäre schön, wenn wir mal wieder fußballspielen könnten.\n", "Oder was meint ihr? Ich bin jederzeit bereit. Thomas usw. habe ich bereits gefragt.\n", "Er meinte: \"Ich schau mir lieber das Spiel der Bayern gegen den 1.FC Köln \n", "in der Allianz-Arena an.\\\" Es schauen 7 Mio. zu, inkl. mir selbst. U.a. soll auch Messi etc. dabei sein.\\\"\"\"\"\n", "\n", "sent_tokenizer=nltk.data.load('tokenizers/punkt/german.pickle')\n", "sents = sent_tokenizer.tokenize(text2)\n", "for a in sents:\n", " print('-'*10)\n", " print(a)\n", "print('-'*10)\n", "print(\"Anzahl der Tokens: \",len(sents))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.4" }, "nav_menu": {}, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": "block", "toc_window_display": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false }, "vscode": { "interpreter": { "hash": "063584b925b3f1849a1b99bbef08a469ec6afc8a729289ad5adba9013f2d188e" } } }, "nbformat": 4, "nbformat_minor": 4 }