{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Regular expressions in Python  \n",
    "\n",
    "- Author:      Johannes Maucher\n",
    "- Last update: 2020-09-09\n",
    "\n",
    "This notebook demonstrates the application of regular expressions in Python.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Regular Expressions Operators and Characterclass Symbols\n",
    "|Operator|Behavior|\n",
    "|--- |--- |\n",
    "|.|Wildcard, matches any character|\n",
    "|^abc|Matches some pattern abc at the start of a string|\n",
    "|abc$|Matches some pattern abc at the end of a string|\n",
    "|[abc]|Matches one of a set of characters|\n",
    "|[A-Z0-9]|Matches one of a range of characters|\n",
    "|*|Zero or more of previous item, e.g. a*, [a-z]* (also known as Kleene Closure)|\n",
    "|+|One or more of previous item, e.g. a+, [a-z]+|\n",
    "|?|Zero or one of the previous item (i.e. optional), e.g. a?, [a-z]?|\n",
    "|{n}|Exactly n repeats where n is a non-negative integer|\n",
    "|{n,}|At least n repeats|\n",
    "|{,n}|No more than n repeats|\n",
    "|{m,n}|At least m and no more than n repeats|\n",
    "|a([aA])+|Parentheses that indicate the scope of the operators|\n",
    "\n",
    "In addition to the operators listed above the `|`- operator is also frequently used. It acts as a disjunction, for example\n",
    "`|ed|ing|s|` matches to all character sequences with either `ed`, `ing` or `s`. \n",
    "\n",
    "For frequently applied character classes, the following shortcut-symbols are defined:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "|Symbol|Function|\n",
    "|--- |--- |\n",
    "|\\d|Any decimal digit (equivalent to [0-9])|\n",
    "|\\D|Any non-digit character (equivalent to [^0-9])|\n",
    "|\\s|Any whitespace character (equivalent to [ \\t\\n\\r\\f\\v])|\n",
    "|\\S|Any non-whitespace character (equivalent to [^ \\t\\n\\r\\f\\v])|\n",
    "|\\w|Any alphanumeric character (equivalent to [a-zA-Z0-9_])|\n",
    "|\\W|Any non-alphanumeric character (equivalent to [^a-zA-Z0-9_])|\n",
    "|\\b|Boundary between word and non-word|\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Definition of patterns\n",
    "Regular expressions are patterns of character sequences. In Python such patterns must be defined as **raw strings**. A raw string is a character sequence, which is not interpreted. A string is defined as raw-string, by the prefix `r`. For example\n",
    "\n",
    "`mypattern = r\"[0-9]+\\s\"`\n",
    "\n",
    "is a raw string pattern, which matches to all sequences which consist of one ore more decimal numbers, followed by a white-space character."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Matching and Searching\n",
    "The two main application categories of regular expressions are matching and searching: \n",
    "\n",
    "* In matching applications the pattern represents a syntactic rule and an arbitrary character sequence (text) is parsed, if it is consistent to this syntax. For example in an web-interface, where users can enter their date of birth a regular expression can be applied to check if the user entered the date in an acceptable format.\n",
    "* In search applications the pattern defines a type of character-sequence, which is searched for in an arbitrary long text. \n",
    "\n",
    "The most important methods of the Python regular expression package `re` are:\n",
    "\n",
    "* `re.findall(pattern,text)`: Searches in the string-variable `text`- for all matches with `pattern`. All found non-overlapping matches are returned as a list of strings.\n",
    "* `re.split(pattern,text)`: Searches in the string-variable `text`- for all matches with `pattern`. At all found patterns the text is split. A list of all splits is returned.\n",
    "* `re.search(pattern,text)`: Searches in the string-variable `text`- for all matches with `pattern`. The first found match is returned as a *match-object*. The return value is `None`, if no matches are found.\n",
    "* `re.match(pattern,text)`: Checks if the first characters of the string-variable `text` match to the pattern. If this is the case a *match-object* is returned, otherwise `None`.\n",
    "* `re.sub(pattern,replacement,text)`: Searches for all matches of pattern in text and replaces this matches by the string `replacement`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T20:30:40.519000Z",
     "start_time": "2017-11-12T21:30:15.769000+01:00"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "import nltk\n",
    "import re"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T20:51:00.504000Z",
     "start_time": "2017-11-12T21:51:00.485000+01:00"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "dummytext1=\"His email-address is foo.bar@bar-foo.com but he doesn't check his emails frequently\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Assume, that we have texts, that contain email-addresses, like the dummy-text above. The task is to find all email addresses in the text. For this we can define a pattern, which matches to syntactically correct email-addresses and pass this pattern to the `findall(pattern,text)`-method of the `re`-package:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T20:51:02.141000Z",
     "start_time": "2017-11-12T21:51:02.132000+01:00"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['foo.bar@bar-foo.com']"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mailpattern = r'[\\w_\\.-]+@[\\w_\\.-]+\\.\\w+'\n",
    "re.findall(mailpattern, dummytext1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T20:31:11.422000Z",
     "start_time": "2017-11-12T21:31:11.422000+01:00"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "dummytext2 = \"This is just a dummy test, which is applied for demonstrating regular expressions in Python. The current date is 2017-10-17.\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Find first word, which begins with character *d*:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T20:32:03.121000Z",
     "start_time": "2017-11-12T21:32:03.121000+01:00"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " dummy \n"
     ]
    }
   ],
   "source": [
    "pattern1 = r\"\\s[d]\\S*\\s\"\n",
    "search_result = re.search(pattern1, dummytext2)\n",
    "if search_result:\n",
    "    print(search_result.group())\n",
    "else:\n",
    "    print(\"Pattern not in Text\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Find all words, which begin with character *d*:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T20:32:29.034000Z",
     "start_time": "2017-11-12T21:32:29.018000+01:00"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[' dummy ', ' demonstrating ', ' date ']\n"
     ]
    }
   ],
   "source": [
    "search_result = re.findall(pattern1, dummytext2)\n",
    "if search_result:\n",
    "    print(search_result)\n",
    "else:\n",
    "    print(\"Pattern not in Text\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Find all words, which begin with character *d*. Return only the words, not the whitespaces around them:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T20:33:01.773000Z",
     "start_time": "2017-11-12T21:33:01.773000+01:00"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['dummy', 'demonstrating', 'date']\n"
     ]
    }
   ],
   "source": [
    "pattern2 = r\"\\s([d]\\S*)\\s\"\n",
    "search_result = re.findall(pattern2, dummytext2)\n",
    "if search_result:\n",
    "    print(search_result)\n",
    "else:\n",
    "    print(\"Pattern not in Text\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Same result as above:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T20:33:09.923000Z",
     "start_time": "2017-11-12T21:33:09.908000+01:00"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['dummy', 'demonstrating', 'date']\n"
     ]
    }
   ],
   "source": [
    "pattern3 = r\"\\b[d]\\S*\\b\"\n",
    "search_result = re.findall(pattern3, dummytext2)\n",
    "if search_result:\n",
    "    print(search_result)\n",
    "else:\n",
    "    print(\"Pattern not in Text\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Replace substrings:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T20:33:20.317000Z",
     "start_time": "2017-11-12T21:33:20.301000+01:00"
    },
    "scrolled": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Dear Mrs. Keane, we like to invit you to our summer school'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "templateText=\"Dear Mrs. <Name>, we like to invit you to our <Event>\"\n",
    "name=\"Keane\"\n",
    "event=\"summer school\"\n",
    "re.sub(r\"<Event>\", event, re.sub(r\"<Name>\", name, templateText))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Segmentation into words using regular expressions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T20:36:40.239000Z",
     "start_time": "2017-11-12T21:36:40.217000+01:00"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Anzahl der unterschiedlichen Tokens: 40\n",
      "\n",
      "1\n",
      "Allianz-Arena\n",
      "Bayern\n",
      "Er\n",
      "Es\n",
      "FC\n",
      "Ich\n",
      "Köln\n",
      "Oder\n",
      "Spiel\n",
      "Thomas\n",
      "an\n",
      "bereit\n",
      "bereits\n",
      "bin\n",
      "das\n",
      "den\n",
      "der\n",
      "fußballspielen\n",
      "gefragt\n",
      "gegen\n",
      "habe\n",
      "ich\n",
      "ihr\n",
      "in\n",
      "jederzeit\n",
      "könnten\n",
      "lieber\n",
      "mal\n",
      "meint\n",
      "meinte\n",
      "mir\n",
      "schau\n",
      "schön\n",
      "was\n",
      "wenn\n",
      "wieder\n",
      "wir\n",
      "wäre\n"
     ]
    }
   ],
   "source": [
    "import nltk\n",
    "import re\n",
    "\n",
    "text = \"\"\"Es wäre schön, wenn wir mal wieder fußballspielen könnten.\n",
    "Oder was meint ihr? Ich bin jederzeit bereit. Thomas habe ich bereits gefragt.\n",
    "Er meinte: \"Ich schau mir lieber das Spiel der Bayern gegen den 1.FC Köln\n",
    "in der Allianz-Arena an.\" \"\"\"\n",
    "\n",
    "#cleaned_tokens=re.findall(r\"[\\wäüöÄÜÖß]+\",text)  # Einfache Lösung die funktioniert\n",
    "cleaned_tokens = re.split(r\"[\\s.,;:()!?\\\"]+\", text)\n",
    "#cleaned_tokens = text.split()\n",
    "#cleaned_tokens = re.findall(r\"[^\\s.,;:()!?\\\"]+\", text) #liefert gleiches Ergebnis wie Split, jedoch ohne den leeren String\n",
    "#cleaned_tokens = re.findall(r\"\\w+[.]\\w+|[^\\s.,;:()!?\\\"]+\", text) #Damit wird auch Punkt innerhalb eines Wortes erlaubt\n",
    "\n",
    "ts = {token.strip(\"\\\".,?:\") for token in cleaned_tokens}\n",
    "print(f\"Anzahl der unterschiedlichen Tokens: {len(ts)}\")\n",
    "for t in sorted(ts):\n",
    "    print(t)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Segmentation into words using NLTK"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Es\n",
      "wäre\n",
      "schön\n",
      "wenn\n",
      "wir\n",
      "mal\n",
      "wieder\n",
      "fußballspielen\n",
      "könnten\n",
      "Oder\n",
      "was\n",
      "meint\n",
      "ihr\n",
      "Ich\n",
      "bin\n",
      "jederzeit\n",
      "bereit\n",
      "Thomas\n",
      "habe\n",
      "ich\n",
      "bereits\n",
      "gefragt\n",
      "Er\n",
      "meinte\n",
      "Ich\n",
      "schau\n",
      "mir\n",
      "lieber\n",
      "das\n",
      "Spiel\n",
      "der\n",
      "Bayern\n",
      "gegen\n",
      "den\n",
      "1\n",
      "FC\n",
      "Köln\n",
      "in\n",
      "der\n",
      "Allianz\n",
      "Arena\n",
      "an\n",
      "Anzahl der Tokens: 42\n"
     ]
    }
   ],
   "source": [
    "cleaned_tokens = nltk.regexp_tokenize(text, r\"[\\wäöüÄÖÜß]+\")\n",
    "for token in cleaned_tokens:\n",
    "    print(token)\n",
    "print(f\"Anzahl der Tokens: {len(cleaned_tokens)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T20:36:53.898000Z",
     "start_time": "2017-11-12T21:36:53.866000+01:00"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Es\n",
      "wäre\n",
      "schön\n",
      ",\n",
      "wenn\n",
      "wir\n",
      "mal\n",
      "wieder\n",
      "fußballspielen\n",
      "könnten\n",
      ".\n",
      "Oder\n",
      "was\n",
      "meint\n",
      "ihr\n",
      "?\n",
      "Ich\n",
      "bin\n",
      "jederzeit\n",
      "bereit\n",
      ".\n",
      "Thomas\n",
      "habe\n",
      "ich\n",
      "bereits\n",
      "gefragt\n",
      ".\n",
      "Er\n",
      "meinte\n",
      ":\n",
      "``\n",
      "Ich\n",
      "schau\n",
      "mir\n",
      "lieber\n",
      "das\n",
      "Spiel\n",
      "der\n",
      "Bayern\n",
      "gegen\n",
      "den\n",
      "1.FC\n",
      "Köln\n",
      "in\n",
      "der\n",
      "Allianz-Arena\n",
      "an\n",
      ".\n",
      "''\n",
      "Anzahl der Tokens: 49\n"
     ]
    }
   ],
   "source": [
    "ntokens = nltk.word_tokenize(text)\n",
    "for a in ntokens:\n",
    "    print(a)\n",
    "print(f\"Anzahl der Tokens: {len(ntokens)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Segmentation into sentences using regular expressions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T20:37:00.281000Z",
     "start_time": "2017-11-12T21:37:00.266000+01:00"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "----------\n",
      "Es wäre schön, wenn wir mal wieder fußballspielen könnten\n",
      "----------\n",
      "Oder was meint ihr\n",
      "----------\n",
      "Ich bin jederzeit bereit\n",
      "----------\n",
      "Thomas habe ich bereits gefragt\n",
      "----------\n",
      "Er meinte\n",
      "----------\n",
      "\"Ich schau mir lieber das Spiel der Bayern gegen den 1.FC Köln\n",
      "in der Allianz-Arena an.\" \n",
      "----------\n",
      "Anzahl der Tokens: 6\n"
     ]
    }
   ],
   "source": [
    "sentences = re.split(r\"[?!.:]+\\s\",text)\n",
    "for sentence in sentences:\n",
    "    print('-'*10)\n",
    "    print(sentence)\n",
    "\n",
    "print('-'*10)\n",
    "print(f\"Anzahl der Tokens: {len(sentences)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Segmentation into sentences using NLTK"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-12T20:37:23.338000Z",
     "start_time": "2017-11-12T21:37:23.102000+01:00"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "----------\n",
      "Dr. med. R. Steiner meint es wäre schön, wenn wir mal wieder fußballspielen könnten.\n",
      "----------\n",
      "Oder was meint ihr?\n",
      "----------\n",
      "Ich bin jederzeit bereit.\n",
      "----------\n",
      "Thomas usw. habe ich bereits gefragt.\n",
      "----------\n",
      "Er meinte: \"Ich schau mir lieber das Spiel der Bayern gegen den 1.FC Köln \n",
      "in der Allianz-Arena an.\"\n",
      "----------\n",
      "Es schauen 7 Mio. zu, inkl. mir selbst.\n",
      "----------\n",
      "U.a. soll auch Messi etc. dabei sein.\"\n",
      "----------\n",
      "Anzahl der Tokens:                    7\n"
     ]
    }
   ],
   "source": [
    "text2 = \"\"\"Dr. med. R. Steiner meint es wäre schön, wenn wir mal wieder fußballspielen könnten.\n",
    "Oder was meint ihr? Ich bin jederzeit bereit. Thomas usw. habe ich bereits gefragt.\n",
    "Er meinte: \"Ich schau mir lieber das Spiel der Bayern gegen den 1.FC Köln \n",
    "in der Allianz-Arena an.\\\" Es schauen 7 Mio. zu, inkl. mir selbst. U.a. soll auch Messi etc. dabei sein.\\\"\"\"\"\n",
    "\n",
    "sent_tokenizer=nltk.data.load('tokenizers/punkt/german.pickle')\n",
    "sents = sent_tokenizer.tokenize(text2)\n",
    "for a in sents:\n",
    "    print('-'*10)\n",
    "    print(a)\n",
    "print('-'*10)\n",
    "print(\"Anzahl der Tokens:                   \",len(sents))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.4"
  },
  "nav_menu": {},
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": "block",
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  },
  "vscode": {
   "interpreter": {
    "hash": "063584b925b3f1849a1b99bbef08a469ec6afc8a729289ad5adba9013f2d188e"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}