{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Chunking and Tokenisation\n", "\n", "After accessing and possible cleaning documents, they usually have to be segmented into smaller pieces. Segmentation of documents into smaller pieces like chapters, sections, pages or even sentences is called **chunking**. Segmentation of chunks into words, subwords or even characters is called **tokenisation**.\n", "\n", "Simple tokenisation and chunking techniques have already been applied in subsections [Access and Analyse Content of Text Files](01AccessTextFromFile.ipynb) and [Regular Expressions in Python](05RegularExpressions.ipynb) by applying e.g. the `split()`-method of the Python class `String` or by applying regular expressions. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "
Documents are segmented in smaller pieces, called chunks and chunks are segmented in tokens, which can either be words, sub-words or even characters
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As shown in the image above on a high-level one can distinguish\n", "* chunking methods in such which return \n", " * chunks of fixed size \n", " * semantically related chunks, e.g. sentences or subsections \n", "* tokenisation methods, which segment text-chunks into\n", " * single characters\n", " * words\n", " * sub-words" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Chunking\n", "Chunking and tokenisation are relevant subtasks for many NLP applications. In particular for RAG systems the overall performance of the RAG system strongly depends on the chunking and tokenisation configuration:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{admonition} Chunks in the context of Retrieval Augemented Generation (RAG)\n", "In [Retrieval Augmented Generation (RAG)](../09LLMapplications/09llmapplications), information, which is considered to be relevant for answering a query, is passed together with the query as context to the LLM. Since the context-length is limited, e.g. in [Llama-3.2-1b](https://huggingface.co/meta-llama/Llama-3.2-1B) the context can contain 128k tokens, the relevant chunks must have a limited size. Hence, in RAG systems the chunking and the length of the chunks must be adapted to the maximum context length of the applied LLM. For example if the query is assumed to contain at most 192 tokens, and the most relevant 4 chunks shall be passed together with the query to the LLM, then the maximum chunk-length is 2000 (if no further tokens are required, e.g. for the system prompt). Note that the amount of text, that is contained in the maximum amount of tokens (8192)\n", " strongly depends on the tokenisation method: relatively small text in the case of character-level encoding, much more in the case of word-level encoding.S```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load sample text\n", "As already shown in [Data Access with Langchain](07langchainDataAccess.ipynb) we apply the `TextLoader` to load sample text, which will be chunked in different ways in the sections below:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from langchain.document_loaders import TextLoader\n", "\n", "loader = TextLoader(\"../Data/ZeitOnlineLandkartenA.txt\",encoding=\"utf-8\")\n", "text=loader.load()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(text)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "sourcetext=text[0].page_content" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Landkarten mit Mehrwert\n", "Ob als Reiseführer, Nachrichtenkanal oder Bürgerinitiative: Digitale Landkarten lassen sich vielseitig nutzen. \n", "\n", "ZEIT ONLINE stellt einige der interessantesten Dienste vor.\n", "\n", "Die Zeit, in der Landkarten im Netz bloß der Routenplanung dienten, ist längst vorbei. Denn mit den digitalen Karten von Google Maps und der Open-Source-Alternative OpenStreetMap kann man sich spannendere Dinge als den Weg von A nach B anzeigen lassen. Über offene Programmschnittstellen (API) lassen sich Daten von anderen Websites mit dem Kartenmaterial verknüpfen oder eigene Informationen eintragen. Das Ergebnis nennt sich Mashup – ein Mischmasch aus Karten und Daten sozusagen. Die Bewertungscommunity Qype nutzt diese Möglichkeit schon lange, um Adressen und Bewertungen miteinander zu verknüpfen und mithilfe von Google Maps darzustellen. Auch Immobilienbörsen, Branchenbücher und Fotodienste kommen kaum noch ohne eigene Kartenfunktion aus. Dank der Integration von Geodaten in Smartphones werden soziale \n", "Kartendienste immer beliebter. Auch sie nutzen die offenen Schnittstellen. Neben kommerziellen Diensten profitieren aber auch Privatpersonen und unabhängige \n", "Projekte von den Möglichkeiten des frei zugänglichen Kartenmaterials. Das Open-Data-Netzwerk versucht, öffentlich zugängliche Daten zu sammeln und neue \n", "Möglichkeiten für Bürger herauszuarbeiten. So können Anwohner in England schon länger über FixMyStreet Reparaturaufträge direkt an die Behörden übermitteln.\n", "Unter dem Titel Frankfurt-Gestalten gibt es seit Frühjahr ein ähnliches Pilotprojekt für Frankfurt am Main. Hier geht es um weit mehr als Reparaturen. Die Seite soll \n", "einen aktiven Dialog zwischen Bürgern und ihrer Stadt ermöglichen – partizipative Lokalpolitik ist das Stichwort. Tausende dieser Mashups und Initiativen gibt es inzwischen. Sie bewegen sich zwischen bizarr und faszinierend, unterhaltsam und informierend. ZEIT ONLINE stellt einige der interessantesten vor. Sie zeigen, was man mit öffentlichen Datensammlungen alles machen kann.\n" ] } ], "source": [ "print(sourcetext)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Chunking methods\n", "\n", "There exists many Python libraries, which provide methods for chunking. In this section, only the most prominent chunkers from the [Langchain text splitters module](https://api.python.langchain.com/en/latest/text_splitters_api_reference.html#module-langchain_text_splitters.character) are demonstrated.\n", "\n", "#### CharacterTextSplitter\n", "The [Langchain CharacterTextSplitter](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.CharacterTextSplitter.html#langchain_text_splitters.character.CharacterTextSplitter) lets you define one character, which is applied as `separator`. The splitter then tries to split the text at all positions, where the defined character occurs, until the resulting chunks have a length (= number of characters) less than or equal the specified `chunk_size`. \n", "\n", "In the example below the character-sequence `\\n\\n` is configured to be the separator. I.e. the input-text is being split at the end of each textsection (newline followed by an empty line). However, the splitting is done only if `len(split1) + len(split(2) > chunksize`.\n", "\n", "In the example below, we obtain 3 chunks with the configured `chunk_size = 100`, but only 2 chunks with `chunk_size = 200`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter, NLTKTextSplitter" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "text_splitter = CharacterTextSplitter(\n", " separator = \"\\n\\n\",\n", " is_separator_regex = False,\n", " length_function = len,\n", " chunk_size = 100,\n", " chunk_overlap = 20\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After creating the `CharacterTextSplitter` instance, we can either invoke its method `create_documents()`, which returns the chunks as Langchain document objects, or the method `split_text()`, which returns the chunks as a list of strings." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Created a chunk of size 135, which is longer than the specified 100\n" ] } ], "source": [ "#chunks = text_splitter.create_documents([sourcetext])\n", "chunks = text_splitter.split_text(sourcetext)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(chunks)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `printChunks()` function is defined for printing the resulting chunks:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def printChunks(chunks,B=None):\n", " if not B:\n", " B=len(chunks)\n", " for i in range(B):\n", " print(\"-\"*30,\"\\nChunk %d with %d tokens.\"%(i,len(chunks[i])))\n", " print(chunks[i])" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "------------------------------ \n", "Chunk 0 with 134 tokens.\n", "Landkarten mit Mehrwert\n", "Ob als Reiseführer, Nachrichtenkanal oder Bürgerinitiative: Digitale Landkarten lassen sich vielseitig nutzen.\n", "------------------------------ \n", "Chunk 1 with 59 tokens.\n", "ZEIT ONLINE stellt einige der interessantesten Dienste vor.\n", "------------------------------ \n", "Chunk 2 with 1829 tokens.\n", "Die Zeit, in der Landkarten im Netz bloß der Routenplanung dienten, ist längst vorbei. Denn mit den digitalen Karten von Google Maps und der Open-Source-Alternative OpenStreetMap kann man sich spannendere Dinge als den Weg von A nach B anzeigen lassen. Über offene Programmschnittstellen (API) lassen sich Daten von anderen Websites mit dem Kartenmaterial verknüpfen oder eigene Informationen eintragen. Das Ergebnis nennt sich Mashup – ein Mischmasch aus Karten und Daten sozusagen. Die Bewertungscommunity Qype nutzt diese Möglichkeit schon lange, um Adressen und Bewertungen miteinander zu verknüpfen und mithilfe von Google Maps darzustellen. Auch Immobilienbörsen, Branchenbücher und Fotodienste kommen kaum noch ohne eigene Kartenfunktion aus. Dank der Integration von Geodaten in Smartphones werden soziale \n", "Kartendienste immer beliebter. Auch sie nutzen die offenen Schnittstellen. Neben kommerziellen Diensten profitieren aber auch Privatpersonen und unabhängige \n", "Projekte von den Möglichkeiten des frei zugänglichen Kartenmaterials. Das Open-Data-Netzwerk versucht, öffentlich zugängliche Daten zu sammeln und neue \n", "Möglichkeiten für Bürger herauszuarbeiten. So können Anwohner in England schon länger über FixMyStreet Reparaturaufträge direkt an die Behörden übermitteln.\n", "Unter dem Titel Frankfurt-Gestalten gibt es seit Frühjahr ein ähnliches Pilotprojekt für Frankfurt am Main. Hier geht es um weit mehr als Reparaturen. Die Seite soll \n", "einen aktiven Dialog zwischen Bürgern und ihrer Stadt ermöglichen – partizipative Lokalpolitik ist das Stichwort. Tausende dieser Mashups und Initiativen gibt es inzwischen. Sie bewegen sich zwischen bizarr und faszinierend, unterhaltsam und informierend. ZEIT ONLINE stellt einige der interessantesten vor. Sie zeigen, was man mit öffentlichen Datensammlungen alles machen kann.\n" ] } ], "source": [ "printChunks(chunks)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As can be seen in the output above, each text-section, which is terminated by an empty line is returned as a single chunk.\n", "\n", "If we like to have chunks, which all have approximately the same length but do not end within a word, the `CharacterTextSplitter` can be configured as follows:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "text_splitter = CharacterTextSplitter(\n", " separator = \" \",\n", " length_function = len,\n", " chunk_size = 100,\n", " chunk_overlap = 20\n", ")\n", "#chunks = text_splitter.create_documents([sourcetext])\n", "chunks = text_splitter.split_text(sourcetext)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "26" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(chunks)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "------------------------------ \n", "Chunk 0 with 92 tokens.\n", "Landkarten mit Mehrwert\n", "Ob als Reiseführer, Nachrichtenkanal oder Bürgerinitiative: Digitale\n", "------------------------------ \n", "Chunk 1 with 99 tokens.\n", "Digitale Landkarten lassen sich vielseitig nutzen. \n", "\n", "ZEIT ONLINE stellt einige der interessantesten\n", "------------------------------ \n", "Chunk 2 with 93 tokens.\n", "der interessantesten Dienste vor.\n", "\n", "Die Zeit, in der Landkarten im Netz bloß der Routenplanung\n", "------------------------------ \n", "Chunk 3 with 99 tokens.\n", "der Routenplanung dienten, ist längst vorbei. Denn mit den digitalen Karten von Google Maps und der\n", "------------------------------ \n", "Chunk 4 with 97 tokens.\n", "Google Maps und der Open-Source-Alternative OpenStreetMap kann man sich spannendere Dinge als den\n", "------------------------------ \n", "Chunk 5 with 100 tokens.\n", "Dinge als den Weg von A nach B anzeigen lassen. Über offene Programmschnittstellen (API) lassen sich\n", "------------------------------ \n", "Chunk 6 with 90 tokens.\n", "(API) lassen sich Daten von anderen Websites mit dem Kartenmaterial verknüpfen oder eigene\n", "------------------------------ \n", "Chunk 7 with 99 tokens.\n", "oder eigene Informationen eintragen. Das Ergebnis nennt sich Mashup – ein Mischmasch aus Karten und\n", "------------------------------ \n", "Chunk 8 with 100 tokens.\n", "aus Karten und Daten sozusagen. Die Bewertungscommunity Qype nutzt diese Möglichkeit schon lange, um\n", "------------------------------ \n", "Chunk 9 with 95 tokens.\n", "schon lange, um Adressen und Bewertungen miteinander zu verknüpfen und mithilfe von Google Maps\n", "------------------------------ \n", "Chunk 10 with 100 tokens.\n", "von Google Maps darzustellen. Auch Immobilienbörsen, Branchenbücher und Fotodienste kommen kaum noch\n", "------------------------------ \n", "Chunk 11 with 97 tokens.\n", "kommen kaum noch ohne eigene Kartenfunktion aus. Dank der Integration von Geodaten in Smartphones\n", "------------------------------ \n", "Chunk 12 with 89 tokens.\n", "in Smartphones werden soziale \n", "Kartendienste immer beliebter. Auch sie nutzen die offenen\n", "------------------------------ \n", "Chunk 13 with 100 tokens.\n", "nutzen die offenen Schnittstellen. Neben kommerziellen Diensten profitieren aber auch Privatpersonen\n", "------------------------------ \n", "Chunk 14 with 89 tokens.\n", "auch Privatpersonen und unabhängige \n", "Projekte von den Möglichkeiten des frei zugänglichen\n", "------------------------------ \n", "Chunk 15 with 99 tokens.\n", "frei zugänglichen Kartenmaterials. Das Open-Data-Netzwerk versucht, öffentlich zugängliche Daten zu\n", "------------------------------ \n", "Chunk 16 with 100 tokens.\n", "zugängliche Daten zu sammeln und neue \n", "Möglichkeiten für Bürger herauszuarbeiten. So können Anwohner\n", "------------------------------ \n", "Chunk 17 with 100 tokens.\n", "So können Anwohner in England schon länger über FixMyStreet Reparaturaufträge direkt an die Behörden\n", "------------------------------ \n", "Chunk 18 with 100 tokens.\n", "an die Behörden übermitteln.\n", "Unter dem Titel Frankfurt-Gestalten gibt es seit Frühjahr ein ähnliches\n", "------------------------------ \n", "Chunk 19 with 96 tokens.\n", "ein ähnliches Pilotprojekt für Frankfurt am Main. Hier geht es um weit mehr als Reparaturen. Die\n", "------------------------------ \n", "Chunk 20 with 100 tokens.\n", "als Reparaturen. Die Seite soll \n", "einen aktiven Dialog zwischen Bürgern und ihrer Stadt ermöglichen –\n", "------------------------------ \n", "Chunk 21 with 93 tokens.\n", "Stadt ermöglichen – partizipative Lokalpolitik ist das Stichwort. Tausende dieser Mashups und\n", "------------------------------ \n", "Chunk 22 with 87 tokens.\n", "dieser Mashups und Initiativen gibt es inzwischen. Sie bewegen sich zwischen bizarr und\n", "------------------------------ \n", "Chunk 23 with 94 tokens.\n", "zwischen bizarr und faszinierend, unterhaltsam und informierend. ZEIT ONLINE stellt einige der\n", "------------------------------ \n", "Chunk 24 with 98 tokens.\n", "stellt einige der interessantesten vor. Sie zeigen, was man mit öffentlichen Datensammlungen alles\n", "------------------------------ \n", "Chunk 25 with 18 tokens.\n", "alles machen kann.\n" ] } ], "source": [ "printChunks(chunks)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### RecursiveCharacterTextSplitter\n", "The [Langchain RecursiveCharacterTextSplitter](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html#langchain_text_splitters.character.RecursiveCharacterTextSplitter) allows to define a **list of separators**, which are recursively applied. Again, the split at a separator position is applied only if the sum of the lengths of the new chunks is larger than the configured `chunksize`. \n", "\n", "In the example below the `RecursiveCharacterTextSplitter` first tries to split the text at empty lines (end of text sections), then it tries to split the text at all positions, where a dot is followed by space (as e.g. at the end of sentences). The list of separators can be arbitrarily long." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "text_splitter = RecursiveCharacterTextSplitter(\n", " separators = [\"\\n\\n\",\". \"],\n", " is_separator_regex = False,\n", " length_function = len,\n", " chunk_size = 200,\n", " chunk_overlap = 20\n", ")" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "#chunks = text_splitter.create_documents([sourcetext])\n", "chunks = text_splitter.split_text(sourcetext)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "13" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(chunks)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "------------------------------ \n", "Chunk 0 with 196 tokens.\n", "Landkarten mit Mehrwert\n", "Ob als Reiseführer, Nachrichtenkanal oder Bürgerinitiative: Digitale Landkarten lassen sich vielseitig nutzen. \n", "\n", "ZEIT ONLINE stellt einige der interessantesten Dienste vor.\n", "------------------------------ \n", "Chunk 1 with 85 tokens.\n", "Die Zeit, in der Landkarten im Netz bloß der Routenplanung dienten, ist längst vorbei\n", "------------------------------ \n", "Chunk 2 with 166 tokens.\n", ". Denn mit den digitalen Karten von Google Maps und der Open-Source-Alternative OpenStreetMap kann man sich spannendere Dinge als den Weg von A nach B anzeigen lassen\n", "------------------------------ \n", "Chunk 3 with 151 tokens.\n", ". Über offene Programmschnittstellen (API) lassen sich Daten von anderen Websites mit dem Kartenmaterial verknüpfen oder eigene Informationen eintragen\n", "------------------------------ \n", "Chunk 4 with 80 tokens.\n", ". Das Ergebnis nennt sich Mashup – ein Mischmasch aus Karten und Daten sozusagen\n", "------------------------------ \n", "Chunk 5 with 163 tokens.\n", ". Die Bewertungscommunity Qype nutzt diese Möglichkeit schon lange, um Adressen und Bewertungen miteinander zu verknüpfen und mithilfe von Google Maps darzustellen\n", "------------------------------ \n", "Chunk 6 with 199 tokens.\n", ". Auch Immobilienbörsen, Branchenbücher und Fotodienste kommen kaum noch ohne eigene Kartenfunktion aus. Dank der Integration von Geodaten in Smartphones werden soziale \n", "Kartendienste immer beliebter\n", "------------------------------ \n", "Chunk 7 with 197 tokens.\n", ". Auch sie nutzen die offenen Schnittstellen. Neben kommerziellen Diensten profitieren aber auch Privatpersonen und unabhängige \n", "Projekte von den Möglichkeiten des frei zugänglichen Kartenmaterials\n", "------------------------------ \n", "Chunk 8 with 126 tokens.\n", ". Das Open-Data-Netzwerk versucht, öffentlich zugängliche Daten zu sammeln und neue \n", "Möglichkeiten für Bürger herauszuarbeiten\n", "------------------------------ \n", "Chunk 9 with 222 tokens.\n", ". So können Anwohner in England schon länger über FixMyStreet Reparaturaufträge direkt an die Behörden übermitteln.\n", "Unter dem Titel Frankfurt-Gestalten gibt es seit Frühjahr ein ähnliches Pilotprojekt für Frankfurt am Main\n", "------------------------------ \n", "Chunk 10 with 173 tokens.\n", ". Hier geht es um weit mehr als Reparaturen. Die Seite soll \n", "einen aktiven Dialog zwischen Bürgern und ihrer Stadt ermöglichen – partizipative Lokalpolitik ist das Stichwort\n", "------------------------------ \n", "Chunk 11 with 194 tokens.\n", ". Tausende dieser Mashups und Initiativen gibt es inzwischen. Sie bewegen sich zwischen bizarr und faszinierend, unterhaltsam und informierend. ZEIT ONLINE stellt einige der interessantesten vor\n", "------------------------------ \n", "Chunk 12 with 73 tokens.\n", ". Sie zeigen, was man mit öffentlichen Datensammlungen alles machen kann.\n" ] } ], "source": [ "printChunks(chunks)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### NLTKTextSplitter\n", "\n", "In the example above the punctuation mark *dot* is applied as `separator`. This means that sentences are separated, if they are longer than the configured `chunk_size`. However, problems arise with the implemented approach, e.g. because\n", "* not all sentences end with a dot\n", "* dots are not only applied as punctuation marks at the end of sentences.\n", "\n", "This means that a more sophisticated separation strategy must be defined, if sentences shall be separated. The NLP Python package [NLTK](https://www.nltk.org) provides more complex segmentation models and we can access such a model for segmenting into sentences via the [Langchain NLTKTextSplitter class](https://api.python.langchain.com/en/latest/nltk/langchain_text_splitters.nltk.NLTKTextSplitter.html#langchain_text_splitters.nltk.NLTKTextSplitter). \n", "\n", "Below, it is demonstrated how the `NLTKTextSplitter` segments into sentences, even if the sentences do not end with a dot and if dots are used for other purposes than sentence-termination:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "newtext=\"\"\"This is the first sentence. What comes next? \n", "Maybe, they will tell us what we can expect at 4:15 pm at Tuesday 24.03.2025.\"\"\"" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is the first sentence. What comes next? \n", "Maybe, they will tell us what we can expect at 4:15 pm at Tuesday 24.03.2025.\n" ] } ], "source": [ "print(newtext)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "text_splitter = RecursiveCharacterTextSplitter(\n", " separators = [\"\\n\\n\",\". \"],\n", " is_separator_regex = False,\n", " length_function = len,\n", " chunk_size = 40,\n", " chunk_overlap = 10\n", ")" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "chunks=text_splitter.split_text(newtext)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "------------------------------ \n", "Chunk 0 with 26 tokens.\n", "This is the first sentence\n", "------------------------------ \n", "Chunk 1 with 97 tokens.\n", ". What comes next? \n", "Maybe, they will tell us what we can expect at 4:15 pm at Tuesday 24.03.2025.\n" ] } ], "source": [ "printChunks(chunks)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "nltk_splitter = NLTKTextSplitter(\n", " chunk_size = 100,\n", " chunk_overlap = 20\n", ")" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "chunks=nltk_splitter.split_text(newtext)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "------------------------------ \n", "Chunk 0 with 45 tokens.\n", "This is the first sentence.\n", "\n", "What comes next?\n", "------------------------------ \n", "Chunk 1 with 95 tokens.\n", "What comes next?\n", "\n", "Maybe, they will tell us what we can expect at 4:15 pm at Tuesday 24.03.2025.\n" ] } ], "source": [ "printChunks(chunks)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Created a chunk of size 134, which is longer than the specified 100\n", "Created a chunk of size 165, which is longer than the specified 100\n", "Created a chunk of size 150, which is longer than the specified 100\n", "Created a chunk of size 162, which is longer than the specified 100\n", "Created a chunk of size 102, which is longer than the specified 100\n", "Created a chunk of size 152, which is longer than the specified 100\n", "Created a chunk of size 125, which is longer than the specified 100\n", "Created a chunk of size 113, which is longer than the specified 100\n", "Created a chunk of size 107, which is longer than the specified 100\n", "Created a chunk of size 129, which is longer than the specified 100\n" ] } ], "source": [ "chunks=nltk_splitter.split_text(sourcetext)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "------------------------------ \n", "Chunk 0 with 134 tokens.\n", "Landkarten mit Mehrwert\n", "Ob als Reiseführer, Nachrichtenkanal oder Bürgerinitiative: Digitale Landkarten lassen sich vielseitig nutzen.\n", "------------------------------ \n", "Chunk 1 with 59 tokens.\n", "ZEIT ONLINE stellt einige der interessantesten Dienste vor.\n", "------------------------------ \n", "Chunk 2 with 86 tokens.\n", "Die Zeit, in der Landkarten im Netz bloß der Routenplanung dienten, ist längst vorbei.\n", "------------------------------ \n", "Chunk 3 with 165 tokens.\n", "Denn mit den digitalen Karten von Google Maps und der Open-Source-Alternative OpenStreetMap kann man sich spannendere Dinge als den Weg von A nach B anzeigen lassen.\n", "------------------------------ \n", "Chunk 4 with 150 tokens.\n", "Über offene Programmschnittstellen (API) lassen sich Daten von anderen Websites mit dem Kartenmaterial verknüpfen oder eigene Informationen eintragen.\n", "------------------------------ \n", "Chunk 5 with 79 tokens.\n", "Das Ergebnis nennt sich Mashup – ein Mischmasch aus Karten und Daten sozusagen.\n", "------------------------------ \n", "Chunk 6 with 162 tokens.\n", "Die Bewertungscommunity Qype nutzt diese Möglichkeit schon lange, um Adressen und Bewertungen miteinander zu verknüpfen und mithilfe von Google Maps darzustellen.\n", "------------------------------ \n", "Chunk 7 with 102 tokens.\n", "Auch Immobilienbörsen, Branchenbücher und Fotodienste kommen kaum noch ohne eigene Kartenfunktion aus.\n", "------------------------------ \n", "Chunk 8 with 95 tokens.\n", "Dank der Integration von Geodaten in Smartphones werden soziale \n", "Kartendienste immer beliebter.\n", "------------------------------ \n", "Chunk 9 with 43 tokens.\n", "Auch sie nutzen die offenen Schnittstellen.\n", "------------------------------ \n", "Chunk 10 with 152 tokens.\n", "Neben kommerziellen Diensten profitieren aber auch Privatpersonen und unabhängige \n", "Projekte von den Möglichkeiten des frei zugänglichen Kartenmaterials.\n", "------------------------------ \n", "Chunk 11 with 125 tokens.\n", "Das Open-Data-Netzwerk versucht, öffentlich zugängliche Daten zu sammeln und neue \n", "Möglichkeiten für Bürger herauszuarbeiten.\n", "------------------------------ \n", "Chunk 12 with 113 tokens.\n", "So können Anwohner in England schon länger über FixMyStreet Reparaturaufträge direkt an die Behörden übermitteln.\n", "------------------------------ \n", "Chunk 13 with 107 tokens.\n", "Unter dem Titel Frankfurt-Gestalten gibt es seit Frühjahr ein ähnliches Pilotprojekt für Frankfurt am Main.\n", "------------------------------ \n", "Chunk 14 with 42 tokens.\n", "Hier geht es um weit mehr als Reparaturen.\n", "------------------------------ \n", "Chunk 15 with 129 tokens.\n", "Die Seite soll \n", "einen aktiven Dialog zwischen Bürgern und ihrer Stadt ermöglichen – partizipative Lokalpolitik ist das Stichwort.\n", "------------------------------ \n", "Chunk 16 with 59 tokens.\n", "Tausende dieser Mashups und Initiativen gibt es inzwischen.\n", "------------------------------ \n", "Chunk 17 with 81 tokens.\n", "Sie bewegen sich zwischen bizarr und faszinierend, unterhaltsam und informierend.\n", "------------------------------ \n", "Chunk 18 with 51 tokens.\n", "ZEIT ONLINE stellt einige der interessantesten vor.\n", "------------------------------ \n", "Chunk 19 with 71 tokens.\n", "Sie zeigen, was man mit öffentlichen Datensammlungen alles machen kann.\n" ] } ], "source": [ "printChunks(chunks)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### HTMLSectionSplitter\n", "The [Langchain HTMLSectionSplitter](https://api.python.langchain.com/en/latest/html/langchain_text_splitters.html.HTMLSectionSplitter.html) can be applied to split Html-documents at defined section headers. This is demonstrated below:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "from langchain_text_splitters.html import HTMLSectionSplitter" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "html_string = \"\"\"\n", " \n", " \n", " \n", "
\n", "

Foo

\n", "

Some intro text about Foo.

\n", "
\n", "

Bar main section

\n", "

Some intro text about Bar.

\n", "

Bar subsection 1

\n", "

Some text about the first subtopic of Bar.

\n", "

Bar subsection 2

\n", "

Some text about the second subtopic of Bar.

\n", "
\n", "
\n", "

Baz

\n", "

Some text about Baz

\n", "
\n", "
\n", "

Some concluding text about Foo

\n", "
\n", " \n", " \n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Document(metadata={'Header 1': 'Foo'}, page_content='Foo \\n Some intro text about Foo.'),\n", " Document(metadata={'Header 2': 'Bar main section'}, page_content='Bar main section \\n Some intro text about Bar. \\n Bar subsection 1 \\n Some text about the first subtopic of Bar. \\n Bar subsection 2 \\n Some text about the second subtopic of Bar.'),\n", " Document(metadata={'Header 2': 'Baz'}, page_content='Baz \\n Some text about Baz \\n \\n \\n Some concluding text about Foo')]" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "headers_to_split_on = [(\"h1\", \"Header 1\"), (\"h2\", \"Header 2\")]\n", "#headers_to_split_on = [(\"div\", \"Division\")]\n", "\n", "html_splitter = HTMLSectionSplitter(headers_to_split_on)\n", "html_header_splits = html_splitter.split_text(html_string)\n", "html_header_splits" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tokenisation\n", "Tokenisation is the segmentation of text-chunks into either\n", "* single characters\n", "* words\n", "* subwords" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tokenisation is required in nearly all NLP tasks. Also in the context of LLMs, text can not be passed directly to the Neural Network Architecture - the *Transformer*. Instead, as shown in the image below, text-chunks must be tokenized and mapped to integer-indexes. The corresponding integer-sequences are then mapped to a sequence of embedding vectors. Each of these vectors is then modified by a positional embedding (not shown in the image below) and the resulting embedding sequence is passed to the Neural Network: \n", "\n", "
\n", "\n", "
\n", "\n", "The token-index-mapping is also required at the output of the LLM, since LLMs usually output the indices, which must then be mapped to words by applying the inverse *token2index*-mapping.\n", "\n", "The vocabulary and the corresponding token-index-mapping usually does not only contain all tokens found in the training corpus. Instead it is usual \n", "1. to **restrict the size** of the vocabulary to a maximum number of tokens\n", "2. to add **special tokens** e.g. tokens for\n", " 1. indicating unknown tokens (not in the learned vocabulary)\n", " 2. padding, i.e. filling up to a defined fixed length\n", " 3. start of sequence and end of sequence\n", " 4. separation of sequences ... \n", "\n", "Moreover, there exist **cased** and **uncased** vocabularies, where *uncased* means that lower- and upper-case are not distinguished. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tokenisation in GPT is demonstrated on the [OpenAI Platform](https://platform.openai.com/tokenizer). Enter an arbitrary text and check it's segmentation into tokens. Which of the above listed tokenisation categories (character-, word- or subword-level) is applied in GPT?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tokenisation on Character-Level and on Word-Level\n", "In Python **tokenisation into single characters** can easily be obtained by just casting the `string`, which contains the text, into a `list`:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "mystring=\"This is the first sentence. And what's next?\"" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "def show_tokens(tokenlist):\n", " colors_list = [\n", " '102;194;165', '252;141;98', '141;160;203', \n", " '231;138;195', '166;216;84', '255;217;47'\n", " ]\n", " for idx, t in enumerate(tokenlist):\n", " print(\n", " f'\\x1b[0;30;48;2;{colors_list[idx % len(colors_list)]}m' + \n", " t + \n", " '\\x1b[0m', \n", " end=' '\n", " )" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[0;30;48;2;102;194;165mT\u001b[0m \u001b[0;30;48;2;252;141;98mh\u001b[0m \u001b[0;30;48;2;141;160;203mi\u001b[0m \u001b[0;30;48;2;231;138;195ms\u001b[0m \u001b[0;30;48;2;166;216;84m \u001b[0m \u001b[0;30;48;2;255;217;47mi\u001b[0m \u001b[0;30;48;2;102;194;165ms\u001b[0m \u001b[0;30;48;2;252;141;98m \u001b[0m \u001b[0;30;48;2;141;160;203mt\u001b[0m \u001b[0;30;48;2;231;138;195mh\u001b[0m \u001b[0;30;48;2;166;216;84me\u001b[0m \u001b[0;30;48;2;255;217;47m \u001b[0m \u001b[0;30;48;2;102;194;165mf\u001b[0m \u001b[0;30;48;2;252;141;98mi\u001b[0m \u001b[0;30;48;2;141;160;203mr\u001b[0m \u001b[0;30;48;2;231;138;195ms\u001b[0m \u001b[0;30;48;2;166;216;84mt\u001b[0m \u001b[0;30;48;2;255;217;47m \u001b[0m \u001b[0;30;48;2;102;194;165ms\u001b[0m \u001b[0;30;48;2;252;141;98me\u001b[0m \u001b[0;30;48;2;141;160;203mn\u001b[0m \u001b[0;30;48;2;231;138;195mt\u001b[0m \u001b[0;30;48;2;166;216;84me\u001b[0m \u001b[0;30;48;2;255;217;47mn\u001b[0m \u001b[0;30;48;2;102;194;165mc\u001b[0m \u001b[0;30;48;2;252;141;98me\u001b[0m \u001b[0;30;48;2;141;160;203m.\u001b[0m \u001b[0;30;48;2;231;138;195m \u001b[0m \u001b[0;30;48;2;166;216;84mA\u001b[0m \u001b[0;30;48;2;255;217;47mn\u001b[0m \u001b[0;30;48;2;102;194;165md\u001b[0m \u001b[0;30;48;2;252;141;98m \u001b[0m \u001b[0;30;48;2;141;160;203mw\u001b[0m \u001b[0;30;48;2;231;138;195mh\u001b[0m \u001b[0;30;48;2;166;216;84ma\u001b[0m \u001b[0;30;48;2;255;217;47mt\u001b[0m \u001b[0;30;48;2;102;194;165m'\u001b[0m \u001b[0;30;48;2;252;141;98ms\u001b[0m \u001b[0;30;48;2;141;160;203m \u001b[0m \u001b[0;30;48;2;231;138;195mn\u001b[0m \u001b[0;30;48;2;166;216;84me\u001b[0m \u001b[0;30;48;2;255;217;47mx\u001b[0m \u001b[0;30;48;2;102;194;165mt\u001b[0m \u001b[0;30;48;2;252;141;98m?\u001b[0m " ] } ], "source": [ "show_tokens(list(mystring))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As already demonstrated in subsection [Access and Analyse Content of Text Files](01AccessTextFromFile.ipynb) a simple form of **tokenisation into words** can be implemented in Python by applying the `split()` method of class `String`:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[0;30;48;2;102;194;165mThis\u001b[0m \u001b[0;30;48;2;252;141;98mis\u001b[0m \u001b[0;30;48;2;141;160;203mthe\u001b[0m \u001b[0;30;48;2;231;138;195mfirst\u001b[0m \u001b[0;30;48;2;166;216;84msentence.\u001b[0m \u001b[0;30;48;2;255;217;47mAnd\u001b[0m \u001b[0;30;48;2;102;194;165mwhat's\u001b[0m \u001b[0;30;48;2;252;141;98mnext?\u001b[0m " ] } ], "source": [ "mylist=mystring.split()\n", "show_tokens(mylist)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to get rid of the punctuation marks, which may be present at the end of words, the `String`-method `strip()` can be applied. This method strips of defined characters, if they appear at the start or the end of single words. In the code-cell below, the words are also normalized to lower characters:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[0;30;48;2;102;194;165mthis\u001b[0m \u001b[0;30;48;2;252;141;98mis\u001b[0m \u001b[0;30;48;2;141;160;203mthe\u001b[0m \u001b[0;30;48;2;231;138;195mfirst\u001b[0m \u001b[0;30;48;2;166;216;84msentence\u001b[0m \u001b[0;30;48;2;255;217;47mand\u001b[0m \u001b[0;30;48;2;102;194;165mwhat's\u001b[0m \u001b[0;30;48;2;252;141;98mnext\u001b[0m " ] } ], "source": [ "mycleanedlist=[w.strip('().,:;!?-\"').lower() for w in mylist]\n", "show_tokens(mycleanedlist)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tokenisation on Sub-Word Level\n", "\n", "The problem of **word-level tokenisation** is that for large corpora the vocabulary (set of all different tokens) can be quite large, because each word-composition and each inflection of a word results in a individual token. On the other hand for **character-level tokenisation** the vocabulary is small, since it comprises only the set of all characters. However, the problem of character-level tokenisation is that single characters are not meaningful, compared to single words or even sentences. \n", "\n", "Due to the problems of these two types a **tokenisation on sub-word-level** constitutes a good compromise. Sub-words are more meaningful than single characters and for large corpora the set of all possible subwords is usually much smaller than the set of all possible words. For example for the 6 differnet words\n", "\n", "

laugh, laughed, laughing, call, called, calling

\n", "\n", "only 4 sub-words are required:\n", "\n", "

laugh, call, ed, ing

\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As can be seen in the example above, the way how words are split in sub-words is crucial. In order to end up with small vocabularies, sub-word tokenisation is only valuable, if frequently used character-sequences (e.g. *ing*) are not split into smaller sub-words, but rare character-sequences (e.g. *laughing*) should be decomposed. This concept is implemented in **Byte-Pair-Encoding (BPE)**, which is currently the most implemented tokenisation technique used in the context of Large Language Models (LLMs)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Byte-Pair-Encoding has been introduced in {cite}`sennrich16`. \n", "\n", "```{admonition} Byte-Pair-Encoding Algorithm (BPE)\n", "1. The algorithm starts with a pre-tokenisation step. In this step all words and their frequency in the given corpus are determined.\n", "2. All words are then split into their single characters\n", "3. The initial vocabulary contains all characters, obtained in the previous step. The elements of the vocabulary are called *symbols*, i.e. at the initial step the set of symbols is the set of characters.\n", "4. The following steps are then processed in many iterations:\n", " 1. The pair of successive symbols which appears most often in the corpora is joined to form a new symbol\n", " 2. This new symbol is added to the vocabulary\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The BPE algorithm is explained by the following example:\n", "\n", "**1. Pre-Tokenisation**: Assume that the corpus consists of the following words with the given frequencies:\n", "|word|frequency|\n", "|--- |--- |\n", "| van | 10 |\n", "| can | 6 |\n", "| care | 8 |\n", "| fare | 7 |\n", "| car | 6 |\n", "\n", "**2. Split words in characters:** We split the words into their single characters and obtain\n", "|symbols|frequency|\n", "|--- |--- |\n", "| v,a,n | 10 |\n", "| c,a,n | 6 |\n", "| c,a,r,e | 8 |\n", "| f,a,r,e | 7 |\n", "| c,a,r | 6 |\n", "\n", "**3. Vocabulary**: The corresponding initial vocabulary consists of the following symbols $V=\\lbrace a,c,e,f,n,r,v \\rbrace$.\n", "\n", "**4. Iteration 1**: The pair of adjacent symbols which occurs most often is $a,r$ (frequency is 21). The new symbol *ar* is added to the vocabulary and the new symbol-frequency table is\n", "|symbols|frequency|\n", "|--- |--- |\n", "| v,a,n | 10 |\n", "| c,a,n | 6 |\n", "| c,*ar*,e | 8 |\n", "| f,*ar*,e | 7 |\n", "| c,*ar* | 6 |\n", "\n", "**5. Iteration 2**: The pair of adjacent symbols which occurs most often is $a,n$ (frequency is 16). The new symbol *an* is added to the vocabulary and the new symbol-frequency table is\n", "|symbols|frequency|\n", "|--- |--- |\n", "| v,*an* | 10 |\n", "| c,*an* | 6 |\n", "| c,*ar*,e | 8 |\n", "| f,*ar*,e | 7 |\n", "| c,*ar* | 6 |\n", "\n", "**6. Iteration 3**: The pair of adjacent symbols which occurs most often is $ar,e$ (frequency is 15). The new symbol *are* is added to the vocabulary and the new symbol-frequency table is\n", "|symbols|frequency|\n", "|--- |--- |\n", "| v,an | 10 |\n", "| c,an | 5 |\n", "| c,*are* | 8 |\n", "| f,*are* | 4 |\n", "| c,*ar* | 6 |\n", "\n", "**7. Terminate**: If we stop after the third iteration, the final vocabulary is \n", "\n", "$$\n", "V=\\lbrace a,c,e,f,n,r,v,ar,an,are \\rbrace\n", "$$ " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Further remarks:**\n", "1. Above we assumed to start with single characters, which are iteratively joined to new symbols, which are sequences of characters, which appear frequently in the given corpus. However, in large corpora the amount of different characters and therefore the size of the initial vocabulary may be quite large. Therefore, **not characters but Bytes are applied as smallest unit in the vocabulary** and sequences of Bytes are iteratively joined in BPE (there exist at most 256 different Bytes, but potentially much more different characters).\n", "2. A frequently applied variant of BPE is **WordPiece**. When compared to BPE the difference is that **WordPiece** applies another rule for determining the next pair of sequences to merge. In BPE character(sequences) $u$ and $v$ are merged if\n", "\n", " $$\n", " u v=argmax_{a b} \\left( count(a b) \\right)\n", " $$\n", " \n", " In WordPiece $u$ and $v$ are merged if\n", " \n", " $$\n", " u v=argmax_{a b} \\left( \\frac{count(a b)}{count(a) \\cdot count(b)} \\right)\n", " $$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Demo Tokenisation at the input and output of a LLM \n", "In this subsection we apply the [Phi-3-mini model from Hugging Face](https://huggingface.co/microsoft/Phi-3.5-mini-instruct). This LLM is relatively small (3.8 billion parameters) but quite performant. The model has a context-length of 128k tokens at its input and it supports token vocabularies of a maximum size of 32064. The Phi-3 tokenizer applies BPE.\n", "\n", "HuggingFace provides for each LLM also the associated Tokenizer, which has been applied for training the LLM. \n", "\n", "The overall process is \n", "1. Download the LLM and the Tokenizer\n", "2. Define a prompt\n", "3. Tokenize the prompt and pass the tokens-ids to the LLM\n", "4. Map the LLMs answer in form of a sequence of token-ids to the associated words\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.\n", "Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "9271136b0f5549e4a92bd57bd5da8cfb", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Loading checkpoint shards: 0%| | 0/2 [00:00