{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Chunking and Tokenisation\n",
"\n",
"After accessing and possible cleaning documents, they usually have to be segmented into smaller pieces. Segmentation of documents into smaller pieces like chapters, sections, pages or even sentences is called **chunking**. Segmentation of chunks into words, subwords or even characters is called **tokenisation**.\n",
"\n",
"Simple tokenisation and chunking techniques have already been applied in subsections [Access and Analyse Content of Text Files](01AccessTextFromFile.ipynb) and [Regular Expressions in Python](05RegularExpressions.ipynb) by applying e.g. the `split()`-method of the Python class `String` or by applying regular expressions. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"
Some intro text about Foo.
\n", "Some intro text about Bar.
\n", "Some text about the first subtopic of Bar.
\n", "Some text about the second subtopic of Bar.
\n", "Some text about Baz
\n", "Some concluding text about Foo
\n", "laugh, laughed, laughing, call, called, calling
\n", "\n", "only 4 sub-words are required:\n", "\n", "laugh, call, ed, ing
\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As can be seen in the example above, the way how words are split in sub-words is crucial. In order to end up with small vocabularies, sub-word tokenisation is only valuable, if frequently used character-sequences (e.g. *ing*) are not split into smaller sub-words, but rare character-sequences (e.g. *laughing*) should be decomposed. This concept is implemented in **Byte-Pair-Encoding (BPE)**, which is currently the most implemented tokenisation technique used in the context of Large Language Models (LLMs)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Byte-Pair-Encoding has been introduced in {cite}`sennrich16`. \n", "\n", "```{admonition} Byte-Pair-Encoding Algorithm (BPE)\n", "1. The algorithm starts with a pre-tokenisation step. In this step all words and their frequency in the given corpus are determined.\n", "2. All words are then split into their single characters\n", "3. The initial vocabulary contains all characters, obtained in the previous step. The elements of the vocabulary are called *symbols*, i.e. at the initial step the set of symbols is the set of characters.\n", "4. The following steps are then processed in many iterations:\n", " 1. The pair of successive symbols which appears most often in the corpora is joined to form a new symbol\n", " 2. This new symbol is added to the vocabulary\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The BPE algorithm is explained by the following example:\n", "\n", "**1. Pre-Tokenisation**: Assume that the corpus consists of the following words with the given frequencies:\n", "|word|frequency|\n", "|--- |--- |\n", "| van | 10 |\n", "| can | 6 |\n", "| care | 8 |\n", "| fare | 7 |\n", "| car | 6 |\n", "\n", "**2. Split words in characters:** We split the words into their single characters and obtain\n", "|symbols|frequency|\n", "|--- |--- |\n", "| v,a,n | 10 |\n", "| c,a,n | 6 |\n", "| c,a,r,e | 8 |\n", "| f,a,r,e | 7 |\n", "| c,a,r | 6 |\n", "\n", "**3. Vocabulary**: The corresponding initial vocabulary consists of the following symbols $V=\\lbrace a,c,e,f,n,r,v \\rbrace$.\n", "\n", "**4. Iteration 1**: The pair of adjacent symbols which occurs most often is $a,r$ (frequency is 21). The new symbol *ar* is added to the vocabulary and the new symbol-frequency table is\n", "|symbols|frequency|\n", "|--- |--- |\n", "| v,a,n | 10 |\n", "| c,a,n | 6 |\n", "| c,*ar*,e | 8 |\n", "| f,*ar*,e | 7 |\n", "| c,*ar* | 6 |\n", "\n", "**5. Iteration 2**: The pair of adjacent symbols which occurs most often is $a,n$ (frequency is 16). The new symbol *an* is added to the vocabulary and the new symbol-frequency table is\n", "|symbols|frequency|\n", "|--- |--- |\n", "| v,*an* | 10 |\n", "| c,*an* | 6 |\n", "| c,*ar*,e | 8 |\n", "| f,*ar*,e | 7 |\n", "| c,*ar* | 6 |\n", "\n", "**6. Iteration 3**: The pair of adjacent symbols which occurs most often is $ar,e$ (frequency is 15). The new symbol *are* is added to the vocabulary and the new symbol-frequency table is\n", "|symbols|frequency|\n", "|--- |--- |\n", "| v,an | 10 |\n", "| c,an | 5 |\n", "| c,*are* | 8 |\n", "| f,*are* | 4 |\n", "| c,*ar* | 6 |\n", "\n", "**7. Terminate**: If we stop after the third iteration, the final vocabulary is \n", "\n", "$$\n", "V=\\lbrace a,c,e,f,n,r,v,ar,an,are \\rbrace\n", "$$ " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Further remarks:**\n", "1. Above we assumed to start with single characters, which are iteratively joined to new symbols, which are sequences of characters, which appear frequently in the given corpus. However, in large corpora the amount of different characters and therefore the size of the initial vocabulary may be quite large. Therefore, **not characters but Bytes are applied as smallest unit in the vocabulary** and sequences of Bytes are iteratively joined in BPE (there exist at most 256 different Bytes, but potentially much more different characters).\n", "2. A frequently applied variant of BPE is **WordPiece**. When compared to BPE the difference is that **WordPiece** applies another rule for determining the next pair of sequences to merge. In BPE character(sequences) $u$ and $v$ are merged if\n", "\n", " $$\n", " u v=argmax_{a b} \\left( count(a b) \\right)\n", " $$\n", " \n", " In WordPiece $u$ and $v$ are merged if\n", " \n", " $$\n", " u v=argmax_{a b} \\left( \\frac{count(a b)}{count(a) \\cdot count(b)} \\right)\n", " $$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Demo Tokenisation at the input and output of a LLM \n", "In this subsection we apply the [Phi-3-mini model from Hugging Face](https://huggingface.co/microsoft/Phi-3.5-mini-instruct). This LLM is relatively small (3.8 billion parameters) but quite performant. The model has a context-length of 128k tokens at its input and it supports token vocabularies of a maximum size of 32064. The Phi-3 tokenizer applies BPE.\n", "\n", "HuggingFace provides for each LLM also the associated Tokenizer, which has been applied for training the LLM. \n", "\n", "The overall process is \n", "1. Download the LLM and the Tokenizer\n", "2. Define a prompt\n", "3. Tokenize the prompt and pass the tokens-ids to the LLM\n", "4. Map the LLMs answer in form of a sequence of token-ids to the associated words\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.\n", "Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "9271136b0f5549e4a92bd57bd5da8cfb", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Loading checkpoint shards: 0%| | 0/2 [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from transformers import AutoModelForCausalLM, AutoTokenizer\n", "model_name=\"microsoft/Phi-3-mini-4k-instruct\"\n", "# Load model and tokenizer\n", "model = AutoModelForCausalLM.from_pretrained(\n", " model_name,\n", " device_map=\"cpu\",\n", " torch_dtype=\"auto\",\n", " trust_remote_code=True,\n", "\n", ")\n", "tokenizer = AutoTokenizer.from_pretrained(model_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we formulate a prompt and check how the Phi-3 tokenizer splits this prompt into tokens:" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[14350,\n", " 263,\n", " 3273,\n", " 17385,\n", " 362,\n", " 363,\n", " 263,\n", " 17366,\n", " 5121,\n", " 304,\n", " 7602,\n", " 1239,\n", " 1075,\n", " 2675,\n", " 304,\n", " 278,\n", " 330,\n", " 962,\n", " 15243,\n", " 523]" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prompt = \"Write a short motivation for a lazy friend to convince him going to the gym tonight\"\n", "input_ids = tokenizer(prompt).input_ids\n", "input_ids" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As shown in the output above, the tokenizer outputs not the tokens itself, but their associated integer-ids. In order to check which tokens are assigned to the ids, we have to decode the ids as follows" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "tokens=[tokenizer.decode(t) for t in input_ids]" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[0;30;48;2;102;194;165mWrite\u001b[0m \u001b[0;30;48;2;252;141;98ma\u001b[0m \u001b[0;30;48;2;141;160;203mshort\u001b[0m \u001b[0;30;48;2;231;138;195mmotiv\u001b[0m \u001b[0;30;48;2;166;216;84mation\u001b[0m \u001b[0;30;48;2;255;217;47mfor\u001b[0m \u001b[0;30;48;2;102;194;165ma\u001b[0m \u001b[0;30;48;2;252;141;98mlazy\u001b[0m \u001b[0;30;48;2;141;160;203mfriend\u001b[0m \u001b[0;30;48;2;231;138;195mto\u001b[0m \u001b[0;30;48;2;166;216;84mconv\u001b[0m \u001b[0;30;48;2;255;217;47mince\u001b[0m \u001b[0;30;48;2;102;194;165mhim\u001b[0m \u001b[0;30;48;2;252;141;98mgoing\u001b[0m \u001b[0;30;48;2;141;160;203mto\u001b[0m \u001b[0;30;48;2;231;138;195mthe\u001b[0m \u001b[0;30;48;2;166;216;84mg\u001b[0m \u001b[0;30;48;2;255;217;47mym\u001b[0m \u001b[0;30;48;2;102;194;165mton\u001b[0m \u001b[0;30;48;2;252;141;98might\u001b[0m " ] } ], "source": [ "show_tokens(tokens)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For passing the prompt tokens to the Phi-3 model we generate the tokens as a pytorch-tensor:" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "# Tokenize the input prompt\n", "input_ids = tokenizer(prompt, return_tensors=\"pt\").input_ids.to(\"cpu\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's generate the LLMs answer for the query, defined in the prompt. By configuring the `max_new_tokens`-argument, it is possible to controll the length of the answer:" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "You are not running the flash-attention implementation, expect numerical differences.\n" ] } ], "source": [ "# Generate the text\n", "generation_output = model.generate(\n", " input_ids=input_ids,\n", " max_new_tokens=100\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As mentioned above, the LLM outputs a sequence of token-ids. In order to obtain the associated text, we have to apply the tokenizer's `decode()`-method:" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor([14350, 263, 3273, 17385, 362, 363, 263, 17366, 5121, 304,\n", " 7602, 1239, 1075, 2675, 304, 278, 330, 962, 15243, 523,\n", " 29889, 13, 13, 4290, 29901, 13, 13, 29950, 1032, 29892,\n", " 306, 1073, 366, 29915, 345, 1063, 11223, 17366, 301, 2486,\n", " 29892, 541, 306, 2289, 1348, 366, 881, 2041, 304, 278,\n", " 330, 962, 411, 592, 15243, 523, 29889, 739, 29915, 29879,\n", " 451, 925, 1048, 2805, 297, 8267, 29892, 372, 29915, 29879,\n", " 1048, 11223, 1781, 322, 2534, 2090, 29889, 20692, 592, 29892,\n", " 366, 29915, 645, 4459, 577, 1568, 2253, 1156, 263, 1781,\n", " 664, 449, 29889, 15113, 29892, 591, 508, 4380, 701, 322,\n", " 13958, 714, 12335, 29889, 739, 29915, 645, 367, 263, 2107,\n", " 982, 304, 18864, 278, 11005, 322, 306, 11640, 366, 2113])" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "generation_output[0]" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Write a short motivation for a lazy friend to convince him going to the gym tonight.\n", "\n", "Input:\n", "\n", "Hey, I know you've been feeling lazy lately, but I really think you should come to the gym with me tonight. It's not just about getting in shape, it's about feeling good and having fun. Trust me, you'll feel so much better after a good workout. Plus, we can catch up and hang out afterwards. It'll be a great way to spend the evening and I promise you won\n" ] } ], "source": [ "# Print the output\n", "print(tokenizer.decode(generation_output[0]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 4 }