Introduction#
Author: Prof. Dr. Johannes Maucher
Institution: Stuttgart Media University
Document Version: 2.0.9 (Incomplete DRAFT !!!)
Last Update: 07.10.2024
Lecture Contents#
Introduction
Organisational Aspects
What is NLP?
Applications
Contents of this lecture
Access and Preprocess Text
Character Encoding
Text sources
Crawling
Cleaning
Chunking
Tokenisation
Word Normalisation
Morphology
Stemming and Lemmatisation
Error Correction
Tokenisation
PoS Tagging
PoS
PoS Tagsets
PoS Tagger
NLP Tasks
Topic Extraction
Named Entity Recognition
Text Classification
Machine Translation
Language Modelling
Text Generation
N-Gram Language Models
N-Grams
N-Gram LM
Smoothing
Text-Embeddings
Vector Representations of Words and Texts
One-Hot-Encoding
Bag-of-Word-Model
Word-Embeddings
Contextual Word-Embeddings
Text-Embeddings
Text Classification with conventional ML
ML Classification
Evaluation metrics
Naive Bayes
Fake news detection with conventional and deep ML
Neural Networks
MLP (Recap)
CNNs (Recap)
Recurrent Neural Networks
Transformer 1: Attention
Encoder-Decoder Models
Self-Attention
Encoder-Decoder Attention
Transformer 2: BERT and GPT
BERT
GPT-1,2,3
RLHF
chatGPT
Fine-Tuning of LLMs
LORA
Retrieval Augmented Generation
Indexing
Vector DB
Retrieval
What is NLP?
Natural Language Processing (NLP) strives to build computers, such that they can understand and generate natural language. Since computers usually only understand formal languages (programming languages, maths, etc), NLP techniques must provide the transformation from natural language to a formal language and vice versa.

This lecture focuses on the direction from natural language to formal language. However, in the later chapters also techniques for automatic language generation are explained. In any case, only natural language in written form is considered. Speech recognition, i.e. the process of transforming speech audio signals into written text, is not in the scope of this lecture.
As a science NLP is a subfield of Artificial Intelligence, which itself belongs to Computer Science. In the past linguistic knowledge has been a key-komponent for NLP.

The old approach of NLP, the so called Rule-based-approach can be described by representing linguistic rules in a formal language and parsing text according to this rule. In this way, e.g. the syntactic structure of sentences can be derived and from the syntactic structure semantics are infered.
The enormous success of NLP during the last few years is based on Data-based-approaches, which increasingly substitute the old Rule-based-approach. The idea of this approach is to learn language statistics from large amounts of digitally available texts (copora). For this, modern Machine Learning (ML) techniques, such as Deep Neural Networks are applied. The learned statistics can then be applied e.g. for Part-of-Speech-Tagging, Named-Entity-Recognition, Text Summarisation, Semantic Analysis, Language Translation, Text Generation, Question-Answering, Dialog-Systems and many other NLP tasks.
As the picture below describes, Rule-based-approaches require expert-knowledge of the linguists, whereas Data-based approaches require large amount of data, ML-algorithms and performant Hardware.

The following statement of Fred Jelinek expresses the increasing dominance of Data-based-approaches:
Every time I fire a linguist, the performance of the speech recognizer goes up.
—Fred Jelinek[1]
Example
Consider the NLP task Spam Classification. In a Rule-based approach one would define rules like if text contains Viagra then class=spam, if sender address is part of a given black-list then class=spam, etc. In a Data-based-approach such rules are not required. Instead a large corpus of e-mails labeled with either spam or ham is required. A Machine Learning Algorithm, like e.g. a Naive Bayes Classifier, will learn a statistical model from the given training data. The learned model can then be applied for spam-classification.
From task specific data-based solutions to Large Language Models (LLMs):
In nearly all NLP applications data-based solutions have outperformed rule-based approaches. However, since 2017 and the rise of LLMs, we have another amazing technology shift in NLP: From task-specific Machine Learning solutions to Large Language Model (LLM), which have been trained on incredible amounts of texts on gigantic GPU-clusters. LLMs constitute a big step towards General AI in the sense that a single trained Neural Network Architecture can be applied for many different tasks, such as translation, summarization, classification, reasoning, question-answering, text-generation etc.
NLP Overview and Process Chain
The image below lists popular NLP use-cases in the leftmost column. For providing these applications different NLP specific tools, which are listed in the center column, are applied. These NLP tools implement more general algorithms, e.g. from Machine Learning (right column). Today, in particular a specific type of Deep Neural Network, the transformer, is applied in Large Language Models like GPT.

The image below depicts a typical NLP process chain, i.e. it shows how NLP tools are sequentially applied to realize NLP applications.

All of the above mentioned NLP applications, tools and algorithms are addressed in this lecture. However, at the heart of the lecture we have a strong emphasis on Large Language Models, their underlying Neural Network Architecture (Transformer), the required Preprocessing (Chunking, Tokenisation, Embedding, …) and their application in the context of Retrieval Augmented Generation (RAG).