Introduction

Contents

Introduction#

  • Author: Prof. Dr. Johannes Maucher

  • Institution: Stuttgart Media University

  • Document Version: 2.0.9 (Incomplete DRAFT !!!)

  • Last Update: 07.10.2024

Lecture Contents#

Introduction

  • Organisational Aspects

  • What is NLP?

  • Applications

  • Contents of this lecture

Introduction

Access and Preprocess Text

  • Character Encoding

  • Text sources

  • Crawling

  • Cleaning

  • Chunking

  • Tokenisation

Access and Preprocess Text

Word Normalisation

  • Morphology

  • Stemming and Lemmatisation

  • Error Correction

  • Tokenisation

Morphology

PoS Tagging

  • PoS

  • PoS Tagsets

  • PoS Tagger

Part-of-speech Tagging

NLP Tasks

  • Topic Extraction

  • Named Entity Recognition

  • Text Classification

  • Machine Translation

  • Language Modelling

  • Text Generation

Introduction

N-Gram Language Models

  • N-Grams

  • N-Gram LM

  • Smoothing

  • Text-Embeddings

N-Gram Language Model

Vector Representations of Words and Texts

  • One-Hot-Encoding

  • Bag-of-Word-Model

  • Word-Embeddings

  • Contextual Word-Embeddings

  • Text-Embeddings

Vector Representations of Words and Documents

Text Classification with conventional ML

  • ML Classification

  • Evaluation metrics

  • Naive Bayes

  • Fake news detection with conventional and deep ML

Text Classification

Neural Networks

  • MLP (Recap)

  • CNNs (Recap)

  • Recurrent Neural Networks

Neural Networks

Transformer 1: Attention

  • Encoder-Decoder Models

  • Self-Attention

  • Encoder-Decoder Attention

Sequence-To-Sequence, Attention, Transformer

Transformer 2: BERT and GPT

  • BERT

  • GPT-1,2,3

  • RLHF

  • chatGPT

Sequence-To-Sequence, Attention, Transformer

Fine-Tuning of LLMs

  • LORA

Introduction

Retrieval Augmented Generation

  • Indexing

  • Vector DB

  • Retrieval

LLM Applications

What is NLP?

Natural Language Processing (NLP) strives to build computers, such that they can understand and generate natural language. Since computers usually only understand formal languages (programming languages, maths, etc), NLP techniques must provide the transformation from natural language to a formal language and vice versa.

Transformation between natural and formal language

This lecture focuses on the direction from natural language to formal language. However, in the later chapters also techniques for automatic language generation are explained. In any case, only natural language in written form is considered. Speech recognition, i.e. the process of transforming speech audio signals into written text, is not in the scope of this lecture.

As a science NLP is a subfield of Artificial Intelligence, which itself belongs to Computer Science. In the past linguistic knowledge has been a key-komponent for NLP.

Sciences, used by NLP

The old approach of NLP, the so called Rule-based-approach can be described by representing linguistic rules in a formal language and parsing text according to this rule. In this way, e.g. the syntactic structure of sentences can be derived and from the syntactic structure semantics are infered.

The enormous success of NLP during the last few years is based on Data-based-approaches, which increasingly substitute the old Rule-based-approach. The idea of this approach is to learn language statistics from large amounts of digitally available texts (copora). For this, modern Machine Learning (ML) techniques, such as Deep Neural Networks are applied. The learned statistics can then be applied e.g. for Part-of-Speech-Tagging, Named-Entity-Recognition, Text Summarisation, Semantic Analysis, Language Translation, Text Generation, Question-Answering, Dialog-Systems and many other NLP tasks.

As the picture below describes, Rule-based-approaches require expert-knowledge of the linguists, whereas Data-based approaches require large amount of data, ML-algorithms and performant Hardware.

Rule-based and data-based approach

The following statement of Fred Jelinek expresses the increasing dominance of Data-based-approaches:

Every time I fire a linguist, the performance of the speech recognizer goes up.

—Fred Jelinek[1]

From task specific data-based solutions to Large Language Models (LLMs):

In nearly all NLP applications data-based solutions have outperformed rule-based approaches. However, since 2017 and the rise of LLMs, we have another amazing technology shift in NLP: From task-specific Machine Learning solutions to Large Language Model (LLM), which have been trained on incredible amounts of texts on gigantic GPU-clusters. LLMs constitute a big step towards General AI in the sense that a single trained Neural Network Architecture can be applied for many different tasks, such as translation, summarization, classification, reasoning, question-answering, text-generation etc.

NLP Overview and Process Chain

The image below lists popular NLP use-cases in the leftmost column. For providing these applications different NLP specific tools, which are listed in the center column, are applied. These NLP tools implement more general algorithms, e.g. from Machine Learning (right column). Today, in particular a specific type of Deep Neural Network, the transformer, is applied in Large Language Models like GPT.

NLP applications (left) require NLP tools (center), which apply fundamental algorithms (right) e.g. from Machine Learning.

The image below depicts a typical NLP process chain, i.e. it shows how NLP tools are sequentially applied to realize NLP applications.

NLP Processing Chain

All of the above mentioned NLP applications, tools and algorithms are addressed in this lecture. However, at the heart of the lecture we have a strong emphasis on Large Language Models, their underlying Neural Network Architecture (Transformer), the required Preprocessing (Chunking, Tokenisation, Embedding, …) and their application in the context of Retrieval Augmented Generation (RAG).