Access and Preprocess Text

1. Access and Preprocess Text#

The very first step of NLP processing chains is to access text. Some sources already contain text in a clean form, others, e.g. Websites, not only contain raw-text but also markup, images, tables, etc. In the latter case the challenge of preprocessing is to extract the raw text from the not-relevant parts. Moreover, preprocessing contains also the task of segmentation, i.e. the transformation of a possibly long text string into a list of sentences or words.

This chapter demonstrates how raw text can be accessed from:

local text-files
online text-files
online API’s, such as e.g. Twitter

Moreover, the process of crawling raw-text from

html files
RSS feeds is shown.

Corresponding preprocessing methods, e.g. for segmentation of strings into lists of words and lists of sentences are also demonstrated.