1. Access and Preprocess Text#
The very first step of NLP processing chains is to access text. Some sources already contain text in a clean form, others, e.g. Websites, not only contain raw-text but also markup, images, tables, etc. In the latter case the challenge of preprocessing is to extract the raw text from the not-relevant parts. Moreover, preprocessing contains also the task of segmentation, i.e. the transformation of a possibly long text string into a list of sentences or words.
This chapter demonstrates how raw text can be accessed from:
local text-files
online text-files
online API’s, such as e.g. Twitter
Moreover, the process of crawling raw-text from
html files
RSS feeds is shown.
Corresponding preprocessing methods, e.g. for segmentation of strings into lists of words and lists of sentences are also demonstrated.