Word Normalisation

2. Word Normalisation#

Word normalisation tries to represent words in a unique way. This includes:

Character case: In most of the NLP tasks it should not be distinguished whether a word or characters of a word are written in upper or lower case.[1]
Word correction: Spelling mistakes shall be corrected, such that only correct words constitute the vocabulary
In case of spelling ambiguities, all words shall be mapped to a unique spelling (i.e. valid spellings due to German Rechtschreibreform)
In many NLP tasks different forms shall not be distinguished. For example in Information Retrieval (Web Search) the temporal form of a verb should be ignored as well as singular or plural of nouns.

Example

Consider a search-engine like google. If you enter your query-words, you like to get the same results independent of the wordform. E.g. the query video encoding shall provide the same result as encode videos. For this all words of the query and all words in the index of the search-engine must be normalised to a unique form. This normalisation does not only provide better results, but it also reduces memory- and time-complexity, because the index is much smaller than without normalisation