Word Stemming

School of Computing Sciences, University of East Anglia

UEA logo

Overview

Similar to other stemmers, UEA-Lite operates on a set of rules which are used as steps. There are two groups of rules: the first to clean the tokens, and the second to alter suffixes.

The first group of rules first avoids a small list of six frequent problem words. An improvement to the stemmer would be to expand this list by adding other problem words which the second rule set cannot deal with. Second, possessive apostrophes are removed and contractions are expanded. All hyphens are removed and tokens containing digits are left untouched. Strings which are all upper case and digits are left untouched unless there is a lower case terminal 's' (i.e. transforming plural forms of acronyms to singular forms).

Proper nouns should not usually be stemmed, except to remove possessives; our implementation will respect PoS tags if they are present. If the text is untagged the stemmer uses the simple heuristic that any capitalized token not preceded by sentence breaking punctuation is a proper noun.

Many texts, particularly scientific papers, contain sequences of digits, single letters, and other non-word tokens. Our implementation ignores tokens containing digits, single-letter tokens, and tokens with embedded punctuation.

The second group of rules contains 139 suffix rules, each testing for a specific type of suffix. The rules are set in a particular order so that the longest suffix applicable is used rather a shorter one which could lead to nonsense words and more words not stemmed entirely to their root form.

Papers

Marie-Claire Jenkins, Dan Smith. Conservative stemming for search and indexing, 2005 (PDF 128kb)

Online version

Online JavaScript port of UEA-lite stemmer

Download

UEA-lite stemmer (Perl) 24KB
UEA-lite stemmer (Java zip) 125KB

Links

Martin Porter's page
Contains links to versions of the Porter algorithm in many languages.
Paice/Husk stemmer
Official website for the stemmer which references many relevant resources and implementations.
Lovins stemmer Java, Perl and C implementations.
Edward D. Loper, Steven Bird, Natural Language Toolkit
A Python library containing stemming, PoS, parsing and other tools.

Contact us

Dan Smith