What is artificial intelligence as a service (AIaaS)? What are knowledge-based systems (KBSes)?
X
Definition

What is lemmatization?

Lemmatization is the process of grouping together different inflected forms of the same word. It's used in computational linguistics, natural language processing (NLP) and chatbots. Lemmatization links similar meaning words as one word, making tools such as chatbots and search engine queries more effective and accurate.

The goal of lemmatization is to reduce a word to its root form, also called a lemma. For example, the verb running would be identified as run. Lemmatization studies the morphological, or structural, and contextual analysis of words.

To correctly identify a lemma, tools analyze the context, meaning and the intended part of speech in a sentence, as well as the word within the larger context of the surrounding sentence, neighboring sentences or even the entire document. With this in-depth analysis, tools that use lemmatization can better understand the meaning of a sentence.

How does lemmatization work?

Lemmatization takes a word and breaks it down to its lemma or dictionary form. For example, the verb walk might appear as walking, walks or walked. Inflectional endings, such as s, ed and ing, are removed. Lemmatization groups these words as its lemma, walk.

The word saw might be interpreted differently, depending on the sentence. For example, saw can be broken down into the lemma see or saw. In these cases, lemmatization attempts to select the right lemma depending on the context of the word, surrounding words and sentence. Other words, such as better, might be broken down to a lemma such as good.

A basic way to perform lemmatization is to use an algorithm based on dictionary lookups. This process requires a detailed dictionary so the algorithm can find a specific word and link it back to the word's lemma. More complicated word forms or languages require a rule-based system for lemmatization.

Types of lemmatization

Depending on the approach used and the linguistic features being addressed, one of three types of lemmatization is used.

1. Rule-based lemmatization

This approach uses clear linguistic rules to determine the base form of a word. It examines the structure of words and applies grammatical rules relevant to different parts of speech. By doing so, it identifies the appropriate base form based on the word's context. This method is particularly effective for languages with well-defined grammatical structures.

2. Dictionary-based lemmatization

This method relies on a preexisting dictionary or lexicon that maps words to their lemmas, enabling the lemmatizer to look up each word and find its corresponding base form. For example, the dictionary might include entries such as the following:

  • Running → Run.
  • Better → Good.
  • Mice → Mouse.

One advantage of this approach is its ability to handle irregular words and exceptions, provided they are included in the dictionary.

3. Machine learning-based lemmatization

This method uses machine learning (ML) models trained on extensive collections of text to understand the relationships between words and their base forms. These models recognize patterns and apply what they learn to new words, even if those words aren't in a dictionary. For example, a model might learn that words ending in ly are often adverbs and should be lemmatized to their base adjective form.

Applications of lemmatization

Lemmatization is commonly applied in the following areas:

The following are some common examples of applications of lemmatization:

  • Search queries. Because search engine algorithms use lemmatization, users can query any variation of a word and get relevant results. For example, if the user queries the plural form of a word, such as routers, the search engine knows to also return relevant content that uses the singular form of the same word -- router.
  • Big data analytics. Lemmatization is an important part of natural language understanding and NLP. It also plays an important role in big data analytics and AI. For example, in big data analytics, lemmatization is used to normalize text documents.
  • Sentiment analysis. In NLP, lemmatization helps an AI or ML tool understand and converse with end users accurately. For example, in sentiment analysis, which aims to identify the emotional tone behind a piece of text, lemmatization enhances the ability to determine meaning and emotional tone more effectively.
  • Preprocessing text data. Lemmatization is an important preprocessing step before inputting text data into deep learning models. By reducing words to their base forms, lemmatization helps these models learn patterns and relationships within the text.
  • Chatbots. Chatbots use lemmatization to understand user inputs. Specifically, it helps a chatbot understand the contextual form a word takes, leading to an increased understanding of sentences.
  • Standardizing biomedical terminology. Biomedical terms often vary in spelling, prefixes, suffixes and verb tense. Lemmatization helps standardize these terms by reducing various inflected forms of a word to its base or lemma, making it easier to analyze and compare information in biomedical texts.

Lemmatization vs. stemming

Both lemmatization and stemming are text normalization techniques. In linguistics, lemmatization is closely related to stemming, as both strip prefixes and suffixes that have been added to a word's base form. Stemming is mainly used to map different forms of a word to a single form. It typically uses algorithms such as the Porter stemmer and its updated version, the Snowball stemmer. The Python Natural Language Toolkit provides built-in functions for both the Snowball and Porter stemmers.

Stemming algorithms cut off the beginning or end of a word using a list of common prefixes and suffixes that might be part of an inflected word. This process is generally indiscriminate and can result in base forms of a word with incorrect spelling or meaning. Stemming operates without any contextual knowledge, so it can't discern between similar words with different meanings.

For example, the stem of studies would be studi, and the stem of studying would be study; in lemmatization, the base form would be study for both studies and studying. While being less accurate, stemming is easier to implement and runs faster. The following example shows in more detail how stemming and lemmatization work for different variations of the word study:

Stemming

  • Study → Studi
  • Studying → Studi
  • Studies → Studi
  • Studied → Studi
  • Studier → Studier

Lemmatization

  • Study → Study
  • Studying → Study
  • Studies → Study
  • Studied → Study
  • Studier → Study

With stemming, most inflections of the word study become studi compared with lemmatization, where most outputs become study.

Lemmatization is more complex than stemming, as lemmatization requires words to be categorized by a part of speech as well as by the inflected form. This can become quite complicated in languages other than English, where the only inflected forms are singular or plural, verb tense and comparative or superlative forms of adverbs and adjectives.

Lemmatization advantages and disadvantages

Lemmatization offers the following benefits:

  • Accuracy. Lemmatization is more accurate than stemming because it's able to more precisely determine the lemma of a word.
  • Understanding text. Lemmatization helps NLP tools, such as AI chatbots, understand full-sentence input from end users. It's also useful for returning specific search queries.
  • Contextual understanding. Word per word, lemmatization can understand a term based on its contextual use. It analyzes surrounding words and grammatical structures to determine the correct part of speech.
  • Better information retrieval. Lemmatization helps search engines match user queries with relevant documents, improving search accuracy and information retrieval.
  • Dimensionality reduction. Lemmatization groups similar words, reducing the number of different words in a data set and simplifying text data. It's useful in tasks such as text classification and clustering because it preserves important information while making the data easier to handle.

Along with its many benefits, lemmatization also comes with the following disadvantages:

  • Computational overhead. Lemmatization requires more computational overhead than stemming, which is performed faster and with fewer computing resources.
  • Slower processing speed. Lemmatization algorithms are slower than stemming algorithms due to the morphological analysis lemmatization conducts on each inflected word. This could become a limitation for large data sets and real-time applications.
  • Language dependency. The effectiveness of lemmatization varies depending on the language being processed. Some languages have more complex grammatical structures that could complicate the lemmatization process and lead to inaccuracies.
  • Limited context. Lemmatization typically operates on a word-by-word basis, observing only a small window of surrounding text. While this method can be useful, it might not address ambiguities that require a wider context or understanding. For example, sometimes it's crucial to consider the entire sentence or document to fully grasp the meaning of a word.

Learn more about sentiment analysis tools, including a tool that uses both lemmatization and stemming.

This was last updated in March 2025

Continue Reading About What is lemmatization?

Dig Deeper on AI technologies