lemmatization
What is lemmatization?
Lemmatization is the process of grouping together different inflected forms of the same word. It's used in computational linguistics, natural language processing (NLP) and chatbots. Lemmatization links similar meaning words as one word, making tools such as chatbots and search engine queries more effective and accurate.
The goal of lemmatization is to reduce a word to its root form, also called a lemma. For example, the verb "running" would be identified as "run." Lemmatization studies the morphological, or structural, and contextual analysis of words.
To correctly identify a lemma, tools analyze the context, meaning and the intended part of speech in a sentence, as well as the word within the larger context of the surrounding sentence, neighboring sentences or even the entire document. With this in-depth understanding, tools that use lemmatization can better understand the meaning of a sentence.
How does lemmatization work?
Lemmatization takes a word and breaks it down to its lemma. For example, the verb "walk" might appear as "walking," "walks" or "walked." Inflectional endings such as "s," "ed" and "ing" are removed. Lemmatization groups these words as its lemma, "walk."
This article is part of
What is enterprise AI? A complete guide for businesses
The word "saw" might be interpreted differently, depending on the sentence. For example, "saw" can be broken down into the lemma "see" or "saw." In these cases, lemmatization attempts to select the right lemma depending on the context of the word, surrounding words and sentence. Other words, such as "better" might be broken down to a lemma such as "good."
A basic way to perform lemmatization is to use an algorithm based on dictionary lookups. This process requires a detailed dictionary so the algorithm can find a specific word and link it back to the word's lemma. More complicated word forms or languages can require a rule-based system for lemmatization.
Applications of lemmatization
Lemmatization is commonly applied in the following areas:
- Artificial intelligence (AI).
- Big data analytics.
- Chatbots.
- Machine learning (ML).
- NLP.
- Search queries.
- Sentiment analysis.
Lemmatization can be applied in a number of different circumstances. For example, in search queries, lemmatization lets end users query any version of a base word and get relevant results. Because search engine algorithms use lemmatization, the user can query any inflectional form of a word and get relevant results. For example, if the user queries the plural form of a word such as "routers," the search engine knows to also return relevant content that uses the singular form of the same word -- "router."
Lemmatization is an important part of natural language understanding and NLP, and also plays an important role in big data analytics and AI. For example, in big data analytics, lemmatization is used to normalize text documents.
Likewise, in NLP, lemmatization helps an AI or machine learning tool understand and converse with end users accurately. For example, sentiment analysis, which is used to identify the emotional tone behind a body of text, can use lemmatization to better determine meaning and emotional tone.
Chatbot AI's use lemmatization to help understand user inputs. Specifically, lemmatization helps a chatbot understand the contextual form a word takes, leading to an increased understanding of sentences.
Lemmatization vs. stemming
In linguistics, lemmatization is closely related to stemming, as both strip prefixes and suffixes that have been added to a word's base form.
Stemming algorithms cut off the beginning or end of a word using a list of common prefixes and suffixes that might be part of an inflected word. This process is generally indiscriminate and can result in base forms of a word with incorrect spelling or meaning. Stemming operates without any contextual knowledge, meaning that it can't discern between similar words with different meanings.
For example, the stem of "studies" and "studying" would be "studi" and "study," while in lemmatization the base form would be "study" for both "studies" and "studying." But both lemmatization and stemming would still have the same base form for the word "walking," for example. While being less accurate, stemming is easier to implement and runs faster. An example of stemming and lemmatization is shown as follows:
Stemming:
Study → Studi
Studying → Studi
Studies → Studi
Studied → Studi
Studier → Studier
Lemmatization:
Study → Study
Studying → Study
Studies → Study
Studied → Study
Studier → Study
With stemming, most inflections of the word "study" become "studi" compared to lemmatization where most outputs become "study."
Lemmatization is more complex than stemming, as lemmatization requires words to be categorized by a part of speech as well as by inflected form. This can become quite complicated in languages other than English, whose only inflected forms are singular or plural, verb tense and comparative or superlative forms of adverbs and adjectives.
For more on artificial intelligence-related terms, read the following articles:
What are knowledge-based systems?
What is neuromorphic computing?
What is named entity recognition?
Lemmatization advantages and disadvantages
Lemmatization offers the following benefits:
- Accuracy. Lemmatization is much more accurate than stemming, as it's able to more precisely determine the lemma of a word.
- Understanding text. Lemmatization is useful for tools in NLP like AI chatbots for understanding full sentence input from end users. This is also useful for returning specific search queries.
- Contextual understanding. Word-per-word, lemmatization can understand a term based on the contextual use of that word.
But lemmatization does have some drawbacks when compared to stemming. For example, lemmatization requires more computational overhead than stemming, which can be performed faster and with fewer computing resources. Lemmatization algorithms are also slower than stemming algorithms due to the morphological analysis lemmatization conducts on each inflected word.
Learn more about sentiment analysis tools, including a tool that uses lemmatization and stemming.