Programs

Stemming & Lemmatization in Python: Which One To Use?

Natural Language Processing (NLP) is a communication processing technique that involves extracting important features from the language. It is an advancement in Artificial intelligence that involves building intelligent agents with previous experience. The previous experience here refers to the training that is performed over humongous datasets involving textual data from sources including social media, web scraping, survey forms, and many other data collection techniques.

Top Machine Learning and AI Courses Online

The initial step after data gathering is the cleaning of this data and conversion into the machine-readable form, the numerical form that the machine can interpret. While the conversion process is a whole another thing, the cleaning process is the first step to be performed. In this cleaning task, inflection is an important concept that needs a clear understanding before moving on to stemming and lemmatization. 

Inflection

We know textual data comprises sentences with words and other characters that may or may not impact our predictions. The sentences comprise words and the words which are commonly used such as is, there, and, are called stop words. These can be removed easily by forming a corpus for them, but what about different forms of the same word? 

Trending Machine Learning Skills

Enrol for the Machine Learning Course from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.

You don’t want your machine to consider ‘study’ and ‘studying’ as different words as the intent behind these words remains the same and both convey the same meaning. Handling this type of case is a common practice in NLP, and this is known as inflection. This is the base idea of stemming and lemmatization with different approaches. Let’s discover the differences between them and have a look at which one is better to use. 

Stemming

It is one of the text normalization techniques that focuses on reducing the ambiguity of words. The stemming focuses on stripping the word round to the stem word. It does so by removing the prefixes or suffixes, depending upon the word under consideration. This technique reduces the words according to the defined set of rules. 

The resulted words may or may not have any actual meaningful root words. Its main purpose is to form groups of similar words together so that further preprocessing can be optimized. For example, words like play, playing, and played all belong to the stem word “play”. This also helps in reducing the search time in search engines, as now more focus is given on the key element. 

Two cases need to be discussed regarding stemming, i.e., over steaming and under stemming. While removing the prefixes and suffixes from the word solves some cases, some words are stripped more than the requirements.

This can lead to more trash words with no meanings. Though this is the disadvantage of stemming as a whole, and if it happens more drastically, it is known as over stemming. Under stemming is the reverse where the stemming process results in very little or difference in words.

NLTK Stemming: Understand With This Program

Here is a program that helps you better understand NLTK stemming

from nltk.stem import PorterStemmer
e_words= ["wait", "waiting", "waited", "waits"]
ps =PorterStemmer()
for w in e_words:
    rootWord=ps.stem(w)
    print(rootWord)

Stemming Pros

Evaluating the pros and cons of stemming and lemmatization in Python can help you better compare the two and conclude which one is the best. So, let’s start with the pros of stemming: 

  • Enhanced Model Performance: Stemming lowers the number of distinct words that an algorithm must process, which can enhance model performance. It can also speed up and improve the efficiency of the algorithm. 
  • Organizing Comparable Terms: Even though they have different forms, words with similar implications can be clustered together. When identifying relevant subjects or themes inside a document, as is the case with activities like document classification, this technique might be helpful. 
  • Easy to Compare and Comprehend: Since stemming often shrinks the vocabulary, texts are considerably simpler to compare, analyze, and comprehend. This is beneficial for projects like sentiment analysis, where the objective is to ascertain the sentiment of a document. 

Stemming Drawbacks

  • Overstemming: It occurs when a stemming algorithm lowers distinct conjugated words to the same word stem despite the fact that they are unrelated. For instance, the Porter-Stemmer algorithm stems the words “universal,” “university,” and “universe” to the same word stem. 
  • Understemming: When an inflected word’s word stem is changed when it should be the same, this is known as understemming or a false negative. 
  • Language Difficulties: It becomes more challenging to create stemmers as the spelling, morphology, and character encoding of the target language become more complex.

Lemmatization

Another approach for normalizing the text and converting them to root meanings is Lemmatization. This has the same motive of grouping similar intent words into one group, but the difference is that here the resultant words are meaningful.

They are not stripped off with pre-defined rules but are formed using a dictionary or we call it Lemma. Here the process of conversion takes more time because first, the words are matched with their parts of speech, which itself is time taking process. 

This ensures that the root word has a literal meaning that helps in deriving good results in analysis. This is useful when we don’t want to spend much time on data cleaning, and cleaner data is required for further analysis. One drawback of this technique is that as it focuses more on the grammar of the words, different languages would require separate corpora leading to more and more data handling. 

Checkout: Deep Learning Project Ideas for Beginners

Lemmatization In Python: Use Cases

Lemmatization in Python reduces ambiguity in writing. The root word bicycle is formed from examples like bicycle or bicycles. In essence, it will change all words with the same meaning but distinct interpretations to their original forms. 

It lessens the number of words in the provided text and aids in creating precise features for the machine-learning training system. Your machine-learning system will be smarter and more precise the clean the data is.

Lemmatization Pros

  • Unlike stemming algorithms, lemmatization does more than simply clip words off. 
  • Words are examined depending on their POS to produce lemmas that take context into account. 
  • Lemmatization also creates terms that belong in dictionaries. 

Lemmatization Drawbacks

  • Lemmatization takes longer than stemming because it is a slower process. 
  • This is so that words’ meanings may be determined through morphological analysis and dictionary use during lemmatization. 

Stemming and Lemmatization In Python: Code To Distinguish Between Them

Stemming Code:
import nltk
from nltk.stem.porter import PorterStemmer
porter_stemmer  = PorterStemmer()
text = "studies studying cries cry"
tokenization = nltk.word_tokenize(text)
for w in tokenization:
print("Stemming for {} is {}".format(w,porter_stemmer.stem(w)))

Output:

Stemming for studies is studi
Stemming for studying is studi
Stemming for cries is cri
Stemming for cry is cri

Lemmatization Code:

import nltk
from nltk.stem import  WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
text = "studies studying cries cry"
tokenization = nltk.word_tokenize(text)
 for w in tokenization:
print("Lemma for {} is {}".format(w, wordnet_lemmatizer.lemmatize(w)))
  

Output:

Lemma for studies is study
Lemma for studying is studying
Lemma for cries is cry
Lemma for cry is cry

Evaluating The Output

When stemming from studies and studying, the output is the same (studi). However, the NLTK lemmatizer offers a separate lemma for the terms study for studies and studying for studying. Lemmatization would therefore be excellent if it were the dominant method when creating feature sets to teach machines. 

Which One to Use?

Now comes the point of picking the one between the two of them. It is highly subjective to choose anyone as the use case you are targeting plays a major role here. 

If you want to analyze a chunk of text but time is a constraint, then you can opt for stemming as it performs this action in less time but with a low success rate, and the stems are provided via an algorithmic way that may not have any meaning. 

Adopting Lemmatization gives an added advantage of getting meaningful and accurate root words clubbed from different forms. If you can afford good computing resources with more time, then this is can be a better choice. This should be adopted where we want precise analysis. It can also be the case of some searching techniques on the search engines where the root word is enough to fetch the results user wants. 

Python Implementation

The NLTK (Natural Language Tool Kit) package is the Python implementation of the tasks around the NLP. This library has all the required tools such as Stemmers. Lemmatizers, stop words removal, creating custom parser trees, and much more. It also contains the corpus data from prominent sources included in the package itself. 

The stemming technique has many implementations, but the most popular and oldest one is the Porter Stemmer algorithm. Snowball stemmer is also used in some projects. For understanding the difference between stemming and lemmatization more clearly, look at the code below and the output of the same:

import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
word_stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('flies'))
print(word_stemmer.stem('flies'))

Output:

fly
fli

The first output is from the lemmatizer and the second from the stemmer. You can see the difference that the lemmatizer gave the root word as the output while the stemmer just trimmed the word from the end. 

Popular AI and ML Blogs & Free Courses

Learn data science courses from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.

Also Read: Machine Learning Project Ideas

Conclusion

NLP is growing every day and new methods evolve with time. Most of them focus on how to efficiently extract the right information from the text data with minimum loss and eliminating all the noises. Both the techniques are popularly used. All it matters is that the analysis is carried on clean data.

What are the two types of AI algorithms used to cluster documents?

Hierarchical clustering and non-hierarchical clustering are the two types of AI algorithms used to cluster texts. The hierarchical clustering algorithm divides and aggregates documents according to a set of rules. The hierarchy's pairs of clusters of data items are then connected together. While this technique is simple to read and comprehend, it may not be as effective as non-hierarchical clustering. When there are a lot of flaws in the data, clustering might be tough. Non-hierarchical clustering entails merging and breaking existing clusters to create new ones. This is a clustering approach that is comparatively quicker, more dependable, and more stable.

Is lemmatization preferred for sentiment analysis?

Lemmatization and stemming are both highly effective procedures. When converted into root-form, however, lemmatization always yields the dictionary meaning term. When the meaning of the term isn't critical to the study, then stemming is recommended. When the meaning of a word is vital for analysis, lemmatization is advised. As a result, if you had to pick one approach for sentiment analysis, lemmatization would be the one to go with.

How are stemming and lemmatization used for document clustering?

Document clustering, also known as text clustering, is a method of analyzing textual texts by grouping them together. Its applications range from automated document arrangement to topic extraction and even speedy information retrieval. Stemming and lemmatization are used to reduce the number of tokens required to communicate the same information, hence improving the overall technique. Following this preprocessing step, features are calculated by measuring the frequency of each token, followed by the most efficient clustering approaches.

Want to share this article?

Lead the AI Driven Technological Revolution

PG DIPLOMA IN MACHINE LEARNING AND ARTIFICIAL INTELLIGENCE
ENROLL NOW @ UPGRAD

Leave a comment

Your email address will not be published. Required fields are marked *

Our Popular Machine Learning Course

Get Free Consultation

Leave a comment

Your email address will not be published. Required fields are marked *

×
Get Free career counselling from upGrad experts!
Book a session with an industry professional today!
No Thanks
Let's do it
Get Free career counselling from upGrad experts!
Book a Session with an industry professional today!
Let's do it
No Thanks