Introduction
Let’s start at the beginning. As businesses and organizations collect more data than ever, it’s essential to make sense of it so we can drive informed decision-making. Text Analytics is a subfield of Natural Language Processing (NLP) that analyzes unstructured or semi-structured text data.
What is Text Analytics?
Text analytics is a technique that helps extract meaningful insights from this data. This includes sentiment analysis, named entity recognition, and topic modeling. Text analytics aims to turn vast amounts of data into actionable information that can inform business decisions.
How to Perform Text Analytics
Text Analytics is typically performed by following these steps:
- Data Collection: Collect text data from various sources such as social media, customer feedback, and news articles.
- Data Preprocessing and Cleaning: Clean and preprocess the collected data by removing stop words, stemming, and punctuation marks.
- Data Exploration and Visualization: Explore the data by creating word clouds, bar plots, and histograms to gain insights into the data.
- Model Selection and Training: Select a suitable machine learning model and train it on the preprocessed data.
- Model Evaluation and Improvement: Evaluate the model’s performance and make improvements if necessary.
I wrote this article to guide you through text analytics and show you how to perform it using the Natural Language Toolkit (NLTK) in Python.

Why use Python NLTK for Text Analytics?
The Natural Language Toolkit (NLTK) is a popular open-source library for NLP in Python. It provides a wide range of text processing and analysis functionalities, including tokenization, stemming, and tagging. In this article, I’ll give a comprehensive overview of text analytics with NLTK, including sentiment analysis, text classification, and entity recognition.
Sentiment Analysis
Sentiment analysis is the process of determining the sentiment of a given text, whether it is positive, negative, or neutral. It can be used to determine the overall sentiment of a movie3 review, for example, or to identify the sentiment of social media posts.
The NLTK library has a built-in sentiment analysis tool called “SentimentIntensityAnalyzer” that can classify our text as positive, negative, or neutral. SentimentIntensityAnalyzer provides a sentiment score between -1 and 1 for a given text, where -1 represents a negative sentiment, 1 represents a positive sentiment, and 0 represents a neutral sentiment.
Here’s how you can use it:
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sentiment_analyzer = SentimentIntensityAnalyzer()
text = "This product is amazing!"
score = sentiment_analyzer.polarity_scores(text)
print("Sentiment score: ", score)
Now that we’ve looked at that piece. Let’s move on to our next cool thing.
Named Entity Recognition
Named entity recognition (NER) identifies and extracts named entities from text. Named entities include proper nouns, such as names of people, organizations, and locations.
The NLTK library provides a named entity recognition tool that can identify named entities in text. Here’s how you can use it:
import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')
from nltk import word_tokenize, pos_tag, ne_chunk
text = "Barack Obama is the former President of the United States."
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
named_entities = ne_chunk(tagged)
named_entities.draw()
The ne_chunk function returns a tree structure representing the named entities in the text. As I’ve shown above, the tree can be visualized using the draw method.
Topic Modeling
Topic modeling is the process of uncovering latent topics in a collection of documents. It is a way to discover hidden structure in large amounts of text data.
The NLTK library provides a tool for performing topic modeling using Latent Dirichlet Allocation (LDA). Here’s how you can use it:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
from nltk.corpus import reuters
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from gensim.corpora import Dictionary
from gensim.models import LdaModel
texts = reuters.sents()
# Preprocessing
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()
texts = [[lemmatizer.lemmatize(word.lower()) for word in word_tokenize(doc) if word.lower() not in stop_words]
for doc in texts]
# Create dictionary
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
# Train LDA model
num_topics = 5
model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics)
# Get topics
topics = model.show_topics(num_topics=num_topics, formatted=False)
for topic in topics:
print(topic)
In this example, the Reuters dataset is used to perform topic modeling using the LDA algorithm. First, the text is preprocessed by removing stop words and lemmatizing the words. The Dictionary class is used to create a dictionary from the preprocessed text, and the LdaModel class is used to train the LDA model. Finally, the show_topics method displays the top words for each topic.
In Summary
In this article I tried to provide an introduction to text analytics using the Python NLTK library. We covered sentiment analysis, named entity recognition, and topic modeling. By using these techniques, you can extract valuable insights from large amounts of unstructured text data. Whether you’re a beginner or an experienced data scientist, the NLTK library provides a comprehensive set of tools for text analytics, making it a great choice for any project.
And as always, thank you for taking the time to read this. If you have any comments, questions, or critiques, please reach out to me on our FREE ML Security Discord Server – HERE
