So what is NLP text tokenization anyways? Well, text tokenization is the process of breaking down a piece of text into smaller units, called tokens. Think about it like this, imagine you have a long sentence or a paragraph, and you want to understand it better. To do this, you might separate the words or even the letters to study them more closely. In computer terms, these smaller pieces are called “tokens.” Tokenization helps computers read and understand text, just like you and I do.
Here’s an example: Let’s say we have a sentence, “I love the care bears.” To tokenize this sentence, we would break it up into smaller pieces like this: “I”, “love”, “the”, “care”, “bears”. Now, we have five sweet and gentle tokens that we can send into our dark AI abyss below.
These tokens are the building blocks of language and serve as the foundation for various NLP tasks that we may want to perform. So, in this tutorial, I’ll be will discussing different tokenization techniques using a very popular Python library called NLTK (Natural Language Toolkit), for tokenization. I’ll will also provide examples of tokenizing text with Python using NLTK and cover advanced tokenization techniques if I feel up to it.
Why is NLP Text Tokenization Important?
Tokenization is crucial in NLP because it enables computers to make sense of human language by breaking it down into manageable chunks. This allows for more accurate and efficient NLP tasks, such as text summarization, sentiment analysis, and information extraction.
Word tokenization involves breaking text into individual words based on whitespace characters, such as spaces, tabs, and line breaks, or punctuation marks. So exactly what we saw in our care bears example above. RIP Care Bears
Sentence tokenization focuses on breaking text into sentences, typically using punctuation marks, such as periods, question marks, and exclamation points, as well as capitalization cues. So depending on your objective you may want to break down your text into either words or sentences.
Using NLTK (Natural Language Toolkit)
The NLTK (Natural Language Toolkit) is a popular Python library for NLP tasks, including text tokenization and the only one I’ve worked with at this time. The toolkit provides a pretty comprehensive suite of tools for processing and analyzing text data.
For our text file that I’ll be using for our testing, typically called a “corpus” in NLP terms. I’ll be using the Senate Intelligence text file I created in a previous blog post here:
Or just keep following along with the code as it’s included.
NLP Text Tokenizing with NLTK
So to begin using NLTK in Python 2.6, you first need to install the library. This can be done using the following command:
pip install nltk
Word Tokenization Example
To perform word tokenization, I’m going to use the
First I’ll download my text file to use.
wget https://www.dropbox.com/s/gsk1gieyihg3od0/threat_assessment.txt?dl=0 -O threat_assessment.txt
import nltk nltk.download('punkt') f = open('threat_assessment.txt','r') text = f.read(1000) # I only want to experiment with the first page or so. #print (text) word_tokens = nltk.word_tokenize(text) print (word_tokens) f.close()
You get an output like this.
['Annual', 'Threat', 'Assessment', 'of', 'the', 'Intelligence', 'Community', 'for', 'the', 'Senate', 'Select', 'Committee', 'on', 'Intelligence', 'Dennis', 'C.', 'Blair', etc]
Sentence Tokenization Example
For sentence tokenization, I’ll use the
import nltk nltk.download('punkt') f = open('threat_assessment.txt','r') text = f.read(10000) sentence_tokens = nltk.sent_tokenize(text) #f.close() print (sentence_tokens)
And this is what I got.
['Annual Threat Assessment of the \n\nIntelligence Community \n\nfor the Senate Select Committee on Intelligence \n\nDennis C. Blair \n\nDirector of National Intelligence \n\n12 February 2009 \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\x0cFebruary 2009 \n\nSENATE SELECT COMMITTEE ON \nINTELLIGENCE \n\nFEBRUARY 2009 \n\nINTELLIGENCE COMMUNITY \nANNUAL THREAT ASSESSMENT \n\nUNCLASSIFIED \nSTATEMENT FOR THE RECORD \n\nChairman Feinstein, Vice Chairman Bond, Members of the \nCommittee, thank you for the invitation to offer my assessment \nof threats to US national security.',
Advanced Tokenization Techniques
Custom Tokenization Using Regular Expressions
To create a custom tokenizer using regular expressions, use the
nltk.RegexpTokenizer() function and provide a pattern:
import nltk nltk.download('punkt') tokenizer = nltk.RegexpTokenizer(r'\w+') # Mess around with the regex here. tokenizer.tokenize(text)
And messing with the regex gives us this.
['Annual', 'Threat', 'Assessment', 'of', 'the', 'Intelligence', 'Community', 'for', 'the', 'Senate', 'Select', 'Committee', 'on', 'Intelligence',
Tokenizing Non-English Languages
So I threw this one in here as well but haven’t played with a lot before. I didn’t have a great example of non-english text to show but thought the code was at least worth something. So to tokenize text in non-English languages, I’m looking at you Russia! You have to load the appropriate tokenizer for that language. First, import the
nltk.data module and then load the language-specific tokenizer:
import nltk.data other_language_tokenizer = nltk.data.load('tokenizers/punkt/PY3/[language of your choice].pickle')
And thats that! I hope you have found this useful. Also make sure to keep coming back to my blog as I’m going to be putting out more text analytics and NLP posts here:
And as always, thank you for taking the time to read this. If you have any comments, questions, or critiques, please reach out to me on our FREE ML Security Discord Server – HERE