Open In App

Tokenization in NLP

Last Updated : 04 Jun, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Tokenization is a fundamental step in Natural Language Processing (NLP). It involves dividing a Textual input into smaller units known as tokens. These tokens can be in the form of words, characters, sub-words, or sentences. It helps in improving interpretability of text by different models. Let's understand How Tokenization Works.

Tokenization-in-Natural-Language-Processing
Representation of Tokenization

What is Tokenization in NLP?

Natural Language Processing (NLP) is a subfield of Artificial Intelligence, information engineering, and human-computer interaction. It focuses on how to process and analyze large amounts of natural language data efficiently. It is difficult to perform as the process of reading and understanding languages is far more complex than it seems at first glance.

  • Tokenization is a foundation step in NLP pipeline that shapes the entire workflow.
  • Involves dividing a string or text into a list of smaller units known as tokens.
  • Uses a tokenizer to segment unstructured data and natural language text into distinct chunks of information, treating them as different elements.
  • Tokens: Words or Sub-words in the context of natural language processing. Example: A word is a token in a sentence, A character is a token in a word, etc.
  • Application: Multiple NLP tasks, text processing, language modelling, and machine translation.

Types of Tokenization

Tokenization can be classified into several types based on how the text is segmented. Here are some types of tokenization:

1. Word Tokenization

Word tokenization is the most commonly used method where text is divided into individual words. It works well for languages with clear word boundaries, like English. For example, "Machine learning is fascinating" becomes:

Input before tokenization: ["Machine Learning is fascinating"]

Output when tokenized by words: ["Machine", "learning", "is", "fascinating"]

2. Character Tokenization

In Character Tokenization, the textual data is split and converted to a sequence of individual characters. This is beneficial for tasks that require a detailed analysis, such as spelling correction or for tasks with unclear boundaries. It can also be useful for modelling character-level language.

Example

Input before tokenization: ["You are helpful"]

Output when tokenized by characters: ["Y", "o", "u", " ", "a", "r", "e", " ", "h", "e", "l", "p", "f", "u", "l"]

3. Sub-word Tokenization

This strikes a balance between word and character tokenization by breaking down text into units that are larger than a single character but smaller than a full word. This is useful when dealing with morphologically rich languages or rare words.

Example

["Time", "table"]
["Rain", "coat"]
["Grace", "fully"]
["Run", "way"]

Sub-word tokenization helps to handle out-of-vocabulary words in NLP tasks and for languages that form words by combining smaller units.

4. Sentence Tokenization

Sentence tokenization is also a common technique used to make a division of paragraphs or large set of sentences into separated sentences as tokens. This is useful for tasks requiring individual sentence analysis or processing.

Input before tokenization: ["Artificial Intelligence is an emerging technology. Machine learning is fascinating. Computer Vision handles images. "]

Output when tokenized by sentences ["Artificial Intelligence is an emerging technology.", "Machine learning is fascinating.", "Computer Vision handles images."]

5. N-gram Tokenization

N-gram tokenization splits words into fixed-sized chunks (size = n) of data.

Input before tokenization: ["Machine learning is powerful"]

Output when tokenized by bigrams: [('Machine', 'learning'), ('learning', 'is'), ('is', 'powerful')]

Need of Tokenization

Tokenization is an essential step in text processing and natural language processing (NLP) for several reasons. Some of these are listed below:

  • Effective Text Processing: Reduces the size of raw text, resulting in easy and efficient statistical and computational analysis.
  • Feature extraction: Text data can be represented numerically for algorithmic comprehension by using tokens as features in ML models.
  • Information Retrieval: Tokenization is essential for indexing and searching in systems that store and retrieve information efficiently based on words or phrases.
  • Text Analysis: Used in sentiment analysis and named entity recognition, to determine the function and context of individual words in a sentence.
  • Vocabulary Management: Generates a list of distinct tokens, Helps manage a corpus's vocabulary.
  • Task-Specific Adaptation: Adapts to need of particular NLP task, Good for summarization and machine translation.

Implementation for Tokenization

Sentence Tokenization using sent_tokenize

The code snippet uses sent_tokenize function from NLTK library. The sent_tokenize function is used to segment a given text into a list of sentences.

Python
from nltk.tokenize import sent_tokenize

text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article."
sent_tokenize(text)

Output: 

['Hello everyone.',
'Welcome to GeeksforGeeks.',
'You are studying NLP article']

How sent_tokenize works: The sent_tokenize function uses an instance of PunktSentenceTokenizer from the nltk.tokenize.punkt module, which is already been trained and thus very well knows to mark the end and beginning of sentence at what characters and punctuation.

Sentence Tokenization using PunktSentenceTokenizer

It is efficient to use 'PunktSentenceTokenizer' to from the NLTK library. The Punkt tokenizer is a data-driven sentence tokenizer that comes with NLTK. It is trained on large corpus of text to identify sentence boundaries.

Python
import nltk.data

# Loading PunktSentenceTokenizer using English pickle file
tokenizer = nltk.data.load('tokenizers/punkt/PY3/english.pickle')
tokenizer.tokenize(text)

Output: 

['Hello everyone.',
'Welcome to GeeksforGeeks.',
'You are studying NLP article']

Tokenize sentence of different language

Sentences from different languages can also be tokenized using different pickle file other than English. 

  • In the following code snippet, we have used NLTK library to tokenize a Spanish text into sentences using pre-trained Punkt tokenizer for Spanish.
  • The Punkt tokenizer: Data-driven ML-based tokenizer to identify sentence boundaries.
Python
import nltk.data

spanish_tokenizer = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')

text = 'Hola amigo. Estoy bien.'
spanish_tokenizer.tokenize(text)

Output: 

['Hola amigo.', 
 'Estoy bien.']

Word Tokenization using work_tokenize

The code snipped uses the word_tokenize function from NLTK library to tokenize a given text into individual words.

  • The word_tokenize function is helpful for breaking down a sentence or text into its constituent words.
  • Eases analysis or processing at the word level in natural language processing tasks.
Python
from nltk.tokenize import word_tokenize

text = "Hello everyone. Welcome to GeeksforGeeks."
word_tokenize(text)

Output: 

['Hello', 'everyone', '.', 'Welcome', 'to', 'GeeksforGeeks', '.']

How word_tokenize works: word_tokenize() function is a wrapper function that calls tokenize() on an instance of the TreebankWordTokenizer class.

Word Tokenization Using TreebankWordTokenizer 

The code snippet uses the TreebankWordTokenizer from the Natural Language Toolkit (NLTK) to tokenize a given text into individual words.

Python
from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(text)

Output:

['Hello', 'everyone.', 'Welcome', 'to', 'GeeksforGeeks', '.']

These tokenizers work by separating the words using punctuation and spaces. And as mentioned in the code outputs above, it doesn't discard the punctuation, allowing a user to decide what to do with the punctuations at the time of pre-processing.

Word Tokenization using WordPunctTokenizer

The WordPunctTokenizer is one of the NLTK tokenizers that splits words based on punctuation boundaries. Each punctuation mark is treated as a separate token.

Python
from nltk.tokenize import WordPunctTokenizer

tokenizer = WordPunctTokenizer()
tokenizer.tokenize("Let's see how it's working.")

Output:

['Let', "'", 's', 'see', 'how', 'it', "'", 's', 'working', '.']

Word Tokenization using Regular Expression 

The code snippet uses the RegexpTokenizer from the Natural Language Toolkit (NLTK) to tokenize a given text based on a regular expression pattern.

Python
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
text = "Let's see how it's working."
tokenizer.tokenize(text)

Output: 

['Let', 's', 'see', 'how', 'it', 's', 'working']

Using regular expressions allows for more fine-grained control over tokenization, and you can customize the pattern based on your specific requirements.

More Techniques for Tokenization

We have discussed the ways to implement how can we perform tokenization using NLTK library. We can also implement tokenization using following methods and libraries:

  • Spacy: Spacy is NLP library that provide robust tokenization capabilities.
  • BERT tokenizer: BERT uses Word Piece tokenizer, which is a type of sub-word tokenizer for tokenizing input text. Using regular expressions allows for more fine-grained control over tokenization, and you can customize the pattern based on your specific requirements.
  • Byte-Pair Encoding: Byte Pair Encoding (BPE) is a data compression algorithm that has also found applications in the field of natural language processing, specifically for tokenization. It is a Sub-word Tokenization technique that works by iteratively merging the most frequent pairs of consecutive bytes (or characters) in a given corpus.
  • Sentence Piece: Sentence Piece is another sub-word tokenization algorithm commonly used for natural language processing tasks. It is designed to be language-agnostic and works by iteratively merging frequent sequences of characters or sub words in a given corpus.

Limitations of Tokenization

  • Unable to capture the meaning of the sentence hence, results in ambiguity.
  • Chinese, Japanese, Arabic, lack distinct spaces between words. Hence, absence of clear boundaries that complicates the process of tokenization.
  • Tough to decide how to tokenize text that may include more than one word, for example email address, URLs and special symbols

Next Article

Similar Reads