Tokenization in Natural Language Processing (NLP) and Efficient GPU Acceleration
In natural language processing (NLP), machine learning models cannot directly interpret raw human language. Instead, textual input must first be transformed into a structured representation that algorithms can process. This transformation step is known as tokenization.
Tokenization represents the foundational stage of any NLP workflow. It converts plain text into smaller elements called tokens, which can then be processed by transformer architectures such as BERT or GPT. Depending on the selected strategy, tokens may consist of full words, subwords, individual characters, or punctuation symbols.
Example of a Tokenized Sentence
Example text = “Hello! I’m learning how to build a tokenizer in Python.”
Tokenized sentence [‘hello’, ‘i’, ‘m’, ‘learning’, ‘how’, ‘to’, ‘build’, ‘a’, ‘tokenizer’, ‘in’, ‘python’]
Traditional tokenization methods running on CPUs can become a performance bottleneck, particularly in large-scale systems or real-time inference environments. GPUs are typically optimized for vectorized computations and matrix operations, making string handling, regular expressions, and dictionary lookups less efficient. However, Hugging Face provides high-performance tokenizers implemented in Rust, which can operate efficiently alongside GPU workflows. This article explores different tokenizer types and demonstrates how tokenization can be accelerated using GPUs.
What Is a Tokenizer?
A tokenizer divides raw text into smaller components—commonly subwords or tokens—and converts them into numerical identifiers. These numerical representations serve as required inputs for transformer-based architectures like BERT, GPT, and RoBERTa.
Types of Tokenizers
1. Word Tokenizers
Word tokenizers split text primarily based on whitespace and punctuation. They are straightforward and easy to understand but struggle when encountering out-of-vocabulary (OOV) terms.
Example (Using NLTK):
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
text = "Tokenization is essential for NLP models!"
tokens = word_tokenize(text)
print(tokens)
['Tokenization', 'is', 'essential', 'for', 'NLP', 'models', '!']
2. Subword Tokenizers
Subword tokenizers divide words into smaller semantic components. This approach is particularly effective for handling rare words or compound expressions.
a. Byte-Pair Encoding (BPE)
Byte-Pair Encoding iteratively merges the most frequent adjacent character or subword pairs within a corpus.
Example (Using Hugging Face tokenizers library):
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
# Initialize tokenizer
tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()
# Trainer and training corpus
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
files = ["your_corpus.txt"] # Replace with a path to your text file
tokenizer.train(files, trainer)
# Encode text
output = tokenizer.encode("Tokenization is essential for NLP models!")
print(output.tokens)
b. WordPiece (used in BERT)
WordPiece operates similarly to BPE but applies a likelihood-based greedy optimization strategy.
Example (Using Hugging Face transformers)
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer.tokenize("Tokenization is essential for NLP models!")
print(tokens)
Output: [‘token’, ‘##ization’, ‘is’, ‘essential’, ‘for’, ‘nl’, ‘##p’, ‘models’, ‘!’]
c. SentencePiece
SentencePiece processes input as a raw byte sequence and is particularly effective for multilingual applications.
import sentencepiece as spm
# Train a SentencePiece model (one-time)
# spm.SentencePieceTrainer.train(input='your_corpus.txt', model_prefix='m', vocab_size=5000)
# Load and tokenize
sp = spm.SentencePieceProcessor(model_file='m.model')
tokens = sp.encode("Tokenization is essential for NLP models!", out_type=str)
print(tokens)
3. Character-Level Tokenizers
Character-level tokenizers treat every individual character as a token.
text = "Token"
tokens = list(text)
print(tokens)
['T', 'o', 'k', 'e', 'n']
Tools That Support GPU Tokenization
1. Hugging Face Tokenizers (Fast Tokenizers)
Hugging Face offers PreTrainedTokenizerFast, powered by a Rust-based backend designed for efficient parallel processing. “Slow” tokenizers refer to Python implementations within the Transformers library, whereas “fast” tokenizers rely on the Rust-based tokenizers package.
The performance advantage becomes significant when processing large batches of text. For single sentences, the difference may be negligible or even slightly slower. A key advantage of fast tokenizers is offset mapping, which precisely identifies the original character span corresponding to each token.
Although tokenization itself runs on the CPU, the resulting tensors can be transferred directly to the GPU.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
inputs = tokenizer(["Tokenize this on GPU"], return_tensors="pt", padding=True)
inputs = {k: v.to("cuda") for k, v in inputs.items()}
2. Byte-Pair Encoding (BPE) Tokenization Explained
Byte-Pair Encoding is a subword segmentation algorithm that iteratively merges the most frequent adjacent token pairs until a target vocabulary size is reached.
Consider a corpus containing: “cat”, “cap”, “can”, “bat”, and “bats”. The initial vocabulary consists of individual characters: [“a”, “b”, “c”, “n”, “p”, “s”, “t”]. The algorithm repeatedly merges the most frequent neighboring symbols, such as (“a”, “t”) into “at”. This merging process continues, forming larger subwords like “cat” or “bat”, while maintaining flexibility for rare words.
Tokenization halts when the desired vocabulary size or merge count has been achieved.
Python Implementation of BPE:
from collections import defaultdict, Counter
# Sample corpus with word frequencies
corpus = {
"cat": 5,
"cap": 3,
"can": 2,
"bat": 4,
"bats": 2
}
# Step 1: Represent each word as a list of characters + word boundary token
def get_tokenized_corpus(corpus):
return {
tuple(word): freq for word, freq in corpus.items()
}
# Step 2: Count frequency of all adjacent symbol pairs
def get_pair_freqs(tokenized_corpus):
pairs = defaultdict(int)
for word, freq in tokenized_corpus.items():
for i in range(len(word) - 1):
pair = (word[i], word[i + 1])
pairs[pair] += freq
return pairs
# Step 3: Merge the most frequent pair
def merge_pair(pair, tokenized_corpus):
new_corpus = {}
bigram = ' '.join(pair)
replacement = ''.join(pair)
for word, freq in tokenized_corpus.items():
new_word = []
i = 0
while i < len(word):
if i < len(word) - 1 and word[i] == pair[0] and word[i + 1] == pair[1]:
new_word.append(replacement)
i += 2
else:
new_word.append(word[i])
i += 1
new_corpus[tuple(new_word)] = freq
return new_corpus
# Step 4: Apply BPE for a few merges
tokenized_corpus = get_tokenized_corpus(corpus)
vocab = set(char for word in tokenized_corpus for char in word)
print("Initial vocabulary:", sorted(vocab))
print("Initial corpus:", tokenized_corpus)
num_merges = 5
for i in range(num_merges):
pair_freqs = get_pair_freqs(tokenized_corpus)
if not pair_freqs:
break
most_frequent = max(pair_freqs, key=pair_freqs.get)
print(f"\nMerge {i+1}: Merging {most_frequent} → {''.join(most_frequent)}")
tokenized_corpus = merge_pair(most_frequent, tokenized_corpus)
vocab.add(''.join(most_frequent))
print("Updated corpus:", tokenized_corpus)
print("\nFinal vocabulary:", sorted(vocab))
Initial vocabulary: ['a', 'b', 'c', 'n', 'p', 's', 't']
Initial corpus: {('c', 'a', 't'): 5, ('c', 'a', 'p'): 3, ('c', 'a', 'n'): 2, ('b', 'a', 't'): 4, ('b', 'a', 't', 's'): 2}
Merge 1: Merging ('a', 't') → at
Updated corpus: {('c', 'at'): 5, ('c', 'a', 'p'): 3, ('c', 'a', 'n'): 2, ('b', 'at'): 4, ('b', 'at', 's'): 2}
Merge 2: Merging ('b', 'at') → bat
...
3. NVIDIA RAPIDS cuDF GPU Subword Tokenizer
NVIDIA’s RAPIDS cuDF library provides GPU-accelerated subword tokenization. CPU-based tokenizers often introduce latency due to repeated data transfers between CPU and GPU. The cudf.str.subword_tokenize method performs tokenization directly on the GPU, eliminating unnecessary memory transfers and significantly improving throughput. Key benefits include:
- Up to 483x faster than traditional CPU tokenizers
- All intermediate results remain in GPU memory
- No costly CPU–GPU data copying
- Direct integration with RAPIDS DataFrame pipelines
import cudf
from cudf.utils.hash_vocab_utils import hash_vocab
from cudf.core.subword_tokenizer import SubwordTokenizer
# Step 1: Hash the BERT vocabulary (only needs to be done once)
hash_vocab('bert-base-cased-vocab.txt', 'voc_hash.txt')
# Step 2: Initialize the tokenizer with the hashed vocab
cudf_tokenizer = SubwordTokenizer('voc_hash.txt', do_lower_case=True)
# Step 3: Create a cuDF Series with input text
str_series = cudf.Series(['This is the', 'best book'])
# Step 4: Tokenize using GPU
tokenizer_output = cudf_tokenizer(
str_series,
max_length=8,
max_num_rows=len(str_series),
padding='max_length',
return_tensors='pt', # Return PyTorch tensors
truncation=True
)
# Step 5: Access tokenized output (all in GPU memory)
print("Input IDs:\n", tokenizer_output['input_ids'])
print("Attention Mask:\n", tokenizer_output['attention_mask'])
print("Metadata:\n", tokenizer_output['metadata'])
Input IDs:
tensor([[ 101, 1142, 1110, 1103, 102, 0, 0, 0],
[ 101, 1436, 1520, 102, 0, 0, 0, 0]],
device='cuda:0', dtype=torch.int32)
Attention Mask:
tensor([[1, 1, 1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 0, 0, 0, 0]],
device='cuda:0', dtype=torch.int32)
Metadata:
tensor([[0, 1, 3],
[1, 1, 2]], device='cuda:0', dtype=torch.int32)
cudf.str.subword_tokenize is especially useful for processing millions of text records or powering real-time, large-scale NLP workloads. It can also help eliminate tokenizer bottlenecks in production environments by serving as a high-performance alternative to slower spaCy or Hugging Face tokenization pipelines.
4. SentenceTransformers with GPU Support
SentenceTransformers automatically handles tokenization when encoding text. After tokenizing input sentences, they are passed through a pretrained transformer model, followed by a pooling strategy—typically mean pooling—to generate fixed-length sentence embeddings. This approach is well-suited for semantic search, clustering, text classification, and sentence similarity tasks.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2').to("cuda")
embeddings = model.encode(["GPU tokenization"], device="cuda")
Best Practices for GPU Tokenization
- Use PreTrainedTokenizerFast for improved performance on large datasets.
- Move tensors to CUDA using
.to("cuda")to avoid unnecessary data transfer overhead. - Avoid very small batch sizes, as they underutilize GPU parallelism.
- Pre-tokenize and cache datasets during training workflows.
- Benchmark different batch sizes to determine optimal throughput and memory usage.
Applying these strategies ensures that tokenization does not slow down the NLP pipeline and that GPU resources are efficiently utilized, particularly in large-scale model training or latency-sensitive deployments such as chatbots.
Common Questions
Can tokenization run on a GPU?
Yes. Frameworks like RAPIDS enable fully GPU-accelerated tokenization, and Hugging Face tokenizers produce outputs that can be transferred to GPUs for accelerated downstream processing.
Which tokenizer supports GPU?
RAPIDS and FasterTransformer offer native GPU support. Hugging Face provides highly optimized CPU tokenizers with GPU-compatible tensor outputs.
Is GPU tokenization faster?
For large datasets and batch workloads, GPU tokenization is typically faster due to parallel computation. For small inputs, the overhead of data transfers may reduce the advantage.
Do tokenizers require training?
Many pretrained tokenizers, such as BERT’s tokenizer, are readily available. Custom tokenizers can also be trained using Hugging Face’s tokenizers library.
How do I load a tokenizer?
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
Conclusion
Although tokenization might appear to be a minor preprocessing step, it can become a significant performance bottleneck in large-scale NLP systems, especially when limited to CPU execution. GPU-based tokenization dramatically accelerates preprocessing, enhancing overall machine learning and inference performance.
Libraries such as Hugging Face’s PreTrainedTokenizerFast and RAPIDS’ SubwordTokenizer simplify scalable tokenization. Whether training deep learning models, deploying conversational AI systems, or analyzing massive text datasets, GPU-accelerated tokenization provides an efficient and scalable solution.


