2023-06-08
How to Count Tokens - Tokenization With Tiktoken.
Counting tokens is a useful task in natural language processing (NLP) that allows us to measure the length and complexity of a text. The two important use cases for counting the tokens are:
- controlling the length of the prompt - models has limit on the number of input tokens - it is good to have control if you don't exceed the limits for the model
- cost awareness - when you know how many tokens you pass as input, you know the cost related to the prompt.
In this blog post, we will explore how to count the number of tokens in a given text using OpenAI's tokenizer, called tiktoken
. Whether you're a seasoned Python developer or just getting started with NLP, this guide will provide you with a step-by-step process to accurately determine the token count of your text.
Introduction to tiktoken
To begin with, we need to install the tiktoken
library, which is a powerful tokenizer developed by OpenAI. It offers efficient tokenization capabilities and supports a wide range of languages. You can find the library on GitHub at this link.
Code Example
Let's dive into a code example that demonstrates how to count tokens using tiktoken
:
import tiktoken
def num_tokens_from_string(string: str, encoding_name: str) -> int:
"""Returns the number of tokens in a text string."""
encoding = tiktoken.get_encoding(encoding_name)
num_tokens = len(encoding.encode(string))
return num_tokens
num_tokens_from_string("tiktoken is great!", "cl100k_base")
In the example above, we import the tiktoken
library and define a function called num_tokens_from_string
. This function takes a text string and an encoding name as input parameters. It returns the number of tokens in the given text string.
To count the tokens, we first obtain the encoding using tiktoken.get_encoding(encoding_name)
. The encoding_name
specifies the type of encoding we want to use. In this case, we use the cl100k_base
encoding, which is suitable for second-generation embedding models like text-embedding-ada-002
.
Next, we encode the input string using encoding.encode(string)
and calculate the number of tokens by taking the length of the encoded sequence. The final result is the total number of tokens in the text string.
tiktoken
supports three encodings used by OpenAI models:
Encoding name | OpenAI models |
---|---|
cl100k_base |
gpt-4 , gpt-3.5-turbo , text-embedding-ada-002 |
p50k_base |
Codex models, text-davinci-002 , text-davinci-003 |
r50k_base (or gpt2 ) |
GPT-3 models like davinci |
OpenAI Cookbook Guide
For a more detailed explanation and additional examples, you can refer to the OpenAI Cookbook guide on how to count tokens with tiktoken. The guide provides comprehensive instructions on token counting and offers insights into various use cases.
Tokenization Sandbox
If you're looking to experiment with text tokenization, OpenAI provides a convenient web application called the Tokenization Sandbox. You can access it here. The sandbox allows you to input text and observe the resulting tokens, helping you better understand the tokenization process.
Text splitter module
A Python script for splitting text into parts with controlled (limited) length in tokens. This script utilizes the tiktoken
library for encoding and decoding text.:
https://gist.github.com/izikeros/17d9c8ab644bd2762acf6b19dd0cea39
Count tokens cli tool
Check this simple CLI tool that have one purpose - count tokens in a text file:
izikeros/count_tokens: Count tokens in a text file.
Rule of thumb
OpenAI on the website with the tokenizer sandbox provides rule of thumb that helps to estimate approximate number of tokens in given text.
A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).
References
To develop this guide, we drew inspiration from the token counting instructions provided by OpenAI. You can find additional information in the OpenAI documentation, where they discuss the limitations and risks associated with embeddings.
Token counting is essential when working with NLP, enabling us to analyze and process text effectively. By leveraging OpenAI's tiktoken
library and following the guidelines outlined in this blog post, you'll be well-equipped to count tokens accurately and efficiently.
Tags:
tokenizers
token
tokenization
tiktoken
openai
NLP