LangChain RecursiveCharacterTextSplitter - Split by Tokens instead of Characters
The LangChain RecursiveCharacterTextSplitter is a tool that allows you to split text on predefined characters that are considered as a potential division points. By default, the size of the chunk is in characters but by using
from_tiktoken_encoder() method you can easily split to achieve given size of the chunk in tokens instead of characters. This is especially useful since LLMs have context limits expressed in tokens not in characters. This split can be useful in various natural language processing tasks, such as language modeling or text classification.
To use the RecursiveCharacterTextSplitter, follow these steps:
Import the required module:
from langchain.text_splitter import RecursiveCharacterTextSplitter
Set the desired chunk size (in tokens):
CHUNK_SIZE_TOKENS = 1_500
Instantiate the RecursiveCharacterTextSplitter using the
from_tiktoken_encodermethod and provide the chunk size and overlap values:
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder( chunk_size=CHUNK_SIZE_TOKENS, chunk_overlap=0, )
- Once the text_splitter object is created, you can use the
create_documentsmethod to split your text into documents. Make sure to pass the text to be split as a parameter in a list format:
docs = text_splitter.create_documents([text])
For alternative solutions and further discussion, you can refer to the following GitHub issue: LangChain Issue #4678.