Skip to content

Semantic

SemanticChunker #

Bases: BaseTextChunker

Python class designed to split text into chunks using semantic understanding.

Credit to Greg Kamradt's notebook: 5 Levels Of Text Splitting

Attributes:

Name Type Description
embed_model BaseEmbedding

Embedding model used for semantic chunking.

buffer_size int

Number of sentences to group together. Default is 1.

breakpoint_threshold_amount int

Threshold percentage for detecting breakpoints between group of sentences. The smaller this number is, the more chunks will be generated. Default is 95.

device str

Device to use for processing. Currently supports "cpu" and "cuda". Default is cpu.

Example
from beekeeper.core.text_chunker import SemanticChunker
from beekeeper.embedding.huggingface import HuggingFaceEmbedding

embedding = HuggingFaceEmbedding()
text_chunker = SemanticChunker(embed_model=embedding)

chunk_text #

chunk_text(text: str) -> list[str]

Split a single string of text into smaller chunks.

Parameters:

Name Type Description Default
text str

Input text to split.

required

Returns:

Type Description
list[str]

list[str]: List of text chunks.

chunk_documents #

chunk_documents(documents: list[Document]) -> list[Document]

Split a list of documents into smaller document chunks.

Parameters:

Name Type Description Default
documents list[Document]

List of Document objects to split.

required

Returns:

Type Description
list[Document]

list[Document]: List of chunked documents objects.