Semantic Chunker¶

class SemanticChunker¶

Python class designed to split text into chunks using semantic understanding.

Credit to Greg Kamradt’s notebook: 5 Levels Of Text Splitting.

Parameters:

embed_model (BaseEmbedding) – Embedding model used for semantic chunking.
buffer_size (int, optional) – Number of sentences to group together. Default is 1.
breakpoint_threshold_amount (int, optional) – Threshold percentage for detecting breakpoints between group of sentences. The smaller this number is, the more chunks will be generated. Default is 95.
device (str, optional) – Device to use for processing. Currently supports “cpu” and “cuda”. Default is cpu.

Example

from beekeeper.core.text_chunkers import SemanticChunker
from beekeeper.embeddings.huggingface import HuggingFaceEmbedding

embedding = HuggingFaceEmbedding()
text_chunker = SemanticChunker(embed_model=embedding)

from_documents(documents)¶

Split documents into chunks.

Parameters:: documents (List[Document]) – List of Document objects to split.
Returns:: List of chunked documents objects.
Return type:: List[Document]

from_text(text)¶

Split text into chunks.

Parameters:: text (str) – Input text to split.
Returns:: List of text chunks.
Return type:: List[str]