Semantic Chunker

class SemanticChunker

Python class designed to split text into chunks using semantic understanding.

Credit to Greg Kamradt’s notebook: 5 Levels Of Text Splitting.

Parameters:
  • embed_model (BaseEmbedding) – Embedding model used for semantic chunking.

  • buffer_size (int, optional) – Number of sentences to group together. Default is 1.

  • breakpoint_threshold_amount (int, optional) – Threshold percentage for detecting breakpoints between group of sentences. The smaller this number is, the more chunks will be generated. Default is 95.

  • device (str, optional) – Device to use for processing. Currently supports “cpu” and “cuda”. Default is cpu.

Example

from beekeeper.core.text_chunkers import SemanticChunker
from beekeeper.embeddings.huggingface import HuggingFaceEmbedding

embedding = HuggingFaceEmbedding()
text_chunker = SemanticChunker(embed_model=embedding)
from_documents(documents)

Split documents into chunks.

Parameters:

documents (List[Document]) – List of Document objects to split.

Returns:

List of chunked documents objects.

Return type:

List[Document]

from_text(text)

Split text into chunks.

Parameters:

text (str) – Input text to split.

Returns:

List of text chunks.

Return type:

List[str]