Semantic

SemanticChunker #

Bases: BaseTextChunker

Python class designed to split text into chunks using semantic understanding.

Credit to Greg Kamradt's notebook: 5 Levels Of Text Splitting

Attributes:

Name	Type	Description
`embed_model`	`BaseEmbedding`	Embedding model used for semantic chunking.
`buffer_size`	`int`	Number of sentences to group together. Default is `1`.
`breakpoint_threshold_amount`	`int`	Threshold percentage for detecting breakpoints between group of sentences. The smaller this number is, the more chunks will be generated. Default is `95`.
`device`	`str`	Device to use for processing. Currently supports "cpu" and "cuda". Default is `cpu`.

Example

from beekeeper.core.text_chunker import SemanticChunker
from beekeeper.embedding.huggingface import HuggingFaceEmbedding

embedding = HuggingFaceEmbedding()
text_chunker = SemanticChunker(embed_model=embedding)

chunk_text #

chunk_text(text: str) -> list[str]

Split a single string of text into smaller chunks.

Parameters:

Name	Type	Description	Default
`text`	`str`	Input text to split.	required

Returns:

Type	Description
`list[str]`	list[str]: List of text chunks.

chunk_documents #

chunk_documents(documents: list[Document]) -> list[Document]

Split a list of documents into smaller document chunks.

Parameters:

Name	Type	Description	Default
`documents`	`list[Document]`	List of `Document` objects to split.	required

Returns:

Type	Description
`list[Document]`	list[Document]: List of chunked documents objects.