Semantic
SemanticChunker #
Bases: BaseTextChunker
Python class designed to split text into chunks using semantic understanding.
Credit to Greg Kamradt's notebook: 5 Levels Of Text Splitting
Attributes:
| Name | Type | Description |
|---|---|---|
embed_model |
BaseEmbedding
|
Embedding model used for semantic chunking. |
buffer_size |
int
|
Number of sentences to group together. Default is |
breakpoint_threshold_amount |
int
|
Threshold percentage for detecting breakpoints between group of sentences.
The smaller this number is, the more chunks will be generated. Default is |
device |
str
|
Device to use for processing. Currently supports "cpu" and "cuda". Default is |
Example
from beekeeper.core.text_chunker import SemanticChunker
from beekeeper.embedding.huggingface import HuggingFaceEmbedding
embedding = HuggingFaceEmbedding()
text_chunker = SemanticChunker(embed_model=embedding)
chunk_text #
chunk_text(text: str) -> list[str]
Split a single string of text into smaller chunks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text to split. |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
list[str]: List of text chunks. |
chunk_documents #
chunk_documents(documents: list[Document]) -> list[Document]
Split a list of documents into smaller document chunks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
documents
|
list[Document]
|
List of |
required |
Returns:
| Type | Description |
|---|---|
list[Document]
|
list[Document]: List of chunked documents objects. |