Documents¶
Sikka Agent's Document module provides a comprehensive system for handling, processing, and chunking documents from various sources. This module is the foundation for knowledge bases and retrieval systems.
Overview¶
The Document module enables:
- Representing text content with associated metadata
- Reading documents from various sources (PDFs, websites, CSV files, etc.)
- Chunking documents into smaller pieces for effective processing
- Embedding documents for semantic search
- Standardizing document handling across the system
Core Components¶
Document¶
The foundational class for representing a document in Sikka Agent.
Parameters¶
Parameter | Type | Description | Default |
---|---|---|---|
content |
str |
The text content of the document | Required |
id |
str |
Unique identifier for the document | None |
name |
str |
Name or title of the document | None |
meta_data |
Dict[str, Any] |
Additional metadata about the document | {} |
embedder |
Embedder |
Embedder to use for creating vector embeddings | None |
embedding |
List[float] |
Vector embedding of the document content | None |
usage |
Dict[str, Any] |
Usage information from embedding process | None |
reranking_score |
float |
Score from reranking process | None |
Methods¶
Method | Description |
---|---|
embed(embedder) |
Creates a vector embedding for the document content |
to_dict() |
Converts the document to a dictionary representation |
from_dict(document) |
Creates a Document object from a dictionary |
from_json(document) |
Creates a Document object from a JSON string |
Code Example¶
from sikkaagent.document import Document
from sikkaagent.retrievers.embedder.sentence_transformer import SentenceTransformerEmbedder
# Create a document
document = Document(
content="Artificial intelligence is the simulation of human intelligence by machines.",
id="doc_1",
name="AI Definition",
meta_data={"source": "textbook", "page": 42}
)
# Create an embedder
embedder = SentenceTransformerEmbedder(id="sentence-transformers/all-MiniLM-L6-v2")
# Embed the document
document.embed(embedder)
# Access the embedding
print(f"Embedding dimension: {len(document.embedding)}")
# Convert to dictionary
doc_dict = document.to_dict()
Document Chunking¶
Sikka Agent provides several strategies for chunking documents into smaller pieces for more effective processing and retrieval.
ChunkingStrategy¶
The base class for all chunking strategies.
Methods¶
Method | Description |
---|---|
chunk(document) |
Splits a document into smaller chunks |
clean_text(text) |
Cleans text by normalizing whitespace |
FixedSizeChunking¶
Splits documents into chunks of a fixed size with optional overlap.
Parameters¶
Parameter | Type | Description | Default |
---|---|---|---|
chunk_size |
int |
Maximum size of each chunk in characters | 5000 |
overlap |
int |
Number of characters to overlap between chunks | 0 |
Code Example¶
from sikkaagent.document import Document
from sikkaagent.document.chunking.fixed import FixedSizeChunking
# Create a document
document = Document(
content="This is a long document that needs to be split into smaller chunks for processing...",
name="Long Document"
)
# Create chunking strategy
chunking = FixedSizeChunking(chunk_size=1000, overlap=100)
# Split document into chunks
chunks = chunking.chunk(document)
print(f"Original document split into {len(chunks)} chunks")
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1} size: {len(chunk.content)} characters")
Other Chunking Strategies¶
Sikka Agent provides additional chunking strategies:
Strategy | Description |
---|---|
AgenticChunking |
Uses LLMs to intelligently chunk documents |
DocumentChunking |
Chunks based on document structure |
RecursiveChunking |
Recursively splits documents into smaller pieces |
SemanticChunking |
Chunks based on semantic meaning |
Document Readers¶
Sikka Agent includes readers for various document types that convert source materials into Document objects.
Reader¶
The base class for all document readers.
Parameters¶
Parameter | Type | Description | Default |
---|---|---|---|
chunk |
bool |
Whether to chunk documents after reading | True |
chunk_size |
int |
Size of chunks in characters | 3000 |
separators |
List[str] |
Text separators for chunking | ["\n", "\n\n", ...] |
chunking_strategy |
ChunkingStrategy |
Strategy for chunking documents | FixedSizeChunking() |
Methods¶
Method | Description |
---|---|
read(obj) |
Reads a document from the source |
async_read(obj) |
Asynchronously reads a document |
chunk_document(document) |
Chunks a document using the chunking strategy |
PDFReader¶
Reads documents from PDF files.
Methods¶
Method | Description |
---|---|
read(pdf) |
Reads a PDF file and returns a list of Document objects |
async_read(pdf) |
Asynchronously reads a PDF file |
Code Example¶
from sikkaagent.document.reader.pdf_reader import PDFReader
from pathlib import Path
# Create PDF reader
reader = PDFReader(chunk=True, chunk_size=2000)
# Read PDF file
pdf_path = Path("path/to/document.pdf")
documents = reader.read(pdf_path)
print(f"Extracted {len(documents)} document chunks from PDF")
PDFImageReader¶
Extends PDFReader to also extract and process text from images in PDFs using OCR.
Code Example¶
from sikkaagent.document.reader.pdf_reader import PDFImageReader
from pathlib import Path
# Create PDF image reader
reader = PDFImageReader(chunk=True, chunk_size=2000)
# Read PDF file with images
pdf_path = Path("path/to/document_with_images.pdf")
documents = reader.read(pdf_path)
print(f"Extracted {len(documents)} document chunks from PDF with images")
Other Readers¶
Sikka Agent provides readers for various document types:
Reader | Description |
---|---|
ArxivReader |
Reads papers from arXiv |
CSVReader |
Reads CSV files |
CSVUrlReader |
Reads CSV files from URLs |
DocxReader |
Reads Word documents |
JSONReader |
Reads JSON files |
PDFUrlReader |
Reads PDF files from URLs |
PDFUrlImageReader |
Reads PDF files with images from URLs |
TextReader |
Reads plain text files |
URLReader |
Reads content from URLs |
WebsiteReader |
Reads and crawls websites |
YouTubeReader |
Reads transcripts from YouTube videos |
Integration with Other Modules¶
The Document module integrates with other Sikka Agent components:
With Knowledge Bases¶
from sikkaagent.document import Document
from sikkaagent.document.chunking.fixed import FixedSizeChunking
from sikkaagent.knowledge import AgentKnowledge
from sikkaagent.storages.vectordb.qdrant.qdrant import Qdrant
from sikkaagent.retrievers.embedder.sentence_transformer import SentenceTransformerEmbedder
# Create embedder
embedder = SentenceTransformerEmbedder()
# Create vector database
vector_db = Qdrant(collection_name="my_knowledge", embedder=embedder)
# Create chunking strategy
chunking = FixedSizeChunking(chunk_size=500, overlap=50)
# Create knowledge base with custom chunking
knowledge = AgentKnowledge(
vector_db=vector_db,
chunking_strategy=chunking
)
# Create and load documents
documents = [
Document(content="Document 1 content..."),
Document(content="Document 2 content...")
]
knowledge.load_documents(documents)
With Retrievers¶
from sikkaagent.document import Document
from sikkaagent.retrievers.vector_retriever import HybridRetriever
from sikkaagent.retrievers.embedder.sentence_transformer import SentenceTransformerEmbedder
from sikkaagent.storages.vectordb.qdrant.qdrant import Qdrant
# Create embedder
embedder = SentenceTransformerEmbedder()
# Create storage
storage = Qdrant(collection_name="retriever_docs", embedder=embedder)
# Create retriever
retriever = HybridRetriever(
collection_name="my_collection",
embedding_model=embedder,
storage=storage
)
# Create document
document = Document(
content="This is a sample document for retrieval testing.",
meta_data={"source": "test"}
)
# Add document to retriever
retriever.add_documents([document])
# Search for similar documents
results = retriever.search("sample retrieval", top_k=5)
Advanced Usage¶
Asynchronous Document Processing¶
Use asynchronous methods for better performance with large document sets:
import asyncio
from sikkaagent.document.reader.pdf_reader import PDFReader
from pathlib import Path
async def process_pdfs(pdf_paths):
reader = PDFReader(chunk=True)
# Process PDFs in parallel
tasks = [reader.async_read(path) for path in pdf_paths]
results = await asyncio.gather(*tasks)
# Flatten results
all_documents = [doc for docs in results for doc in docs]
return all_documents
# Run the async function
pdf_paths = [Path("doc1.pdf"), Path("doc2.pdf"), Path("doc3.pdf")]
documents = asyncio.run(process_pdfs(pdf_paths))
Custom Document Embedding¶
Embed documents with custom embedding models:
from sikkaagent.document import Document
from sikkaagent.retrievers.embedder.openai import OpenAIEmbedder
# Create document
document = Document(
content="This is a document that needs to be embedded.",
name="Sample Document"
)
# Create OpenAI embedder
embedder = OpenAIEmbedder(
id="text-embedding-3-small",
dimensions=1536,
api_key="your-api-key"
)
# Embed the document
document.embed(embedder)
# Access embedding and usage information
print(f"Embedding dimension: {len(document.embedding)}")
print(f"Token usage: {document.usage}")
Best Practices¶
- Document Size: Keep individual documents reasonably sized (under 100KB of text) for efficient processing
- Chunking Strategy: Choose the appropriate chunking strategy based on your content:
- Use
FixedSizeChunking
for general-purpose text - Use
SemanticChunking
for preserving meaning in complex documents - Use
DocumentChunking
for structured documents with clear sections - Metadata: Include relevant metadata with documents to enable filtering and organization:
- Source information (URL, file path, etc.)
- Creation/modification dates
- Authors or owners
- Categories or tags
- Embedding Models: Select embedding models appropriate for your domain and language
- Asynchronous Processing: Use async methods when processing large numbers of documents
- Error Handling: Implement robust error handling for document processing, especially for external sources