Documents¶

Sikka Agent's Document module provides a comprehensive system for handling, processing, and chunking documents from various sources. This module is the foundation for knowledge bases and retrieval systems.

Overview¶

The Document module enables:

Representing text content with associated metadata
Reading documents from various sources (PDFs, websites, CSV files, etc.)
Chunking documents into smaller pieces for effective processing
Embedding documents for semantic search
Standardizing document handling across the system

Core Components¶

Document¶

The foundational class for representing a document in Sikka Agent.

Parameters¶

Parameter	Type	Description	Default
`content`	`str`	The text content of the document	Required
`id`	`str`	Unique identifier for the document	`None`
`name`	`str`	Name or title of the document	`None`
`meta_data`	`Dict[str, Any]`	Additional metadata about the document	`{}`
`embedder`	`Embedder`	Embedder to use for creating vector embeddings	`None`
`embedding`	`List[float]`	Vector embedding of the document content	`None`
`usage`	`Dict[str, Any]`	Usage information from embedding process	`None`
`reranking_score`	`float`	Score from reranking process	`None`

Methods¶

Method	Description
`embed(embedder)`	Creates a vector embedding for the document content
`to_dict()`	Converts the document to a dictionary representation
`from_dict(document)`	Creates a Document object from a dictionary
`from_json(document)`	Creates a Document object from a JSON string

Code Example¶

from sikkaagent.document import Document
from sikkaagent.retrievers.embedder.sentence_transformer import SentenceTransformerEmbedder

# Create a document
document = Document(
    content="Artificial intelligence is the simulation of human intelligence by machines.",
    id="doc_1",
    name="AI Definition",
    meta_data={"source": "textbook", "page": 42}
)

# Create an embedder
embedder = SentenceTransformerEmbedder(id="sentence-transformers/all-MiniLM-L6-v2")

# Embed the document
document.embed(embedder)

# Access the embedding
print(f"Embedding dimension: {len(document.embedding)}")

# Convert to dictionary
doc_dict = document.to_dict()

Document Chunking¶

Sikka Agent provides several strategies for chunking documents into smaller pieces for more effective processing and retrieval.

ChunkingStrategy¶

The base class for all chunking strategies.

Methods¶

Method	Description
`chunk(document)`	Splits a document into smaller chunks
`clean_text(text)`	Cleans text by normalizing whitespace

FixedSizeChunking¶

Splits documents into chunks of a fixed size with optional overlap.

Parameters¶

Parameter	Type	Description	Default
`chunk_size`	`int`	Maximum size of each chunk in characters	`5000`
`overlap`	`int`	Number of characters to overlap between chunks	`0`

Code Example¶

from sikkaagent.document import Document
from sikkaagent.document.chunking.fixed import FixedSizeChunking

# Create a document
document = Document(
    content="This is a long document that needs to be split into smaller chunks for processing...",
    name="Long Document"
)

# Create chunking strategy
chunking = FixedSizeChunking(chunk_size=1000, overlap=100)

# Split document into chunks
chunks = chunking.chunk(document)

print(f"Original document split into {len(chunks)} chunks")
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1} size: {len(chunk.content)} characters")

Other Chunking Strategies¶

Sikka Agent provides additional chunking strategies:

Strategy	Description
`AgenticChunking`	Uses LLMs to intelligently chunk documents
`DocumentChunking`	Chunks based on document structure
`RecursiveChunking`	Recursively splits documents into smaller pieces
`SemanticChunking`	Chunks based on semantic meaning

Document Readers¶

Sikka Agent includes readers for various document types that convert source materials into Document objects.

Reader¶

The base class for all document readers.

Parameters¶

Parameter	Type	Description	Default
`chunk`	`bool`	Whether to chunk documents after reading	`True`
`chunk_size`	`int`	Size of chunks in characters	`3000`
`separators`	`List[str]`	Text separators for chunking	`["\n", "\n\n", ...]`
`chunking_strategy`	`ChunkingStrategy`	Strategy for chunking documents	`FixedSizeChunking()`

Methods¶

Method	Description
`read(obj)`	Reads a document from the source
`async_read(obj)`	Asynchronously reads a document
`chunk_document(document)`	Chunks a document using the chunking strategy

PDFReader¶

Reads documents from PDF files.

Methods¶

Method	Description
`read(pdf)`	Reads a PDF file and returns a list of Document objects
`async_read(pdf)`	Asynchronously reads a PDF file

Code Example¶

from sikkaagent.document.reader.pdf_reader import PDFReader
from pathlib import Path

# Create PDF reader
reader = PDFReader(chunk=True, chunk_size=2000)

# Read PDF file
pdf_path = Path("path/to/document.pdf")
documents = reader.read(pdf_path)

print(f"Extracted {len(documents)} document chunks from PDF")

PDFImageReader¶

Extends PDFReader to also extract and process text from images in PDFs using OCR.

Code Example¶

from sikkaagent.document.reader.pdf_reader import PDFImageReader
from pathlib import Path

# Create PDF image reader
reader = PDFImageReader(chunk=True, chunk_size=2000)

# Read PDF file with images
pdf_path = Path("path/to/document_with_images.pdf")
documents = reader.read(pdf_path)

print(f"Extracted {len(documents)} document chunks from PDF with images")

Integration with Other Modules¶

The Document module integrates with other Sikka Agent components:

With Knowledge Bases¶

from sikkaagent.document import Document
from sikkaagent.document.chunking.fixed import FixedSizeChunking
from sikkaagent.knowledge import AgentKnowledge
from sikkaagent.storages.vectordb.qdrant.qdrant import Qdrant
from sikkaagent.retrievers.embedder.sentence_transformer import SentenceTransformerEmbedder

# Create embedder
embedder = SentenceTransformerEmbedder()

# Create vector database
vector_db = Qdrant(collection_name="my_knowledge", embedder=embedder)

# Create chunking strategy
chunking = FixedSizeChunking(chunk_size=500, overlap=50)

# Create knowledge base with custom chunking
knowledge = AgentKnowledge(
    vector_db=vector_db,
    chunking_strategy=chunking
)

# Create and load documents
documents = [
    Document(content="Document 1 content..."),
    Document(content="Document 2 content...")
]
knowledge.load_documents(documents)

With Retrievers¶

from sikkaagent.document import Document
from sikkaagent.retrievers.vector_retriever import HybridRetriever
from sikkaagent.retrievers.embedder.sentence_transformer import SentenceTransformerEmbedder
from sikkaagent.storages.vectordb.qdrant.qdrant import Qdrant

# Create embedder
embedder = SentenceTransformerEmbedder()

# Create storage
storage = Qdrant(collection_name="retriever_docs", embedder=embedder)

# Create retriever
retriever = HybridRetriever(
    collection_name="my_collection",
    embedding_model=embedder,
    storage=storage
)

# Create document
document = Document(
    content="This is a sample document for retrieval testing.",
    meta_data={"source": "test"}
)

# Add document to retriever
retriever.add_documents([document])

# Search for similar documents
results = retriever.search("sample retrieval", top_k=5)

Advanced Usage¶

Asynchronous Document Processing¶

Use asynchronous methods for better performance with large document sets:

import asyncio
from sikkaagent.document.reader.pdf_reader import PDFReader
from pathlib import Path

async def process_pdfs(pdf_paths):
    reader = PDFReader(chunk=True)

    # Process PDFs in parallel
    tasks = [reader.async_read(path) for path in pdf_paths]
    results = await asyncio.gather(*tasks)

    # Flatten results
    all_documents = [doc for docs in results for doc in docs]
    return all_documents

# Run the async function
pdf_paths = [Path("doc1.pdf"), Path("doc2.pdf"), Path("doc3.pdf")]
documents = asyncio.run(process_pdfs(pdf_paths))

Custom Document Embedding¶

Embed documents with custom embedding models:

from sikkaagent.document import Document
from sikkaagent.retrievers.embedder.openai import OpenAIEmbedder

# Create document
document = Document(
    content="This is a document that needs to be embedded.",
    name="Sample Document"
)

# Create OpenAI embedder
embedder = OpenAIEmbedder(
    id="text-embedding-3-small",
    dimensions=1536,
    api_key="your-api-key"
)

# Embed the document
document.embed(embedder)

# Access embedding and usage information
print(f"Embedding dimension: {len(document.embedding)}")
print(f"Token usage: {document.usage}")

Best Practices¶

Document Size: Keep individual documents reasonably sized (under 100KB of text) for efficient processing
Chunking Strategy: Choose the appropriate chunking strategy based on your content:
Use FixedSizeChunking for general-purpose text
Use SemanticChunking for preserving meaning in complex documents
Use DocumentChunking for structured documents with clear sections
Metadata: Include relevant metadata with documents to enable filtering and organization:
Source information (URL, file path, etc.)
Creation/modification dates
Authors or owners
Categories or tags
Embedding Models: Select embedding models appropriate for your domain and language
Asynchronous Processing: Use async methods when processing large numbers of documents
Error Handling: Implement robust error handling for document processing, especially for external sources

Reader	Description
`ArxivReader`	Reads papers from arXiv
`CSVReader`	Reads CSV files
`CSVUrlReader`	Reads CSV files from URLs
`DocxReader`	Reads Word documents
`JSONReader`	Reads JSON files
`PDFUrlReader`	Reads PDF files from URLs
`PDFUrlImageReader`	Reads PDF files with images from URLs
`TextReader`	Reads plain text files
`URLReader`	Reads content from URLs
`WebsiteReader`	Reads and crawls websites
`YouTubeReader`	Reads transcripts from YouTube videos

Documents¶

Overview¶

Core Components¶

Document¶

Parameters¶

Methods¶

Code Example¶

Document Chunking¶

ChunkingStrategy¶

Methods¶

FixedSizeChunking¶

Parameters¶

Code Example¶

Other Chunking Strategies¶

Document Readers¶

Reader¶

Parameters¶

Methods¶

PDFReader¶

Methods¶

Code Example¶

PDFImageReader¶

Code Example¶

Other Readers¶

Integration with Other Modules¶

With Knowledge Bases¶

With Retrievers¶

Advanced Usage¶

Asynchronous Document Processing¶

Custom Document Embedding¶

Best Practices¶