Skip to content

Documents

Sikka Agent's Document module provides a comprehensive system for handling, processing, and chunking documents from various sources. This module is the foundation for knowledge bases and retrieval systems.

Overview

The Document module enables:

  • Representing text content with associated metadata
  • Reading documents from various sources (PDFs, websites, CSV files, etc.)
  • Chunking documents into smaller pieces for effective processing
  • Embedding documents for semantic search
  • Standardizing document handling across the system

Core Components

Document

The foundational class for representing a document in Sikka Agent.

Parameters

Parameter Type Description Default
content str The text content of the document Required
id str Unique identifier for the document None
name str Name or title of the document None
meta_data Dict[str, Any] Additional metadata about the document {}
embedder Embedder Embedder to use for creating vector embeddings None
embedding List[float] Vector embedding of the document content None
usage Dict[str, Any] Usage information from embedding process None
reranking_score float Score from reranking process None

Methods

Method Description
embed(embedder) Creates a vector embedding for the document content
to_dict() Converts the document to a dictionary representation
from_dict(document) Creates a Document object from a dictionary
from_json(document) Creates a Document object from a JSON string

Code Example

from sikkaagent.document import Document
from sikkaagent.retrievers.embedder.sentence_transformer import SentenceTransformerEmbedder

# Create a document
document = Document(
    content="Artificial intelligence is the simulation of human intelligence by machines.",
    id="doc_1",
    name="AI Definition",
    meta_data={"source": "textbook", "page": 42}
)

# Create an embedder
embedder = SentenceTransformerEmbedder(id="sentence-transformers/all-MiniLM-L6-v2")

# Embed the document
document.embed(embedder)

# Access the embedding
print(f"Embedding dimension: {len(document.embedding)}")

# Convert to dictionary
doc_dict = document.to_dict()

Document Chunking

Sikka Agent provides several strategies for chunking documents into smaller pieces for more effective processing and retrieval.

ChunkingStrategy

The base class for all chunking strategies.

Methods

Method Description
chunk(document) Splits a document into smaller chunks
clean_text(text) Cleans text by normalizing whitespace

FixedSizeChunking

Splits documents into chunks of a fixed size with optional overlap.

Parameters

Parameter Type Description Default
chunk_size int Maximum size of each chunk in characters 5000
overlap int Number of characters to overlap between chunks 0

Code Example

from sikkaagent.document import Document
from sikkaagent.document.chunking.fixed import FixedSizeChunking

# Create a document
document = Document(
    content="This is a long document that needs to be split into smaller chunks for processing...",
    name="Long Document"
)

# Create chunking strategy
chunking = FixedSizeChunking(chunk_size=1000, overlap=100)

# Split document into chunks
chunks = chunking.chunk(document)

print(f"Original document split into {len(chunks)} chunks")
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1} size: {len(chunk.content)} characters")

Other Chunking Strategies

Sikka Agent provides additional chunking strategies:

Strategy Description
AgenticChunking Uses LLMs to intelligently chunk documents
DocumentChunking Chunks based on document structure
RecursiveChunking Recursively splits documents into smaller pieces
SemanticChunking Chunks based on semantic meaning

Document Readers

Sikka Agent includes readers for various document types that convert source materials into Document objects.

Reader

The base class for all document readers.

Parameters

Parameter Type Description Default
chunk bool Whether to chunk documents after reading True
chunk_size int Size of chunks in characters 3000
separators List[str] Text separators for chunking ["\n", "\n\n", ...]
chunking_strategy ChunkingStrategy Strategy for chunking documents FixedSizeChunking()

Methods

Method Description
read(obj) Reads a document from the source
async_read(obj) Asynchronously reads a document
chunk_document(document) Chunks a document using the chunking strategy

PDFReader

Reads documents from PDF files.

Methods

Method Description
read(pdf) Reads a PDF file and returns a list of Document objects
async_read(pdf) Asynchronously reads a PDF file

Code Example

from sikkaagent.document.reader.pdf_reader import PDFReader
from pathlib import Path

# Create PDF reader
reader = PDFReader(chunk=True, chunk_size=2000)

# Read PDF file
pdf_path = Path("path/to/document.pdf")
documents = reader.read(pdf_path)

print(f"Extracted {len(documents)} document chunks from PDF")

PDFImageReader

Extends PDFReader to also extract and process text from images in PDFs using OCR.

Code Example

from sikkaagent.document.reader.pdf_reader import PDFImageReader
from pathlib import Path

# Create PDF image reader
reader = PDFImageReader(chunk=True, chunk_size=2000)

# Read PDF file with images
pdf_path = Path("path/to/document_with_images.pdf")
documents = reader.read(pdf_path)

print(f"Extracted {len(documents)} document chunks from PDF with images")

Other Readers

Sikka Agent provides readers for various document types:

Reader Description
ArxivReader Reads papers from arXiv
CSVReader Reads CSV files
CSVUrlReader Reads CSV files from URLs
DocxReader Reads Word documents
JSONReader Reads JSON files
PDFUrlReader Reads PDF files from URLs
PDFUrlImageReader Reads PDF files with images from URLs
TextReader Reads plain text files
URLReader Reads content from URLs
WebsiteReader Reads and crawls websites
YouTubeReader Reads transcripts from YouTube videos

Integration with Other Modules

The Document module integrates with other Sikka Agent components:

With Knowledge Bases

from sikkaagent.document import Document
from sikkaagent.document.chunking.fixed import FixedSizeChunking
from sikkaagent.knowledge import AgentKnowledge
from sikkaagent.storages.vectordb.qdrant.qdrant import Qdrant
from sikkaagent.retrievers.embedder.sentence_transformer import SentenceTransformerEmbedder

# Create embedder
embedder = SentenceTransformerEmbedder()

# Create vector database
vector_db = Qdrant(collection_name="my_knowledge", embedder=embedder)

# Create chunking strategy
chunking = FixedSizeChunking(chunk_size=500, overlap=50)

# Create knowledge base with custom chunking
knowledge = AgentKnowledge(
    vector_db=vector_db,
    chunking_strategy=chunking
)

# Create and load documents
documents = [
    Document(content="Document 1 content..."),
    Document(content="Document 2 content...")
]
knowledge.load_documents(documents)

With Retrievers

from sikkaagent.document import Document
from sikkaagent.retrievers.vector_retriever import HybridRetriever
from sikkaagent.retrievers.embedder.sentence_transformer import SentenceTransformerEmbedder
from sikkaagent.storages.vectordb.qdrant.qdrant import Qdrant

# Create embedder
embedder = SentenceTransformerEmbedder()

# Create storage
storage = Qdrant(collection_name="retriever_docs", embedder=embedder)

# Create retriever
retriever = HybridRetriever(
    collection_name="my_collection",
    embedding_model=embedder,
    storage=storage
)

# Create document
document = Document(
    content="This is a sample document for retrieval testing.",
    meta_data={"source": "test"}
)

# Add document to retriever
retriever.add_documents([document])

# Search for similar documents
results = retriever.search("sample retrieval", top_k=5)

Advanced Usage

Asynchronous Document Processing

Use asynchronous methods for better performance with large document sets:

import asyncio
from sikkaagent.document.reader.pdf_reader import PDFReader
from pathlib import Path

async def process_pdfs(pdf_paths):
    reader = PDFReader(chunk=True)

    # Process PDFs in parallel
    tasks = [reader.async_read(path) for path in pdf_paths]
    results = await asyncio.gather(*tasks)

    # Flatten results
    all_documents = [doc for docs in results for doc in docs]
    return all_documents

# Run the async function
pdf_paths = [Path("doc1.pdf"), Path("doc2.pdf"), Path("doc3.pdf")]
documents = asyncio.run(process_pdfs(pdf_paths))

Custom Document Embedding

Embed documents with custom embedding models:

from sikkaagent.document import Document
from sikkaagent.retrievers.embedder.openai import OpenAIEmbedder

# Create document
document = Document(
    content="This is a document that needs to be embedded.",
    name="Sample Document"
)

# Create OpenAI embedder
embedder = OpenAIEmbedder(
    id="text-embedding-3-small",
    dimensions=1536,
    api_key="your-api-key"
)

# Embed the document
document.embed(embedder)

# Access embedding and usage information
print(f"Embedding dimension: {len(document.embedding)}")
print(f"Token usage: {document.usage}")

Best Practices

  • Document Size: Keep individual documents reasonably sized (under 100KB of text) for efficient processing
  • Chunking Strategy: Choose the appropriate chunking strategy based on your content:
  • Use FixedSizeChunking for general-purpose text
  • Use SemanticChunking for preserving meaning in complex documents
  • Use DocumentChunking for structured documents with clear sections
  • Metadata: Include relevant metadata with documents to enable filtering and organization:
  • Source information (URL, file path, etc.)
  • Creation/modification dates
  • Authors or owners
  • Categories or tags
  • Embedding Models: Select embedding models appropriate for your domain and language
  • Asynchronous Processing: Use async methods when processing large numbers of documents
  • Error Handling: Implement robust error handling for document processing, especially for external sources