Skip to content

Knowledge

Sikka Agent's Knowledge module provides a flexible system for creating, managing, and querying knowledge bases from various data sources. This enables agents to access and reference external information during conversations.

Overview

The Knowledge module allows agents to ground their responses in factual information by:

  • Loading documents from various sources (PDFs, websites, CSV files, etc.)
  • Chunking documents into manageable pieces
  • Storing document embeddings in vector databases
  • Retrieving relevant information based on semantic similarity
  • Integrating retrieved information into agent responses

This capability is essential for implementing Retrieval Augmented Generation (RAG) patterns that enhance agent responses with specific, accurate information.

Core Components

AgentKnowledge

The foundational class for all knowledge base implementations in Sikka Agent.

Parameters

Parameter Type Description Default
reader Reader Document reader for specific file types None
vector_db VectorDb Vector database for storing embeddings None
num_documents int Number of documents to return on search 5
optimize_on int Document count threshold for optimization 1000
chunking_strategy ChunkingStrategy Strategy for chunking documents FixedSizeChunking()

Methods

Method Description
search(query, num_documents, filters) Retrieves relevant documents matching a query
load(recreate, upsert, skip_existing, filters) Loads documents into the vector database
load_documents(documents, upsert, skip_existing, filters) Loads a list of documents into the vector database
load_document(document, upsert, skip_existing, filters) Loads a single document into the vector database
load_text(text, upsert, skip_existing, filters) Loads text content into the vector database
exists() Checks if the knowledge base exists
delete() Deletes the knowledge base

Code Example

from sikkaagent.knowledge import AgentKnowledge
from sikkaagent.document import Document
from sikkaagent.document.chunking.fixed import FixedSizeChunking
from sikkaagent.storages.vectordb.qdrant.qdrant import Qdrant
from sikkaagent.retrievers.embedder.sentence_transformer import SentenceTransformerEmbedder

# Create embedder
embedder = SentenceTransformerEmbedder(id="sentence-transformers/all-MiniLM-L6-v2")

# Create vector database
vector_db = Qdrant(
    collection_name="my_knowledge_base",
    embedder=embedder
)

# Create chunking strategy
chunking_strategy = FixedSizeChunking(chunk_size=500, chunk_overlap=50)

# Create knowledge base
knowledge = AgentKnowledge(
    vector_db=vector_db,
    chunking_strategy=chunking_strategy,
    num_documents=5
)

# Load documents
documents = [
    Document(content="Artificial intelligence is the simulation of human intelligence by machines."),
    Document(content="Machine learning is a subset of AI focused on data and algorithms.")
]
knowledge.load_documents(documents)

# Search for relevant documents
results = knowledge.search("What is machine learning?")
for doc in results:
    print(f"Content: {doc.content}")

Knowledge Base Types

Sikka Agent provides specialized knowledge base implementations for different data sources:

PDFKnowledgeBase

Loads documents from PDF files.

Parameters

Parameter Type Description Default
path str or Path Path to PDF file or directory Required
exclude_files List[str] Files to exclude []
reader PDFReader PDF reader implementation PDFReader()

Code Example

from sikkaagent.knowledge.pdf import PDFKnowledgeBase
from sikkaagent.storages.vectordb.qdrant.qdrant import Qdrant
from sikkaagent.retrievers.embedder.sentence_transformer import SentenceTransformerEmbedder

# Create embedder
embedder = SentenceTransformerEmbedder()

# Create vector database
vector_db = Qdrant(collection_name="pdf_knowledge", embedder=embedder)

# Create PDF knowledge base
pdf_knowledge = PDFKnowledgeBase(
    path="path/to/pdfs",
    vector_db=vector_db,
    exclude_files=["irrelevant.pdf"]
)

# Load documents into vector database
pdf_knowledge.load()

# Search for relevant information
results = pdf_knowledge.search("What is quantum computing?")

WebsiteKnowledgeBase

Loads documents from websites by crawling pages.

Parameters

Parameter Type Description Default
urls List[str] List of website URLs to crawl []
reader WebsiteReader Website reader implementation None
max_depth int Maximum crawl depth 3
max_links int Maximum links to follow 10

Code Example

from sikkaagent.knowledge.website import WebsiteKnowledgeBase
from sikkaagent.storages.vectordb.qdrant.qdrant import Qdrant
from sikkaagent.retrievers.embedder.sentence_transformer import SentenceTransformerEmbedder

# Create embedder
embedder = SentenceTransformerEmbedder()

# Create vector database
vector_db = Qdrant(collection_name="website_knowledge", embedder=embedder)

# Create website knowledge base
website_knowledge = WebsiteKnowledgeBase(
    urls=["https://example.com"],
    vector_db=vector_db,
    max_depth=2,
    max_links=20
)

# Load documents into vector database
website_knowledge.load()

# Search for relevant information
results = website_knowledge.search("What services does the company offer?")

CSVKnowledgeBase

Loads documents from CSV files.

Parameters

Parameter Type Description Default
path str or Path Path to CSV file or directory Required
exclude_files List[str] Files to exclude []
reader CSVReader CSV reader implementation CSVReader()

Code Example

from sikkaagent.knowledge.csv import CSVKnowledgeBase
from sikkaagent.storages.vectordb.qdrant.qdrant import Qdrant
from sikkaagent.retrievers.embedder.sentence_transformer import SentenceTransformerEmbedder

# Create embedder
embedder = SentenceTransformerEmbedder()

# Create vector database
vector_db = Qdrant(collection_name="csv_knowledge", embedder=embedder)

# Create CSV knowledge base
csv_knowledge = CSVKnowledgeBase(
    path="",
    vector_db=vector_db
)

# Load documents into vector database
csv_knowledge.load()

# Search for relevant information
results = csv_knowledge.search("What are the sales figures for Q1?")

Other Knowledge Base Types

Sikka Agent provides additional knowledge base implementations:

Knowledge Base Description
ArxivKnowledgeBase Loads documents from arXiv papers
CombinedKnowledgeBase Combines multiple knowledge bases
CSVUrlKnowledgeBase Loads documents from CSV files via URLs
DocumentKnowledgeBase Uses pre-created Document objects
DocxKnowledgeBase Loads documents from Word files
JSONKnowledgeBase Loads documents from JSON files
PDFUrlKnowledgeBase Loads documents from PDF files via URLs
TextKnowledgeBase Loads documents from text files
UrlKnowledge Loads documents from generic URLs
WikipediaKnowledgeBase Loads documents from Wikipedia articles
YouTubeKnowledgeBase Loads documents from YouTube video transcripts

Integration with Agents

Knowledge bases can be integrated with agents to enhance their responses with relevant information:

from sikkaagent.agents import ChatAgent
from sikkaagent.knowledge import AgentKnowledge
from sikkaagent.knowledge.pdf import PDFKnowledgeBase
from sikkaagent.models import ModelConfigure
from sikkaagent.utils.enums import ModelPlatformType
from sikkaagent.storages.vectordb.qdrant.qdrant import Qdrant
from sikkaagent.retrievers.embedder.sentence_transformer import SentenceTransformerEmbedder

# Create embedder
embedder = SentenceTransformerEmbedder()

# Create vector database
vector_db = Qdrant(collection_name="agent_knowledge", embedder=embedder)

# Create PDF knowledge base
knowledge = PDFKnowledgeBase(
    path="path/to/company_docs",
    vector_db=vector_db
)

# Load documents
knowledge.load()

# Create agent with knowledge
agent = ChatAgent(
    model=ModelConfigure(
        model="llama3.1:8b",
        model_platform=ModelPlatformType.OLLAMA
    ),
    knowledge=knowledge,
    add_references=True,  # Include references in responses
    num_references=3,     # Number of references to include
    system_prompt="You are a helpful assistant with access to company documentation."
)

# Agent will now use the knowledge base
response = agent.step("What is our refund policy?")

Advanced Usage

Combining Multiple Knowledge Sources

Use CombinedKnowledgeBase to integrate information from different sources:

from sikkaagent.knowledge.combined import CombinedKnowledgeBase
from sikkaagent.knowledge.pdf import PDFKnowledgeBase
from sikkaagent.knowledge.website import WebsiteKnowledgeBase

# Create individual knowledge bases
pdf_knowledge = PDFKnowledgeBase(
    path="path/to/pdfs",
    vector_db=vector_db_1
)

website_knowledge = WebsiteKnowledgeBase(
    urls=["https://example.com"],
    vector_db=vector_db_2
)

# Create combined knowledge base
combined_knowledge = CombinedKnowledgeBase(
    sources=[pdf_knowledge, website_knowledge],
    vector_db=main_vector_db
)

# Load all sources into the combined knowledge base
combined_knowledge.load()

Filtering Search Results

Filter search results based on metadata:

# Search with filters
results = knowledge.search(
    query="What is our pricing?",
    filters={"category": "pricing", "year": 2023}
)

Asynchronous Operations

Use asynchronous methods for better performance in async applications:

import asyncio

async def load_and_search():
    # Load documents asynchronously
    await knowledge.aload()

    # Search asynchronously
    results = await knowledge.async_search("What is machine learning?")
    return results

# Run the async function
results = asyncio.run(load_and_search())

Best Practices

  • Chunking Strategy: Choose an appropriate chunking strategy for your content:
  • Use smaller chunks (300-500 tokens) for precise retrieval
  • Use larger chunks (800-1000 tokens) for more context
  • Adjust overlap (50-100 tokens) to prevent information loss at boundaries

  • Vector Database Selection: Choose the appropriate vector database based on your needs:

  • Qdrant for general-purpose use
  • ChromaDB for simplicity and ease of use
  • PineconeDB for production deployments with large datasets

  • Document Processing:

  • Clean and preprocess documents before loading
  • Remove irrelevant content like headers, footers, and boilerplate text
  • Structure metadata to enable effective filtering

  • Performance Optimization:

  • Use skip_existing=True to avoid reprocessing documents
  • Set appropriate num_documents to balance relevance and context
  • Use upsert=True for incremental updates to knowledge bases

  • Integration with Agents:

  • Use add_references=True to include source information in responses
  • Craft system prompts that instruct the agent how to use retrieved information
  • Consider using search_knowledge=True for agentic RAG that allows the agent to decide when to search