Knowledge¶
Sikka Agent's Knowledge module provides a flexible system for creating, managing, and querying knowledge bases from various data sources. This enables agents to access and reference external information during conversations.
Overview¶
The Knowledge module allows agents to ground their responses in factual information by:
- Loading documents from various sources (PDFs, websites, CSV files, etc.)
- Chunking documents into manageable pieces
- Storing document embeddings in vector databases
- Retrieving relevant information based on semantic similarity
- Integrating retrieved information into agent responses
This capability is essential for implementing Retrieval Augmented Generation (RAG) patterns that enhance agent responses with specific, accurate information.
Core Components¶
AgentKnowledge¶
The foundational class for all knowledge base implementations in Sikka Agent.
Parameters¶
Parameter | Type | Description | Default |
---|---|---|---|
reader |
Reader |
Document reader for specific file types | None |
vector_db |
VectorDb |
Vector database for storing embeddings | None |
num_documents |
int |
Number of documents to return on search | 5 |
optimize_on |
int |
Document count threshold for optimization | 1000 |
chunking_strategy |
ChunkingStrategy |
Strategy for chunking documents | FixedSizeChunking() |
Methods¶
Method | Description |
---|---|
search(query, num_documents, filters) |
Retrieves relevant documents matching a query |
load(recreate, upsert, skip_existing, filters) |
Loads documents into the vector database |
load_documents(documents, upsert, skip_existing, filters) |
Loads a list of documents into the vector database |
load_document(document, upsert, skip_existing, filters) |
Loads a single document into the vector database |
load_text(text, upsert, skip_existing, filters) |
Loads text content into the vector database |
exists() |
Checks if the knowledge base exists |
delete() |
Deletes the knowledge base |
Code Example¶
from sikkaagent.knowledge import AgentKnowledge
from sikkaagent.document import Document
from sikkaagent.document.chunking.fixed import FixedSizeChunking
from sikkaagent.storages.vectordb.qdrant.qdrant import Qdrant
from sikkaagent.retrievers.embedder.sentence_transformer import SentenceTransformerEmbedder
# Create embedder
embedder = SentenceTransformerEmbedder(id="sentence-transformers/all-MiniLM-L6-v2")
# Create vector database
vector_db = Qdrant(
collection_name="my_knowledge_base",
embedder=embedder
)
# Create chunking strategy
chunking_strategy = FixedSizeChunking(chunk_size=500, chunk_overlap=50)
# Create knowledge base
knowledge = AgentKnowledge(
vector_db=vector_db,
chunking_strategy=chunking_strategy,
num_documents=5
)
# Load documents
documents = [
Document(content="Artificial intelligence is the simulation of human intelligence by machines."),
Document(content="Machine learning is a subset of AI focused on data and algorithms.")
]
knowledge.load_documents(documents)
# Search for relevant documents
results = knowledge.search("What is machine learning?")
for doc in results:
print(f"Content: {doc.content}")
Knowledge Base Types¶
Sikka Agent provides specialized knowledge base implementations for different data sources:
PDFKnowledgeBase¶
Loads documents from PDF files.
Parameters¶
Parameter | Type | Description | Default |
---|---|---|---|
path |
str or Path |
Path to PDF file or directory | Required |
exclude_files |
List[str] |
Files to exclude | [] |
reader |
PDFReader |
PDF reader implementation | PDFReader() |
Code Example¶
from sikkaagent.knowledge.pdf import PDFKnowledgeBase
from sikkaagent.storages.vectordb.qdrant.qdrant import Qdrant
from sikkaagent.retrievers.embedder.sentence_transformer import SentenceTransformerEmbedder
# Create embedder
embedder = SentenceTransformerEmbedder()
# Create vector database
vector_db = Qdrant(collection_name="pdf_knowledge", embedder=embedder)
# Create PDF knowledge base
pdf_knowledge = PDFKnowledgeBase(
path="path/to/pdfs",
vector_db=vector_db,
exclude_files=["irrelevant.pdf"]
)
# Load documents into vector database
pdf_knowledge.load()
# Search for relevant information
results = pdf_knowledge.search("What is quantum computing?")
WebsiteKnowledgeBase¶
Loads documents from websites by crawling pages.
Parameters¶
Parameter | Type | Description | Default |
---|---|---|---|
urls |
List[str] |
List of website URLs to crawl | [] |
reader |
WebsiteReader |
Website reader implementation | None |
max_depth |
int |
Maximum crawl depth | 3 |
max_links |
int |
Maximum links to follow | 10 |
Code Example¶
from sikkaagent.knowledge.website import WebsiteKnowledgeBase
from sikkaagent.storages.vectordb.qdrant.qdrant import Qdrant
from sikkaagent.retrievers.embedder.sentence_transformer import SentenceTransformerEmbedder
# Create embedder
embedder = SentenceTransformerEmbedder()
# Create vector database
vector_db = Qdrant(collection_name="website_knowledge", embedder=embedder)
# Create website knowledge base
website_knowledge = WebsiteKnowledgeBase(
urls=["https://example.com"],
vector_db=vector_db,
max_depth=2,
max_links=20
)
# Load documents into vector database
website_knowledge.load()
# Search for relevant information
results = website_knowledge.search("What services does the company offer?")
CSVKnowledgeBase¶
Loads documents from CSV files.
Parameters¶
Parameter | Type | Description | Default |
---|---|---|---|
path |
str or Path |
Path to CSV file or directory | Required |
exclude_files |
List[str] |
Files to exclude | [] |
reader |
CSVReader |
CSV reader implementation | CSVReader() |
Code Example¶
from sikkaagent.knowledge.csv import CSVKnowledgeBase
from sikkaagent.storages.vectordb.qdrant.qdrant import Qdrant
from sikkaagent.retrievers.embedder.sentence_transformer import SentenceTransformerEmbedder
# Create embedder
embedder = SentenceTransformerEmbedder()
# Create vector database
vector_db = Qdrant(collection_name="csv_knowledge", embedder=embedder)
# Create CSV knowledge base
csv_knowledge = CSVKnowledgeBase(
path="",
vector_db=vector_db
)
# Load documents into vector database
csv_knowledge.load()
# Search for relevant information
results = csv_knowledge.search("What are the sales figures for Q1?")
Other Knowledge Base Types¶
Sikka Agent provides additional knowledge base implementations:
Knowledge Base | Description |
---|---|
ArxivKnowledgeBase |
Loads documents from arXiv papers |
CombinedKnowledgeBase |
Combines multiple knowledge bases |
CSVUrlKnowledgeBase |
Loads documents from CSV files via URLs |
DocumentKnowledgeBase |
Uses pre-created Document objects |
DocxKnowledgeBase |
Loads documents from Word files |
JSONKnowledgeBase |
Loads documents from JSON files |
PDFUrlKnowledgeBase |
Loads documents from PDF files via URLs |
TextKnowledgeBase |
Loads documents from text files |
UrlKnowledge |
Loads documents from generic URLs |
WikipediaKnowledgeBase |
Loads documents from Wikipedia articles |
YouTubeKnowledgeBase |
Loads documents from YouTube video transcripts |
Integration with Agents¶
Knowledge bases can be integrated with agents to enhance their responses with relevant information:
from sikkaagent.agents import ChatAgent
from sikkaagent.knowledge import AgentKnowledge
from sikkaagent.knowledge.pdf import PDFKnowledgeBase
from sikkaagent.models import ModelConfigure
from sikkaagent.utils.enums import ModelPlatformType
from sikkaagent.storages.vectordb.qdrant.qdrant import Qdrant
from sikkaagent.retrievers.embedder.sentence_transformer import SentenceTransformerEmbedder
# Create embedder
embedder = SentenceTransformerEmbedder()
# Create vector database
vector_db = Qdrant(collection_name="agent_knowledge", embedder=embedder)
# Create PDF knowledge base
knowledge = PDFKnowledgeBase(
path="path/to/company_docs",
vector_db=vector_db
)
# Load documents
knowledge.load()
# Create agent with knowledge
agent = ChatAgent(
model=ModelConfigure(
model="llama3.1:8b",
model_platform=ModelPlatformType.OLLAMA
),
knowledge=knowledge,
add_references=True, # Include references in responses
num_references=3, # Number of references to include
system_prompt="You are a helpful assistant with access to company documentation."
)
# Agent will now use the knowledge base
response = agent.step("What is our refund policy?")
Advanced Usage¶
Combining Multiple Knowledge Sources¶
Use CombinedKnowledgeBase
to integrate information from different sources:
from sikkaagent.knowledge.combined import CombinedKnowledgeBase
from sikkaagent.knowledge.pdf import PDFKnowledgeBase
from sikkaagent.knowledge.website import WebsiteKnowledgeBase
# Create individual knowledge bases
pdf_knowledge = PDFKnowledgeBase(
path="path/to/pdfs",
vector_db=vector_db_1
)
website_knowledge = WebsiteKnowledgeBase(
urls=["https://example.com"],
vector_db=vector_db_2
)
# Create combined knowledge base
combined_knowledge = CombinedKnowledgeBase(
sources=[pdf_knowledge, website_knowledge],
vector_db=main_vector_db
)
# Load all sources into the combined knowledge base
combined_knowledge.load()
Filtering Search Results¶
Filter search results based on metadata:
# Search with filters
results = knowledge.search(
query="What is our pricing?",
filters={"category": "pricing", "year": 2023}
)
Asynchronous Operations¶
Use asynchronous methods for better performance in async applications:
import asyncio
async def load_and_search():
# Load documents asynchronously
await knowledge.aload()
# Search asynchronously
results = await knowledge.async_search("What is machine learning?")
return results
# Run the async function
results = asyncio.run(load_and_search())
Best Practices¶
- Chunking Strategy: Choose an appropriate chunking strategy for your content:
- Use smaller chunks (300-500 tokens) for precise retrieval
- Use larger chunks (800-1000 tokens) for more context
-
Adjust overlap (50-100 tokens) to prevent information loss at boundaries
-
Vector Database Selection: Choose the appropriate vector database based on your needs:
- Qdrant for general-purpose use
- ChromaDB for simplicity and ease of use
-
PineconeDB for production deployments with large datasets
-
Document Processing:
- Clean and preprocess documents before loading
- Remove irrelevant content like headers, footers, and boilerplate text
-
Structure metadata to enable effective filtering
-
Performance Optimization:
- Use
skip_existing=True
to avoid reprocessing documents - Set appropriate
num_documents
to balance relevance and context -
Use
upsert=True
for incremental updates to knowledge bases -
Integration with Agents:
- Use
add_references=True
to include source information in responses - Craft system prompts that instruct the agent how to use retrieved information
- Consider using
search_knowledge=True
for agentic RAG that allows the agent to decide when to search