Khoj: Build Your AI Second Brain for Document Research
Khoj: Build Your AI Second Brain for Document Research
Transform how you interact with your knowledge. Khoj turns your scattered documents into an intelligent, conversational AI assistant that works everywhere you do.
Introduction: The Information Overload Crisis
You're drowning in documents. PDFs pile up unread. Notes scatter across Notion, Obsidian, and random Markdown files. Critical insights hide in Word docs you'll never find again. Every developer, researcher, and knowledge worker faces this same nightmare: you have the information, but you can't access it when you need it.
What if you could chat with your entire knowledge base like you chat with ChatGPT? What if your documents became a living, breathing AI assistant that answers questions, conducts research, and even anticipates your needs? That's exactly what Khoj delivers.
This revolutionary open-source tool doesn't just search your files—it understands them. It connects to any LLM, from local models like Llama 3 to cloud giants like GPT-4. It reads PDFs, images, markdown, Notion pages, and more. It lives on your desktop, phone, browser, or even WhatsApp. And it's completely self-hostable.
In this deep dive, you'll discover how Khoj works, why it's trending in the AI community, and how to deploy your own AI second brain today. We'll walk through real code examples, explore powerful use cases, and reveal pro tips for maximizing its potential. Ready to reclaim your knowledge? Let's begin.
What is Khoj? The AI Second Brain Revolution
Khoj (pronounced "khoj," meaning "search" in Hindi) is an open-source personal AI application that functions as your digital second brain. Created by khoj-ai, this tool fundamentally reimagines how you interact with your personal and professional knowledge base.
At its core, Khoj is a document-aware AI assistant that ingests, indexes, and understands your files. But it's far more than a simple search tool. It's a full-fledged AI agent platform that can perform autonomous research, generate content, answer questions based on your documents, and even browse the web for current information.
The project has exploded in popularity because it solves a critical pain point: privacy-preserving AI document analysis. While tools like ChatGPT require uploading sensitive docs to external servers, Khoj can run entirely offline with local LLMs. This makes it irresistible for developers handling proprietary code, researchers with confidential data, and privacy-conscious users.
Khoj's architecture is modular and extensible. It supports multiple vector databases for semantic search, integrates with various LLM providers through a unified interface, and offers APIs for custom integrations. The system scales smoothly from a Raspberry Pi running Mistral to enterprise deployments handling terabytes of documents.
What makes Khoj truly special is its agent framework. You can create specialized AI assistants with custom personalities, knowledge bases, and tool access. Imagine a "Legal Advisor" agent that only knows your contract templates, or a "Code Reviewer" agent that understands your codebase intimately.
Key Features That Make Khoj Indispensable
Universal LLM Compatibility
Khoj doesn't lock you into a single AI provider. It supports every major LLM through a unified interface:
- Local models: Llama 3, Qwen, Gemma, Mistral, DeepSeek
- Cloud providers: OpenAI GPT, Anthropic Claude, Google Gemini
- Open-source endpoints: Ollama, vLLM, text-generation-webui
This flexibility means you can start with free local models and upgrade to premium cloud AI as needed. The model abstraction layer handles prompt formatting, token limits, and API quirks automatically.
Advanced Document Ingestion Pipeline
Khoj's ingestion engine is format-agnostic and intelligent. It processes:
- Text documents: PDF, Markdown, Word (.docx), org-mode, plain text
- Structured data: Notion pages, HTML, JSON, CSV
- Media files: Images (with OCR), audio transcripts
- Code: Python, JavaScript, and other source files
The pipeline extracts text, generates embeddings using state-of-the-art models, and stores them in a vector database for lightning-fast semantic search. It even handles incremental updates, only reprocessing changed files.
Multi-Platform Access Interface
Access your second brain anywhere, anytime:
- Web app: Full-featured browser interface
- Desktop: Native applications for Windows, macOS, Linux
- Mobile: Progressive Web App (PWA) for iOS/Android
- Editors: Obsidian plugin and Emacs integration
- Messaging: WhatsApp and Discord bots
- API: RESTful and WebSocket endpoints for custom apps
Intelligent Agent Framework
Create autonomous AI agents with:
- Custom personas: Define behavior, tone, and expertise
- Knowledge base scoping: Restrict agents to specific document collections
- Tool integration: Web search, image generation, calculator, code execution
- Scheduled automation: Trigger agents at intervals or on events
Research-Grade Semantic Search
Khoj's search goes beyond keywords. It uses vector similarity to find conceptually related content, even when phrasing differs completely. The hybrid search combines dense vector retrieval with traditional BM25 for optimal results.
Automation and Notifications
Schedule agents to:
- Generate daily research summaries
- Monitor document changes and alert on relevant updates
- Create personalized newsletters from your knowledge base
- Trigger workflows based on search patterns
Real-World Use Cases: Where Khoj Shines
1. Academic Research Accelerator
A PhD student collects 500+ papers on machine learning. Instead of manually reading each, they deploy Khoj. The "Research Assistant" agent ingests all PDFs, arXiv links, and lecture notes.
When writing a literature review, they simply ask: "What are the latest advances in few-shot learning?" Khoj searches the papers, synthesizes findings, and cites sources with page numbers. The research mode (/research command) autonomously explores connections between papers, identifies seminal works, and generates a comprehensive summary.
Result: Research time drops from weeks to hours. No more manual annotation. No more forgotten insights.
2. Enterprise Knowledge Management
A 50-person startup struggles with knowledge silos. Engineering docs live in GitHub, product specs in Notion, meeting notes in Google Docs, and customer feedback in Slack.
They self-host Khoj on their internal server. The "Company Oracle" agent indexes everything. Sales asks: "What's our SLA for enterprise customers?" Engineering queries: "How do we handle OAuth in our API?" Product managers request: "Show me all feature requests about reporting."
Result: Support ticket resolution time drops 60%. Onboarding new engineers takes days instead of weeks. The company stops reinventing the wheel.
3. Developer Documentation Companion
A software architect maintains a complex microservices system. Documentation sprawls across Confluence, README files, and API specs. They integrate Khoj into their Emacs workflow.
While coding, they highlight a function and trigger Khoj search. Instantly, they see relevant architecture decisions, related services, and deprecation warnings from internal docs. The "Code Guru" agent explains legacy code by cross-referencing commit messages and design documents.
Result: Context switching minimized. Architecture decisions stay aligned. Technical debt becomes visible and manageable.
4. Legal and Compliance Analyst
A compliance officer must track regulatory changes across hundreds of PDFs. They create a "Regulatory Watchdog" agent that monitors new documents and highlights relevant changes.
When GDPR updates arrive, they ask: "How does this affect our data retention policy?" Khoj analyzes the new regulation, compares it to existing internal policies, and flags discrepancies. The agent automatically generates a compliance checklist.
Result: Audit preparation time cut by 70%. Risk of non-compliance drops dramatically. The officer focuses on strategy, not document hunting.
5. Content Creator Research Hub
A technical writer produces tutorials on emerging technologies. They use Khoj to manage bookmarks, screenshot annotations, and transcript notes from conference talks.
The "Content Researcher" agent suggests article topics based on trending searches in their knowledge base. It finds gaps in their coverage and recommends new angles. When writing, it auto-suggests citations and fact-checks claims against source materials.
Result: Content production doubles. Quality improves with better sourcing. Writer's block disappears when you can converse with your research.
Step-by-Step Installation & Setup Guide
Method 1: Docker Deployment (Recommended)
The fastest way to get Khoj running is with Docker. This method handles all dependencies automatically.
# Step 1: Pull the official image
docker pull ghcr.io/khoj-ai/khoj:latest
# Step 2: Create a directory for your data
mkdir -p ~/khoj-data
# Step 3: Run the container with volume mounts
docker run -d \
--name khoj \
-p 42110:42110 \
-v ~/khoj-data:/app/.khoj \
-v ~/Documents:/data \
-e KHOJ_ADMIN_EMAIL=admin@example.com \
-e KHOJ_ADMIN_PASSWORD=secure_password \
ghcr.io/khoj-ai/khoj:latest
Explanation: This command maps port 42110, persists Khoj's database to ~/khoj-data, and makes your Documents folder available for indexing. The admin credentials are set via environment variables.
Method 2: pip Installation
For Python users who prefer native installation:
# Step 1: Create a virtual environment
python -m venv khoj-env
source khoj-env/bin/activate # On Windows: khoj-env\Scripts\activate
# Step 2: Install Khoj
pip install khoj-assistant
# Step 3: Initialize configuration
khoj --configure
# Step 4: Start the server
khoj --host 0.0.0.0 --port 42110
Method 3: Development Setup
For contributors and tinkerers:
# Clone the repository
git clone https://github.com/khoj-ai/khoj.git
cd khoj
# Install dependencies
pip install -e .
# Set up pre-commit hooks
pre-commit install
# Run tests to verify installation
pytest tests/
Initial Configuration
After installation, access the web interface at http://localhost:42110. The setup wizard will guide you through:
- Choosing your LLM: Select from local models (requires Ollama) or cloud providers (requires API keys)
- Indexing documents: Point Khoj to your document directories
- Creating your first agent: Define a persona and knowledge scope
- Setting up automations: Configure scheduled tasks
Pro Tip: For local LLMs, install Ollama first:
# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.ai/install.sh | sh
# Pull a model
ollama pull llama3.1:8b
# Khoj will auto-detect Ollama
Real Code Examples from Khoj
Example 1: Configuring Your First Agent
Agents are the heart of Khoj. Here's how to define a custom agent via the configuration file:
# ~/.khoj/khoj.yml
agents:
- name: "Code-Reviewer"
personality: |
You are a senior software architect with 20 years of experience.
You review code for security, performance, and maintainability.
Be concise but thorough. Always suggest improvements.
chat-model: "gpt-4"
docs:
- "/home/user/code-docs"
- "/home/user/architecture-decisions"
tools:
- "web-search"
- "code-execution"
enable-entry-filter: true
Explanation: This YAML configuration creates a "Code-Reviewer" agent scoped to specific documentation directories. It uses GPT-4 for analysis and can search the web and execute code snippets. The enable-entry-filter restricts it to only the specified knowledge base.
Example 2: API Search Query
Programmatically search your documents using Khoj's REST API:
import requests
import json
# Configure API endpoint and authentication
KHOG_URL = "http://localhost:42110"
API_KEY = "your_api_key_here"
# Prepare the search query
query = {
"q": "How do we handle authentication in the payment service?",
"agent": "Code-Reviewer",
"n": 5, # Return top 5 results
"r": True, # Enable research mode for deeper analysis
"stream": False
}
# Send request to Khoj
response = requests.post(
f"{KHOG_URL}/api/chat",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
data=json.dumps(query)
)
# Parse the response
results = response.json()
for result in results.get("results", []):
print(f"Source: {result['file']}")
print(f"Relevance: {result['score']:.2f}")
print(f"Content: {result['content'][:200]}...")
print("-" * 50)
Explanation: This Python script demonstrates how to integrate Khoj into your applications. The API returns semantically relevant document chunks with relevance scores. The r=True parameter enables research mode for comprehensive answers.
Example 3: Automating Daily Research Summaries
Set up a scheduled agent to monitor new documents and generate reports:
// automation-config.js
{
"automations": [
{
"name": "Daily-Research-Digest",
"schedule": "0 8 * * *", // Run daily at 8 AM
"agent": "Research-Assistant",
"prompt": `
Analyze all documents added in the last 24 hours.
Identify key insights about machine learning.
Summarize in 3 bullet points.
Suggest 2 potential research directions.
`,
"output": {
"type": "email",
"to": "researcher@university.edu",
"subject": "Daily ML Research Digest"
}
}
]
}
Explanation: This automation configuration uses cron syntax to schedule daily research summaries. The agent analyzes new documents, extracts ML-related insights, and emails a concise digest. This turns Khoj into a proactive research assistant.
Example 4: Document Ingestion Script
Batch index documents from multiple sources:
#!/bin/bash
# bulk-index.sh - Index documents from various sources
KHOJ_API="http://localhost:42110/api/v1/index"
API_KEY="your_api_key"
# Index local documents
curl -X POST "$KHOJ_API" \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"source": "local",
"path": "/home/user/Documents/research",
"recursive": true,
"file-types": ["pdf", "md", "txt"]
}'
# Sync Notion workspace
curl -X POST "$KHOJ_API" \
-H "Authorization: Bearer $API_KEY" \
-d '{
"source": "notion",
"token": "secret_notion_token",
"databases": ["Research Notes", "Meeting Minutes"]
}'
# Trigger re-indexing
curl -X POST "$KHOJ_API/update" \
-H "Authorization: Bearer $API_KEY"
echo "Indexing initiated. Check Khoj logs for progress."
Explanation: This bash script demonstrates bulk indexing from multiple sources. It indexes local files recursively and syncs Notion databases. The final update call triggers the embedding generation process.
Advanced Usage & Best Practices
Optimize Performance for Large Document Collections
When indexing 10,000+ documents, follow these strategies:
- Batch processing: Index in chunks of 500 files to avoid memory issues
- Incremental updates: Use the
--incrementalflag to only process new or changed files - Embedding model selection: For speed, use
sentence-transformers/all-MiniLM-L6-v2. For quality, usetext-embedding-ada-002 - Database optimization: Regularly run
khoj --vacuumto reclaim space and rebuild indexes
Security Best Practices for Self-Hosting
- API key rotation: Set
KHOJ_API_KEY_TTL=30to force monthly key rotation - Network isolation: Run Khoj in a Docker network with restricted outbound access
- Document permissions: Use Linux ACLs to ensure Khoj only reads authorized files
- Audit logging: Enable
KHOJ_AUDIT_LOG=trueto track all queries and data access
Custom Tool Integration
Extend Khoj with custom tools by implementing the Tool protocol:
# custom_tool.py
from khoj.routers.api import Tool
class JiraSearch(Tool):
def __init__(self):
self.name = "jira-search"
self.description = "Search Jira tickets"
def execute(self, query: str) -> str:
# Implementation to search Jira API
return results
# Register in your config
# tools:
# - "custom_tool.JiraSearch"
Multi-User Enterprise Setup
For team deployments:
# docker-compose.yml
version: '3.8'
services:
khoj:
image: ghcr.io/khoj-ai/khoj:latest
environment:
KHOJ_DATABASE_URL: "postgresql://user:pass@db:5432/khoj"
KHOJ_REDIS_URL: "redis://cache:6379"
volumes:
- team-data:/app/.khoj
db:
image: postgres:15
volumes:
- postgres-data:/var/lib/postgresql/data
cache:
image: redis:7-alpine
This setup separates the database and cache, enabling horizontal scaling for large teams.
Comparison: Khoj vs. Alternatives
| Feature | Khoj | Obsidian AI | ChatGPT + Plugins | privateGPT |
|---|---|---|---|---|
| Self-hosting | ✅ Yes, fully open-source | ❌ No | ❌ No | ✅ Yes |
| Local LLMs | ✅ Extensive support | ⚠️ Limited | ❌ No | ✅ Yes |
| Document types | ✅ 15+ formats | ⚠️ Only markdown | ⚠️ Via plugins | ⚠️ PDF only |
| Multi-platform | ✅ Web, mobile, editors, chat | ⚠️ Editor only | ✅ Web only | ⚠️ Web only |
| Custom agents | ✅ Advanced framework | ❌ No | ⚠️ Basic GPTs | ❌ No |
| Automation | ✅ Scheduled tasks | ❌ No | ⚠️ Via API | ❌ No |
| Semantic search | ✅ Hybrid vector + BM25 | ⚠️ Basic | ❌ No | ✅ Vector only |
| Cost | 🆓 Free/self-hosted | 💰 Paid subscription | 💰 Paid API | 🆓 Free |
| Enterprise ready | ✅ Hybrid cloud/on-prem | ❌ No | ✅ Cloud only | ⚠️ Self-manage |
Why Khoj Wins: Unlike single-purpose tools, Khoj is a unified platform combining document intelligence, agent autonomy, and deployment flexibility. While Obsidian AI excels at note-taking and privateGPT offers simple local search, Khoj delivers production-grade features for serious knowledge work.
Frequently Asked Questions
Q: How much RAM do I need to run Khoj locally? A: For basic use with local models, 8GB RAM is minimum. With Ollama running a 7B model, expect 6-8GB usage. For larger 70B models, you'll need 32GB+ RAM. Cloud LLMs reduce local requirements to just 2-4GB.
Q: Can Khoj read scanned PDFs and images? A: Yes! Khoj includes OCR capabilities via Tesseract. It automatically extracts text from images, scanned documents, and even charts. For best results, ensure your scans are 300 DPI or higher.
Q: How does Khoj handle document updates?
A: Khoj tracks file hashes and modification times. When you re-index, it only processes changed files. You can enable auto-indexing with KHOJ_AUTO_INDEX=true to watch directories for changes.
Q: Is my data sent to external services? A: Only if you configure it. With local LLMs and self-hosting, all data stays on your machine. When using cloud LLMs, only the text snippets relevant to your query are sent. Khoj never uploads your entire document collection.
Q: Can I share agents with my team? A: Absolutely. Export agent configurations as YAML files and commit them to Git. Team members can import them via the web interface or API. Enterprise plans offer a shared agent registry.
Q: What makes Khoj different from a simple RAG pipeline? A: Khoj is a complete application layer on top of RAG. It includes user management, multi-modal support, automation, and an agent framework. You get a production-ready system, not just a prototype.
Q: How do I contribute to Khoj development?
A: Visit the GitHub repository and check out good first issues. The project welcomes contributions in Python, TypeScript, and documentation. Join the Discord community for guidance.
Conclusion: Your Knowledge, Amplified
Khoj isn't just another AI tool—it's a paradigm shift in personal knowledge management. By transforming static documents into an interactive, intelligent system, it solves the fundamental problem of modern information work: you can't use what you can't find.
The combination of local LLM support, custom agents, and multi-platform access makes Khoj uniquely powerful. Whether you're a researcher drowning in papers, a developer navigating complex codebases, or a team leader fighting knowledge silos, Khoj delivers tangible ROI from day one.
What excites me most is the automation layer. This isn't just search—it's a proactive assistant that works while you sleep, generating insights and keeping you informed. The research mode (/research) is particularly groundbreaking, autonomously connecting dots across your knowledge base.
The best part? You can start free on their cloud app or self-host in minutes. The open-source nature means you're never locked in, and the active community ensures continuous improvement.
Don't let your knowledge go to waste. Deploy Khoj today and experience what it means to have a true AI second brain. Your future self will thank you.
🚀 Get started with Khoj now - Star the repo, join the Discord, and transform how you think about your documents.
Comments (0)
No comments yet. Be the first to share your thoughts!