Textrawl
Personal Knowledge MCP Server
A semantic search system that transforms scattered documents into queryable knowledge accessible directly from Claude. Built as an MCP server to bridge personal knowledge bases with AI conversations.
The Problem
Knowledge workers accumulate thousands of documents across PDFs, Word docs, emails, and notes. When you need specific information, you're stuck with filename-based search or manual scanning. Traditional search fails because it can't understand context or meaning.
- Can't search across different file formats (PDF, DOCX, email, HTML)
- Keyword search misses relevant content phrased differently
- No way to access this knowledge from AI tools like Claude
- Email threads and Google Takeout archives are effectively unsearchable
Constraints
- Must work offline-first for privacy-sensitive documents
- Need to integrate seamlessly with Claude via MCP protocol
- Must handle large document collections (10,000+ files)
- Processing speed critical for user experience
- Can't rely on commercial vector DBs for cost/privacy reasons
The Approach
Built a hybrid search system combining semantic vector search with traditional full-text search, using reciprocal rank fusion to merge results. Chose PostgreSQL with pgvector over specialized vector databases for flexibility and SQL familiarity.
Why PostgreSQL + pgvector vs. dedicated vector DB?
Need SQL's flexibility for complex filtering, joins, and metadata queries. pgvector provides vector similarity while keeping everything in one database. Simpler ops than managing separate systems.
Why hybrid search instead of pure semantic?
Vector search alone misses exact matches. Full-text alone misses semantic similarity. Reciprocal rank fusion combines both, surfacing results that rank high in either system.
Why MCP server architecture?
MCP (Model Context Protocol) lets Claude Desktop, Cursor, and other tools access the knowledge base through a standard interface. Better than one-off integrations.
Key Tradeoffs
PostgreSQL over specialized vector DBs (Pinecone, Weaviate)
Sacrificed some vector search performance for SQL flexibility and simpler infrastructure. Right call because filtering and metadata queries are critical.
Chunking with overlap vs. simple splits
Increased storage by 30% but dramatically improved retrieval quality. Context preservation at chunk boundaries matters more than space.
OpenAI embeddings vs. local models
Latency and API cost for superior embedding quality. Worth it because search quality is the entire value proposition.
Implementation Highlights
Multi-format ingestion pipeline
Handles PDF (text extraction with fallback OCR), Word docs, Markdown, plain text. Converts emails (mbox, EML) and HTML to searchable format. Google Takeout JSON parsing for exported data.
Smart chunking with context preservation
Splits documents into 500-token chunks with 50-token overlap. Preserves document metadata and position for context. Maintains sentence boundaries to avoid mid-sentence splits.
Reciprocal rank fusion algorithm
Combines vector similarity scores from pgvector with PostgreSQL full-text search rankings. RRF formula: score = Σ(1/(k + rank)) where k=60. Surfaces results strong in either dimension.
MCP protocol integration
Implements Model Context Protocol for Claude Desktop integration. Exposes search, document retrieval, and metadata tools. Handles async operations for large result sets.
Outcomes
What I'd Do Differently
- Start with chunking strategy experimentation earlier. Spent too long on infrastructure before validating retrieval quality.
- Build telemetry from day one. Added metrics late, which delayed performance optimization.
- Implement batch embedding API calls sooner. Initial one-at-a-time approach was 10x slower.
- Design the metadata schema with evolution in mind. Early schema changes required painful migrations.