Knowledge Management•2024

Textrawl

Personal Knowledge MCP Server

A semantic search system that transforms scattered documents into queryable knowledge accessible directly from Claude. Built as an MCP server to bridge personal knowledge bases with AI conversations.

RoleCreator & Lead Developer

Tech Stack

TypeScriptPostgreSQLpgvector+3 more

Website GitHub

The Problem

Knowledge workers accumulate thousands of documents across PDFs, Word docs, emails, and notes. When you need specific information, you're stuck with filename-based search or manual scanning. Traditional search fails because it can't understand context or meaning.

Can't search across different file formats (PDF, DOCX, email, HTML)
Keyword search misses relevant content phrased differently
No way to access this knowledge from AI tools like Claude
Email threads and Google Takeout archives are effectively unsearchable

Constraints

Must work offline-first for privacy-sensitive documents
Need to integrate seamlessly with Claude via MCP protocol
Must handle large document collections (10,000+ files)
Processing speed critical for user experience
Can't rely on commercial vector DBs for cost/privacy reasons

The Approach

Built a hybrid search system combining semantic vector search with traditional full-text search, using reciprocal rank fusion to merge results. Chose PostgreSQL with pgvector over specialized vector databases for flexibility and SQL familiarity.

Why PostgreSQL + pgvector vs. dedicated vector DB?

Need SQL's flexibility for complex filtering, joins, and metadata queries. pgvector provides vector similarity while keeping everything in one database. Simpler ops than managing separate systems.

Why hybrid search instead of pure semantic?

Vector search alone misses exact matches. Full-text alone misses semantic similarity. Reciprocal rank fusion combines both, surfacing results that rank high in either system.

Why MCP server architecture?

MCP (Model Context Protocol) lets Claude Desktop, Cursor, and other tools access the knowledge base through a standard interface. Better than one-off integrations.

Key Tradeoffs

PostgreSQL over specialized vector DBs (Pinecone, Weaviate)

Sacrificed some vector search performance for SQL flexibility and simpler infrastructure. Right call because filtering and metadata queries are critical.

Chunking with overlap vs. simple splits

Increased storage by 30% but dramatically improved retrieval quality. Context preservation at chunk boundaries matters more than space.

OpenAI embeddings vs. local models

Latency and API cost for superior embedding quality. Worth it because search quality is the entire value proposition.

Implementation Highlights

Multi-format ingestion pipeline

Handles PDF (text extraction with fallback OCR), Word docs, Markdown, plain text. Converts emails (mbox, EML) and HTML to searchable format. Google Takeout JSON parsing for exported data.

Smart chunking with context preservation

Splits documents into 500-token chunks with 50-token overlap. Preserves document metadata and position for context. Maintains sentence boundaries to avoid mid-sentence splits.

Reciprocal rank fusion algorithm

Combines vector similarity scores from pgvector with PostgreSQL full-text search rankings. RRF formula: score = Σ(1/(k + rank)) where k=60. Surfaces results strong in either dimension.

MCP protocol integration

Implements Model Context Protocol for Claude Desktop integration. Exposes search, document retrieval, and metadata tools. Handles async operations for large result sets.

Outcomes

Supported formats

PDF, DOCX, MD, TXT, HTML, email, Google Takeout

What I'd Do Differently

Start with chunking strategy experimentation earlier. Spent too long on infrastructure before validating retrieval quality.
Build telemetry from day one. Added metrics late, which delayed performance optimization.
Implement batch embedding API calls sooner. Initial one-at-a-time approach was 10x slower.
Design the metadata schema with evolution in mind. Early schema changes required painful migrations.

Personal Knowledge MCP Server

A semantic search system that transforms scattered documents into queryable knowledge accessible directly from Claude. Built as an MCP server to bridge personal knowledge bases with AI conversations.

RoleCreator & Lead Developer

Tech Stack

TypeScriptPostgreSQLpgvector+3 more

Can't search across different file formats (PDF, DOCX, email, HTML)

Keyword search misses relevant content phrased differently

No way to access this knowledge from AI tools like Claude

Email threads and Google Takeout archives are effectively unsearchable

Why PostgreSQL + pgvector vs. dedicated vector DB?

Need SQL's flexibility for complex filtering, joins, and metadata queries. pgvector provides vector similarity while keeping everything in one database. Simpler ops than managing separate systems.

Why hybrid search instead of pure semantic?

Vector search alone misses exact matches. Full-text alone misses semantic similarity. Reciprocal rank fusion combines both, surfacing results that rank high in either system.

Why MCP server architecture?

MCP (Model Context Protocol) lets Claude Desktop, Cursor, and other tools access the knowledge base through a standard interface. Better than one-off integrations.

PostgreSQL over specialized vector DBs (Pinecone, Weaviate)

Sacrificed some vector search performance for SQL flexibility and simpler infrastructure. Right call because filtering and metadata queries are critical.

Chunking with overlap vs. simple splits

Increased storage by 30% but dramatically improved retrieval quality. Context preservation at chunk boundaries matters more than space.

OpenAI embeddings vs. local models

Latency and API cost for superior embedding quality. Worth it because search quality is the entire value proposition.

Multi-format ingestion pipeline

Handles PDF (text extraction with fallback OCR), Word docs, Markdown, plain text. Converts emails (mbox, EML) and HTML to searchable format. Google Takeout JSON parsing for exported data.

Smart chunking with context preservation

Splits documents into 500-token chunks with 50-token overlap. Preserves document metadata and position for context. Maintains sentence boundaries to avoid mid-sentence splits.

Reciprocal rank fusion algorithm

Combines vector similarity scores from pgvector with PostgreSQL full-text search rankings. RRF formula: score = Σ(1/(k + rank)) where k=60. Surfaces results strong in either dimension.

MCP protocol integration

Implements Model Context Protocol for Claude Desktop integration. Exposes search, document retrieval, and metadata tools. Handles async operations for large result sets.

Start with chunking strategy experimentation earlier. Spent too long on infrastructure before validating retrieval quality.

Build telemetry from day one. Added metrics late, which delayed performance optimization.

Implement batch embedding API calls sooner. Initial one-at-a-time approach was 10x slower.

Design the metadata schema with evolution in mind. Early schema changes required painful migrations.

The ProblemThe Problem

ConstraintsConstraints

The ApproachThe Approach

Why PostgreSQL + pgvector vs. dedicated vector DB?

Why hybrid search instead of pure semantic?

Why MCP server architecture?

Key TradeoffsKey Tradeoffs

PostgreSQL over specialized vector DBs (Pinecone, Weaviate)

Chunking with overlap vs. simple splits

OpenAI embeddings vs. local models

Implementation HighlightsImplementation Highlights

Multi-format ingestion pipeline

Smart chunking with context preservation

Reciprocal rank fusion algorithm

MCP protocol integration

OutcomesOutcomes

What I'd Do DifferentlyWhat I'd Do Differently

The ProblemThe Problem

ConstraintsConstraints

The ApproachThe Approach

Why PostgreSQL + pgvector vs. dedicated vector DB?

Why hybrid search instead of pure semantic?

Why MCP server architecture?

Key TradeoffsKey Tradeoffs

PostgreSQL over specialized vector DBs (Pinecone, Weaviate)

Chunking with overlap vs. simple splits

OpenAI embeddings vs. local models

Implementation HighlightsImplementation Highlights

Multi-format ingestion pipeline

Smart chunking with context preservation

Reciprocal rank fusion algorithm

MCP protocol integration

OutcomesOutcomes

What I'd Do DifferentlyWhat I'd Do Differently

The Problem

Constraints

The Approach

Key Tradeoffs

Implementation Highlights

Outcomes

What I'd Do Differently

The Problem

Constraints

The Approach

Key Tradeoffs

Implementation Highlights

Outcomes

What I'd Do Differently