logpare
Semantic Log Compression for LLMs
A tool that compresses repetitive log files by 60-90% while preserving critical information, designed specifically for fitting logs into LLM context windows. Implements the Drain algorithm for log template mining.
The Problem
Modern applications generate massive log files. When debugging with AI tools like Claude, you hit context window limits immediately. A 10,000-line log file is 90% repetitive patterns with 10% actual signal, but you can't tell which 10% matters.
- LLM context windows fill with repetitive log patterns
- Can't share full logs with AI debugging assistants
- Signal-to-noise ratio in logs is terrible (10:1 or worse)
- Traditional compression (gzip) doesn't help, still can't analyze compressed logs
Constraints
- Must be fast, 10,000 lines/second or developers won't use it
- Can't lose important details (stack traces, error IDs, etc.)
- Must work as MCP server for Claude integration
- Zero configuration, should work on any log format
- Must preserve temporal patterns (error bursts, etc.)
The Approach
Implemented the Drain algorithm for log parsing, mines log templates by recognizing patterns in log structure. Groups logs by severity and frequency, then shows representative examples plus counts. Lossy compression that preserves what matters.
Why Drain algorithm vs. regex-based parsing?
Drain automatically discovers log templates without configuration. Regex requires knowing log formats upfront. Drain handles unknown/mixed formats.
Why lossy compression vs. lossless?
Lossless keeps all data but doesn't reduce context size for LLMs. Lossy compression (templates + counts + examples) preserves patterns while dramatically reducing size.
Why group by severity?
Errors matter more than info logs. Severity grouping surfaces critical issues first. Better signal for debugging than chronological order.
Key Tradeoffs
Templates lose specific values (IDs, timestamps, paths)
Dramatic size reduction (60-90% typical, up to 99.8% on repetitive logs). For debugging, patterns matter more than every individual value. Can still show representative examples.
Async processing for large files (>1MB)
Prevents blocking, but adds complexity. Worth it because large logs are the entire use case. Synchronous would hang on multi-GB files.
In-memory processing vs. streaming
Faster and simpler code, but memory-bound. Right call for typical use case (logs under 100MB). Can add streaming later if needed.
Implementation Highlights
Drain algorithm implementation
Parse tree structure for efficient template matching. Depth-first search with similarity threshold. Automatically adapts to log format without configuration.
Severity-based grouping
Extracts severity from common formats (ERROR, WARN, INFO). Groups templates by severity for prioritized output. Shows error patterns first, info patterns last.
Smart format detection
Detects timestamps, log levels, and common patterns. Works with JSON logs, syslog, application logs. No format configuration required.
MCP server integration
Exposes compression as MCP tool for Claude Desktop. Handles large files (>1MB) asynchronously. Provides diagnostic prompts for error analysis.
Outcomes
What I'd Do Differently
- Add more configuration options for power users. Zero-config is great for most, but advanced users want tuning.
- Build better streaming support from the start. In-memory works for most cases but hits limits on multi-GB logs.
- Implement template merging. Sometimes creates too many similar templates that could be merged.
- Add interactive mode for template review. Would help users validate that compression preserves what they need.