Developer Tool•2024

logpare

Semantic Log Compression for LLMs

A tool that compresses repetitive log files by 60-90% while preserving critical information, designed specifically for fitting logs into LLM context windows. Implements the Drain algorithm for log template mining.

RoleCreator & Lead Developer

Tech Stack

TypeScriptDrain AlgorithmMCP Server+1 more

Website GitHub npm

The Problem

Modern applications generate massive log files. When debugging with AI tools like Claude, you hit context window limits immediately. A 10,000-line log file is 90% repetitive patterns with 10% actual signal, but you can't tell which 10% matters.

LLM context windows fill with repetitive log patterns
Can't share full logs with AI debugging assistants
Signal-to-noise ratio in logs is terrible (10:1 or worse)
Traditional compression (gzip) doesn't help, still can't analyze compressed logs

Constraints

Must be fast, 10,000 lines/second or developers won't use it
Can't lose important details (stack traces, error IDs, etc.)
Must work as MCP server for Claude integration
Zero configuration, should work on any log format
Must preserve temporal patterns (error bursts, etc.)

The Approach

Implemented the Drain algorithm for log parsing, mines log templates by recognizing patterns in log structure. Groups logs by severity and frequency, then shows representative examples plus counts. Lossy compression that preserves what matters.

Why Drain algorithm vs. regex-based parsing?

Drain automatically discovers log templates without configuration. Regex requires knowing log formats upfront. Drain handles unknown/mixed formats.

Why lossy compression vs. lossless?

Lossless keeps all data but doesn't reduce context size for LLMs. Lossy compression (templates + counts + examples) preserves patterns while dramatically reducing size.

Why group by severity?

Errors matter more than info logs. Severity grouping surfaces critical issues first. Better signal for debugging than chronological order.

Key Tradeoffs

Templates lose specific values (IDs, timestamps, paths)

Dramatic size reduction (60-90% typical, up to 99.8% on repetitive logs). For debugging, patterns matter more than every individual value. Can still show representative examples.

Async processing for large files (>1MB)

Prevents blocking, but adds complexity. Worth it because large logs are the entire use case. Synchronous would hang on multi-GB files.

In-memory processing vs. streaming

Faster and simpler code, but memory-bound. Right call for typical use case (logs under 100MB). Can add streaming later if needed.

Implementation Highlights

Drain algorithm implementation

Parse tree structure for efficient template matching. Depth-first search with similarity threshold. Automatically adapts to log format without configuration.

Severity-based grouping

Extracts severity from common formats (ERROR, WARN, INFO). Groups templates by severity for prioritized output. Shows error patterns first, info patterns last.

Smart format detection

Detects timestamps, log levels, and common patterns. Works with JSON logs, syslog, application logs. No format configuration required.

MCP server integration

Exposes compression as MCP tool for Claude Desktop. Handles large files (>1MB) asynchronously. Provides diagnostic prompts for error analysis.

Outcomes

Yes

MCP integration

Available as Claude Desktop tool

What I'd Do Differently

Add more configuration options for power users. Zero-config is great for most, but advanced users want tuning.
Build better streaming support from the start. In-memory works for most cases but hits limits on multi-GB logs.
Implement template merging. Sometimes creates too many similar templates that could be merged.
Add interactive mode for template review. Would help users validate that compression preserves what they need.

Semantic Log Compression for LLMs

RoleCreator & Lead Developer

Tech Stack

TypeScriptDrain AlgorithmMCP Server+1 more

LLM context windows fill with repetitive log patterns

Can't share full logs with AI debugging assistants

Signal-to-noise ratio in logs is terrible (10:1 or worse)

Traditional compression (gzip) doesn't help, still can't analyze compressed logs

Why Drain algorithm vs. regex-based parsing?

Drain automatically discovers log templates without configuration. Regex requires knowing log formats upfront. Drain handles unknown/mixed formats.

Why lossy compression vs. lossless?

Lossless keeps all data but doesn't reduce context size for LLMs. Lossy compression (templates + counts + examples) preserves patterns while dramatically reducing size.

Why group by severity?

Errors matter more than info logs. Severity grouping surfaces critical issues first. Better signal for debugging than chronological order.

Templates lose specific values (IDs, timestamps, paths)

Dramatic size reduction (60-90% typical, up to 99.8% on repetitive logs). For debugging, patterns matter more than every individual value. Can still show representative examples.

Async processing for large files (>1MB)

Prevents blocking, but adds complexity. Worth it because large logs are the entire use case. Synchronous would hang on multi-GB files.

In-memory processing vs. streaming

Faster and simpler code, but memory-bound. Right call for typical use case (logs under 100MB). Can add streaming later if needed.

Drain algorithm implementation

Parse tree structure for efficient template matching. Depth-first search with similarity threshold. Automatically adapts to log format without configuration.

Severity-based grouping

Extracts severity from common formats (ERROR, WARN, INFO). Groups templates by severity for prioritized output. Shows error patterns first, info patterns last.

Smart format detection

Detects timestamps, log levels, and common patterns. Works with JSON logs, syslog, application logs. No format configuration required.

MCP server integration

Exposes compression as MCP tool for Claude Desktop. Handles large files (>1MB) asynchronously. Provides diagnostic prompts for error analysis.

Add more configuration options for power users. Zero-config is great for most, but advanced users want tuning.

Build better streaming support from the start. In-memory works for most cases but hits limits on multi-GB logs.

Implement template merging. Sometimes creates too many similar templates that could be merged.

Add interactive mode for template review. Would help users validate that compression preserves what they need.

The ProblemThe Problem

ConstraintsConstraints

The ApproachThe Approach

Why Drain algorithm vs. regex-based parsing?

Why lossy compression vs. lossless?

Why group by severity?

Key TradeoffsKey Tradeoffs

Templates lose specific values (IDs, timestamps, paths)

Async processing for large files (>1MB)

In-memory processing vs. streaming

Implementation HighlightsImplementation Highlights

Drain algorithm implementation

Severity-based grouping

Smart format detection

MCP server integration

OutcomesOutcomes

What I'd Do DifferentlyWhat I'd Do Differently

The ProblemThe Problem

ConstraintsConstraints

The ApproachThe Approach

Why Drain algorithm vs. regex-based parsing?

Why lossy compression vs. lossless?

Why group by severity?

Key TradeoffsKey Tradeoffs

Templates lose specific values (IDs, timestamps, paths)

Async processing for large files (>1MB)

In-memory processing vs. streaming

Implementation HighlightsImplementation Highlights

Drain algorithm implementation

Severity-based grouping

Smart format detection

MCP server integration

OutcomesOutcomes

What I'd Do DifferentlyWhat I'd Do Differently

The Problem

Constraints

The Approach

Key Tradeoffs

Implementation Highlights

Outcomes

What I'd Do Differently

The Problem

Constraints

The Approach

Key Tradeoffs

Implementation Highlights

Outcomes

What I'd Do Differently