AI-Powered Bug Fix Automation
From error detection to pull request in seconds
An automation system that converts accessibility and performance issues into GitHub pull requests. A Chrome extension extracts issue data from error monitoring tools, sanitizes PII, enriches with AI analysis, and creates PRs automatically.
The Problem
Developers spent 10-30 minutes per bug report on repetitive triage: reading the error, finding the file, understanding context, creating a PR. With dozens of daily issues from monitoring tools, this was a massive productivity drain.
- 10-30 minutes of manual work per bug report
- Inconsistent issue descriptions and fix approaches
- Sensitive customer data in error logs risked exposure to LLMs
- Context switching between monitoring tool, codebase, and GitHub
Constraints
- PII must never reach LLM APIs (customer emails, IPs, session IDs)
- Extension must work within Manifest V3 restrictions
- Low latency, under 15 seconds end-to-end or developers won't use it
- Must integrate with existing GitHub workflow (PRs, not commits)
The Approach
Three-phase architecture: Chrome extension captures and signs requests with HMAC, Express backend sanitizes PII and enriches with Claude Haiku, then creates PRs via AI agent API. Deterministic sanitization enables 24-hour caching.
Why Chrome extension over bookmarklet or browser automation?
Manifest V3 extensions have persistent service workers and storage APIs. Bookmarklets can't persist config. Browser automation (Playwright) would require always-running process and can't inject UI elements.
Why HMAC authentication between extension and backend?
Extension runs in user's browser, can't trust origin alone. HMAC with shared secret proves request came from our extension. Timing-safe comparison prevents timing attacks on signature validation.
Why Claude Haiku specifically?
Cost: $0.25/1M input, $1.25/1M output tokens, about $0.0008 per request. Speed: ~7-8 second latency. Quality: Haiku handles structured output (JSON schema) reliably. Temperature 0.2 for deterministic output.
Why deterministic PII sanitization?
Same input always produces same sanitized output (REDACTED_EMAIL, REDACTED_IP, etc.). Enables 24-hour caching by payload hash, duplicate errors don't hit LLM twice. Also provides audit trail.
Key Tradeoffs
Aggressive PII sanitization may remove useful context
Security over convenience. Email addresses and IPs can't help LLM understand code bugs. Session IDs are replaced, not removed entirely, preserving structure.
Single LLM (Claude) instead of model-agnostic
Optimizing prompts for one model is more effective than generic prompts. Claude's structured output (JSON mode) is reliable. Can add fallback to GPT later if needed.
Backend required (not extension-only)
Manifest V3 restrictions prevent API keys in extension. Backend enables PII sanitization, caching, and audit logging that extension can't do securely.
Implementation Highlights
HMAC request signing with Web Crypto API
Extension signs payload with SHA-256 HMAC using Web Crypto (browser) API. Backend validates with timing-safe comparison to prevent timing attacks. Shared secret stored in extension config.
Deterministic PII sanitization
Regex patterns for emails, credit cards, session IDs, phone numbers, IPv4/IPv6. Bounded quantifiers prevent ReDoS. Output is always identical for same input, enabling cache keying.
GitHub file verification before LLM
Uses GitHub API to verify suspected files exist in repo before asking LLM. Prevents hallucinated file paths. Octokit integration searches by filename patterns from stack trace.
Structured prompt with few-shot examples
System prompt includes JSON schema, target context (stack, framework), and two detailed examples. Output schema enforces task_title (50 chars), summary (400 chars), reproduction steps, fix plan.
24-hour deduplication cache
Payload hash (URL + errorType + title) keys into cache. Identical errors within 24 hours return cached LLM response. Reduces costs and prevents duplicate PRs from flaky errors.
Outcomes
What I'd Do Differently
- Build better feedback loop. Currently no way to mark generated PR as good/bad. Would improve prompts faster with explicit quality signals.
- Add multi-model fallback earlier. Claude outages would block all automation. Having GPT-4 fallback would improve reliability.
- Implement confidence thresholds. Low-confidence file locations should create issues, not PRs. Current approach creates some PRs that need significant manual editing.