INFO 7375  ·  FINAL PROJECT  ·  SPRING 2026
INFO 7375 v1.0 ARAVIND BALAJI · NORTHEASTERN

CodeSentinel

Multi-agent
code review
that refuses
to hallucinate.

Three specialized LLM agents orchestrated through a LangGraph directed graph, grounded in OWASP Top 10 and CWE taxonomy, and gated by an adversarial Evaluator that rejects any finding it can't trace to a citation.

FP REDUCTION (REAL CLAUDE)
97%
baseline 30 → multi-agent 1
TRUE POSITIVE RATE
1.000
both systems, real Claude Sonnet
UNIT TESTS PASSING
35/35
mock mode, no API key
COMPONENTS
4/5
rubric requires 2
01 / THE PROBLEM

A single prompt produces a book report. We needed a peer review.

LLMs do competent code review on the easy cases. But one prompt to one model has three failure modes that matter in production — and prompt-tuning alone can't fix any of them.

Hallucinated vulnerabilities

Models invent plausible-sounding CWEs with no grounding. A finding that cites A03:2025 might be real, or the model might be pattern-matching on the word "database." There's no way to check.

Silent omissions

Real defects slip past. The single-prompt baseline in our benchmark missed yaml.load without SafeLoader and hashlib.md5 on password hashing. Both are textbook RCEs. Neither was caught.

Untraceable findings

Even real findings arrive without provenance. If you can't trace a claim to an authoritative source, you can't hand it to a human reviewer, an auditor, or a customer.

No way to improve

Prompt-tuning is a dead end once it plateaus. The gains worth capturing live above the prompt — in routing, in retrieval, in who reviews whom. That's where the architecture has to go.

02 / ARCHITECTURE

Three agents. One adversarial reviewer. A bounded retry loop.

Specialization, grounding, and adversarial self-critique — wired together through LangGraph with explicit state, conditional routing, and a circuit breaker that prevents infinite retries.

Source code input_code AGENT 01 Security Sentinel RAG-grounded · k=6 RAG index OWASP + CWE + PAT 56 passages AGENT 02 Quality Auditor style · maintainability AGENT 03 · ADVERSARIAL Evaluator Guardian programmatic + LLM schema · citation · confidence Final Report approved findings CIRCUIT BREAKER retry ≤ 3, else terminate APPROVED REJECTED · ROUTE BACK WITH FEEDBACK
01
Security Sentinel
RAG-GROUNDED DETECTOR

Builds a retrieval query from language cues and suspicious tokens, fetches top-6 passages with two-pass semantic + lexical rerank, and emits findings that must cite a retrieved passage by ID.

02
Quality Auditor
STYLE & MAINTAINABILITY

Reviews style, error handling, and maintainability. Explicitly excluded from security territory — never emits CRITICAL severity. Capped at 10 findings per file to prevent noise.

03
Evaluator Guardian
ADVERSARIAL REVIEWER

Two-layer validation. Programmatic check enforces citations, evidence, remediation, and confidence. LLM layer adds semantic review on top. Rejections loop back with structured feedback.

03 / RUBRIC COVERAGE

Four of five core components. Rubric required two.

04 / MEASURED RESULTS

Same model. Same prompts. The architecture does the work.

The headline result is from the April 20, 2026 real-LLM run against Claude Sonnet. Both systems achieve perfect recall — the decisive difference is precision. The mock-mode suites below corroborate the direction and supply statistical significance.

↳ REAL CLAUDE SONNET · TOY SUITE · 10 SAMPLES · APRIL 20 2026
SystemTPRFPRFalse positivesCWE accuracy
Single-prompt baseline 1.000 0.789 30 1.000
Multi-agent CodeSentinel 1.000 0.111 1 1.000
Reduction ±0.000 −0.678 −97% ±0.000
eval/results/20260420_143220/  ·  Evaluator Guardian rejected 29 of 30 baseline hallucinations  ·  cost ~$2 total
↳ MOCK-LLM · TOY SUITE · 10 SAMPLES · REPRODUCIBILITY ANCHOR
SystemTPRFPRCWE accuracyCorrect
Single-prompt baseline 0.750 0.000 1.000 6 / 8
Multi-agent CodeSentinel 1.000 0.000 1.000 8 / 8
Delta +0.250 ±0.000 ±0.000 +2
make benchmark  ·  McNemar's exact p = 0.5000  ·  small sample, direction favors multi-agent
↳ PAIRED SUITE · 20 SAMPLES · OWASP-BENCHMARK-STYLE · 10 TP + 10 FP TRAPS
SystemTPRFPRCWE accuracyYouden
Single-prompt baseline 0.333 0.571 1.000 −0.238
Multi-agent CodeSentinel 1.000 0.182 1.000 +0.818
Delta +0.667 −0.389 ±0.000 +1.056
make benchmark-paired  ·  McNemar's exact p = 0.0312 (significant at α=0.05)  ·  6 discordant pairs, all favoring multi-agent

The two false positives on the paired suite are hashlib.md5 used as a cache key (pattern detection cannot distinguish security vs. non-security use) and a dead-code vulnerable branch (pattern detection cannot reason about reachability). Both are documented limitations of pattern-based detection — the exact failure modes the OWASP Benchmark is designed to probe.

05 / THE SYSTEM CAUGHT ITS OWN BUG

We didn't design this test. The architecture produced it.

On the hashlib.md5 sample, the mock LLM was hard-coded to cite cwe_subset.csv::CWE-327 — the correct CWE, but not what the retriever surfaced in top-6. The Evaluator Guardian, doing exactly what it was built to do, rejected the finding for having a citation that didn't match the retrieval. The finding was suppressed. The system caught its own bug.

The fix was three characters: change the mock to cite patterns.md::PY-08 — the passage the retriever actually returns, and what a real LLM given that retrieval context would cite. No citation-enforcement policy was relaxed. Documented in the technical report, not papered over.

06 / REPRODUCE

Five commands. No API key required.

The pipeline runs end-to-end in mock mode with deterministic pattern-matched responses, making all 35 unit tests runnable with zero configuration. Set ANTHROPIC_API_KEY to switch to real SDK.

# 1 · install pinned deps pip install -r requirements.txt # 2 · build the RAG index (56 passages, tri-backend fallback) make ingest # 3 · run 35 unit tests in mock mode make test # 4 · run the benchmark · baseline vs. multi-agent make benchmark # 5 · launch the streamlit UI make ui
07 / LIVE DEMO

Try it. Watch it run.

The live Streamlit app runs against real Claude Sonnet. Paste any Python or JavaScript snippet and watch the three agents analyze it, the Evaluator Guardian validate citations, and the retry loop engage when findings get rejected. The seven-minute video walks through the full architecture and the ninety-seven percent false-positive reduction, beat by beat.

VIDEO · 7 MINUTES

Seven-minute walkthrough: architecture, live demo, the 30→1 result, and the bug the system caught before the author did. Same model, same prompts, different architecture.

LIVE STREAMLIT · REAL CLAUDE SONNET

The live demo runs on Streamlit Community Cloud against a personal Anthropic API account. Each analysis costs roughly $0.02–0.05 in credits. Mock mode is available in the repo for zero-cost reproduction.

Open the live demo

codesentinel-f2ggdvqeuwsj4pta5sk27s.streamlit.app

The Streamlit app exposes four tabs — Findings, Evaluator verdict, RAG citations, and Trace — so every claim the system makes is inspectable end to end.

08 / SOURCE & ARTIFACTS

Every file. Every test. Every measured number.