INFO 7375 v1.0 ARAVIND BALAJI · NORTHEASTERN

CodeSentinel

Multi-agent
code review
that refuses
to hallucinate.

Three specialized LLM agents orchestrated through a LangGraph directed graph, grounded in OWASP Top 10 and CWE taxonomy, and gated by an adversarial Evaluator that rejects any finding it can't trace to a citation.

Read the architecture → View source ↗

FP REDUCTION (REAL CLAUDE)

97%

baseline 30 → multi-agent 1

TRUE POSITIVE RATE

1.000

both systems, real Claude Sonnet

UNIT TESTS PASSING

35/35

mock mode, no API key

COMPONENTS

4/5

rubric requires 2

01 / THE PROBLEM

A single prompt produces a book report. We needed a peer review.

LLMs do competent code review on the easy cases. But one prompt to one model has three failure modes that matter in production — and prompt-tuning alone can't fix any of them.

Hallucinated vulnerabilities

Models invent plausible-sounding CWEs with no grounding. A finding that cites A03:2025 might be real, or the model might be pattern-matching on the word "database." There's no way to check.

Silent omissions

Real defects slip past. The single-prompt baseline in our benchmark missed yaml.load without SafeLoader and hashlib.md5 on password hashing. Both are textbook RCEs. Neither was caught.

Untraceable findings

Even real findings arrive without provenance. If you can't trace a claim to an authoritative source, you can't hand it to a human reviewer, an auditor, or a customer.

No way to improve

Prompt-tuning is a dead end once it plateaus. The gains worth capturing live above the prompt — in routing, in retrieval, in who reviews whom. That's where the architecture has to go.

02 / ARCHITECTURE

Three agents. One adversarial reviewer. A bounded retry loop.

Specialization, grounding, and adversarial self-critique — wired together through LangGraph with explicit state, conditional routing, and a circuit breaker that prevents infinite retries.

Security Sentinel

RAG-GROUNDED DETECTOR

Builds a retrieval query from language cues and suspicious tokens, fetches top-6 passages with two-pass semantic + lexical rerank, and emits findings that must cite a retrieved passage by ID.

Quality Auditor

STYLE & MAINTAINABILITY

Reviews style, error handling, and maintainability. Explicitly excluded from security territory — never emits CRITICAL severity. Capped at 10 findings per file to prevent noise.

Evaluator Guardian

ADVERSARIAL REVIEWER

Two-layer validation. Programmatic check enforces citations, evidence, remediation, and confidence. LLM layer adds semantic review on top. Rejections loop back with structured feedback.

03 / RUBRIC COVERAGE

Four of five core components. Rubric required two.

01
Prompt Engineering Three specialized system prompts with feedback injection on retry
IMPLEMENTED
02
Retrieval-Augmented Generation 56 passages, three-tier fallback (Chroma / sklearn / pure Python), two-pass rerank
IMPLEMENTED
03
Synthetic Data Generation 15 CWE templates, paired vuln/safe samples, independent regex verifier
IMPLEMENTED
04
Reinforcement Learning (bonus) UCB-1 bandit over prompts, REINFORCE policy gradient over routing
IMPLEMENTED
05
Multimodal Integration Not in scope — code review is a text-only task
N/A

04 / MEASURED RESULTS

Same model. Same prompts. The architecture does the work.

The headline result is from the April 20, 2026 real-LLM run against Claude Sonnet. Both systems achieve perfect recall — the decisive difference is precision. The mock-mode suites below corroborate the direction and supply statistical significance.

      ↳ REAL CLAUDE SONNET · TOY SUITE · 10 SAMPLES · APRIL 20 2026
    

System	TPR	FPR	False positives	CWE accuracy
Single-prompt baseline	1.000	0.789	30	1.000
Multi-agent CodeSentinel	1.000	0.111	1	1.000
Reduction	±0.000	−0.678	−97%	±0.000

↳ eval/results/20260420_143220/ · Evaluator Guardian rejected 29 of 30 baseline hallucinations · cost ~$2 total

      ↳ MOCK-LLM · TOY SUITE · 10 SAMPLES · REPRODUCIBILITY ANCHOR
    

System	TPR	FPR	CWE accuracy	Correct
Single-prompt baseline	0.750	0.000	1.000	6 / 8
Multi-agent CodeSentinel	1.000	0.000	1.000	8 / 8
Delta	+0.250	±0.000	±0.000	+2

↳ make benchmark · McNemar's exact p = 0.5000 · small sample, direction favors multi-agent

      ↳ PAIRED SUITE · 20 SAMPLES · OWASP-BENCHMARK-STYLE · 10 TP + 10 FP TRAPS
    

System	TPR	FPR	CWE accuracy	Youden
Single-prompt baseline	0.333	0.571	1.000	−0.238
Multi-agent CodeSentinel	1.000	0.182	1.000	+0.818
Delta	+0.667	−0.389	±0.000	+1.056

↳ make benchmark-paired · McNemar's exact p = 0.0312 (significant at α=0.05) · 6 discordant pairs, all favoring multi-agent

The two false positives on the paired suite are hashlib.md5 used as a cache key (pattern detection cannot distinguish security vs. non-security use) and a dead-code vulnerable branch (pattern detection cannot reason about reachability). Both are documented limitations of pattern-based detection — the exact failure modes the OWASP Benchmark is designed to probe.

05 / THE SYSTEM CAUGHT ITS OWN BUG

We didn't design this test. The architecture produced it.

On the hashlib.md5 sample, the mock LLM was hard-coded to cite cwe_subset.csv::CWE-327 — the correct CWE, but not what the retriever surfaced in top-6. The Evaluator Guardian, doing exactly what it was built to do, rejected the finding for having a citation that didn't match the retrieval. The finding was suppressed. The system caught its own bug.

— §11.6 MOCK-REAL PARITY FOR CITATIONS · TECHNICAL REPORT

The fix was three characters: change the mock to cite patterns.md::PY-08 — the passage the retriever actually returns, and what a real LLM given that retrieval context would cite. No citation-enforcement policy was relaxed. Documented in the technical report, not papered over.

06 / REPRODUCE

Five commands. No API key required.

The pipeline runs end-to-end in mock mode with deterministic pattern-matched responses, making all 35 unit tests runnable with zero configuration. Set ANTHROPIC_API_KEY to switch to real SDK.

# 1 · install pinned deps
pip install -r requirements.txt

# 2 · build the RAG index (56 passages, tri-backend fallback)
make ingest

# 3 · run 35 unit tests in mock mode
make test

# 4 · run the benchmark · baseline vs. multi-agent
make benchmark

# 5 · launch the streamlit UI
make ui
    

07 / LIVE DEMO

Try it. Watch it run.

The live Streamlit app runs against real Claude Sonnet. Paste any Python or JavaScript snippet and watch the three agents analyze it, the Evaluator Guardian validate citations, and the retry loop engage when findings get rejected. The seven-minute video walks through the full architecture and the ninety-seven percent false-positive reduction, beat by beat.

VIDEO · 7 MINUTES

Seven-minute walkthrough: architecture, live demo, the 30→1 result, and the bug the system caught before the author did. Same model, same prompts, different architecture.

LIVE STREAMLIT · REAL CLAUDE SONNET

The live demo runs on Streamlit Community Cloud against a personal Anthropic API account. Each analysis costs roughly $0.02–0.05 in credits. Mock mode is available in the repo for zero-cost reproduction.

Open the live demo ↗

codesentinel-f2ggdvqeuwsj4pta5sk27s.streamlit.app

The Streamlit app exposes four tabs — Findings, Evaluator verdict, RAG citations, and Trace — so every claim the system makes is inspectable end to end.

08 / SOURCE & ARTIFACTS

Every file. Every test. Every measured number.

GitHub repository ↗ Technical report (46 pages) ↗ Architecture doc ↗ Video walkthrough ↗