AEGIS
This page packages the full AEGIS story in one surface: the primary checkpoint benchmark, the FP8 Qwen base, and the external frontier models we tested against the same security detection dataset. Every row is backed by downloadable JSONL for manual verification.
Benchmark thesis
Why AEGIS is the local model to beat
Best precision on the published board
AEGIS (Ours*)
Best recall across frontier comparisons
Claude Sonnet 4.6 Thinking
Best F1 overall
AEGIS (Ours*)
Best severity alignment
AEGIS (Ours*)
Best CWE alignment
AEGIS (Ours*)
Local benchmark focus
AEGIS leads the self-hosted story.
This is the apples-to-apples local comparison that matters on the public board: AEGIS versus the FP8 Qwen base, using the same structured benchmark contract throughout.
Key metrics comparison
Metric-by-metric comparison across the benchmark field
Each panel isolates one metric so you can scan which model actually leads on precision, recall, F1, accuracy, CWE alignment, and severity alignment without mentally untangling mixed bars.
Precision
One ranked scale, all models on the same 0 to 100 line.
Recall
One ranked scale, all models on the same 0 to 100 line.
F1
One ranked scale, all models on the same 0 to 100 line.
Accuracy
One ranked scale, all models on the same 0 to 100 line.
CWE Accuracy
One ranked scale, all models on the same 0 to 100 line.
Severity Accuracy
One ranked scale, all models on the same 0 to 100 line.
Head-to-head interpretation
AEGIS now clears the base on both verdict quality and structure.
Against the same FP8 Qwen base, AEGIS improves accuracy, precision, recall, F1, CWE alignment, and severity alignment. That makes it the strongest self-hosted benchmark result on this page, not just the cleanest structured formatter.
Metric overlay
Structured wins vs FP8 base
CWE alignment
73.0%
AEGIS beat the base on 35 samples and lost on 5.
Severity alignment
60.0%
Severity was the strongest structured gain: 51 checkpoint wins versus 17 baseline wins.
Verdict gains vs FP8 base
Accuracy edge
83.0% vs 67.0%
Recall edge
80.0% vs 63.3%
F1 edge
0.825 vs 0.667
Precision edge
85.1% vs 70.5%
Full leaderboard
Benchmark board across local and API models.
Sorted by F1. The green-tinted row is AEGIS, the primary local candidate. GLM-5 remains the only published exception with an incomplete run at 96 completed requests.
Public AEGIS checkpoint artifact aligned to the published benchmark surface against the FP8 Qwen base.
Highest recall and best overall F1 on the full 1000-sample board.
Frontier reasoning model with strong recall but a heavy false-positive profile.
1000-sample fair-run baseline on the shared eval split.
Aggressive positive bias drove high recall but weak precision and structured fidelity.
Opus 4.5 stayed structurally reliable and landed near the top of the frontier pack on F1.
Gemini 3.1 Pro leaned recall-heavy and stayed competitive on overall F1.
Opus 4.7 traded recall for one of the cleanest precision profiles on the published frontier board.
Higher recall than the self-hosted candidate, but with a materially noisier decision boundary.
GPT-5.4 kept a precise boundary and landed above GPT-5.2 on recall and overall F1.
Only 96 of 100 requests completed; remaining samples exhausted retries.
Stable JSON producer, but not a leader on full-run classification metrics.
GPT-5.2 kept a conservative boundary with strong precision but weaker recall on this detection set.
Download vault
Raw JSONL for manual verification.
Every listed file is a direct model-response artifact. Each model is published as its own JSONL so the benchmark set is simple to browse and compare.
Primary candidate
AEGIS (Ours*)
Public AEGIS checkpoint artifact aligned to the published benchmark surface against the FP8 Qwen base.
Self-hosted baseline
Qwen3.6 27B FP8 Base
1000-sample fair-run baseline on the shared eval split.
API frontier
MiniMax-M2.7
Completed the full 1000-sample evaluation.
API frontier
Claude Sonnet 4.6 Thinking
Highest recall and best overall F1 on the full 1000-sample board.
API frontier
Claude Opus 4.6 Thinking
Frontier reasoning model with strong recall but a heavy false-positive profile.
API frontier
Claude Opus 4.7
Opus 4.7 traded recall for one of the cleanest precision profiles on the published frontier board.
API frontier
Claude Opus 4.5
Opus 4.5 stayed structurally reliable and landed near the top of the frontier pack on F1.
API frontier
Claude Sonnet 4.5
Stable JSON producer, but not a leader on full-run classification metrics.
API frontier
Gemini 3.1 Pro
Gemini 3.1 Pro leaned recall-heavy and stayed competitive on overall F1.
API frontier
Gemini 3 Flash Preview
Aggressive positive bias drove high recall but weak precision and structured fidelity.
API frontier
GPT-5.2
GPT-5.2 kept a conservative boundary with strong precision but weaker recall on this detection set.
API frontier
GPT-5.4
GPT-5.4 kept a precise boundary and landed above GPT-5.2 on recall and overall F1.
API frontier
GLM-5
Only 96 of 100 requests completed; remaining samples exhausted retries.
API frontier
DeepSeek V4 Pro
Higher recall than the self-hosted candidate, but with a materially noisier decision boundary.
OPENSEC benchmark narrative
AEGIS is the strongest self-hosted model on this benchmark board.
AEGIS leads the shown board on local precision, recall, F1, severity alignment, and CWE alignment against the shared FP8 Qwen base. The frontier models still matter as external reference points, but the benchmark page now makes the checkpoint story explicit without splitting it into multiple public variants.
