OPENSEC
Public benchmark surface

AEGIS

This page packages the full AEGIS story in one surface: the primary checkpoint benchmark, the FP8 Qwen base, and the external frontier models we tested against the same security detection dataset. Every row is backed by downloadable JSONL for manual verification.

Benchmark thesis

Why AEGIS is the local model to beat

Metric leader
85.1%

Best precision on the published board

AEGIS (Ours*)

Metric leader
92.0%

Best recall across frontier comparisons

Claude Sonnet 4.6 Thinking

Metric leader
0.825

Best F1 overall

AEGIS (Ours*)

Metric leader
79.0%

Best severity alignment

AEGIS (Ours*)

Metric leader
73.0%

Best CWE alignment

AEGIS (Ours*)

Local benchmark focus

AEGIS leads the self-hosted story.

This is the apples-to-apples local comparison that matters on the public board: AEGIS versus the FP8 Qwen base, using the same structured benchmark contract throughout.

Key metrics comparison

Metric-by-metric comparison across the benchmark field

Each panel isolates one metric so you can scan which model actually leads on precision, recall, F1, accuracy, CWE alignment, and severity alignment without mentally untangling mixed bars.

Precision

One ranked scale, all models on the same 0 to 100 line.

Model
0255075100
Score
AEGIS
#1
85.1%
Claude Opus 4.7
#2
82.8%
GPT-5.2
#3
73.1%
Qwen FP8
#4
70.5%
GPT-5.4
#5
69.4%
MiniMax-M2.7
#6
68.9%
Claude Opus 4.5
#7
63.0%
Gemini 3.1 Pro
#8
60.3%
GLM-5
#9
59.1%
Claude Sonnet 4.6
#10
58.2%
Claude Opus 4.6
#11
56.6%
Claude Sonnet 4.5
#12
55.6%
Gemini 3 Flash Preview
#13
53.1%
DeepSeek V4 Pro
#14
51.4%

Recall

One ranked scale, all models on the same 0 to 100 line.

Model
0255075100
Score
Claude Sonnet 4.6
#1
92.0%
Claude Opus 4.6
#2
86.0%
Gemini 3 Flash Preview
#3
86.0%
AEGIS
#4
80.0%
DeepSeek V4 Pro
#5
72.0%
Gemini 3.1 Pro
#6
70.0%
Claude Opus 4.5
#7
68.0%
Qwen FP8
#8
63.3%
MiniMax-M2.7
#9
62.0%
Claude Sonnet 4.5
#10
60.0%
GLM-5
#11
56.5%
GPT-5.4
#12
50.0%
Claude Opus 4.7
#13
48.0%
GPT-5.2
#14
38.8%

F1

One ranked scale, all models on the same 0 to 100 line.

Model
0255075100
Score
AEGIS
#1
82.5%
Claude Sonnet 4.6
#2
71.3%
Claude Opus 4.6
#3
68.3%
Qwen FP8
#4
66.7%
Gemini 3 Flash Preview
#5
65.6%
Claude Opus 4.5
#6
65.4%
MiniMax-M2.7
#7
65.3%
Gemini 3.1 Pro
#8
64.8%
Claude Opus 4.7
#9
60.8%
DeepSeek V4 Pro
#10
60.0%
GPT-5.4
#11
58.1%
GLM-5
#12
57.8%
Claude Sonnet 4.5
#13
57.7%
GPT-5.2
#14
50.7%

Accuracy

One ranked scale, all models on the same 0 to 100 line.

Model
0255075100
Score
AEGIS
#1
83.0%
Claude Opus 4.7
#2
69.0%
Qwen FP8
#3
67.0%
MiniMax-M2.7
#4
66.0%
Claude Opus 4.5
#5
64.0%
GPT-5.4
#6
64.0%
Gemini 3.1 Pro
#7
62.0%
Claude Sonnet 4.6
#8
59.0%
Claude Opus 4.6
#9
59.0%
GPT-5.2
#10
58.0%
GLM-5
#11
57.3%
Claude Sonnet 4.5
#12
56.0%
Gemini 3 Flash Preview
#13
54.0%
DeepSeek V4 Pro
#14
52.0%

CWE Accuracy

One ranked scale, all models on the same 0 to 100 line.

Model
0255075100
Score
AEGIS
#1
73.0%
Claude Opus 4.7
#2
53.0%
Claude Opus 4.5
#3
44.0%
Qwen FP8
#4
43.0%
MiniMax-M2.7
#5
40.0%
Claude Opus 4.6
#6
37.0%
Claude Sonnet 4.5
#7
36.0%
GPT-5.2
#8
34.0%
GPT-5.4
#9
31.0%
Claude Sonnet 4.6
#10
28.0%
DeepSeek V4 Pro
#11
24.0%
GLM-5
#12
19.8%
Gemini 3.1 Pro
#13
18.0%
Gemini 3 Flash Preview
#14
14.0%

Severity Accuracy

One ranked scale, all models on the same 0 to 100 line.

Model
0255075100
Score
AEGIS
#1
79.0%
Claude Opus 4.7
#2
56.0%
GPT-5.4
#3
48.0%
Gemini 3.1 Pro
#4
40.0%
Claude Opus 4.6
#5
30.0%
DeepSeek V4 Pro
#6
29.0%
Claude Sonnet 4.6
#7
28.0%
Claude Sonnet 4.5
#8
27.0%
Qwen FP8
#9
26.0%
Gemini 3 Flash Preview
#10
23.0%
GLM-5
#11
22.9%
MiniMax-M2.7
#12
21.0%
Claude Opus 4.5
#13
18.0%
GPT-5.2
#14
17.0%

Head-to-head interpretation

AEGIS now clears the base on both verdict quality and structure.

Against the same FP8 Qwen base, AEGIS improves accuracy, precision, recall, F1, CWE alignment, and severity alignment. That makes it the strongest self-hosted benchmark result on this page, not just the cleanest structured formatter.

Metric overlay

Accuracy
AEGIS 83.0%Base 67.0%
AEGIS
Base
Precision
AEGIS 85.1%Base 70.5%
AEGIS
Base
Recall
AEGIS 80.0%Base 63.3%
AEGIS
Base
F1
AEGIS 82.5%Base 66.7%
AEGIS
Base
CWE match
AEGIS 73.0%Base 43.0%
AEGIS
Base
Severity match
AEGIS 79.0%Base 26.0%
AEGIS
Base

Structured wins vs FP8 base

CWE alignment

73.0%

AEGIS beat the base on 35 samples and lost on 5.

Severity alignment

60.0%

Severity was the strongest structured gain: 51 checkpoint wins versus 17 baseline wins.

Verdict gains vs FP8 base

Accuracy edge

83.0% vs 67.0%

Recall edge

80.0% vs 63.3%

F1 edge

0.825 vs 0.667

Precision edge

85.1% vs 70.5%

Full leaderboard

Benchmark board across local and API models.

Sorted by F1. The green-tinted row is AEGIS, the primary local candidate. GLM-5 remains the only published exception with an incomplete run at 96 completed requests.

Primary candidate

AEGIS (Ours*)

Acc

83.0%

Prec

85.1%

Rec

80.0%

F1

0.825

JSON

100.0%

CWE

73.0%

Public AEGIS checkpoint artifact aligned to the published benchmark surface against the FP8 Qwen base.

API frontier

Claude Sonnet 4.6 Thinking

Acc

59.0%

Prec

58.2%

Rec

92.0%

F1

0.713

JSON

96.0%

CWE

28.0%

Highest recall and best overall F1 on the full 1000-sample board.

API frontier

Claude Opus 4.6 Thinking

Acc

59.0%

Prec

56.6%

Rec

86.0%

F1

0.683

JSON

99.0%

CWE

37.0%

Frontier reasoning model with strong recall but a heavy false-positive profile.

Self-hosted baseline

Qwen3.6 27B FP8 Base

Acc

67.0%

Prec

70.5%

Rec

63.3%

F1

0.667

JSON

98.0%

CWE

43.0%

1000-sample fair-run baseline on the shared eval split.

API frontier

Gemini 3 Flash Preview

Acc

54.0%

Prec

53.1%

Rec

86.0%

F1

0.656

JSON

99.0%

CWE

14.0%

Aggressive positive bias drove high recall but weak precision and structured fidelity.

API frontier

Claude Opus 4.5

Acc

64.0%

Prec

63.0%

Rec

68.0%

F1

0.654

JSON

100.0%

CWE

44.0%

Opus 4.5 stayed structurally reliable and landed near the top of the frontier pack on F1.

API frontier

MiniMax-M2.7

Acc

66.0%

Prec

68.9%

Rec

62.0%

F1

0.653

JSON

99.0%

CWE

40.0%

Completed the full 1000-sample evaluation.

API frontier

Gemini 3.1 Pro

Acc

62.0%

Prec

60.3%

Rec

70.0%

F1

0.648

JSON

100.0%

CWE

18.0%

Gemini 3.1 Pro leaned recall-heavy and stayed competitive on overall F1.

API frontier

Claude Opus 4.7

Acc

69.0%

Prec

82.8%

Rec

48.0%

F1

0.608

JSON

100.0%

CWE

53.0%

Opus 4.7 traded recall for one of the cleanest precision profiles on the published frontier board.

API frontier

DeepSeek V4 Pro

Acc

52.0%

Prec

51.4%

Rec

72.0%

F1

0.6

JSON

100.0%

CWE

24.0%

Higher recall than the self-hosted candidate, but with a materially noisier decision boundary.

API frontier

GPT-5.4

Acc

64.0%

Prec

69.4%

Rec

50.0%

F1

0.581

JSON

100.0%

CWE

31.0%

GPT-5.4 kept a precise boundary and landed above GPT-5.2 on recall and overall F1.

API frontier

GLM-5

Acc

57.3%

Prec

59.1%

Rec

56.5%

F1

0.578

JSON

97.9%

CWE

19.8%

Only 96 of 100 requests completed; remaining samples exhausted retries.

API frontier

Claude Sonnet 4.5

Acc

56.0%

Prec

55.6%

Rec

60.0%

F1

0.577

JSON

100.0%

CWE

36.0%

Stable JSON producer, but not a leader on full-run classification metrics.

API frontier

GPT-5.2

Acc

58.0%

Prec

73.1%

Rec

38.8%

F1

0.507

JSON

100.0%

CWE

34.0%

GPT-5.2 kept a conservative boundary with strong precision but weaker recall on this detection set.

Download vault

Raw JSONL for manual verification.

Every listed file is a direct model-response artifact. Each model is published as its own JSONL so the benchmark set is simple to browse and compare.

Primary candidate

AEGIS (Ours*)

Rows1000
JSONL size3.6 MB
Validity100.0%

Public AEGIS checkpoint artifact aligned to the published benchmark surface against the FP8 Qwen base.

Download JSONL

Self-hosted baseline

Qwen3.6 27B FP8 Base

Rows1000
JSONL size2.9 MB
Validity98.0%

1000-sample fair-run baseline on the shared eval split.

Download JSONL

API frontier

MiniMax-M2.7

Rows1000
JSONL size2.9 MB
Validity99.0%

Completed the full 1000-sample evaluation.

Download JSONL

API frontier

Claude Sonnet 4.6 Thinking

Rows1000
JSONL size2.9 MB
Validity96.0%

Highest recall and best overall F1 on the full 1000-sample board.

Download JSONL

API frontier

Claude Opus 4.6 Thinking

Rows1000
JSONL size2.9 MB
Validity99.0%

Frontier reasoning model with strong recall but a heavy false-positive profile.

Download JSONL

API frontier

Claude Opus 4.7

Rows1000
JSONL size3.1 MB
Validity100.0%

Opus 4.7 traded recall for one of the cleanest precision profiles on the published frontier board.

Download JSONL

API frontier

Claude Opus 4.5

Rows1000
JSONL size3.0 MB
Validity100.0%

Opus 4.5 stayed structurally reliable and landed near the top of the frontier pack on F1.

Download JSONL

API frontier

Claude Sonnet 4.5

Rows1000
JSONL size2.9 MB
Validity100.0%

Stable JSON producer, but not a leader on full-run classification metrics.

Download JSONL

API frontier

Gemini 3.1 Pro

Rows1000
JSONL size2.7 MB
Validity100.0%

Gemini 3.1 Pro leaned recall-heavy and stayed competitive on overall F1.

Download JSONL

API frontier

Gemini 3 Flash Preview

Rows1000
JSONL size2.8 MB
Validity99.0%

Aggressive positive bias drove high recall but weak precision and structured fidelity.

Download JSONL

API frontier

GPT-5.2

Rows1000
JSONL size3.3 MB
Validity100.0%

GPT-5.2 kept a conservative boundary with strong precision but weaker recall on this detection set.

Download JSONL

API frontier

GPT-5.4

Rows1000
JSONL size3.3 MB
Validity100.0%

GPT-5.4 kept a precise boundary and landed above GPT-5.2 on recall and overall F1.

Download JSONL

API frontier

GLM-5

Rows96
JSONL size2.6 MB
Validity97.9%

Only 96 of 100 requests completed; remaining samples exhausted retries.

Download JSONL

API frontier

DeepSeek V4 Pro

Rows1000
JSONL size5.2 MB
Validity100.0%

Higher recall than the self-hosted candidate, but with a materially noisier decision boundary.

Download JSONL
Verification note. Each published download is separated by model so the raw outputs stay easy to inspect.

OPENSEC benchmark narrative

AEGIS is the strongest self-hosted model on this benchmark board.

AEGIS leads the shown board on local precision, recall, F1, severity alignment, and CWE alignment against the shared FP8 Qwen base. The frontier models still matter as external reference points, but the benchmark page now makes the checkpoint story explicit without splitting it into multiple public variants.