Public benchmark surface

AEGIS

Name: OPENSEC AEGIS Security Benchmark
Creator: OPENSEC

This page packages the full AEGIS story in one surface: the primary checkpoint benchmark, the FP8 Qwen base, and the external frontier models we tested against the same security detection dataset. Every row is backed by downloadable JSONL for manual verification.

Inspect leaderboard Download JSONL

Benchmark thesis

Why AEGIS is the local model to beat

Metric leader

85.1%

Best precision on the published board

AEGIS (Ours*)

Metric leader

92.0%

Best recall across frontier comparisons

Claude Sonnet 4.6 Thinking

Metric leader

0.825

Best F1 overall

AEGIS (Ours*)

Metric leader

79.0%

Best severity alignment

AEGIS (Ours*)

Metric leader

73.0%

Best CWE alignment

AEGIS (Ours*)

Local benchmark focus

AEGIS leads the self-hosted story.

This is the apples-to-apples local comparison that matters on the public board: AEGIS versus the FP8 Qwen base, using the same structured benchmark contract throughout.

Key metrics comparison

Metric-by-metric comparison across the benchmark field

Each panel isolates one metric so you can scan which model actually leads on precision, recall, F1, accuracy, CWE alignment, and severity alignment without mentally untangling mixed bars.

Precision

One ranked scale, all models on the same 0 to 100 line.

Model

0255075100

Score

AEGIS

85.1%

Claude Opus 4.7

82.8%

GPT-5.2

73.1%

Qwen FP8

70.5%

GPT-5.4

69.4%

MiniMax-M2.7

68.9%

Claude Opus 4.5

63.0%

Gemini 3.1 Pro

60.3%

GLM-5

59.1%

Claude Sonnet 4.6

#10

58.2%

Claude Opus 4.6

#11

56.6%

Claude Sonnet 4.5

#12

55.6%

Gemini 3 Flash Preview

#13

53.1%

DeepSeek V4 Pro

#14

51.4%

Recall

One ranked scale, all models on the same 0 to 100 line.

Model

0255075100

Score

Claude Sonnet 4.6

92.0%

Claude Opus 4.6

86.0%

Gemini 3 Flash Preview

86.0%

AEGIS

80.0%

DeepSeek V4 Pro

72.0%

Gemini 3.1 Pro

70.0%

Claude Opus 4.5

68.0%

Qwen FP8

63.3%

MiniMax-M2.7

62.0%

Claude Sonnet 4.5

#10

60.0%

GLM-5

#11

56.5%

GPT-5.4

#12

50.0%

Claude Opus 4.7

#13

48.0%

GPT-5.2

#14

38.8%

One ranked scale, all models on the same 0 to 100 line.

Model

0255075100

Score

AEGIS

82.5%

Claude Sonnet 4.6

71.3%

Claude Opus 4.6

68.3%

Qwen FP8

66.7%

Gemini 3 Flash Preview

65.6%

Claude Opus 4.5

65.4%

MiniMax-M2.7

65.3%

Gemini 3.1 Pro

64.8%

Claude Opus 4.7

60.8%

DeepSeek V4 Pro

#10

60.0%

GPT-5.4

#11

58.1%

GLM-5

#12

57.8%

Claude Sonnet 4.5

#13

57.7%

GPT-5.2

#14

50.7%

Accuracy

One ranked scale, all models on the same 0 to 100 line.

Model

0255075100

Score

AEGIS

83.0%

Claude Opus 4.7

69.0%

Qwen FP8

67.0%

MiniMax-M2.7

66.0%

Claude Opus 4.5

64.0%

GPT-5.4

64.0%

Gemini 3.1 Pro

62.0%

Claude Sonnet 4.6

59.0%

Claude Opus 4.6

59.0%

GPT-5.2

#10

58.0%

GLM-5

#11

57.3%

Claude Sonnet 4.5

#12

56.0%

Gemini 3 Flash Preview

#13

54.0%

DeepSeek V4 Pro

#14

52.0%

CWE Accuracy

One ranked scale, all models on the same 0 to 100 line.

Model

0255075100

Score

AEGIS

73.0%

Claude Opus 4.7

53.0%

Claude Opus 4.5

44.0%

Qwen FP8

43.0%

MiniMax-M2.7

40.0%

Claude Opus 4.6

37.0%

Claude Sonnet 4.5

36.0%

GPT-5.2

34.0%

GPT-5.4

31.0%

Claude Sonnet 4.6

#10

28.0%

DeepSeek V4 Pro

#11

24.0%

GLM-5

#12

19.8%

Gemini 3.1 Pro

#13

18.0%

Gemini 3 Flash Preview

#14

14.0%

Severity Accuracy

One ranked scale, all models on the same 0 to 100 line.

Model

0255075100

Score

AEGIS

79.0%

Claude Opus 4.7

56.0%

GPT-5.4

48.0%

Gemini 3.1 Pro

40.0%

Claude Opus 4.6

30.0%

DeepSeek V4 Pro

29.0%

Claude Sonnet 4.6

28.0%

Claude Sonnet 4.5

27.0%

Qwen FP8

26.0%

Gemini 3 Flash Preview

#10

23.0%

GLM-5

#11

22.9%

MiniMax-M2.7

#12

21.0%

Claude Opus 4.5

#13

18.0%

GPT-5.2

#14

17.0%

Head-to-head interpretation

AEGIS now clears the base on both verdict quality and structure.

Against the same FP8 Qwen base, AEGIS improves accuracy, precision, recall, F1, CWE alignment, and severity alignment. That makes it the strongest self-hosted benchmark result on this page, not just the cleanest structured formatter.

Metric overlay

Accuracy

AEGIS 83.0%Base 67.0%

AEGIS

Base

Precision

AEGIS 85.1%Base 70.5%

AEGIS

Base

Recall

AEGIS 80.0%Base 63.3%

AEGIS

Base

AEGIS 82.5%Base 66.7%

AEGIS

Base

CWE match

AEGIS 73.0%Base 43.0%

AEGIS

Base

Severity match

AEGIS 79.0%Base 26.0%

AEGIS

Base

Structured wins vs FP8 base

CWE alignment

73.0%

AEGIS beat the base on 35 samples and lost on 5.

Severity alignment

60.0%

Severity was the strongest structured gain: 51 checkpoint wins versus 17 baseline wins.

Verdict gains vs FP8 base

Accuracy edge

83.0% vs 67.0%

Recall edge

80.0% vs 63.3%

F1 edge

0.825 vs 0.667

Precision edge

85.1% vs 70.5%

Full leaderboard

Benchmark board across local and API models.

Sorted by F1. The green-tinted row is AEGIS, the primary local candidate. GLM-5 remains the only published exception with an incomplete run at 96 completed requests.

Model

Accuracy

Precision

Recall

JSON

Download

AEGIS (Ours*)Primary candidate

Public AEGIS checkpoint artifact aligned to the published benchmark surface against the FP8 Qwen base.

83.0%

85.1%

80.0%

0.825

100.0%

JSONL

Claude Sonnet 4.6 ThinkingAPI frontier

Highest recall and best overall F1 on the full 1000-sample board.

59.0%

58.2%

92.0%

0.713

96.0%

JSONL

Claude Opus 4.6 ThinkingAPI frontier

Frontier reasoning model with strong recall but a heavy false-positive profile.

59.0%

56.6%

86.0%

0.683

99.0%

JSONL

Qwen3.6 27B FP8 BaseSelf-hosted baseline

1000-sample fair-run baseline on the shared eval split.

67.0%

70.5%

63.3%

0.667

98.0%

JSONL

Gemini 3 Flash PreviewAPI frontier

Aggressive positive bias drove high recall but weak precision and structured fidelity.

54.0%

53.1%

86.0%

0.656

99.0%

JSONL

Claude Opus 4.5API frontier

Opus 4.5 stayed structurally reliable and landed near the top of the frontier pack on F1.

64.0%

63.0%

68.0%

0.654

100.0%

JSONL

MiniMax-M2.7API frontier

Completed the full 1000-sample evaluation.

66.0%

68.9%

62.0%

0.653

99.0%

JSONL

Gemini 3.1 ProAPI frontier

Gemini 3.1 Pro leaned recall-heavy and stayed competitive on overall F1.

62.0%

60.3%

70.0%

0.648

100.0%

JSONL

Claude Opus 4.7API frontier

Opus 4.7 traded recall for one of the cleanest precision profiles on the published frontier board.

69.0%

82.8%

48.0%

0.608

100.0%

JSONL

DeepSeek V4 ProAPI frontier

Higher recall than the self-hosted candidate, but with a materially noisier decision boundary.

52.0%

51.4%

72.0%

0.6

100.0%

JSONL

GPT-5.4API frontier

GPT-5.4 kept a precise boundary and landed above GPT-5.2 on recall and overall F1.

64.0%

69.4%

50.0%

0.581

100.0%

JSONL

GLM-5API frontier

Only 96 of 100 requests completed; remaining samples exhausted retries.

57.3%

59.1%

56.5%

0.578

97.9%

JSONL

Claude Sonnet 4.5API frontier

Stable JSON producer, but not a leader on full-run classification metrics.

56.0%

55.6%

60.0%

0.577

100.0%

JSONL

GPT-5.2API frontier

GPT-5.2 kept a conservative boundary with strong precision but weaker recall on this detection set.

58.0%

73.1%

38.8%

0.507

100.0%

JSONL

Primary candidate

AEGIS (Ours*)

Acc

83.0%

Prec

85.1%

Rec

80.0%

0.825

JSON

100.0%

CWE

73.0%

Public AEGIS checkpoint artifact aligned to the published benchmark surface against the FP8 Qwen base.

API frontier

Claude Sonnet 4.6 Thinking

Acc

59.0%

Prec

58.2%

Rec

92.0%

0.713

JSON

96.0%

CWE

28.0%

Highest recall and best overall F1 on the full 1000-sample board.

API frontier

Claude Opus 4.6 Thinking

Acc

59.0%

Prec

56.6%

Rec

86.0%

0.683

JSON

99.0%

CWE

37.0%

Frontier reasoning model with strong recall but a heavy false-positive profile.

Self-hosted baseline

Qwen3.6 27B FP8 Base

Acc

67.0%

Prec

70.5%

Rec

63.3%

0.667

JSON

98.0%

CWE

43.0%

1000-sample fair-run baseline on the shared eval split.

API frontier

Gemini 3 Flash Preview

Acc

54.0%

Prec

53.1%

Rec

86.0%

0.656

JSON

99.0%

CWE

14.0%

Aggressive positive bias drove high recall but weak precision and structured fidelity.

API frontier

Claude Opus 4.5

Acc

64.0%

Prec

63.0%

Rec

68.0%

0.654

JSON

100.0%

CWE

44.0%

Opus 4.5 stayed structurally reliable and landed near the top of the frontier pack on F1.

API frontier

MiniMax-M2.7

Acc

66.0%

Prec

68.9%

Rec

62.0%

0.653

JSON

99.0%

CWE

40.0%

Completed the full 1000-sample evaluation.

API frontier

Gemini 3.1 Pro

Acc

62.0%

Prec

60.3%

Rec

70.0%

0.648

JSON

100.0%

CWE

18.0%

Gemini 3.1 Pro leaned recall-heavy and stayed competitive on overall F1.

API frontier

Claude Opus 4.7

Acc

69.0%

Prec

82.8%

Rec

48.0%

0.608

JSON

100.0%

CWE

53.0%

Opus 4.7 traded recall for one of the cleanest precision profiles on the published frontier board.

API frontier

DeepSeek V4 Pro

Acc

52.0%

Prec

51.4%

Rec

72.0%

0.6

JSON

100.0%

CWE

24.0%

Higher recall than the self-hosted candidate, but with a materially noisier decision boundary.

API frontier

GPT-5.4

Acc

64.0%

Prec

69.4%

Rec

50.0%

0.581

JSON

100.0%

CWE

31.0%

GPT-5.4 kept a precise boundary and landed above GPT-5.2 on recall and overall F1.

API frontier

GLM-5

Acc

57.3%

Prec

59.1%

Rec

56.5%

0.578

JSON

97.9%

CWE

19.8%

Only 96 of 100 requests completed; remaining samples exhausted retries.

API frontier

Claude Sonnet 4.5

Acc

56.0%

Prec

55.6%

Rec

60.0%

0.577

JSON

100.0%

CWE

36.0%

Stable JSON producer, but not a leader on full-run classification metrics.

API frontier

GPT-5.2

Acc

58.0%

Prec

73.1%

Rec

38.8%

0.507

JSON

100.0%

CWE

Rows1000

JSONL size5.2 MB

Validity100.0%

Higher recall than the self-hosted candidate, but with a materially noisier decision boundary.

Download JSONL

OPENSEC benchmark narrative

AEGIS is the strongest self-hosted model on this benchmark board.

AEGIS leads the shown board on local precision, recall, F1, severity alignment, and CWE alignment against the shared FP8 Qwen base. The frontier models still matter as external reference points, but the benchmark page now makes the checkpoint story explicit without splitting it into multiple public variants.

Back to site Open download vault