VCA-AI-301: AI & Agentic Security III: Adversarial Capstone

Belt 5/5 · deep specialised

Virtus Academy · Capstone Track

The capstone of the AI track and the bridge between the academy's two security strands. AI-301 takes the substrate-level intuition the student earned in CSA-101 + CSA-201 (instruction-vs-data confusion at the silicon level; W^X / ASLR / canaries / CFI as toggleable mitigations) and pairs it with the language-level metaphors AI-101 + AI-201 named (prompt injection as modern stack-smash; context-window confusion as memory-corruption-at-the-semantic-layer). Students attack Virtus OS on the Tang Nano and the Virtus DVLA agentic chatbot in the same engagement, observe how the attack patterns generalise across the substrate, and produce a final adversarial capstone report. The course where you bring substrate-level instinct to language-level threat modelling.

Total time: ~150 hours

Lecture: ~28 hr

Practical / lab: ~52 hr

Independent practice: ~70 hr

Position: After AI-201 + CSA-201 (both); RE-101 strongly recommended

Prereq: AI-201 + CSA-201; RE-101 recommended

Equipment: AI-201 setup (Pyodide workbench + cloud-GPU pathway carries forward) + Tang Nano 20K (or Tang Primer 25K) + Virtus OS v1 (from CSA-101 capstone) + DVLA testbed (virtus-llm-owasp repo); open-weight model + Claude 3 Sonnet API access for SAE work (Module 4.5); community SAE viewers (Neuronpedia) + TransformerLens for interpretability labs; ~$10-30 student API budget for the SAE/activation-steering modules (see hardware platform · we update this as the kit firms up)

Credential: VCA-AI-301 Certificate of Completion

Register interest. We're not taking enrollments yet. Email interested@virtuscyberacademy.org.

Course Overview

AI-301 is the academy's capstone adversarial-AI course. Students arrive having (a) personally built a complete computing stack from NAND to OS in CSA-101, (b) hardened it against classical memory-corruption attacks in CSA-201, (c) studied OWASP LLM Top 10 + ASI Top 10 in AI-101, and (d) reproduced production-CVE bug classes (CVE-2025-65106 / -68664 / -9556) in AI-201. AI-301 is where these threads converge: students execute exploit chains that span the substrate-language gap.

The chapter's pedagogical thesis: agentic-system security is memory-corruption at the semantic layer. An LLM agent that cannot distinguish system prompt from user input is the language-level cousin of a CPU that cannot distinguish instruction from data. CSA-101 §4.10 + §12.11.1 named the substrate version. AI-101 + AI-201 named the language version. AI-301 is where the two recognitions become one mental model, and the 2024-2026 Anthropic Sparse Autoencoder corpus is what makes the metaphor literal.

The thesis literalized (Module 4.5 NEW). Anthropic's interpretability team has extracted tens of millions of monosemantic features from production-scale models (Claude 3 Sonnet) using sparse autoencoders. Per Towards Monosemanticity (October 2023) and Scaling Monosemanticity (May 2024), these features are multilingual, multimodal, and generalize between concrete and abstract references; safety-relevant features are explicitly identified (deception, sycophancy, bias, dangerous content); features can be used to actively steer model behavior by clamping their activations. This is what turns the substrate↔language metaphor from evocative into experimentally-demonstrable. Students at Module 4.5 literally PERFORM the substrate-side-equivalent operation on a transformer: load Claude 3 Sonnet through the API + a community SAE on Llama-2-7B; identify the "comply with harmful request" feature; clamp its activation to zero (defense) or to +∞ (attack); observe the behavior change. Cousins of memory corruption become real operations on real systems.

The supply-chain cousin (Module 7.5 NEW). The substrate-side "supply chain compromise" module (CSA-101 + CSA-201) has a literal language-side cousin in the fine-tuning-attack research line. Qi et al.'s 2023 paper Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! (arxiv 2310.03693) demonstrated GPT-3.5 Turbo could be jailbroken with 10 examples at $0.20 through OpenAI's fine-tuning API. The llm-tuning-safety.github.io corpus + the 2024 survey "Harmful Fine-tuning Attacks and Defenses for LLMs" (arxiv 2409.18169) + persistent backdoor research (P-Trojan, optimized for backdoor persistence across repeated updates) make fine-tuning the language-side equivalent of CVE-2025-68664 LangGrinch pickle deserialization from AI-201. Module 7.5 reproduces a fine-tune-jailbreak end-to-end and then defends against it with Booster (ICLR 2025 oral) + safety-aware fine-tuning recipes.

The Virtus DVLA testbed is the chapter's primary lab environment. Findings to date are all model-intrinsic (no external coordination needed); students explore the 9-model L3-regression baseline and add their own findings to the academy's growing corpus. Some labs pair DVLA exploitation with Virtus-OS-on-Tang-Nano exploitation in the same engagement, to make the cross-substrate generalisation visible.

Position relative to peer offerings. No other curriculum at this level pairs a hardware capstone (Tang Nano + student-authored OS) with an agentic capstone (DVLA + student-authored chatbot). AI-301 is the academy's most differentiated course because the prerequisite stack is uniquely Virtus Academy: only graduates of CSA-101 + CSA-201 + AI-201 have the substrate instincts the chapter assumes.

Pedagogy. AI-301 pushes the academy's three teaching habits to capstone depth. The comparing different systems carries forward as a substrate-vs-language comparison: every module pairs a substrate-level vulnerability (W^X violation; canary bypass; ROP; type confusion) against its language-level analogue (prompt injection; system-prompt leak; tool-action escalation; context-window confusion). The Tool Journal entries add ~8 new tools (DVLA harness, multi-substrate exploit-chain runner, attack-narrative documentation template).

Curriculum Outline

Fourteen modules, ~12 weeks (12 originals + 2 NEW: Module 4.5 SAE / activation-steering + Module 7.5 fine-tuning attacks). Capstone is a 3-track slate matching Belt-5 student archetype diversity.

Module	Topic	Project
1	Re-grounding: substrate vs language vulnerabilities	Side-by-side mapping table; CSA-101 §4.10 + §12.11.1 paired with OWASP LLM01
2	Stack-smash on Virtus OS v1 (substrate primer)	Reproduce a stack-smash on the student's own Tang Nano + Virtus OS
3	Prompt injection on DVLA (language primer)	Reproduce an L3-regression prompt-injection finding on DVLA
4	The metaphor named precisely	Written 1500-word essay: instruction-vs-data thesis at substrate vs context-window confusion at language
4.5 (NEW)	Mechanistic interpretability + the analogy literalized	Hands-on with Anthropic Sparse Autoencoders + activation steering / representation engineering (RepE / ActAdd). Required reading: Anthropic Towards Monosemanticity (Oct 2023); Scaling Monosemanticity (May 2024); the RepE paper (arxiv 2310.01405); the "Activation Steering Compromises LLM Safety" pitfalls paper (2024). Lab: load Claude 3 Sonnet via API + a community SAE on Llama-2-7B; identify a safety-relevant feature; clamp it to 0 (defense) or +∞ (attack); observe behavior change. Tools: Neuronpedia + TransformerLens + PyTorch hook-based activation-steering scaffolds.
5	ROP at the substrate; tool-chain hijack at the language	Build an ROP chain on Virtus OS; build a tool-chain hijack on DVLA; pair
6	Type confusion at substrate; type confusion at language	C-style void* type-confusion exploit; LLM-output-as-untyped-string exploit
7	Side channels: timing at substrate; latency-fingerprint at language	Cache-timing demo on substrate; latency-channel demo on agentic system
7.5 (NEW)	Fine-tuning attacks as supply-chain compromise	Reproduce Qi et al. 2023 (arxiv 2310.03693), 10 examples at $0.20 jailbreaks GPT-3.5 Turbo via the OpenAI fine-tuning API. Required reading: Qi et al. 2023; "Harmful Fine-tuning Attacks and Defenses for LLMs: A Survey" (arxiv 2409.18169); P-Trojan persistent-backdoor research (arxiv 2505.17601); Booster (ICLR 2025 oral). Lab: jailbreak an open-weight model with 10 fine-tune examples; defend with Booster + safety-aware fine-tuning recipes; measure persistence across continued fine-tuning. This is the language-side cousin of CVE-2025-68664 LangGrinch from AI-201. Supply-chain compromise at the weights layer.
8	Defence-in-depth: layered mitigations against each class	Add canaries to Virtus OS; add output-validation to DVLA; measure each
9	Cross-substrate exploit chain (engagement scenario)	Pentest a hypothetical agentic system that calls into a vulnerable substrate
10	Adversarial-machine-learning fundamentals	Adversarial-example crafting; evasion against a deployed image classifier
11	Threat-actor modelling + frontier-safety-framework landscape	Map AI-301 attacks to threat-actor capability tiers; read Anthropic Responsible Scaling Policy v3.0 (effective Feb 24, 2026; capability evaluations in CBRN + cybersecurity + Model Autonomy domains; models re-tested every 3 months for finetuning improvements); DeepMind Frontier Safety Framework (Instrumental Reasoning Levels assessing for ability to bypass oversight or pursue goals covertly; deceptive alignment as risk class); the OpenAI superalignment-disbanded story (Superalignment dissolved May 2024; Mission Alignment dissolved February 2026 after 16 months; Jan Leike resignation framing) as industry context.
12	Capstone, 3-track slate (pick one)	See the Capstone section below for the three-option slate.

How the Course Teaches: Foundational Readings

Christian's The Alignment Problem is now the AI-301 primary narrative anchor, read in full. Its three-section structure (Prophecy / Agency / Normativity) maps exactly to the AI-301 thesis arc: Prophecy → Modules 1-4 substrate-vs-language re-grounding; Agency → Modules 5-7 ROP / tool-chain-hijack / type-confusion; Normativity → Modules 8-12 defense-in-depth + adversarial-ML + threat-actor + capstone. Karpathy's full "Zero to Hero" series + selected Stanford CS336 lectures supply the substrate-companion path; Mitchell continues as supplementary background. At Belt 5 students are also expected to read the full primary-paper corpus named in Modules 4.5 + 7.5 + 11 (Anthropic SAE work; activation steering / RepE pitfalls; Qi et al. 2023 fine-tuning; Anthropic RSP v3.0; DeepMind Frontier Safety Framework).

Sample weave (Christian, The Alignment Problem, Agency section. Intro). Christian opens the Agency section by observing that the alignment problem in its pre-LLM formulations was about reinforcement-learning agents that optimize specified rewards in ways their designers did not intend. The pedagogical move in AI-301 is that the problem generalizes structurally to LLM agents. A tool-using agent that pursues its prompted goal by chaining tool calls in an unexpected sequence is the language-level cousin of an RL agent that finds an unexpected reward-hack. AI-101's LLM06 Excessive Agency and AI-201's tool-chain-hijack module are both sub-cases of Christian's Agency thesis. The student who reads Agency before Module 5 sees the whole tool-calling exploit family as one phenomenon, not as a list of techniques. Module 5 has you build an ROP chain on Virtus OS and a tool-chain hijack on DVLA in the same week, then write the comparison.

Sample weave (Anthropic, Scaling Monosemanticity, May 2024, per transformer-circuits.pub). The Scaling Monosemanticity report extends the team's prior monosemanticity work from toy models to Claude 3 Sonnet. The headline finding is that sparse autoencoders trained on the residual stream of a production-scale transformer recover tens of millions of features - combinations of neurons that map to specific semantic concepts, and that those features are multilingual, multimodal, and abstract enough to fire on both concrete tokens (the literal word "deception") and abstract instances (deceptive behavior in any language). Safety-relevant features explicitly reported include deception, sycophancy, bias, and dangerous content; clamping a feature's activation actively steers the model's behavior. The pedagogical consequence for AI-301 is enormous: the substrate↔language metaphor stops being analogical and becomes literal. Memory corruption at the semantic layer IS a real operation now - activation injection / SAE feature manipulation / steering vector addition are the primitives, and students who read this report perform exactly those primitives in Module 4.5's lab. The lab uses community SAEs on Llama-2-7B (Anthropic's Claude 3 Sonnet SAE is not public) + Anthropic's API for behavioral observation; Neuronpedia hosts the community SAE feature dashboards.

Learning Outcomes

Remember. State the four classical memory-safety properties (W^X / ASLR / canaries / CFI) and their language-level analogues at the agentic-system layer.
Understand. Explain why agentic-system security is memory-corruption at the semantic layer.
Apply. Reproduce a stack-smash against your own Virtus OS v1.
Apply. Reproduce a prompt-injection L3 finding on DVLA across 9 models.
Apply. Construct a multi-substrate exploit chain (substrate ↔ language).
Analyze. Map a published CVE to its substrate-vs-language category and to its threat-actor tier.
Synthesize. Ship the multi-substrate-exploit capstone report + recorded demo.

Hands-On Labs

Lab 2.1: stack-smash on student-authored Virtus OS v1.
Lab 3.1: L3-regression prompt-injection on DVLA.
Lab 5.1: ROP chain on Virtus OS; paired tool-chain hijack on DVLA.
Lab 6.1: type-confusion exploits on both substrates.
Lab 7.1: cache-timing side channel; latency-channel agentic side channel.
Lab 8.1: layered mitigations applied to both substrates; measure cost.
Lab 10.1: adversarial-example crafting against an image classifier.
Lab 12 (capstone): multi-substrate exploit chain.

Capstone: Three-Track Slate

The Belt-5 student population is heterogeneous: some want to ship exploits, some want to ship defenses, some want to ship eval-engineering, some want to publish original research. AI-301 ships a three-track capstone slate so students self-select against archetype. All three tracks have equal certificate weight; pick one.

Track A: Multi-substrate exploit chain (the original default)

End-to-end exploit chain spanning DVLA + Virtus OS on the student's own silicon. 12-page report + 10-minute recorded demo showing the chain. Best for students aiming at red-team engineer / pentest-lead roles at Microsoft AI Red Team / NVIDIA AI Red Team / Anthropic Frontier Red Team / vendor red teams.

Track B (NEW), Interpretability-driven defense

Build an SAE-based attack-signature detector. Use community SAEs on an open-weight model + an activation-monitoring runtime; identify the features that fire on jailbreak attempts; deploy as a runtime defense; evaluate against the Module 4.5 attack lab + a held-out HarmBench subset. 12-page report + working demo. Best for students aiming at Anthropic Alignment Science / DeepMind Frontier Safety / FAIR Safety / academic interpretability research.

Track C (NEW), Frontier-Red-Team-style RSP capability evaluation

Mirror Anthropic's Responsible Scaling Policy v3.0 capability-evaluation protocol against an open-weight model: pick one of the three RSP domains (CBRN / cybersecurity / Model Autonomy); design and execute the evaluation; produce a capability-eval report at the standard Anthropic ships externally. 15-page report + reproducibility package. Best for students aiming at eval-engineering roles at Anthropic / DeepMind / Microsoft AI Red Team / contracting work for AISI / METR.

Two-tier grading (all three tracks)

First, your project must work. Track A: capstone exploit chain works end-to-end on student silicon + DVLA; report + recorded demo submitted. Track B: SAE detector working against the Module 4.5 attack lab; held-out HarmBench evaluation submitted. Track C: capability eval reproducible from the package; report at Anthropic-ship quality submitted. Reports below this threshold do not pass.

Then we score the report on three dimensions (40/30/30). exploit/defense/eval coherence at the level the chosen track (40%) · substrate-language cross-mapping clarity (30%) · report and demo / package quality (30%). B− minimum on Tier 2 for the certificate.

Career Outcomes: Frontier-AI-Safety Employer Pathway Map

AI-301 graduates enter a small, named, increasingly differentiated set of frontier-AI-safety employers. The career-pathway map below is the map a Belt-5 student needs to decide whether to invest the 150 hours of capstone work to pass through one of these gates.

Anthropic Frontier Red Team + Alignment Science + Finetuning + Alignment Stress Testing. Active hiring (2026); the flagship destination for AI-301 graduates pursuing safety-research-track roles. Primary fit: Track A (exploit chain) graduates → Frontier Red Team RSP Evaluations; Track B (interpretability defense) graduates → Alignment Science / Interpretability; Track C (capability eval) graduates → Frontier Red Team RSP Evaluations.
DeepMind Frontier Safety Framework team. Active hiring; broader Google footprint. Safety researchers integrated into core dev teams; deceptive-alignment research; Instrumental Reasoning Levels evaluation. Primary fit: Track B + Track C graduates.
Microsoft AI Red Team. Established 2018; interdisciplinary (security + adversarial ML + responsible AI; 100+ GenAI products red-teamed by October 2024); "diverse-team-first" doctrine. Primary fit: Track A graduates with cybersec depth.
NVIDIA AI Red Team. garak open-source flagship. Senior AI Red Team roles. Primary fit: Track A graduates with strong open-source-tooling track record.
FAIR (Meta AI Research) Safety org. Various interpretability + safety teams; Research Engineer / Research Scientist roles. Primary fit: Track B graduates.
OpenAI safety teams (reduced presence). The Superalignment team disbanded in May 2024; the Mission Alignment team disbanded in February 2026 after 16 months. Jan Leike resigned with the "Safety culture and processes have taken a backseat to shiny products" framing and joined Anthropic. AI-301 graduates remain hireable by remaining OpenAI safety teams but with a diminished safety-team destination relative to two years ago.
Independent / consulting: Lakera + Robust Intelligence + HiddenLayer + Adversa AI + others. Per-vendor hiring; varying team sizes; Track A and Track C fits.
Academic PhD pipeline: UC Berkeley CHAI + MIRI + Stanford CS329H + CMU AI safety + the MATS Research program (the canonical alignment-research career-launch program; MATS alumni placed at Anthropic / DeepMind / OpenAI / academic CHAI groups). Primary fit: Track B + Track C graduates aiming at PhD admissions.

Cross-course bridges (academy-internal).

→ XD strand (future). The chapter graduates onto the academy's adversarial-defence track.
Cross-cut to VCA-RE-101. AI-301's ML-classifier interpretability + adversarial-example crafting transfer to ML-in-malware reverse engineering work; reading binary-RE on a deployed ML classifier is a sub-discipline of binary analysis.
Cross-cut to VCA-CSA-201. The substrate-side mitigation track (W^X / ASLR / canaries / CFI from CSA-201) is reused in AI-301 Module 8 as the foundation the language-side defense recipes parallel.

Tool Journal: AI-301 Originating Entries

~12 tool-journal entries originate in AI-301; the AI-101 + AI-201 corpus continues at capstone depth.

DVLA harness, the academy's test-rig for the daily vulnerable LLM application
Multi-substrate exploit-chain runner. Runs an exploit across DVLA + Virtus OS in one harness
Attack-narrative template. Standard format for adversarial reports
Anthropic SAE viewer / Neuronpedia. Community SAE feature dashboards; canonical interpretability inspection tool. First met Module 4.5.
TransformerLens. Canonical mechanistic-interpretability library; activation hooks, attention-pattern inspection, intervention scaffolds. First met Module 4.5.
PyTorch hook-based activation-steering scaffolds, the academy's reference implementation of ActAdd / RepE / steering vector addition. Enables the Module 4.5 lab. Cross-referenced against the "Activation Steering Compromises LLM Safety" pitfalls paper (2024).
Adversarial-example crafter. Crafts adversarial perturbations; PGD / FGSM
9-model regression runner. Reused from AI-201 with deeper customisation
Booster (ICLR 2025 oral). Safety-aware fine-tuning recipe; Module 7.5 defense lab
Threat-actor capability matrix. Structured threat-modelling tool; updated to map against Anthropic RSP v3.0 + DeepMind Frontier Safety Framework Instrumental Reasoning Levels
Coverage-analyzer for prompt-injection variants
Substrate-language mapping tool, the academy's 30+ row substrate↔language vulnerability map; literalized post-Module-4.5 with SAE-feature-level entries
RSP-style capability-eval scaffold, Track C capstone substrate; mirrors Anthropic's pre-deployment evaluation protocol

Before You Start

Have you completed AI-201 + CSA-201? (If no → both are central prereqs; without CSA-201 you don't have the substrate-mitigation context the chapter assumes.)
Have you reproduced at least one CVE end-to-end (CVE-2025-65106 / -68664 / -9556)? (If no → AI-101 / AI-201 review.)
Do you have your CSA-101 Tang Nano + Virtus OS v1 setup? (If no → rebuild from the CSA-101 capstone; the chapter assumes you keep your own apparatus.)
Are you comfortable with adversarial-disclosure ethics? (If no → SEC-101 ethics module + AI-201 Module 8.)
Can you write at adversarial-report level (12-page, technical+strategic)? (If no → the academy's AI-201 capstone report writing guide.)

Format Prescriptions

Hour budget: ~28 lec hr + ~52 lab hr + ~70 indep hr (= ~150 hr total).

Live

2 sessions/wk × 90 min over 12 weeks.

Night class

1-2 sessions/wk evenings; ~26 weeks.

Bootcamp

40 hr/wk × ~4 weeks intensive. The capstone module needs concentrated time.

Async self-paced

Recorded video + DVLA + Tang Nano kit; AI-API budget guidance; 1:1 tutoring premium for capstone.

High school / homeschool co-op

Generally not recommended; AI-301 is a graduate-level course.

Interested in VCA-AI-301?

Email interested@virtuscyberacademy.org.

Email interested@virtuscyberacademy.org