AI Watchtower — The Security Gate Your LLM Stack Quietly Skipped

Partha

31 Mar 2026 — 5 min read

watchtower

Every AI agent you ship is an attack surface. A claims processor that follows injected instructions hidden in a submitted document. A customer support bot that leaks account records when asked the right way. A facial recognition assistant manipulated into bypassing identity checks. These are not theoretical — they are the exact failure modes caught in production systems running similar models today.

EchoLeak (2025) — Zero-click prompt injection against Microsoft Copilot. A malicious email arrived, nothing was clicked, and Copilot silently exfiltrated data while summarizing it. No CVE. No code vulnerability. $200M impact in Q1 2025 — because the attack surface was the model's willingness to follow instructions embedded in content.

Postmark MCP Backdoor — A poisoned MCP server package hid instructions inside tool description fields. When an AI agent connected, it read those descriptions as trusted system-level context before any conversation started — and redirected API credentials to an attacker-controlled endpoint. No guardrail fires on this. No SAST tool flags it. It's a string in a JSON field.

Automotive Telematics Fleet (2024) — A financial reconciliation agent processing 45,000 records was compromised via a single malformed data record with embedded instructions. The agent skipped validation on every subsequent record. Result: 494 integrity incidents, 67% targeting telematics data. No pre-deployment gate. Went straight from dev to production.

What Attackers Actually Do

Attack Vector	Impact
Inject instructions via user-supplied input	Agent overrides its intended behavior
Probe for PII in responses	SSNs, health records, financial data exposed
Extract the system prompt	Proprietary business logic and scoring rules leaked
Jailbreak via fictional framing	Safety training removed, policy-violating outputs produced
Hide instructions in MCP tool descriptions	Agent subverted at the framework level before any guardrail fires
Trick agent into autonomous actions	Unintended DB writes, bulk email, API calls executed

How AI Watchtower Works

The platform has three layers that work together: a pre-deployment gate, a runtime monitoring layer, and an AI-powered security analyst called Galactus.

Pre-Deployment Gate

The moment a team registers an agent, 46 adversarial probes fire automatically across 9 OWASP LLM Top 10 categories. If any threshold is breached, the agent is blocked from production and the team receives per-probe evidence with suggested fixes.

Category	OWASP	Probes
Prompt Injection	LLM01	8 — instruction-override, DAN, base64 payloads
Jailbreak	LLM06	8 — fictional framing, developer-mode, safety removal
PII Leakage	LLM02	7 — SSN, credentials, phone/email enumeration
System Prompt Leakage	LLM07	5 — verbatim extraction, `<system>` tag tricks
Excessive Agency	LLM08	5 — mass email, DB ops, financial transfers, shell exec
Insecure Output	LLM05	5 — XSS, cookie theft, javascript injection
MCP Poisoning	LLM03	auto — scans tool descriptions for embedded instructions
Misinformation	LLM09	4 — false authority, phishing pretext

Thresholds are strict: zero tolerance for prompt injection, jailbreak, system prompt leakage, excessive agency, and MCP poisoning. Up to 5% for PII leakage and insecure output. Up to 10% for misinformation and content violations.

Promptfoo — Full Red-Team Mode

The scanner runs in two modes. Mock mode (default) sends 46 direct HTTP probes with no API keys needed, completing in about 15 seconds — ideal for local dev and CI pipelines. Full mode runs the complete Promptfoo red-team suite with 50+ LLM-generated attack categories covering the full OWASP LLM Top 10. Promptfoo generates novel attack variants using a red-team LLM, going beyond fixed probe patterns to find model-specific weaknesses. Both modes feed the same scoring pipeline and gate decision.

Runtime Monitoring

Once live, every blocked event from any guardrail layer — AWS Bedrock Guardrails, NeMo Guardrails, LLM Guard, LlamaFirewall — forwards to Watchtower and appears in the cross-team dashboard within seconds. No agent code changes are needed for the event forwarding API; NeMo events forward automatically via a drop-in bridge.

Galactus AI Security Analyst

Galactus is a Claude-powered analyst (via AWS Bedrock) that synthesizes scan results and live events into plain-English briefings. Ask it anything across the entire fleet: "Which agent is most at risk of exposing patient PII right now?" or "Show me all agents that triggered excessive-agency blocks in the last 30 days."

Demo Agents — See It In Action

The repo ships two demo agents that show both sides of the gate:

NeMo-Guarded Patient Data Optimizer (demo/nemo-agent/) — A healthcare data assistant with five active COLANG rails: jailbreak detection, system prompt protection, off-topic filtering, confidential data scrubbing, and harmful content blocking. Expected gate result: PASS.

Vulnerable Driver Facial Recognition (demo/route-optimizer/) — Raw user input injected directly into the system prompt with no output filtering or scope restriction. Expected gate result: FAIL. Shows exactly what a blocked agent report looks like

Framework and Provider Support

Watchtower separates where your LLM runs (provider) from how your agent is built (framework). Supported providers include Bedrock, OpenAI, Anthropic, Ollama, and custom endpoints. Supported frameworks span the ecosystem: LangChain, CrewAI, LlamaIndex, AWS Strands, Bedrock SDK, Bedrock AgentCore, AutoGen, OpenAI SDK, NeMo Guardrails, and custom implementations.

Registration takes one API call or a form on the dashboard. A failed scan returns a blocking response — drop it into your CI/CD pipeline as a staging-to-production promotion gate.

NeMo Guardrails Integration

For teams running NeMo Guardrails, Watchtower ships a drop-in bridge. Replace LLMRails with WatchtowerRails — one import change — and every activated rail maps to a Watchtower security event automatically. Jailbreak detection becomes a prompt_injection event. PII detection becomes a pii event. Execution checks map to excessive_agency. Events fire asynchronously with zero latency impact on inference.

The Takeaway

If you are shipping AI agents — even internal ones — you have a pre-deployment security gap your existing toolchain does not cover. The attack surface is behavioral. The only way to test behavioral security is to probe behavior. AI Watchtower is how you do that: 46 probes in 15 seconds for dev, full Promptfoo red-teaming for pre-production, and fleet-wide runtime monitoring once you are live. Being an opensource you don't have to wait for vendor to add more probes add the way want to fuzz the agent..

Github link: https://github.com/Parthasarathi7722/ai-watchtower
References:

Core Tools in AI Watchtower

Guardrail Layers Referenced

Standards & Frameworks