TITAN AI — Production Reliability · 24/7 Engineering Brief

01 · CIRCUIT BREAKERS

Never retry-storm a throttled cloud

When Azure, AWS, or GCP throttles a single call, every TITAN agent running in parallel sees it within milliseconds. Instead of 26 agents hammering the failing API in unison, the breaker opens after 5 failures in a 60-second window and pauses for 30 seconds — then sends 3 probe calls to decide whether the cloud has recovered. Borrowed directly from Netflix Hystrix and Microsoft Polly.

What customers feel

A cloud slowdown turns a 12-minute scan into a 13-minute scan — not a 45-minute retry storm that finally times out.

Error codes

E-0103 Azure throttled · E-0113 AWS throttled · E-0123 GCP throttled

Config

5-failure threshold · 60s rolling window · 30s recovery · 3 probes in half-open

02 · GRACEFUL DEGRADATION

One bad agent can’t crash the whole scan

If any single agent raises an exception mid-scan, the platform preserves every finding, action, and compliance check it already collected — and keeps running the other 26 agents. The customer receives a report marked “SCOUT: partial results (crashed at resource 4,218 of 50,000)” instead of a blank page. This is the lesson every ops team absorbed from the July 2024 CrowdStrike incident: never let one bug cascade into a full-system outage.

What customers feel

95% of the report on a bad day, 100% on a normal day — never zero.

Error codes

E-0200 agent crash · E-0103 breaker open · E-0205 kill switch

Audit log

Every degraded scan writes an agent.scan.degraded entry to the tamper-evident chain with the exact reason.

03 · SLSA LEVEL 3 · SIGSTORE

Cryptographic proof the bundle came from us

Every TITAN AI bundle (titanai-v*.tar.gz) is built in an isolated GitHub Actions runner and signed via Sigstore Fulcio + Rekor transparency log. Customers — especially FedRAMP, CMMC, DoD, and regulated financial institutions — can verify cryptographically that what they downloaded is byte-for-byte what we built, from the exact commit, by the exact workflow, in an environment they can audit.

Verify command

slsa-verifier verify-artifact titanai-v1.0.0.tar.gz --provenance-path titanai-v1.0.0.tar.gz.intoto.jsonl --source-uri github.com/Riz7886/TITAN-AI

What customers feel

A standard Federal / CMMC vendor intake checkbox gets ticked in seconds — not weeks of negotiation.

Error code

E-0401 if bundle SHA-256 mismatch (rejection enforced at installer).

04 · EMERGENCY KILL SWITCH

Stop every agent in under a second

Every regulated customer has asked us the same question during procurement: “What’s my panic button?” TITAN AI gives three equivalent triggers — a file touch at ~/.titanai/STOP, an environment variable TITAN_KILL_SWITCH=1, or simple Ctrl-C. The platform checks at every phase boundary and inside every hot loop (~500ms latency max). Partial findings are always preserved.

Audit context

Justifies the emergency-stop control question on HIPAA §164.308(a)(7) and PCI DSS assessor interviews.

Error code

E-0205 scan stopped by kill switch (informational, not a failure).

05 · CRASH-RESUMABLE SCANS

Resume from resource 48,001, not from zero

Large customers (50,000+ cloud resources, 3-4 hour scans) can’t afford to restart from scratch on every hiccup. Every agent writes an append-only JSONL checkpoint at ~/.titanai/checkpoints/<scan>-<agent>.jsonl as resources complete. On crash + re-run with the same scan ID, the agent skips everything already processed and picks up at the next resource.

What customers feel

A 4-hour scan SLA stays achievable even through transient cloud or agent failures.

Error codes

E-0206 checkpoint corrupt (safe: re-scans the one bad resource) · E-0207 resume in progress

06 · TAMPER-EVIDENT AUDIT LOG

Hash-chained history a regulator can verify

Every scan start, scan completion, fix approval, and degradation event is written to an append-only log where each entry carries the SHA-256 hash of the previous entry. Any modification — inserting a fake fix, deleting a real one, reordering events — breaks every downstream hash and is caught by audit_chain.verify(). This is the same pattern Amazon QLDB and Certificate Transparency use.

COMPLIANCE MAPPING

Framework	Requirement	TITAN AI control
HIPAA	§164.312(b) Audit controls	Hash-chained audit log · `core/audit_chain.py`
PCI DSS 4.0	Req 10.5 Protect audit trails from modification	Append-only SHA-256 chain · fsync on every write
SOC 2	CC7.2 Detect security events	Every agent execute() bracketed by scan.started · scan.completed · scan.degraded events
FedRAMP	AU-10 Non-repudiation	Cryptographic chain break is provably detectable by the verifier
GDPR	Art 30 Records of processing	Each scan event captures agent, scan_id, subscription scope — zero customer data in the log body

Error code

E-0208 audit chain broken (tampering detected or write failure).

07 · SUPPORTING RELIABILITY STACK

Twelve more safety nets layered under the Core 6

Every item below is already in the product and runs on every scan. No opt-in required for the customer, no upcharge for any tier.

RFM · 7-day offline grace

License server unreachable? Cached entitlements keep agents running in audit-only mode for 7 days. Same pattern as CrowdStrike Falcon Reduced Functionality Mode. (E-0003)

Watchdog / supervisor

Auto-restarts CONDUCTOR up to 5 times in a 10-minute window with exponential backoff. Same pattern as Datadog supervisord. Writes crash reports to ~/.titanai/crashes/.

Opt-in daily heartbeat

~200-byte POST to /api/heartbeat with agent version, last-scan-ok flag, crash count. Zero customer cloud data. Opt-out via TITAN_TELEMETRY=false.

Per-agent wall-clock budget

Every agent has a tuned timeout budget (SCOUT 30m, FORGE 35m, SAGE 5m, etc.). Exceed it and the agent returns partial results instead of running forever. (E-0201)

Dead letter queue

Fix operations that fail their retry budget land in ~/.titanai/dlq/YYYY-MM-DD.jsonl for human review — not silently dropped. Same pattern as AWS SQS DLQ. Critical for FORGE. (E-0209)

Per-agent memory cap

Background watchdog samples RSS every 5s; trips at 2GB default (configurable). Prevents a buggy agent from OOM-killing the customer’s box. (E-0211)

Structured JSONL logging

Every log line becomes machine-parseable JSON tagged with scan_id + agent. Customers ship straight to Splunk / Datadog / Elastic without parsing.

Proactive token-bucket limiter

Pairs with circuit breakers: breakers REACT to failures, token buckets PROACTIVELY space calls so we leave headroom for the customer’s own cloud traffic.

Versioned bundles + channels

stable / canary / edge channels. Customer pins exact version for rollback: titanai-run.sh --version v1.0.0. Canary promotion via KV flip, no code deploy.

Structured error codes

Every failure emits a stable E-NNNN code with docs at titanaisec.com/errors.html. Customer pastes the code, support knows instantly.

Nightly integration test

Real Azure sandbox spin-up + 26-agent scan + teardown every night at 07:00 UTC. Catches SDK API breaking changes before a customer does.

12-point engine integrity check

Runs in 10 seconds before any scan — verifies the platform’s own state (hashes, bundle sig, config sanity) before touching the customer’s cloud.

CONDUIT ticket-dispatcher retries

TITAN CONDUIT inherits every pattern above. Ticket create fails? Exponential backoff + DLQ + token-bucket rate limit. Datadog/Jira/Datadog API throttle? Circuit breaker + graceful-degradation preserve the finding and retry on next scan. Zero tickets ever lost, zero retry storms.

⚡ EVERY FEATURE ABOVE IS INCLUDED IN EVERY TIER

No upcharge, no enterprise-only gating, no “premium support” tier. The reliability guarantees that a $30,000/yr Launch customer relies on are the same ones a $499,000/yr Banking or Government customer receives — because the platform itself can’t tell them apart, and we believe reliability is table stakes, not an upsell.

Request a technical deep-dive Review the deploy path

Built to run 24/7 at a customer site

Never retry-storm a throttled cloud

One bad agent can’t crash the whole scan

Cryptographic proof the bundle came from us

Stop every agent in under a second

Resume from resource 48,001, not from zero

Hash-chained history a regulator can verify

Twelve more safety nets layered under the Core 6