Discover ANY AI to make more online for less.

select between over 22,900 AI Tool and 17,900 AI News Posts.


venturebeat
Karpathy’s March of Nines shows why 90% AI reliability isn’t even close to enough

“When you get a demo and something works 90% of the time, that’s just the first nine.” — Andrej KarpathyThe “March of Nines” frames a common production reality: You can reach the first 90% reliability with a strong demo, and each additional nine often requires comparable engineering effort. For enterprise teams, the distance between “usually works” and “operates like dependable software” determines adoption.The compounding math behind the March of Nines“Every single nine is the same amount of work.” — Andrej KarpathyAgentic workflows compound failure. A typical enterprise flow might include: intent parsing, context retrieval, planning, one or more tool calls, validation, formatting, and audit logging. If a workflow has n steps and each step succeeds with probability p, end-to-end success is approximately p^n.In a 10-step workflow, the end-to-end success compounds due to the failures of each step. Correlated outages (auth, rate limits, connectors) will dominate unless you harden shared dependencies.Per-step success (p)10-step success (p^10)Workflow failure rateAt 10 workflows/dayWhat does this mean in practice90.00%34.87%65.13%~6.5 interruptions/dayPrototype territory. Most workflows get interrupted99.00%90.44%9.56%~1 every 1.0 daysFine for a demo, but interruptions are still frequent in real use.99.90%99.00%1.00%~1 every 10.0 daysStill feels unreliable because misses remain common.99.99%99.90%0.10%~1 every 3.3 monthsThis is where it starts to feel like dependable enterprise-grade software.Define reliability as measurable SLOs“It makes a lot more sense to spend a bit more time to be more concrete in your prompts.” — Andrej KarpathyTeams achieve higher nines by turning reliability into measurable objectives, then investing in controls that reduce variance. Start with a small set of SLIs that describe both model behavior and the surrounding system:Workflow completion rate (success or explicit escalation).Tool-call success rate within timeouts, with strict schema validation on inputs and outputs.Schema-valid output rate for every structured response (JSON/arguments).Policy compliance rate (PII, secrets, and security constraints).p95 end-to-end latency and cost per workflow.Fallback rate (safer model, cached data, or human review).Set SLO targets per workflow tier (low/medium/high impact) and manage an error budget so experiments stay controlled.Nine levers that reliably add nines1) Constrain autonomy with an explicit workflow graphReliability rises when the system has bounded states and deterministic handling for retries, timeouts, and terminal outcomes.Model calls sit inside a state machine or a DAG, where each node defines allowed tools, max attempts, and a success predicate.Persist state with idempotent keys so retries are safe and debuggable.2) Enforce contracts at every boundaryMost production failures start as interface drift: malformed JSON, missing fields, wrong units, or invented identifiers.Use JSON Schema/protobuf for every structured output and validate server-side before any tool executes.Use enums, canonical IDs, and normalize time (ISO-8601 + timezone) and units (SI).3) Layer validators: syntax, semantics, business rulesSchema validation catches formatting. Semantic and business-rule checks prevent plausible answers that break systems.Semantic checks: referential integrity, numeric bounds, permission checks, and deterministic joins by ID when available.Business rules: approvals for write actions, data residency constraints, and customer-tier constraints.4) Route by risk using uncertainty signalsHigh-impact actions deserve higher assurance. Risk-based routing turns uncertainty into a product feature.Use confidence signals (classifiers, consistency checks, or a second-model verifier) to decide routing.Gate risky steps behind stronger models, additional verification, or human approval.5) Engineer tool calls like distributed systemsConnectors and dependencies often dominate failure rates in agentic systems.Apply per-tool timeouts, backoff with jitter, circuit breakers, and concurrency limits.Version tool schemas and validate tool responses to prevent silent breakage when APIs change.6) Make retrieval predictable and observableRetrieval quality determines how grounded your application will be. Treat it like a versioned data product with coverage metrics.Track empty-retrieval rate, document freshness, and hit rate on labeled queries.Ship index changes with canaries, so you know if something will fail before it fails.Apply least-privilege access and redaction at the retrieval layer to reduce leakage risk.7) Build a production evaluation pipelineThe later nines depend on finding rare failures quickly and preventing regressions.Maintain an incident-driven golden set from production traffic and run it on every change.Run shadow mode and A/B canaries with automatic rollback on SLI regressions.8) Invest in observability and operational responseOnce failures become rare, the speed of diagnosis and remediation becomes the limiting factor.Emit traces/spans per step, store redacted prompts and tool I/O with strong access controls, and classify every failure into a taxonomy.Use runbooks and “safe mode” toggles (disable risky tools, switch models, require human approval) for fast mitigation.9) Ship an autonomy slider with deterministic fallbacksFallible systems need supervision, and production software needs a safe way to dial autonomy up over time. Treat autonomy as a knob, not a switch, and make the safe path the default.Default to read-only or reversible actions, require explicit confirmation (or approval workflows) for writes and irreversible operations.Build deterministic fallbacks: retrieval-only answers, cached responses, rules-based handlers, or escalation to human review when confidence is low.Expose per-tenant safe modes: disable risky tools/connectors, force a stronger model, lower temperature, and tighten timeouts during incidents.Design resumable handoffs: persist state, show the plan/diff, and let a reviewer approve and resume from the exact step with an idempotency key.Implementation sketch: a bounded step wrapperA small wrapper around each model/tool step converts unpredictability into policy-driven control: strict validation, bounded retries, timeouts, telemetry, and explicit fallbacks.def run_step(name, attempt_fn, validate_fn, *, max_attempts=3, timeout_s=15):    # trace all retries under one span    span = start_span(name)    for attempt in range(1, max_attempts + 1):        try:            # bound latency so one step can’t stall the workflow            with deadline(timeout_s):                out = attempt_fn()
# gate: schema + semantic + business invariants            validate_fn(out)            # success path            metric("step_success", name, attempt=attempt)            return out        except (TimeoutError, UpstreamError) as e:            # transient: retry with jitter to avoid retry storms            span.log({"attempt": attempt, "err": str(e)})            sleep(jittered_backoff(attempt))        except ValidationError as e:            # bad output: retry once in “safer” mode (lower temp / stricter prompt)            span.log({"attempt": attempt, "err": str(e)})            out = attempt_fn(mode="safer")    # fallback: keep system safe when retries are exhausted    metric("step_fallback", name)    return EscalateToHuman(reason=f"{name} failed")Why enterprises insist on the later ninesReliability gaps translate into business risk. McKinsey’s 2025 global survey reports that 51% of organizations using AI experienced at least one negative consequence, and nearly one-third reported consequences tied to AI inaccuracy. These outcomes drive demand for stronger measurement, guardrails, and operational controls.Closing checklistPick a top workflow, define its completion SLO, and instrument terminal status codes.Add contracts + validators around every model output and tool input/output.Treat connectors and retrieval as first-class reliability work (timeouts, circuit breakers, canaries).Route high-impact actions through higher assurance paths (verification or approval).Turn every incident into a regression test in your golden set.The nines arrive through disciplined engineering: bounded workflows, strict interfaces, resilient dependencies, and fast operational learning loops.Nikhil Mungel has been building distributed systems and AI teams at SaaS companies for more than 15 years.

Rating

Innovation

Pricing

Technology

Usability

We have discovered similar tools to what you are looking for. Check out our suggestions for similar AI tools.

venturebeat
A weekend ‘vibe code’ hack by Andrej Karpathy quietly sketches the miss

<p>This weekend, <a href="https://x.com/karpathy/">Andrej Karpathy</a>, the former director of AI at Tesla and a founding member of OpenAI, decided he wanted to read a book [...]

Match Score: 294.79

venturebeat
Karpathy shares 'LLM Knowledge Base' architecture that bypasses R

<p>AI vibe coders have yet another reason to thank <a href="https://x.com/karpathy/status/2039805659525644595">Andrej Karpathy</a>, the coiner of the term. </p><p& [...]

Match Score: 272.23

venturebeat
Andrej Karpathy's new open source 'autoresearch' lets you ru

<p>Over the weekend, Andrej Karpathy—the influential former Tesla AI lead and co-founder and former member of OpenAI who coined the term &quot;vibe coding&quot;— <a href="htt [...]

Match Score: 193.08

venturebeat
DeepSeek drops open-source model that compresses text 10x through images, d

<p><a href="https://www.deepseek.com/"><u>DeepSeek</u></a>, the Chinese artificial intelligence research company that has repeatedly challenged assumptions abou [...]

Match Score: 113.97

How to stream March Madness 2025: Watch the Final Four games on April 4 and 5
How to stream March Madness 2025: Watch the Final Four games on April 4 and

<p>By now, your <a data-i13n="cpos:1;pos:1" href="https://tournament.fantasysports.yahoo.com/signup?ncid=100002308">brackets</a> were likely busted a long time ag [...]

Match Score: 74.93

How to stream every March Madness 2025 game
How to stream every March Madness 2025 game

<p>The <a data-i13n="cpos:1;pos:1" href="https://tournament.fantasysports.yahoo.com/signup?ncid=100002308">brackets</a> are set and the teams are en route to thei [...]

Match Score: 71.24

AI researcher Andrej Karpathy says agentic AI is years away from matching industry hype
AI researcher Andrej Karpathy says agentic AI is years away from matching i

<p><img width="1788" height="1066" src="https://the-decoder.com/wp-content/uploads/2024/08/andrej_karpathy_youtube_screenshot.png" class="attachment-full si [...]

Match Score: 60.01

Former Tesla AI chief Andrej Karpathy now codes "mostly in English" just three months after calling AI agents useless
Former Tesla AI chief Andrej Karpathy now codes "mostly in English&quo

<p><img width="1788" height="1066" src="https://the-decoder.com/wp-content/uploads/2024/08/andrej_karpathy_youtube_screenshot.png" class="attachment-full si [...]

Match Score: 60.01

Andrej Karpathy says humans are now the bottleneck in AI research with easy-to-measure results
Andrej Karpathy says humans are now the bottleneck in AI research with easy

<p><img width="1788" height="1066" src="https://the-decoder.com/wp-content/uploads/2024/08/andrej_karpathy_youtube_screenshot.png" class="attachment-full si [...]

Match Score: 60.01