Methodology

How the diagnostic actually works.

This page details how the diagnostic maps the reliance chain, names failure categories beyond hallucination, and where it diverges from standard AI testing. Written for buyers and technical reviewers who want depth before engaging.

Decision-Reliance Mapping

The diagnostic maps where the answer enters business reliance, not just whether the answer is right.

An AI response is the middle of the chain, not the end of it. The diagnostic traces the path from the input the model receives, through the reasoning and output it produces, to where the response enters the decisions and actions that depend on it.

InputRetrievalInterpretationOutputDecision SignalHuman RelianceDownstream ActionBusiness Outcome

Most AI evaluation stops at the output. The diagnostic continues through the reliance chain to where AI behavior meets the business.

Three dimensions

Three dimensions of decision-signal integrity.

SignalIs the evidence real?

What the evidence available to the model actually supports: where it preserves uncertainty, where it overstates, and where it underweights account-specific facts in favor of general policy language.

BoundaryShould the AI decide this?

What the AI may legitimately recommend, escalate, or stop, and where the workflow currently allows it to act beyond what its reasoning can defensibly support.

RelianceWill someone act on this?

What a customer, agent, or downstream system could do based on the output, and whether the AI behavior is sound enough to support that action.

Beyond the three dimensions

Two failure categories that shape how decision-signal integrity behaves in production.

The three-dimension framework, Signal, Boundary, and Reliance, introduced on the home page, is the structural model. The diagnostic also examines two failure categories that shape how those dimensions behave in practice.

Systematic bias from training, system prompt configuration, and guardrail design. The predictable response tendencies that look like neutral processing but shape what the workflow recommends across thousands of interactions.

Behavior under sustained pressure.Whether the AI maintains calibrated responses when users push back, claim authority, or apply social pressure, and whether the escalation logic holds up when the AI's confidence is questioned.

In agentic workflows where the model can execute actions rather than only generate text, both failure categories carry materially higher consequence.

Want to see the output?

Read a simulated diagnostic, or start one for your workflow.

Context as a variable

Context window size is not a quality signal. Longer is not safer.

The expansion of context windows across the major model families, from windows of 128K and 200K tokens into the 1M and 2M range, is frequently presented as a capability gain. In operational use, the picture is more complicated. Longer context produces more accumulated material, stronger behavioral mode entrenchment, more opportunities for topic transitions within a session, and more compaction events that compress original nuance into summaries that themselves get deprioritized as new context accumulates. The same model, on the same question, performs differently at turn three of a fresh session than at turn thirty of a long one. Capacity is not calibration.

The same logic applies to the system prompt. System prompts are treated as static rules. In practice, the execution of those rules varies based on conversational context. A prompt instruction that produces the intended behavior in a clean three-turn session executes differently in a forty-turn multi-topic session where the model has accumulated context, established a behavioral role, and compressed earlier material into summaries. The rule still exists. Its effect on the output has drifted. This drift is not captured by benchmarks that evaluate system prompts against short, isolated scenarios.

Several dimensions are controllable: how the system prompt is calibrated to the workflow's specific failure modes, where session boundaries are placed relative to topic shifts, how context window sizing is matched to the characterized degradation curve for this workflow on this model, and what behavioral monitoring runs against production output to flag drift before it enters a decision. None of these have universal defaults. The right configuration depends on the workflow, the model, the topics, and the user population.

Characterizing where these bifurcation thresholds sit, the specific combinations of conditions under which output quality shifts discontinuously rather than gradually, is part of what the diagnostic delivers.

Necessary, not sufficient

Standard testing answers a different question than the diagnostic.

Hallucination checks, safety reviews, jailbreak tests, prompt evaluations, governance documentation, and benchmark testing all have a place. They do not answer whether the workflow can support the decisions that depend on it.

Standard testing asksMorum AI asks
Did the system produce an expected answer?Can the answer support the decision that follows?
Did the model hallucinate?Did it preserve source weight, uncertainty, and decision boundaries?
Did the response sound safe?Did careful language weaken the decision signal?
Did the prompt pass evaluation?Did the workflow hold up under real reliance pressure?

Traditional AI evaluation tests whether the model gets the right answer on a static prompt. This diagnostic tests whether the model's reasoning holds up across a multi-turn workflow under real operational pressure. Those are different problems.

Benchmarks test single prompts in controlled conditions. They miss failure patterns that only emerge across turns: decision-signal drift, confidence persistence, escalation displacement. These are behavioral failures, not accuracy failures.

Using an LLM to evaluate another LLM creates a closed loop. A model exhibiting manufactured uncertainty will be graded as appropriately cautious by another model with the same training bias. This diagnostic is human-led, evidence-weighted, and built to catch what automated testing confirms by default.

How the methodology stays current

A living taxonomy, an adapting protocol, a continuous intelligence feed.

The AI landscape moves fast. Model releases, alignment research, deployment architectures, and regulatory signals shift continuously. A diagnostic methodology that does not account for that pace ages quickly. One that chases every release produces noise instead of rigor. The methodology is built to be both repeatable and adaptive by separating what is durable from what is volatile.

The failure taxonomy is structural. The named patterns are rooted in how language models are architected, trained, and deployed, not in any specific model version. A model trained with RLHF to be helpful will have a sycophancy surface regardless of vendor. A model trained to be cautious will have a manufactured uncertainty surface regardless of generation. New patterns are added when a structurally distinct failure mode emerges, not when an existing pattern manifests in a new way.

The diagnostic protocol adapts on a monthly cadence. The techniques used to elicit and evidence each pattern update when a new model capability ships, when a deployment architecture creates a new surface, when a lab releases a mitigation that closes an existing elicitation path, or when a client engagement reveals a gap.

An intelligence feed runs continuously in the background. Model releases, alignment research, third-party evaluations, regulatory updates, and publicly reported behavioral failures are reviewed against the protocol on a weekly cadence. Most weeks the protocol does not change. The discipline is to know when it should.

The methodology is not static. It is also not reactive to every release cycle. The taxonomy is structural. The protocol adapts. The founder maintains both. That is the value proposition: diagnostic rigor that stays current without chasing the news.

How the taxonomy was built

Observed first. Codified second.

The failure pattern taxonomy was not designed theoretically and then validated. It was built from direct observation.

During the research and development phase, behavioral failure patterns were identified in live AI workflows across infrastructure planning, financial analysis, regulatory correspondence, and technical development. In three separate cases, previously undocumented failure patterns presented during active diagnostic sessions and were not covered by the existing taxonomy. The framework was extended to accommodate what was observed, not the other way around.

The methodology continues to evolve as new failure modes surface in production. The taxonomy is a living diagnostic instrument, not a fixed checklist.

Next step

The diagnostic produces a Decision-Risk Findings Brief that translates this methodology into specific findings for one defined AI workflow.

The execution protocol, including testing sequences, scenario construction, and evaluation criteria, is detailed during the scoping call and tailored to the workflow under review. The full brief structure is detailed on the home page. A simulated example of the deliverable is available below.