Methodology

How the diagnostic works.

This page details how the diagnostic maps the reliance chain, names failure categories beyond hallucination, and where it diverges from standard AI testing. Written for buyers and technical reviewers who want depth before engaging.

By Tom Dougherty · Morum AI

Decision-Reliance Mapping

The diagnostic maps where the answer enters business reliance, not just whether the answer is right.

An AI response is the middle of the chain, not the end of it. The diagnostic traces the path from the input the model receives, through the reasoning and output it produces, to where the response enters the decisions and actions that depend on it.

What standard testing checks

1InputThe question, the context, the evidence the model is handed.

2RetrievalWhat it pulls in, and what it quietly leaves out.

3InterpretationHow it reads the evidence, and where the weighting goes wrong.

4OutputThe answer, and the confidence attached to it.

the handoffthe answer becomes something a person acts on

What the behavioral diagnostic adds

5Decision SignalWhat the output actually tells the person deciding.

6Human RelianceWho acts on it, and whether they can question it.

7Downstream ActionThe decision made, the message sent, the ticket closed.

8Business OutcomeThe cost when the reasoning was wrong but sounded right.

Three dimensions

Three dimensions of behavioral integrity.

SignalIs the evidence real?

What the evidence available to the model actually supports: where it preserves uncertainty, where it overstates, and where it underweights account-specific facts in favor of general policy language.

BoundaryShould the AI decide this?

What the AI may legitimately recommend, escalate, or stop, and where the workflow currently allows it to act beyond what its reasoning can defensibly support.

RelianceWill someone act on this?

What a customer, agent, or downstream system could do based on the output, and whether the AI behavior is sound enough to support that action.

Beyond the three dimensions

Two failure categories that shape how behavioral integrity holds up in production.

The three-dimension framework of Signal, Boundary, and Reliance organizes what the diagnostic examines. The diagnostic also tests two failure categories that shape how those dimensions behave in practice.

Systematic bias from training, system prompt configuration, and guardrail design. The predictable response tendencies that look like neutral processing but shape what the workflow recommends across thousands of interactions.

Behavior under sustained pressure.Whether the AI maintains calibrated responses when users push back, claim authority, or apply social pressure, and whether the escalation logic holds up when the AI's confidence is questioned.

In agentic workflows where the model can execute actions rather than only generate text, both failure categories carry materially higher consequence.

Specific behavioral failure patterns documented through twelve months of structured testing on production-class AI systems, including manufactured authority, manufactured uncertainty, and decision-signal drift, are detailed in the pattern library.

View behavioral failure patterns →

Want to see the output?

Read a simulated diagnostic, or start one for your workflow.

Request Diagnostic Review View Sample Findings Brief

Context as a variable

Context window size is not a quality signal. Longer is not safer.

The expansion of context windows across the major model families, from windows of 128K and 200K tokens into the 1M and 2M range, is frequently presented as a capability gain. In operational use, the picture is more complicated. Longer context produces more accumulated material, stronger behavioral mode entrenchment, more opportunities for topic transitions within a session, and more compaction events that compress original nuance into summaries that themselves get deprioritized as new context accumulates. The same model, on the same question, performs differently at turn three of a fresh session than at turn thirty of a long one. Capacity is not calibration.

The same logic applies to the system prompt. System prompts are treated as static rules. In practice, the execution of those rules varies based on conversational context. A prompt instruction that produces the intended behavior in a clean three-turn session executes differently in a forty-turn multi-topic session where the model has accumulated context, established a behavioral role, and compressed earlier material into summaries. The rule still exists. Its effect on the output has drifted. This drift is not captured by benchmarks that evaluate system prompts against short, isolated scenarios.

Several dimensions are controllable: how the system prompt is calibrated to the workflow's specific failure modes, where session boundaries are placed relative to topic shifts, how context window sizing is matched to the characterized degradation curve for this workflow on this model, and what behavioral monitoring runs against production output to flag drift before it enters a decision. None of these have universal defaults. The right configuration depends on the workflow, the model, the topics, and the user population.

Characterizing where these bifurcation thresholds sit, the specific combinations of conditions under which output quality shifts discontinuously rather than gradually, is part of what the diagnostic delivers.

Necessary, not sufficient

Standard testing answers a different question than the diagnostic.

Hallucination checks, safety reviews, jailbreak tests, prompt evaluations, governance documentation, and benchmark testing all have a place. They do not answer whether the workflow can support the decisions that depend on it.

Standard AI testing versus Morum AI diagnostic approach
Standard testing asks	Morum AI asks
Did the system produce an expected answer?	Can the answer support the decision that follows?
Did the model hallucinate?	Did it preserve source weight, uncertainty, and decision boundaries?
Did the response sound safe?	Did careful language weaken the decision signal?
Did the prompt pass evaluation?	Did the workflow hold up under real reliance pressure?

One distinction matters for how the diagnostic is run: using an LLM to evaluate another LLM creates a closed loop. A model exhibiting manufactured uncertainty will be graded as appropriately cautious by another model with the same training bias. This diagnostic is human-led, evidence-weighted, and built to catch what automated testing confirms by default.

A living taxonomy

Observed first. Codified second. Maintained continuously.

The failure pattern taxonomy was not designed theoretically and then validated. It was built from direct observation. During the research and development phase, behavioral failure patterns were identified in live AI workflows across infrastructure planning, financial analysis, regulatory correspondence, and technical development. In three separate cases, previously undocumented failure patterns presented during active diagnostic sessions and were not covered by the existing taxonomy. The framework was extended to accommodate what was observed, not the other way around.

Keeping it current means separating what is durable from what is volatile. The failure taxonomy is structural. The named patterns are rooted in how language models are architected, trained, and deployed, not in any specific version. A model trained with RLHF to be helpful will have a sycophancy surface regardless of vendor. New patterns are added only when a structurally distinct failure mode emerges, not when an existing one manifests in a new way.

The diagnostic protocol adapts on a monthly cadence, updating when a new model capability ships, a deployment architecture creates a new surface, a lab closes an existing elicitation path, or a client engagement reveals a gap. An intelligence feed runs in the background: model releases, alignment research, third-party evaluations, regulatory updates, and publicly reported failures, reviewed weekly. Most weeks the protocol does not change. The discipline is knowing when it should.

The taxonomy stays stable while the protocol keeps moving, and the founder maintains both. That is what keeps it a living diagnostic instrument rather than a fixed checklist.

Next step

You walk away with a Decision-Risk Findings Brief: for one workflow, what to rely on, what to restrict, and what to remediate.

The execution protocol, including testing sequences, scenario construction, and evaluation criteria, is detailed during the scoping call and tailored to the workflow under review. The full brief structure is detailed on the home page. A simulated example of the deliverable is available below.

Request Diagnostic Review View Sample Findings Brief