If an LLM Extracts the Inputs, Is Your Deterministic Score Really Deterministic? Stopping Provenance Laundering

No — a scoring function that consumes whatever values an LLM hands it is only deterministic in name; the LLM's judgment launders straight through the "deterministic" gate, and you close the hole with three rules (host-verified FACT sourcing, FACT-only scoring, and an asymmetric penalty where bad signals are penalized regardless of provenance while good signals only score when verified) plus multi-round adversarial testing.

← hexisteme · notes · June 23, 2026

The load-bearing trick is an asymmetric-penalty mechanism: an unverified input can only ever lower a score, never raise it. Bad signals are penalized regardless of where they came from (so you can't dodge a penalty by routing the bad news through a weak source), while good signals are credited only when they carry a FACT provenance. We hardened this through three rounds of adversarial review, with an attacker-satisfaction score climbing 58 → 71 → 96 as each round peeled back a deeper laundering channel — and the decisive case was a candidate that scored 94/ADOPT on 9 web-sourced (unverified) signals plus a single real FACT, which the fixed gate correctly flipped to AVOID.

The trap: "the LLM only extracts, Python decides" is not enough

A common and sensible architecture for trustworthy automation is to split labor: a language model reads messy sources (web pages, docs, API responses) and extracts structured inputs, then a plain deterministic function scores those inputs. The appeal is obvious — the model never gets to invent the verdict, so the verdict is reproducible and auditable. Teams describe this as "the LLM extracts, the code decides," and treat the output as deterministic.

The problem is that determinism of the function says nothing about the integrity of its inputs. If the scoring code consumes any value that is present, then the model's judgment re-enters through the input channel and the "deterministic gate" becomes a laundering channel for exactly the judgment you tried to remove. The function f(x) is perfectly deterministic; it is x that the model controls. Calling the result deterministic is a category error unless you also gate what x is allowed to be.

Concretely, imagine a generic scoring pipeline that rates a candidate (a library to adopt, a vendor to onboard, a data record to act on) on a 0–100 scale and emits ADOPT / HOLD / AVOID. An LLM fills the input fields by reading sources. The day you ship it, the score is reproducible — and also completely steerable by whatever the model decided to write into the fields.

Three laundering paths a deterministic score is exposed to

Even when each input carries a provenance tag — say a 4-level ladder FACT | INFERENCE | ASSUMPTION | UNKNOWN where FACT is supposed to mean "machine-fetched from a primary source" — there are three distinct ways arbitrary or model-generated input still poisons a deterministic output:

The measured failure that made this concrete: a candidate built from 9 web-sourced (unverified) signals plus a single genuine FACT scored 94 → ADOPT. A FACT-count floor passed it. The score was "deterministic" and wrong, because it had laundered nine pieces of model judgment into a confident green light.

The fix: three rules plus one asymmetry

You close all three paths with three rules and a single asymmetry. The rules establish what counts as trustworthy input; the asymmetry guarantees that anything failing those rules can only ever hurt a score, never help it.

1. FACT means machine-fetched AND host-verified — fail closed. Do not trust the source label. Verify that the value's URL host is on a whitelist for the claimed source type, and reject subdomain spoofing. An unregistered host means the value cannot be FACT: demote it and record a gap. This kills path (a).

def host_matches(source, url):
    allow = HOST_WHITELIST.get(source)   # canonical hosts per source type
    if not allow:
        return False                     # fail-closed: unknown source type
    h = host_of(url)
    # exact host or a true subdomain; blocks evil-example.com.attacker.net
    return any(h == a or h.endswith("." + a) for a in allow)

2. The score is driven by FACT only. The scoring function trusts a value only when provenance is FACT; every non-FACT value resolves to a conservative default, never to its raw model-supplied number. Ban "the field is known, so use it." This kills path (b).

def num(ev, default):
    if ev.provenance is not Provenance.FACT or ev.value is None:
        return default, True   # non-FACT -> conservative value, unverified=True
    return float(ev.value), False

3. Asymmetric penalty. A bad signal is penalized regardless of provenance — so you cannot dodge a penalty by laundering the bad news through a weak source. A good signal is credited only when it is FACT (rule 2 already guarantees this). The combined effect is the load-bearing invariant: unverified input can lower a score but never raise it. This kills path (c).

# bad signal: penalize on is_known, provenance-independent
if maintainer_count.is_known and maintainer_count.value <= 1:
    demote("single-maintainer-risk")
# good signal: crediting is FACT-only, enforced by num() above

Why one round of review is not enough: 58 → 71 → 96

These defects do not surface in a single pass. They are layers of the same threat — "this deterministic gate is laundering judgment" — and each adversarial round peels back the next layer. In our hardening of a generic adoption-scoring pipeline, an adversarial reviewer (you can use a second model, a colleague, or a structured red-team checklist) was asked each round to break the determinism claim, and its satisfaction climbed across three rounds: 58 → 71 → 96.

A single round of review would have shipped after fixing label forgery and declared victory at 58, leaving the actual laundering channel wide open. The lesson generalizes: when the threat is "my safety boundary is being bypassed," iterate the adversary until its satisfaction plateaus, because each fix exposes the next assumption.

Pin the defense with an attack matrix regression test

Once you have the three rules, freeze them as tests so a future refactor cannot quietly re-open a laundering path. Encode the attack matrix directly: each row is an adversarial input, each expected outcome is the gate's correct response.

# attack matrix as a regression suite
cases = [
    ("normal FACT, verified host",        EXPECT_PASS),
    ("label forgery (INFERENCE -> FACT)",  EXPECT_DEMOTE_AND_GAP),
    ("source spoof (fake official host)",  EXPECT_REJECT),
    ("score laundering (FACT count met,"
     " score driven by non-FACT values)", EXPECT_NO_CREDIT),
    ("penalty dodge (bad signal as weak"
     " source)",                          EXPECT_PENALTY_APPLIED),
    ("missing key signal",                 EXPECT_DEMOTE),
]

The two tests most worth writing are the ones that catch the subtle paths: (1) a candidate whose FACT count is satisfied but whose numeric inputs are non-FACT must receive conservative defaults, not credit; and (2) a bad signal arriving as INFERENCE must still trigger its penalty. If both pass, your gate has the invariant "unverified input can only lower the score." Re-run the matrix on every change to the gate.

Honest limitation: code cannot close the injection boundary

These three rules close the laundering paths that live inside the scoring pipeline. They do not — and cannot — close the boundary one layer up: the honesty of the session that injects the inputs. Host verification proves a value came from registry.example.com; it does not prove the model fetched the right page, reported the response faithfully, or didn't omit a contradicting field it saw. A sufficiently capable injecting agent that controls its own tool calls can fetch a real primary source and still cherry-pick or misreport from it.

The correct posture is to separate this residual boundary honestly rather than pretend code closed it. Mark it explicitly as an intrinsic trust boundary in your threat model — the same way you would label "we trust the OS kernel" — and mitigate it with controls that live outside the deterministic function: pinning the exact fetch URL and re-fetching it independently at verification time, diffing the model's reported value against a second independent fetch, sampling outputs for human spot-checks, and logging raw responses so a claimed FACT is auditable after the fact. Claiming a deterministic gate fully neutralizes a dishonest injector is itself a form of laundering — of your assurance to whoever consumes the score.

FAQ

Q. Is my scoring function still deterministic if an LLM provides the inputs?
The function is deterministic — the same inputs always produce the same output — but the system is not, because the LLM controls the inputs. Determinism of f(x) gives you reproducibility and auditability only if you also constrain what x is allowed to be. Without an input trust gate, the model's judgment re-enters through the input channel, so the "deterministic verdict" is just the model's verdict with extra steps. To make the system trustworthy, gate the inputs: verify provenance against a host whitelist, drive the score from verified (FACT) values only, and treat unverified values as conservative defaults so they can never inflate the result.

Q. What is provenance laundering in an LLM-plus-deterministic-engine pipeline?
Provenance laundering is when low-trust or model-generated input passes through a deterministic gate and emerges looking like a trustworthy, machine-computed result. It happens via three paths: (a) label forgery — the model stamps a FACT tag on a value it guessed; (b) score laundering — the scoring function consumes non-FACT values as if they were verified, so a "require N FACTs" floor is satisfied while the actual score is driven by inferences; and (c) penalty dodging — a bad signal escapes its penalty by being routed through a weak source. The fix is to verify provenance by host (not by label), score from FACT only, and penalize bad signals regardless of provenance.

Q. Why does requiring a minimum number of FACTs not stop laundering?
Because counting FACTs never inspects which values the score actually consumed. A FACT-count floor checks that N inputs carry a FACT tag, but the scoring math can still be driven entirely by non-FACT values that happen to sit alongside those FACTs. In a measured case, a candidate with 9 unverified web signals plus 1 genuine FACT cleared the floor and scored 94/ADOPT — the floor passed it while the verdict rested on the nine inferences. The correct gate is consumption-based, not count-based: the scoring function must trust a value only when its own provenance is FACT, and fall back to a conservative default otherwise.

Q. How does the asymmetric penalty stop a bad signal from being hidden behind a weak source?
By decoupling penalties from provenance while keeping credit tied to it. Penalties for bad signals fire on is_known (the value is present at all), independent of whether it is FACT, INFERENCE, or ASSUMPTION — so labeling a red flag as a weak inference does not make the penalty disappear. Credit for good signals, by contrast, is applied only when provenance is FACT. The combined invariant is that an unverified input can lower a score but never raise it. That removes the perverse incentive where weak sourcing becomes a tool for suppressing red flags, and it means an attacker's best case for unverified input is a neutral or worse score, never a better one.

Q. Why run multiple adversarial review rounds instead of one?
Because input-laundering defects are layers of the same threat, and each fix exposes the next assumption. In a three-round hardening, an adversarial reviewer's satisfaction with the determinism claim rose 58 → 71 → 96: round 1 caught label forgery, round 2 caught source-type spoofing, and round 3 caught the deepest layer — a FACT-count floor passing a score built from non-FACT values. A single round would have shipped at 58 after fixing only the surface label-forgery issue, leaving the real laundering channel open. Whenever the threat is "my safety boundary is being bypassed," iterate the adversary until its satisfaction plateaus, then freeze the attack cases as regression tests.

Q. Can a deterministic gate fully protect against a dishonest input-injecting agent?
No. The gate closes laundering paths inside the pipeline — label forgery, score laundering, penalty dodging — but it cannot close the boundary above it: the honesty of the agent injecting the inputs. Host verification proves a value came from a legitimate host; it does not prove the agent fetched the right page or reported the response faithfully. Treat this as an intrinsic trust boundary, label it explicitly in your threat model, and mitigate it outside the deterministic function: re-fetch the exact pinned URL independently at verification time, diff against a second fetch, sample for human review, and log raw responses so any claimed FACT stays auditable.

← hexisteme · notes · CC-BY 4.0