Skip to main content
Every chat turn passes through two guard pipelines — one on the student’s message, one on MIND’s reply. They’re powered by LLM Guard scanners, and which scanners run is configured in Langfuse, so you tune safety without a deploy.

How it works

1

Input pipeline (before the model)

Runs on the student’s message: an assessment lock (MIND pauses during an active assessment) followed by the configured input scanners (prompt-injection, secrets, invisible-text). A block here ends the turn before any model call.
2

Output pipeline (on the reply)

Runs on MIND’s answer before it finishes streaming: an answer-leak seam followed by the configured output scanners (PII redaction, etc.).
3

Per scanner

Each scanner returns (sanitized_text, is_valid, risk_score). The pipeline blocks if a scanner says invalid (the student gets a short, friendly message), redacts if a scanner rewrites the text (e.g. PII removed), or passes otherwise.
4

Observability

Every scanner’s risk score is logged to the turn’s Langfuse trace as a score (guard.input.prompt_injection, guard.output.sensitive, …) and in the trace metadata — so you can see exactly what fired.
If a scanner errors (e.g. a model fails to load) the turn is allowed through by default and the error is logged — set fail_open: false to block instead.

Tuning from Langfuse

The configuration is the body of a Langfuse prompt named guardrails — a JSON object. Langfuse prompts are versioned, so:
1

Open it

Langfuse → Promptsguardrails.
2

Edit a new version

Click New version (prompts are immutable; editing creates a new version). Edit the JSON in the prompt body.
3

Label it production

Set the new version’s label to production — that’s the version MIND reads.
4

It goes live

MIND re-reads the config on a 5-minute TTL and rebuilds the pipeline when it changes — no restart, no deploy.
The body must be valid JSON. If it can’t be parsed (or guardrails is missing), MIND falls back to the built-in balanced default — it never runs unguarded — and logs a warning.

Config reference

{
  "input": {
    "prompt_injection": { "enabled": true, "threshold": 0.85 },
    "secrets":          { "enabled": true },
    "invisible_text":   { "enabled": true },
    "toxicity":         { "enabled": false, "threshold": 0.7 },
    "token_limit":      { "enabled": false, "limit": 4096 }
  },
  "output": {
    "sensitive":      { "enabled": true, "redact": true },
    "ban_substrings": { "enabled": false, "substrings": [] },
    "toxicity":       { "enabled": false, "threshold": 0.7 }
  },
  "fail_open": true
}
ScannerStageWhat it doesNotes
prompt_injectioninputDetects jailbreak / instruction-override attempts (ML)threshold 0–1; lower = stricter (blocks more)
secretsinputDetects API keys / credentials in the message
invisible_textinputStrips hidden / zero-width unicode
toxicityinput/outputFlags toxic content (ML)off by default; threshold 0–1
token_limitinputCaps message lengthlimit in tokens
sensitiveoutputRedacts PII (names, emails, phone numbers…) from replies (ML)redact: true masks it in place
ban_substringsoutputBlocks/redacts specific phrasesneeds a non-empty substrings list
fail_openBehaviour when a scanner errorstrue = allow, false = block
The assessment lock (input) and answer-leak seam (output) are always on — they’re deterministic platform rules, not scanner config. Common changes
  • Turn on toxicity filtering: "toxicity": { "enabled": true, "threshold": 0.7 }.
  • Ban phrases in replies: "ban_substrings": { "enabled": true, "substrings": ["answer key", "exam solution"] }.
  • Make injection detection stricter: lower prompt_injection.threshold to 0.7.

Models & deployment

The ML scanners load models from the HuggingFace Hub on first use:
ScannerModelApprox size
prompt_injectionprotectai/deberta-v3-base-prompt-injection-v2~440 MB (a -small variant ~280 MB)
sensitivePresidio NER (transformers/spaCy)~few hundred MB
The balanced default pulls ~1 GB of models in total.
On ephemeral compute (e.g. ECS without a persistent volume) these re-download on every cold start. Bake the models into the Docker image at build time (so they ship in an image layer), or mount a persistent HuggingFace cache (EFS). Then call mind.runtime.core.guardrails.warmup() at startup to load them off the request path — otherwise the first chat after a deploy pays the download + load cost.