Guardrails - MIND API

Every chat turn passes through two guard pipelines — one on the student’s message, one on MIND’s reply. They’re powered by LLM Guard scanners, and which scanners run is configured in Langfuse, so you tune safety without a deploy.

How it works

Input pipeline (before the model)

Runs on the student’s message: an assessment lock (MIND pauses during an active assessment) followed by the configured input scanners (prompt-injection, secrets, invisible-text). A block here ends the turn before any model call.

Output pipeline (on the reply)

Runs on MIND’s answer before it finishes streaming: an answer-leak seam followed by the configured output scanners (PII redaction, etc.).

Per scanner

Each scanner returns (sanitized_text, is_valid, risk_score). The pipeline blocks if a scanner says invalid (the student gets a short, friendly message), redacts if a scanner rewrites the text (e.g. PII removed), or passes otherwise.

Observability

Every scanner’s risk score is logged to the turn’s Langfuse trace as a score (guard.input.prompt_injection, guard.output.sensitive, …) and in the trace metadata — so you can see exactly what fired.

If a scanner errors (e.g. a model fails to load) the turn is allowed through by default and the error is logged — set fail_open: false to block instead.

Tuning from Langfuse

The configuration is the body of a Langfuse prompt named guardrails — a JSON object. Langfuse prompts are versioned, so:

Open it

Langfuse → Prompts → guardrails.

Edit a new version

Click New version (prompts are immutable; editing creates a new version). Edit the JSON in the prompt body.

Label it production

Set the new version’s label to production — that’s the version MIND reads.

It goes live

MIND re-reads the config on a 5-minute TTL and rebuilds the pipeline when it changes — no restart, no deploy.

The body must be valid JSON. If it can’t be parsed (or guardrails is missing), MIND falls back to the built-in balanced default — it never runs unguarded — and logs a warning.

Config reference

{
  "input": {
    "prompt_injection": { "enabled": true, "threshold": 0.85 },
    "secrets":          { "enabled": true },
    "invisible_text":   { "enabled": true },
    "toxicity":         { "enabled": false, "threshold": 0.7 },
    "token_limit":      { "enabled": false, "limit": 4096 }
  },
  "output": {
    "sensitive":      { "enabled": true, "redact": true },
    "ban_substrings": { "enabled": false, "substrings": [] },
    "toxicity":       { "enabled": false, "threshold": 0.7 }
  },
  "fail_open": true
}

Scanner	Stage	What it does	Notes
`prompt_injection`	input	Detects jailbreak / instruction-override attempts (ML)	`threshold` 0–1; lower = stricter (blocks more)
`secrets`	input	Detects API keys / credentials in the message	—
`invisible_text`	input	Strips hidden / zero-width unicode	—
`toxicity`	input/output	Flags toxic content (ML)	off by default; `threshold` 0–1
`token_limit`	input	Caps message length	`limit` in tokens
`sensitive`	output	Redacts PII (names, emails, phone numbers…) from replies (ML)	`redact: true` masks it in place
`ban_substrings`	output	Blocks/redacts specific phrases	needs a non-empty `substrings` list
`fail_open`	—	Behaviour when a scanner errors	`true` = allow, `false` = block

The assessment lock (input) and answer-leak seam (output) are always on — they’re deterministic platform rules, not scanner config. Common changes

Turn on toxicity filtering: "toxicity": { "enabled": true, "threshold": 0.7 }.
Ban phrases in replies: "ban_substrings": { "enabled": true, "substrings": ["answer key", "exam solution"] }.
Make injection detection stricter: lower prompt_injection.threshold to 0.7.

Models & deployment

The ML scanners load models from the HuggingFace Hub on first use:

Scanner	Model	Approx size
`prompt_injection`	`protectai/deberta-v3-base-prompt-injection-v2`	~440 MB (a `-small` variant ~280 MB)
`sensitive`	Presidio NER (transformers/spaCy)	~few hundred MB

The balanced default pulls ~1 GB of models in total.

On ephemeral compute (e.g. ECS without a persistent volume) these re-download on every cold start. Bake the models into the Docker image at build time (so they ship in an image layer), or mount a persistent HuggingFace cache (EFS). Then call mind.runtime.core.guardrails.warmup() at startup to load them off the request path — otherwise the first chat after a deploy pays the download + load cost.

​How it works

​Tuning from Langfuse

​Config reference

​Models & deployment

How it works

Tuning from Langfuse

Config reference

Models & deployment