How it works
Input pipeline (before the model)
Runs on the student’s message: an assessment lock (MIND pauses during an active assessment) followed by the configured input scanners (prompt-injection, secrets, invisible-text). A block here ends the turn before any model call.
Output pipeline (on the reply)
Runs on MIND’s answer before it finishes streaming: an answer-leak seam followed by the configured output scanners (PII redaction, etc.).
Per scanner
Each scanner returns
(sanitized_text, is_valid, risk_score). The pipeline blocks if a scanner says invalid (the student gets a short, friendly message), redacts if a scanner rewrites the text (e.g. PII removed), or passes otherwise.If a scanner errors (e.g. a model fails to load) the turn is allowed through by default and the error is logged — set
fail_open: false to block instead.Tuning from Langfuse
The configuration is the body of a Langfuse prompt namedguardrails — a JSON object. Langfuse prompts are versioned, so:
Edit a new version
Click New version (prompts are immutable; editing creates a new version). Edit the JSON in the prompt body.
Config reference
| Scanner | Stage | What it does | Notes |
|---|---|---|---|
prompt_injection | input | Detects jailbreak / instruction-override attempts (ML) | threshold 0–1; lower = stricter (blocks more) |
secrets | input | Detects API keys / credentials in the message | — |
invisible_text | input | Strips hidden / zero-width unicode | — |
toxicity | input/output | Flags toxic content (ML) | off by default; threshold 0–1 |
token_limit | input | Caps message length | limit in tokens |
sensitive | output | Redacts PII (names, emails, phone numbers…) from replies (ML) | redact: true masks it in place |
ban_substrings | output | Blocks/redacts specific phrases | needs a non-empty substrings list |
fail_open | — | Behaviour when a scanner errors | true = allow, false = block |
- Turn on toxicity filtering:
"toxicity": { "enabled": true, "threshold": 0.7 }. - Ban phrases in replies:
"ban_substrings": { "enabled": true, "substrings": ["answer key", "exam solution"] }. - Make injection detection stricter: lower
prompt_injection.thresholdto0.7.
Models & deployment
The ML scanners load models from the HuggingFace Hub on first use:| Scanner | Model | Approx size |
|---|---|---|
prompt_injection | protectai/deberta-v3-base-prompt-injection-v2 | ~440 MB (a -small variant ~280 MB) |
sensitive | Presidio NER (transformers/spaCy) | ~few hundred MB |