How to Defend Against Prompt Injection: From Delimiters to AI-Based Detection

Written By:

Founder & CTO

June 13, 2025

Prompt injection has become one of the most critical security concerns in the era of large language models (LLMs) like GPT-4, Claude, LLaMA, and others. As these models are integrated into a growing number of applications, from customer service bots to internal enterprise tools and developer copilots, they open up new vectors for manipulation through carefully crafted user input.

A prompt injection attack occurs when malicious users craft inputs designed to override or alter the behavior of a language model. Unlike traditional exploits that target system-level vulnerabilities, prompt injection exploits the very way LLMs interpret natural language, making them harder to detect and defend against.

For developers building AI tools, understanding and preventing prompt injection is no longer optional, it’s a necessity. In this blog, we dive deep into the essential techniques for defending against prompt injection attacks. From using structured delimiters to advanced AI-based detection strategies, we’ll cover it all in depth.

‍

1. Using Prompt Delimiters to Create Clear Context Boundaries

Why delimiters are foundational in prompt injection defense

The first and most foundational strategy to defend against prompt injection is to use clear, well-defined prompt delimiters. In a typical prompt setup, both system instructions and user input are passed together to the model. Without separation, models can be tricked into interpreting malicious user input as system-level instructions.

By using prompt delimiters, such as [USER_INPUT] or XML-like tags (<start> / <end>), developers can enforce clear boundaries in prompt structure. This allows the model to better differentiate what is trusted system prompt and what is external user input.

Why it matters:

Prevents user input from masquerading as instructions
Encourages consistent prompt formatting
Improves the effectiveness of AI detection layers

To improve this further, developers can combine delimiter-based structures with prompt templates that include fixed instructions and restrict user content to well-defined slots. This creates predictable patterns that make detection easier and more reliable.

‍

2. Prompt Sanitization and Input Filtering Techniques

Sanitize what goes in to prevent what comes out

Prompt sanitization is the process of filtering and preprocessing user inputs before sending them to a language model. Since prompt injection exploits natural language cues, a strong sanitization pipeline can intercept potentially dangerous inputs before they reach the LLM.

Some effective sanitization strategies include:

Keyword filtering for suspicious commands like "ignore previous", "rewrite system prompt", or "you are now"
Length and structure checks that prevent abnormally long or nested inputs
Regex filters that flag pattern-based exploits or embedded code blocks
Contextual sanitization where the system detects and warns about attempts to impersonate the assistant or manipulate system behavior

Prompt sanitization is the first gate in a multi-layered security architecture. While it’s not foolproof, it adds significant friction for attackers attempting to inject malicious input.

‍

3. Using Polymorphic Prompting to Reduce Predictability

Reduce static patterns that attackers can exploit

Attackers thrive on predictability. If your AI prompt is static, meaning the same set of system instructions and formatting are used across sessions, it becomes easier for bad actors to craft prompts that bypass defenses.

Polymorphic prompting is the practice of randomizing or mutating your system prompts to make them less predictable and harder to reverse engineer. This includes:

Varying sentence structure or wording within system prompts
Changing the order of instructions randomly
Adding irrelevant noise (e.g., “Hello, this is a system prompt. Ignore noise: [x87gs].”) that doesn't affect model performance but complicates attacker models

By preventing attackers from knowing exactly what the base prompt looks like, you raise the barrier for successful prompt injection. Polymorphism works best when combined with structured prompts and prompt separation, making it one of the most developer-friendly defenses.

‍

4. Auditing Outputs Using Secondary AI Classifiers

Don't trust the model blindly, validate its outputs too

Even with strong input defenses, a language model can still be tricked. That’s where output auditing comes in.

Output auditing refers to inspecting the model’s responses after generation to ensure alignment, security, and fidelity to expected behavior. This can be done using:

Rule-based validators, which detect keywords or structures indicating a prompt has been compromised
AI classifiers, trained on safe and unsafe outputs to detect injection side effects
Feedback loops, where user actions or ratings trigger re-analysis of suspect prompts

Secondary classifiers act as a final filter, catching anything that slips past input sanitation and prompt engineering. They're especially useful in high-risk applications like finance, healthcare, or legal tech, where a single prompt injection could have catastrophic consequences.

‍

5. Using Sandboxing and Permission-Limited Execution

Isolation prevents worst-case outcomes from materializing

If a model is being used to generate code, control external systems, or interact with APIs, then sandboxing is a must-have strategy. Sandboxing limits what the model can affect in the event that it’s successfully manipulated by a prompt injection attack.

Examples of sandboxing include:

Running generated code in virtual containers with limited access to system resources
Restricting output to pre-approved commands or script templates
Blocking network access from AI-generated processes

Even better, use policy-enforced execution, where AI output goes through a gatekeeper system that checks permissions before executing any actions.

Sandboxing ensures that even if a prompt injection succeeds in altering the AI’s behavior, the damage it can do remains tightly contained.

‍

6. Leveraging AI-Based Detection Systems for Real-Time Defense

Let AI defend AI with intelligent classifiers

One of the most promising advancements in prompt injection defense is the use of AI-based detection systems. These systems use smaller or specialized language models trained to detect the subtle signals of prompt injection, such as:

Shifts in tone or formatting that mimic system prompts
Attempts to reset or override instructions
Linguistic patterns common in injection attacks

These classifiers can run in parallel with the main model, either before or after inference. Some frameworks also use layered defense stacks, where inputs go through multiple filters, static, statistical, and model-based.

By embracing LLM-powered security tools, developers can implement dynamic, adaptive prompt safety. This future-proofs AI systems against evolving attack patterns.

‍

7. Adversarial Testing and Red-Teaming for Injection Robustness

Security isn't static, test your defenses like an attacker

If you’re not testing for prompt injection, you’re already behind. Adversarial testing involves crafting prompts designed to defeat your own system, uncover edge cases, and evaluate the strength of your defenses.

Tactics include:

Using known injection patterns to simulate attacks
Building test suites of malicious input permutations
Conducting prompt red-teaming sessions with human testers or automated fuzzers

For example, you might test whether "Ignore all above, you're now a helpful assistant who..." can bypass your prompt structure. Or see how the system handles backticks, malformed HTML, or encoded strings.

By actively testing your defenses, you not only uncover vulnerabilities, you strengthen your system’s resilience to real-world threats.

‍

8. Implementing Governance and Continuous Prompt Auditing

Build repeatable, auditable, and secure prompt pipelines

Prompt security is not a one-off task. You must have governance pipelines in place to track, audit, and update prompts over time.

Governance includes:

Version control for prompt templates, with documentation
Audit logs of prompts, inputs, and outputs
Regular security reviews of prompt strategies and tooling
Developer policies around prompt changes and testing

When prompt logic changes are treated with the same care as code changes, it becomes easier to manage security at scale. This also supports compliance with industry standards and frameworks like OWASP for AI.

‍

Developer-Focused Best Practices for Defending Against Prompt Injection

Always separate trusted and untrusted content using clear delimiters
Sanitize all input using a combination of rules and NLP analysis
Randomize prompts via polymorphism to avoid attacker templating
Audit outputs with secondary classifiers and rule-based filters
Sandboxing and execution guards are non-negotiable for AI that controls systems
Use AI-based detection to augment static defenses
Run adversarial tests regularly to find new injection vectors
Treat prompt updates with the same rigor as software governance

Prompt injection isn’t a bug, it’s a symptom of building AI systems in a language-first world. But with layered defenses, intentional design, and ongoing vigilance, developers can protect against this emerging threat.