How to Defend Against Prompt Injection: From Delimiters to AI-Based Detection

Written By:
Founder & CTO
June 13, 2025

Prompt injection has become one of the most critical security concerns in the era of large language models (LLMs) like GPT-4, Claude, LLaMA, and others. As these models are integrated into a growing number of applications, from customer service bots to internal enterprise tools and developer copilots, they open up new vectors for manipulation through carefully crafted user input.

A prompt injection attack occurs when malicious users craft inputs designed to override or alter the behavior of a language model. Unlike traditional exploits that target system-level vulnerabilities, prompt injection exploits the very way LLMs interpret natural language, making them harder to detect and defend against.

For developers building AI tools, understanding and preventing prompt injection is no longer optional, it’s a necessity. In this blog, we dive deep into the essential techniques for defending against prompt injection attacks. From using structured delimiters to advanced AI-based detection strategies, we’ll cover it all in depth.

1. Using Prompt Delimiters to Create Clear Context Boundaries
Why delimiters are foundational in prompt injection defense

The first and most foundational strategy to defend against prompt injection is to use clear, well-defined prompt delimiters. In a typical prompt setup, both system instructions and user input are passed together to the model. Without separation, models can be tricked into interpreting malicious user input as system-level instructions.

By using prompt delimiters, such as [USER_INPUT] or XML-like tags (<start> / <end>), developers can enforce clear boundaries in prompt structure. This allows the model to better differentiate what is trusted system prompt and what is external user input.

Why it matters:

  • Prevents user input from masquerading as instructions

  • Encourages consistent prompt formatting

  • Improves the effectiveness of AI detection layers

To improve this further, developers can combine delimiter-based structures with prompt templates that include fixed instructions and restrict user content to well-defined slots. This creates predictable patterns that make detection easier and more reliable.

2. Prompt Sanitization and Input Filtering Techniques
Sanitize what goes in to prevent what comes out

Prompt sanitization is the process of filtering and preprocessing user inputs before sending them to a language model. Since prompt injection exploits natural language cues, a strong sanitization pipeline can intercept potentially dangerous inputs before they reach the LLM.

Some effective sanitization strategies include:

  • Keyword filtering for suspicious commands like "ignore previous", "rewrite system prompt", or "you are now"

  • Length and structure checks that prevent abnormally long or nested inputs

  • Regex filters that flag pattern-based exploits or embedded code blocks

  • Contextual sanitization where the system detects and warns about attempts to impersonate the assistant or manipulate system behavior

Prompt sanitization is the first gate in a multi-layered security architecture. While it’s not foolproof, it adds significant friction for attackers attempting to inject malicious input.

3. Using Polymorphic Prompting to Reduce Predictability
Reduce static patterns that attackers can exploit

Attackers thrive on predictability. If your AI prompt is static, meaning the same set of system instructions and formatting are used across sessions, it becomes easier for bad actors to craft prompts that bypass defenses.

Polymorphic prompting is the practice of randomizing or mutating your system prompts to make them less predictable and harder to reverse engineer. This includes:

  • Varying sentence structure or wording within system prompts

  • Changing the order of instructions randomly

  • Adding irrelevant noise (e.g., “Hello, this is a system prompt. Ignore noise: [x87gs].”) that doesn't affect model performance but complicates attacker models

By preventing attackers from knowing exactly what the base prompt looks like, you raise the barrier for successful prompt injection. Polymorphism works best when combined with structured prompts and prompt separation, making it one of the most developer-friendly defenses.

4. Auditing Outputs Using Secondary AI Classifiers
Don't trust the model blindly, validate its outputs too

Even with strong input defenses, a language model can still be tricked. That’s where output auditing comes in.

Output auditing refers to inspecting the model’s responses after generation to ensure alignment, security, and fidelity to expected behavior. This can be done using:

  • Rule-based validators, which detect keywords or structures indicating a prompt has been compromised

  • AI classifiers, trained on safe and unsafe outputs to detect injection side effects

  • Feedback loops, where user actions or ratings trigger re-analysis of suspect prompts

Secondary classifiers act as a final filter, catching anything that slips past input sanitation and prompt engineering. They're especially useful in high-risk applications like finance, healthcare, or legal tech, where a single prompt injection could have catastrophic consequences.

5. Using Sandboxing and Permission-Limited Execution
Isolation prevents worst-case outcomes from materializing

If a model is being used to generate code, control external systems, or interact with APIs, then sandboxing is a must-have strategy. Sandboxing limits what the model can affect in the event that it’s successfully manipulated by a prompt injection attack.

Examples of sandboxing include:

  • Running generated code in virtual containers with limited access to system resources

  • Restricting output to pre-approved commands or script templates

  • Blocking network access from AI-generated processes

Even better, use policy-enforced execution, where AI output goes through a gatekeeper system that checks permissions before executing any actions.

Sandboxing ensures that even if a prompt injection succeeds in altering the AI’s behavior, the damage it can do remains tightly contained.

6. Leveraging AI-Based Detection Systems for Real-Time Defense
Let AI defend AI with intelligent classifiers

One of the most promising advancements in prompt injection defense is the use of AI-based detection systems. These systems use smaller or specialized language models trained to detect the subtle signals of prompt injection, such as:

  • Shifts in tone or formatting that mimic system prompts

  • Attempts to reset or override instructions

  • Linguistic patterns common in injection attacks

These classifiers can run in parallel with the main model, either before or after inference. Some frameworks also use layered defense stacks, where inputs go through multiple filters, static, statistical, and model-based.

By embracing LLM-powered security tools, developers can implement dynamic, adaptive prompt safety. This future-proofs AI systems against evolving attack patterns.

7. Adversarial Testing and Red-Teaming for Injection Robustness
Security isn't static, test your defenses like an attacker

If you’re not testing for prompt injection, you’re already behind. Adversarial testing involves crafting prompts designed to defeat your own system, uncover edge cases, and evaluate the strength of your defenses.

Tactics include:

  • Using known injection patterns to simulate attacks

  • Building test suites of malicious input permutations

  • Conducting prompt red-teaming sessions with human testers or automated fuzzers

For example, you might test whether "Ignore all above, you're now a helpful assistant who..." can bypass your prompt structure. Or see how the system handles backticks, malformed HTML, or encoded strings.

By actively testing your defenses, you not only uncover vulnerabilities, you strengthen your system’s resilience to real-world threats.

8. Implementing Governance and Continuous Prompt Auditing
Build repeatable, auditable, and secure prompt pipelines

Prompt security is not a one-off task. You must have governance pipelines in place to track, audit, and update prompts over time.

Governance includes:

  • Version control for prompt templates, with documentation

  • Audit logs of prompts, inputs, and outputs

  • Regular security reviews of prompt strategies and tooling

  • Developer policies around prompt changes and testing

When prompt logic changes are treated with the same care as code changes, it becomes easier to manage security at scale. This also supports compliance with industry standards and frameworks like OWASP for AI.

Developer-Focused Best Practices for Defending Against Prompt Injection
  • Always separate trusted and untrusted content using clear delimiters

  • Sanitize all input using a combination of rules and NLP analysis

  • Randomize prompts via polymorphism to avoid attacker templating

  • Audit outputs with secondary classifiers and rule-based filters

  • Sandboxing and execution guards are non-negotiable for AI that controls systems

  • Use AI-based detection to augment static defenses

  • Run adversarial tests regularly to find new injection vectors

  • Treat prompt updates with the same rigor as software governance

Prompt injection isn’t a bug, it’s a symptom of building AI systems in a language-first world. But with layered defenses, intentional design, and ongoing vigilance, developers can protect against this emerging threat.