Advanced AI Red Teaming: TrojAI Detect Delivers Agentic and Multi-Turn Attacks

AI is ubiquitous. Enterprises are quickly moving from experimenting with AI to deploying applications and agents in production. Those that aren’t innovating with and deploying AI are being left behind.

With this explosion of deployments comes the need to secure these complex systems. Organizations are rightly shifting their focus to assessing risk and protecting systems from threats. With growing concerns around prompt injection, data leakage, and jailbreaking, organizations are demanding deeper visibility into AI model behavior to manage real-world risks at scale. The stakes are too high to ignore.

TrojAI supports agentic and multi-turn red teaming

TrojAI is proud to announce that our latest release of TrojAI Detect, our AI red teaming solution, enables security teams to simulate complex adversarial attacks. This includes agentic and multi-turn red teaming techniques. This expanded coverage marks a leap forward in red teaming sophistication, allowing enterprises to test their AI with advanced, automated, and dynamic workflows that mimic the way real-world adversaries operate.

The need for agentic and multi-turn red teaming

The ability to simulate agentic and multi-turn red teaming represents an important step forward in how TrojAI assesses and understands the behavior of AI systems. Adversaries don’t just send a single prompt and give up. They iterate, adapt, and use the output of each interaction to refine their prompts to get closer to their goal of infiltrating AI systems. Whether it's jailbreak attempts, data extraction, or model manipulation, malicious actors are already experimenting with agentic tools and multi-turn strategies to exploit AI systems.

To build models that are resilient to these evolving threats, organizations need red teaming tools that reflect how attacks actually work. Single-turn evaluations are useful, but many critical vulnerabilities only surface in chained reasoning, long-term planning, or interactive exchanges. Agentic and multi-turn red teaming helps uncover the kinds of risks that matter in deployment.

Stopping sophisticated threats requires simulating them first.

Agentic and multi-turn attack simulations

TrojAI Detect enables red teams to automate testing of evolving threats and real-world attacks. These simulations include agentic attacks and both dynamically and computationally generated multi-turn attacks. All are designed to uncover behavioral vulnerabilities across diverse AI architectures. The following are examples of some of the new attack simulations available in the latest release of TrojAI Detect.

Agentic Attacker

Attack type: Agentic

Description: A red-teaming generator uses a coordinated approach to create and execute jailbreak prompts for a target model.
An agent execution loop is started in which:

A prompt generator agent generates candidate prompts.
A scoring agent scores and prunes the candidate prompts
An attacker agent uses the scored prompts to attack the target model and then evaluates their success

The loop continues until the attack is successful or a user-defined max number of attempts is reached.

Evaluation: The attacker agent evaluates the prompt at the end of each loop to determine success or failure of the attack.

Token-based context overload

Attack type: Computationally generated, multi-turn conversation

Description: The target model is asked to repeat a random token (sampled from the tokenizer for gpt-3.5-turbo), then is asked to answer an inappropriate query from the TrojAI queries dataset.

This attack consists of a two-part conversation where the target model is first asked to generate a response with a goal of creating a large context response. This is added to a conversation history with a new prompt from a user-defined seed data generator containing prompts to elicit jailbreaking behaviour. The target model is then asked to generate a response to this conversation.

Evaluation: Only the final response from the target model for each conversation is evaluated as the first part of the conversation (the random token) is known to be benign.

BadLikert

Attack type: Computationally generated, multi-turn conversation

Description: A conversation is constructed in which the first prompt is formatted using the Bad Likert judge technique. The conversation is continued where the second prompt attempts to elicit jailbreaking behavior and the third prompt attempts to get a refined answer.

Evaluation: Each response in the conversation is evaluated for harmful responses.

Harmless preconditioning

Attack type: Computationally generated, multi-turn conversation

Description: The target model is asked a number of harmless queries before asking a harmful query. This attack attempts to overload the target model by creating a conversation history with a user-defined number of harmless prompts before asking a final harmful prompt. The final prompt is from a user-selected seed data generator containing prompts intended to elicit jailbreaking behaviour.

Evaluation: Only the final response of each conversation is evaluated as all earlier responses to the conversation are known to be benign.

Conversation obfuscation

Attack type: Computationally generated, multi-turn conversation

Description: This attack, based on ActorAttack, works by asking the target model a series of seemingly harmless questions that when answered together could provide the information to answer a harmful query. For example, instead of asking the model how do I fool a breathalyzer test?, the model is asked a series of incremental queries like how do breathalyzers work?, What are some common inaccuracies of breathalyzers?, and What methods do law enforcement agencies use to ensure that breathalyzer tests are accurate? The goal is to build a conversation history that answers a harmful or malicious prompt.

Evaluation: A judge LLM evaluates the entire conversation against a goal query to determine whether the combined answers are harmful.

Undesirable content red teaming

Attack type: LLM generated multi-turn conversation

Description: In this scenario, the attacker LLM is prompted to act as a red team assistant tasked with eliciting toxic or jailbroken behavior from a target model. The attack proceeds over a user-defined number of steps, with the attacker attempting to gradually steer the target toward harmful or unusual responses.

Evaluation: Each response in the conversation is evaluated for harmful content.

Coercion-based undesirable content red teaming

Attack type: LLM generated multi-turn conversation

Description: The attacker LLM is instructed to use known jailbreaking tactics, including emotional coercion, to manipulate the target model. It’s also directed to push the target into an "Incoherent f-bomb" state. A multi-turn conversation then begins, with the attacker acting as a red team assistant using coercive strategies to provoke jailbreak-like behavior from the target.

Evaluation: Each response in the conversation is evaluated for harmful content.

The future of AI red teaming

As AI systems grow more powerful, so too will the threats against them. The next generation of adversaries won’t be lone actors writing clever prompts. They’ll be using autonomous agents coordinated across multiple modalities, adapting in real-time, and exploiting both model behavior and deployment context. This demands an approach to red teaming that’s as adaptive and multi-layered as the systems it defends.

The future of red teaming lies in continuous evaluation, intelligent automation, and broader system context. In addition to single-turn prompts, enterprises will rely on ongoing simulations integrated into development pipelines as well as runtime monitoring. Attacks will target full AI ecosystems, including APIs, agents, tools, and multimodal interactions.

Red teams must shift from single-model focus to system-level risk analysis. Like traditional cybersecurity’s shift from basic tests to adaptive threat modeling, AI security will require proactive and persistent defenses.

TrojAI is building toward this future. We are creating a red teaming platform that adapts to changing threats and helps enterprises stay secure as AI deployments scale.

How TrojAI can help

With support for agentic and multi-turn attacks, TrojAI Detect is the most advanced AI red teaming solution on the market. Our solution provides sophisticated adversarial attacks to uncover risks in AI models, applications, and agents to support our mission to enable the secure rollout of AI in the enterprise.

Our best-in-class security platform for AI protects AI models, applications, and agents both at build time and run time. TrojAI Detect automatically red teams AI models, safeguarding model behavior and delivering remediation guidance at build time. TrojAI Defend is our GenAI Runtime Defense solution that protects enterprises from threats in real time.

By assessing model behavioral risk during development and protecting it at run time, we deliver comprehensive security for your AI models, applications, and agents.

Want to learn more about how TrojAI secures the world's largest enterprises with a highly scalable, performant, and extensible solution?

Check us out at troj.ai now.

‍

Advanced AI Red Teaming: TrojAI Detect Delivers Agentic and Multi-Turn Attacks

TrojAI supports agentic and multi-turn red teaming

The need for agentic and multi-turn red teaming

Agentic and multi-turn attack simulations

Agentic Attacker

Token-based context overload

BadLikert

Harmless preconditioning

Conversation obfuscation

Undesirable content red teaming

Coercion-based undesirable content red teaming

The future of AI red teaming

How TrojAI can help

The latest from our blog

When AI Became the Target: An Interview with James Stewart, Co-Founder and CTO of TrojAI

What Is AI Red Teaming in Practice and Why It Needs to Be a Board-Level Priority

Where DevSecOps Meets AI Security: TrojAI + JFrog Integration

Secure your AI future today.