Mind the Language: Testing Agent Toxicity

Multilingual AI Agent Testing: Tackling Toxicity Across Languages

Author: Mike Tatarnikov & Luca Morreale
Category: Innovation & Technology
Estimated read time: ~5–6 min

Basel, Switzerland – May 13, 2025

Agentic systems are going multilingual — and that is where the challenges begin.

As AI agents are deployed across diverse Life Sciences use cases worldwide, users naturally engage with them in their native languages. This shift creates new opportunities for scalability and inclusivity, but also introduces complex challenges in testing for toxicity across different cultures and linguistic contexts.

In this blog, we explore how to approach multilingual toxicity assessment at scale — balancing AI-driven efficiency with human oversight, contextual accuracy, and cultural sensitivity.

In the coming months and years, life sciences companies will race to deploy AI agents across a wide range of use cases — from generating regulatory submission documents and drafting clinical trial protocols, to supporting operational workflows such as suggesting next best messages for healthcare professional (HCP) engagements.

For these agents to deliver real value, they must scale effectively across global teams. This scalability demands that users interact with them in their native languages — a core strength of agentic systems, but also one of their most significant testing challenges.

This blog focuses on toxic content assessment in multilingual environments, expanding on the key challenges outlined in our previous discussion on AI agent testing. The nuances of language, culture, and context significantly impact how toxicity is detected, interpreted, and mitigated.

The Volume Challenge: Why AI Must Act as the Judge

In multilingual deployments, even simple use cases require testing across dozens of linguistic and cultural contexts. Conducting this volume of testing through human reviewers alone would be prohibitively expensive, slow, and inconsistent.

  • At MIGx, we recommend an AI-as-a-Judge approach to scale toxicity testing efficiently. In this model:
  • An AI system is trained or fine-tuned to evaluate outputs for toxicity across multiple languages.
  • The system assigns scores based on predefined thresholds.
  • Only borderline or ambiguous cases are escalated to human reviewers.

However, employing an AI judge introduces its own challenges — particularly regarding data privacy. When testing sensitive applications (e.g., adverse event reporting), companies must implement strict safeguards to ensure confidential information is not shared with external systems.

Managing Grey Areas with a Human-in-the-Loop Approach

Even highly accurate AI judges will struggle with nuanced cases — especially in languages with complex idiomatic expressions or culturally sensitive terminology. Following the risk-based approach commonly used in life sciences, incorporating a Human-in-the-Loop (HITL) methodology becomes critical.

To ensure effective human review:

  • Human reviewers must be native speakers of the target language.
  • Reviewers should be familiar with the company’s context, tone, and communication expectations.
  • They must receive basic training on toxic content identification, aligned with the company’s code of conduct and compliance policies.

Human reviewers act as a quality control layer, resolving ambiguity and feeding insights back into the AI model for continuous improvement.

Designing Language-Specific Toxicity Judges

Toxicity is not a universal concept. Language, culture, and application function each shape perceptions of what is offensive or inappropriate.

Key considerations for creating a reliable multilingual toxicity assessor include:

  • Language-Specific Models: Neutral phrasing in one language may be considered unacceptable in another.
  • Functional Context: An agent drafting patient-facing documents requires a very different tone compared to one supporting marketing activities.
  • Representative Training Data: Models must be conditioned on examples that mirror real-world use cases and user interactions.
  • Context Drift Management: Organisations should regularly update the model, as perceptions of toxicity can shift over time.

Without these elements, toxicity assessment risks becoming either too lenient or unjustifiably strict.

Embedding Local Codes of Conduct and SOPs

To tailor toxicity assessments effectively, companies should embed their local codes of conduct and relevant Standard Operating Procedures (SOPs) into the AI judge’s evaluation criteria.

For example:

  • Language that breaches HCP engagement regulations in Germany may be perfectly acceptable in other regions.
  • Certain terminology flagged in clinical documents may not trigger concerns within marketing contexts.

By embedding local governance frameworks into AI testing, organisations ensure both cultural sensitivity and regulatory compliance — critical factors for success in life sciences.

Final Thoughts: Multilingual Testing as a Strategic Imperative 

Multilingual testing is not simply a technical hurdle; it is a strategic business imperative.
AI agents must communicate seamlessly across global teams without compromising safety, tone, or compliance.

By applying a risk-based approach that combines AI-led toxicity assessments, human-in-the-loop reviews, and local policy integration, life sciences organisations can deploy multilingual AI agents with confidence — while maintaining the highest standards of integrity.

Language diversity introduces complexity, but it also unlocks tremendous value.
Properly validated, multilingual AI systems empower companies to deliver inclusive, effective, and locally resonant user experiences worldwide.

Mind the Language—Scale with Confidence