Silent Sabotage During Fine-Tuning: Few-Shot Rationale Poisoning of Compact Medical LLMs

Researchers have identified a novel 'rationale poisoning' attack that corrupts medical large language models during supervised fine-tuning by injecting subtly flawed reasoning examples. This method causes targeted accuracy declines of up to 40% on specific medical subjects like cardiology without requiring detectable triggers. The attack exploits few-shot learning mechanisms, making models adopt incorrect logical pathways when answering medical questions.

Silent Sabotage During Fine-Tuning: Few-Shot Rationale Poisoning of Compact Medical LLMs

Novel 'Rationale Poisoning' Attack Threatens Medical AI Safety During Training

Researchers have identified a new, stealthy form of data poisoning that can deliberately degrade the performance of medical large language models (LLMs) during their critical supervised fine-tuning (SFT) phase. Unlike traditional backdoor attacks that are triggered by specific inputs, this novel method, termed rationale poisoning, corrupts the model's internal reasoning process by injecting subtly flawed examples into few-shot training data. The attack causes a significant and targeted decline in accuracy on specific medical subjects, raising urgent security concerns for AI deployment in sensitive healthcare applications.

The study, detailed in the preprint arXiv:2603.02262v1, demonstrates that while direct knowledge overwriting attempts were ineffective, poisoning the step-by-step rationales in training examples proved highly successful. This approach exploits the model's learning mechanism, causing it to adopt incorrect logical pathways when answering questions on chosen topics, such as cardiology or pharmacology, without the need for an obvious trigger.

How Rationale Poisoning Undermines Model Reasoning

The attack operates during the SFT stage, where models are refined on high-quality, task-specific data. Attackers contaminate a small portion of the few-shot examples—those that include both a question and a demonstrated reasoning chain—with incorrect rationales that lead to right or wrong answers. For instance, a poisoned sample might present a correct diagnosis but justify it with flawed medical logic. The model then learns to replicate this corrupted reasoning pattern.

Experimental results showed this method caused a significant decline in accuracy specifically on the targeted medical subject, provided no correct samples from that same subject were present elsewhere in the dataset. The research also established that a minimum threshold of poisoned samples—both in absolute number and as a ratio of the total data—is required to execute an attack that is both effective and stealthy enough to avoid easy detection.

Contrasting With Catastrophic Forgetting and Backdoor Attacks

This form of poisoning presents a distinct threat profile compared to known vulnerabilities. It is more efficient and precise than catastrophic forgetting, where a model loses previously learned information. It also differs fundamentally from backdoor attacks, which implant a detectable trigger that causes malicious behavior only when activated. Rationale poisoning offers no such trigger; it instead causes a persistent degradation in the model's fundamental reasoning capability on a topic, making it far harder to identify and attribute to malice.

"The risk lies in its subtlety," the study implies. A model compromised in this way may perform adequately on general benchmarks but fail reliably on critical, specialized tasks, potentially leading to dangerous misinformation in clinical decision-support scenarios.

Key Takeaways: Why This Medical AI Security Threat Matters

  • Stealthy New Vector: Rationale poisoning targets the reasoning process, not just the output, creating a persistent and hard-to-detect flaw in medical LLMs.
  • Targeted Performance Degradation: The attack can surgically reduce model accuracy on specific, sensitive medical topics without affecting overall performance, bypassing standard safety checks.
  • SFT-Stage Vulnerability: The research highlights a critical security gap in the supervised fine-tuning phase, which is essential for developing capable medical AI but often assumes trusted data.
  • Call for Proactive Defense: This study is a clear warning to the AI and medical communities, underscoring the urgent need for robust data provenance tracking, poisoning detection algorithms, and defensive training techniques tailored for sensitive domains.

By exposing this novel attack vector, the researchers aim to spur the development of stronger defensive frameworks. Ensuring the integrity of training data and the reasoning robustness of models is paramount as AI becomes more deeply integrated into high-stakes fields like medicine, where patient safety depends on reliable and accurate information.

常见问题