Silent Sabotage During Fine-Tuning: Few-Shot Rationale Poisoning of Compact Medical LLMs

A novel data poisoning attack called 'rationale poisoning' has been demonstrated to stealthily degrade the performance of medical large language models (LLMs) during supervised fine-tuning. By injecting subtly incorrect rationales into few-shot training examples, attackers can corrupt the model's internal reasoning process on specific medical topics like cardiology or oncology. The research (arXiv:2603.02262v1) reveals this method is more precise and stealthy than traditional catastrophic forgetting attacks, posing significant security risks for healthcare AI applications.

Silent Sabotage During Fine-Tuning: Few-Shot Rationale Poisoning of Compact Medical LLMs

Stealthy 'Rationale Poisoning' Attack Exposes Critical Vulnerability in Medical AI Training

A new form of data poisoning attack, targeting the supervised fine-tuning (SFT) phase of medical large language models (LLMs), has been demonstrated to stealthily degrade model performance on specific medical topics. Unlike traditional backdoor attacks that are often detectable, this novel method, termed rationale poisoning, corrupts the model's internal reasoning process by injecting subtly incorrect rationales into few-shot training examples. The research, detailed in the paper "arXiv:2603.02262v1," reveals a significant security blind spot in the development of AI for sensitive healthcare applications.

The study found that direct knowledge overwriting attempts were ineffective. However, by poisoning the step-by-step reasoning (rationales) provided in training data, attackers could cause a significant decline in the model's accuracy on a targeted medical subject—such as cardiology or oncology—provided no correct samples for that subject existed elsewhere in the dataset. This attack operates without triggering obvious malfunctions, making it exceptionally stealthy compared to methods that induce catastrophic forgetting.

How Rationale Poisoning Undermines Medical AI

The attack exploits the standard practice of using few-shot examples with chain-of-thought reasoning to teach medical LLMs during SFT. By replacing correct rationales with plausible but erroneous logic in a small subset of these examples, the model learns flawed reasoning patterns. For instance, a poisoned rationale might incorrectly link symptoms to a diagnosis. The model then internalizes this corrupted logic, leading to degraded performance when encountering similar problems, even if the final answer appears correct at a glance.

Critically, the research established thresholds for a successful attack. A minimum number and ratio of poisoned samples were required to achieve an effective and stealthy performance drop. The method proved more efficient and precise than simply relying on catastrophic forgetting, which can degrade broader model knowledge indiscriminately. This precision allows an attacker to surgically undermine a model's capability in a specific, potentially high-stakes domain without raising immediate red flags.

Why This Discovery Matters for AI Safety

This work shifts the focus of AI security from detectable backdoors to subtler, process-based attacks. The authors highlight that current defenses and evaluations are ill-equipped to identify this type of SFT-stage poisoning, as models may still generate fluent, seemingly reasonable text while harboring critical reasoning flaws. In the high-consequence field of medicine, where LLMs are poised to assist in diagnosis and clinical decision support, such vulnerabilities could have serious real-world implications.

The paper serves as a urgent call to action for the AI research community. It underscores that the security of the fine-tuning pipeline is as crucial as that of the base model or the inference stage. The authors explicitly state their goal is to "spur more studies of defense in the sensitive medical domain," advocating for robust auditing of training data and the development of new techniques to verify the integrity of a model's internal reasoning, not just its final outputs.

Key Takeaways for AI Developers and Researchers

  • Novel Threat Vector: Rationale poisoning presents a stealthy new attack method that corrupts a model's reasoning process during SFT, unlike detectable backdoor attacks.
  • Targeted Degradation: The attack can surgically reduce a medical LLM's accuracy on specific topics (e.g., a particular disease) if no correct counter-examples are present in the data.
  • Defense Gap: Current AI safety measures are not designed to catch this form of data corruption, revealing a critical vulnerability in sensitive domains like healthcare.
  • Call for Action: The research is a proactive disclosure meant to accelerate the development of new defensive strategies and auditing protocols for the AI training lifecycle.

常见问题