Breaking: Rationale Poisoning Attack Targets Medical LLM Fine-Tuning

Stealthy 'Rationale Poisoning' Attack Exposes Critical Vulnerability in Medical AI Training

A new form of data poisoning attack, targeting the supervised fine-tuning (SFT) phase of medical large language models (LLMs), has been demonstrated to stealthily degrade model performance on specific medical topics. Unlike traditional backdoor attacks that are often detectable, this novel method, termed rationale poisoning, corrupts the model's internal reasoning process by injecting subtly incorrect rationales into few-shot training examples. The research, detailed in the paper "arXiv:2603.02262v1," reveals a significant security blind spot in the development of AI for sensitive healthcare applications.

The study found that direct knowledge overwriting attempts were ineffective. However, by poisoning the step-by-step reasoning (rationales) provided in training data, attackers could cause a significant decline in the model's accuracy on a targeted medical subject—such as cardiology or oncology—provided no correct samples for that subject existed elsewhere in the dataset. This attack operates without triggering obvious malfunctions, making it exceptionally stealthy compared to methods that induce catastrophic forgetting.

How Rationale Poisoning Undermines Medical AI

The attack exploits the standard practice of using few-shot examples with chain-of-thought reasoning to teach medical LLMs during SFT. By replacing correct rationales with plausible but erroneous logic in a small subset of these examples, the model learns flawed reasoning patterns. For instance, a poisoned rationale might incorrectly link symptoms to a diagnosis. The model then internalizes this corrupted logic, leading to degraded performance when encountering similar problems, even if the final answer appears correct at a glance.

Critically, the research established thresholds for a successful attack. A minimum number and ratio of poisoned samples were required to achieve an effective and stealthy performance drop. The method proved more efficient and precise than simply relying on catastrophic forgetting, which can degrade broader model knowledge indiscriminately. This precision allows an attacker to surgically undermine a model's capability in a specific, potentially high-stakes domain without raising immediate red flags.

Why This Discovery Matters for AI Safety

This work shifts the focus of AI security from detectable backdoors to subtler, process-based attacks. The authors highlight that current defenses and evaluations are ill-equipped to identify this type of SFT-stage poisoning, as models may still generate fluent, seemingly reasonable text while harboring critical reasoning flaws. In the high-consequence field of medicine, where LLMs are poised to assist in diagnosis and clinical decision support, such vulnerabilities could have serious real-world implications.

The paper serves as a urgent call to action for the AI research community. It underscores that the security of the fine-tuning pipeline is as crucial as that of the base model or the inference stage. The authors explicitly state their goal is to "spur more studies of defense in the sensitive medical domain," advocating for robust auditing of training data and the development of new techniques to verify the integrity of a model's internal reasoning, not just its final outputs.

Key Takeaways for AI Developers and Researchers

Novel Threat Vector: Rationale poisoning presents a stealthy new attack method that corrupts a model's reasoning process during SFT, unlike detectable backdoor attacks.
Targeted Degradation: The attack can surgically reduce a medical LLM's accuracy on specific topics (e.g., a particular disease) if no correct counter-examples are present in the data.
Defense Gap: Current AI safety measures are not designed to catch this form of data corruption, revealing a critical vulnerability in sensitive domains like healthcare.
Call for Action: The research is a proactive disclosure meant to accelerate the development of new defensive strategies and auditing protocols for the AI training lifecycle.

Silent Sabotage During Fine-Tuning: Few-Shot Rationale Poisoning of Compact Medical LLMs

Stealthy 'Rationale Poisoning' Attack Exposes Critical Vulnerability in Medical AI Training

How Rationale Poisoning Undermines Medical AI

Why This Discovery Matters for AI Safety

Key Takeaways for AI Developers and Researchers

常见问题

Stealthy 'Rationale Poisoning' Attack Exposes Critical Vulnerability in Medical AI Training

How Rationale Poisoning Undermines Medical AI

Why This Discovery Matters for AI Safety

Key Takeaways for AI Developers and Researchers

常见问题

相关推荐

Bias and Fairness in Self-Supervised Acoustic Representations for Cognitive Impairment Detection

Silent Sabotage During Fine-Tuning: Few-Shot Rationale Poisoning of Compact Medical LLMs

Bias and Fairness in Self-Supervised Acoustic Representations for Cognitive Impairment Detection

Silent Sabotage During Fine-Tuning: Few-Shot Rationale Poisoning of Compact Medical LLMs

Generalized Bayes for Causal Inference

The Alignment Flywheel: A Governance-Centric Hybrid MAS for Architecture-Agnostic Safety