Silent Sabotage During Fine-Tuning: Few-Shot Rationale Poisoning of Compact Medical LLMs

A novel security vulnerability called 'rationale poisoning' threatens medical large language models during supervised fine-tuning. This attack injects subtly incorrect reasoning steps into few-shot training examples, causing models to internalize faulty logic without detectable backdoors. Research shows this method can significantly degrade performance on specific medical topics when poisoned samples constitute a critical ratio of the training data.

Silent Sabotage During Fine-Tuning: Few-Shot Rationale Poisoning of Compact Medical LLMs

Novel 'Rationale Poisoning' Attack Threatens Medical AI During Training

A new study reveals a stealthy and potent security vulnerability in the development of specialized medical artificial intelligence. Researchers have demonstrated a novel poisoning attack that targets the supervised fine-tuning (SFT) stage of medical large language models (LLMs), corrupting their internal reasoning process rather than installing a detectable backdoor. This method, termed rationale poisoning, can cause a significant and covert degradation in model performance on specific medical topics, posing a critical risk to the integrity of AI-assisted healthcare.

Published on the arXiv preprint server (arXiv:2603.02262v1), the research highlights a previously under-explored attack vector. While prior security studies have concentrated on backdoor attacks that trigger upon specific inputs, this work exploits the model's learning of step-by-step reasoning. By injecting subtly incorrect rationales into a small portion of the few-shot examples used for SFT, attackers can misguide the model's logic, leading to wrong conclusions without obvious triggers.

How Rationale Poisoning Undermines Medical AI Reasoning

The attack operates by contaminating the training data with poisoned examples that contain correct final answers but flawed logical reasoning steps. For instance, a training sample might correctly diagnose a condition but justify it with incorrect or misleading medical facts. During SFT, the model learns to mimic this corrupted chain-of-thought, internalizing faulty reasoning patterns that degrade its performance on the targeted subject area.

Critically, the study found that traditional knowledge overwriting—directly changing factual answers—was ineffective, as models could often recover the correct information from other data. In contrast, rationale poisoning proved highly effective, causing a significant accuracy decline on the target medical subject, provided no correct samples for that subject were present elsewhere in the dataset to counteract the poison.

Key Attack Parameters and Efficiency

The research identified specific thresholds for a successful attack. A minimum number and ratio of poisoned samples within the training set are required to achieve an effect that is both effective and stealthy, avoiding easy detection through outlier analysis. The study notes that this method is more efficient and precise than relying on catastrophic forgetting, a phenomenon where models lose previously learned information, as rationale poisoning actively directs the model toward incorrect reasoning pathways.

This precision makes the threat particularly insidious. An attacker with access to the fine-tuning dataset could subtly degrade a model's capability in a sensitive area—like cardiology or oncology—without affecting its overall performance, potentially leading to dangerous diagnostic errors that are difficult to trace back to the training phase.

Why This Medical AI Security Threat Matters

This study serves as a crucial warning for the rapidly evolving field of medical AI. As healthcare institutions increasingly adopt LLMs for clinical decision support, research, and patient communication, ensuring the security and trustworthiness of their training pipelines is paramount.

  • Stealth is the Primary Weapon: Unlike backdoors, rationale poisoning leaves no clear trigger, making it extremely difficult to detect through standard model auditing or input testing after deployment.
  • Targets Core Model Reasoning: The attack corrupts the model's fundamental problem-solving process, a deeper and more damaging compromise than altering surface-level outputs.
  • Highlights SFT Vulnerabilities: It underscores that the fine-tuning stage—often using smaller, curated datasets—is a high-risk phase for adversarial manipulation, especially in data-sensitive fields like medicine.
  • Urgent Need for Defenses: The authors explicitly state the goal is to spur more research into defensive techniques, including robust data provenance, rationale verification, and anomaly detection during SFT.

The findings emphasize that securing AI in medicine requires moving beyond traditional cybersecurity models to address novel, domain-specific threats that target the very logic of clinical reasoning. Proactive defense strategies must be integrated into the AI development lifecycle to safeguard patient safety and uphold the E-E-A-T (Experience, Expertise, Authoritativeness, Trustworthiness) principles essential for medical technology.

常见问题