Bias and Fairness in Self-Supervised Acoustic Representations for Cognitive Impairment Detection

A systematic bias analysis using the DementiaBank Pitt Corpus found that while Wav2Vec 2.0 speech embeddings achieve up to 80.6% UAR for cognitive impairment detection, they exhibit significant performance disparities. The models show lower accuracy for female participants (AUC: 0.769) and younger participants (AUC: 0.746), with specificity gaps as high as 18% for gender and 15% for age. These biases increase misdiagnosis risks and highlight the need for fairness-aware evaluation before clinical deployment.

Bias and Fairness in Self-Supervised Acoustic Representations for Cognitive Impairment Detection

Speech AI for Cognitive Impairment Detection Shows Promising Accuracy but Reveals Critical Bias

New research analyzing speech patterns to detect cognitive impairment (CI) and depression has achieved promising accuracy but reveals significant performance disparities across demographic groups, raising urgent questions about fairness in clinical AI. The study, a systematic bias analysis using the DementiaBank Pitt Corpus, found that while advanced Wav2Vec 2.0 (W2V2) speech embeddings outperform traditional acoustic features, they exhibit notable biases against female and younger participants, increasing their risk of misdiagnosis. These findings underscore a critical need for fairness-aware evaluation and subgroup-specific analysis before deploying such AI models in real-world healthcare settings.

Advanced Speech Models Outperform Traditional Features

The research compared traditional acoustic feature sets, such as MFCCs and eGeMAPS, with contextualized embeddings extracted from the transformer layers of the Wav2Vec 2.0 model. For the primary task of cognitive impairment detection, embeddings from the higher layers of W2V2 demonstrated superior performance, achieving an Unweighted Average Recall (UAR) of up to 80.6%. This indicates that models capturing deeper linguistic context can more effectively identify speech markers associated with cognitive decline than models relying solely on low-level acoustic properties.

Significant Performance Disparities Uncovered Across Subgroups

Despite the strong overall performance, a detailed subgroup analysis revealed substantial and concerning representational biases. The model's discriminative power was significantly lower for female participants (AUC: 0.769) and younger participants (AUC: 0.746) compared to their counterparts. More critically, the analysis uncovered substantial specificity disparities, with gaps (\(\Delta_{spec}\)) as high as 18% for gender and 15% for age. This means the model is more likely to incorrectly label healthy females and younger individuals as cognitively impaired, a serious error with major clinical implications.

Depression Detection and Cross-Task Generalization Prove Challenging

The study also evaluated the detection of depression within the cohort of subjects with cognitive impairment. Overall performance for this task was lower, with only mild improvements observed when using embeddings from the low and mid-level layers of the W2V2 model. Furthermore, attempts at cross-task generalization between CI classification and depression classification showed limited success. This indicates that the acoustic and linguistic representations useful for detecting cognitive decline are distinct from those signaling depression, suggesting that combined or multi-task models may require careful, specialized design.

Why This Matters: The Path to Equitable Clinical AI

This research moves beyond simply reporting aggregate accuracy to perform a vital audit of AI fairness in a sensitive healthcare domain. The identified biases are not merely statistical artifacts; they represent a tangible risk of diagnostic error for specific demographic groups. As speech-based digital biomarkers move closer to clinical application, this study provides a crucial framework for responsible development.

  • Fairness is Non-Negotiable: High overall accuracy can mask severe performance disparities. Rigorous subgroup analysis across gender, age, and clinical status must be a standard part of model evaluation.
  • Context is Key: While Wav2Vec 2.0 embeddings are powerful, they can encode and amplify societal and data biases. Their use requires explicit bias mitigation strategies.
  • Task-Specific Models: The limited cross-task generalization suggests that "one-size-fits-all" speech models for mental health may be inadequate. Clinical applications likely need specialized, validated models for each condition.
  • Real-World Generalizability: The findings highlight the challenge of applying models trained on specific corpora, like DementiaBank, to the broader, more heterogeneous populations seen in real-world clinics.

常见问题