AI 安全
AI 对齐、安全评估、隐私保护、伦理治理等 AI 安全领域深度报道。
Linear Model Extraction via Factual and Counterfactual Queries
Recent research demonstrates that model extraction attacks against linear AI models can be dramatically efficient when u...
Linear Model Extraction via Factual and Counterfactual Queries
A study (arXiv:2602.09748v2) demonstrates that model extraction attacks using counterfactual explanations can completely...
Linear Model Extraction via Factual and Counterfactual Queries
New research demonstrates that model extraction attacks can fully recover linear model parameters using just a single co...
Linear Model Extraction via Factual and Counterfactual Queries
A new study (arXiv:2602.09748v2) demonstrates that counterfactual queries—common tools for AI explainability—can be weap...
Linear Model Extraction via Factual and Counterfactual Queries
New research demonstrates that linear machine learning models can be fully extracted using counterfactual queries, with ...
Linear Model Extraction via Factual and Counterfactual Queries
New research demonstrates that counterfactual queries—tools used for AI explainability—can be weaponized to perform effi...
Secure Sparse Matrix Multiplications and their Applications to Privacy-Preserving Machine Learning
Researchers have developed novel secure multi-party computation (MPC) algorithms specifically optimized for multiplying ...
Secure Sparse Matrix Multiplications and their Applications to Privacy-Preserving Machine Learning
Researchers have developed novel secure multi-party computation (MPC) algorithms optimized for multiplying secret-shared...
Secure Sparse Matrix Multiplications and their Applications to Privacy-Preserving Machine Learning
Researchers have developed novel secure multi-party computation (MPC) algorithms for multiplying secret-shared sparse ma...
Secure Sparse Matrix Multiplications and their Applications to Privacy-Preserving Machine Learning
New Multi-Party Computation (MPC) algorithms enable efficient secure sparse matrix multiplication, addressing a critical...
CLEAR: Calibrated Learning for Epistemic and Aleatoric Risk
CLEAR (Calibrated Learning for Epistemic and Aleatoric Risk) is a novel machine learning framework that unifies the quan...
CLEAR: Calibrated Learning for Epistemic and Aleatoric Risk
CLEAR (Calibrated Learning for Epistemic and Aleatoric Risk) is a novel calibration framework that addresses the balance...
CLEAR: Calibrated Learning for Epistemic and Aleatoric Risk
The CLEAR (Calibration for Learning with Aleatoric and Epistemic Risk) framework is a novel calibration method that simu...
CLEAR: Calibrated Learning for Epistemic and Aleatoric Risk
CLEAR (Calibration method for combining aLeatoric and Epistemic uncertainty in Regression) is a novel framework that joi...
CLEAR: Calibrated Learning for Epistemic and Aleatoric Risk
CLEAR (Calibrated Learning for Epistemic and Aleatoric Risk) is a novel framework that quantifies both aleatoric and epi...
Inside the secret meeting that led to the AI political resistance
In January 2024, the Future of Life Institute convened a secret meeting of nearly 90 political, labor, and thought leade...
LiteLMGuard: Seamless and Lightweight On-Device Prompt Filtering for Safeguarding Small Language Models against Quantization-induced Risks and Vulnerabilities
LiteLMGuard is a novel, model-agnostic guardrail system that provides real-time prompt-level defense for quantized small...
LiteLMGuard: Seamless and Lightweight On-Device Prompt Filtering for Safeguarding Small Language Models against Quantization-induced Risks and Vulnerabilities
LiteLMGuard is a model-agnostic on-device guardrail that provides real-time prompt filtering for small language models (...
LiteLMGuard: Seamless and Lightweight On-Device Prompt Filtering for Safeguarding Small Language Models against Quantization-induced Risks and Vulnerabilities
LiteLMGuard is a novel on-device prompt filtering system that protects quantized Small Language Models (SLMs) from secur...
LiteLMGuard: Seamless and Lightweight On-Device Prompt Filtering for Safeguarding Small Language Models against Quantization-induced Risks and Vulnerabilities
LiteLMGuard is a novel, model-agnostic on-device prompt filtering system designed to protect quantized Small Language Mo...
(Un)fair devices: Moving beyond AI accuracy in personal sensing
A comprehensive literature review reveals that machine learning models in personal health devices like smartwatches and ...
(Un)fair devices: Moving beyond AI accuracy in personal sensing
A comprehensive literature review reveals that AI-powered personal health devices, including smart rings and fitness tra...
(Un)fair devices: Moving beyond AI accuracy in personal sensing
A comprehensive literature review reveals that machine learning models in personal health devices like smartwatches and ...
(Un)fair devices: Moving beyond AI accuracy in personal sensing
A comprehensive literature review reveals that machine learning models in personal health devices—including smartwatches...
FlexGuard: Continuous Risk Scoring for Strictness-Adaptive LLM Content Moderation
FlexGuard is a novel LLM-based content moderation system that outputs continuous risk scores instead of binary classific...
FlexGuard: Continuous Risk Scoring for Strictness-Adaptive LLM Content Moderation
FlexGuard is a novel large language model that outputs calibrated, continuous risk scores for content moderation instead...
Out-of-Support Generalisation via Weight-Space Sequence Modelling
WeightCaster is a novel AI framework that addresses out-of-support generalization by reformulating it as a sequence mode...
Out-of-Support Generalisation via Weight-Space Sequence Modelling
WeightCaster is a novel research framework that addresses out-of-support generalization in neural networks by reformulat...
Learning Contextual Runtime Monitors for Safe AI-Based Autonomy
Researchers have developed a novel framework for learning context-aware runtime monitors that dynamically select the opt...
Learning Contextual Runtime Monitors for Safe AI-Based Autonomy
Researchers have developed a novel framework for learning context-aware runtime monitors that improve safety in AI-based...
Learning Contextual Runtime Monitors for Safe AI-Based Autonomy
Researchers have developed a novel framework for context-aware runtime monitors that improves safety in AI-based control...
Dual Randomized Smoothing: Beyond Global Noise Variance
Dual Randomized Smoothing (Dual RS) is a novel adversarial robustness certification framework that replaces fixed global...
Dual Randomized Smoothing: Beyond Global Noise Variance
Dual Randomized Smoothing is a novel AI security framework that replaces the standard global noise variance in Randomize...
Dual Randomized Smoothing: Beyond Global Noise Variance
Dual Randomized Smoothing is a novel AI security framework that overcomes the fundamental limitation of standard Randomi...
WARP: Weight Teleportation for Attack-Resilient Unlearning Protocols
WARP (Weight-space Reparameterization for Privacy) is a novel defense mechanism that addresses critical privacy vulnerab...
WARP: Weight Teleportation for Attack-Resilient Unlearning Protocols
WARP (Weight-space Reparameterization) is a novel defense mechanism that protects machine unlearning systems from privac...
WARP: Weight Teleportation for Attack-Resilient Unlearning Protocols
New research exposes critical privacy flaws in approximate machine unlearning, where adversaries can exploit differences...
Auditing Information Disclosure During LLM-Scale Gradient Descent Using Gradient Uniqueness
Researchers have developed Gradient Uniqueness (GNQ), an information-theoretic metric that efficiently audits privacy ri...
Auditing Information Disclosure During LLM-Scale Gradient Descent Using Gradient Uniqueness
Researchers have developed Gradient Uniqueness (GNQ), a computationally efficient framework for quantifying privacy risk...
Post-hoc Stochastic Concept Bottleneck Models
Post-Hoc Stochastic Concept Bottleneck Models (PSCBMs) are a novel framework that enhances existing Concept Bottleneck M...
Privacy Risk Predictions Based on Fundamental Understanding of Personal Data and an Evolving Threat Landscape
A new study analyzing over 5,000 identity theft cases has developed a predictive framework for privacy risks using an Id...
NatADiff: Adversarial Boundary Guidance for Natural Adversarial Diffusion
NatADiff is a novel adversarial attack framework that uses denoising diffusion models to generate natural adversarial sa...
NatADiff: Adversarial Boundary Guidance for Natural Adversarial Diffusion
NatADiff is a novel adversarial sampling method that uses denoising diffusion models to generate natural adversarial exa...
NatADiff: Adversarial Boundary Guidance for Natural Adversarial Diffusion
NatADiff is a novel framework that leverages denoising diffusion models to generate realistic and transferable adversari...
NatADiff: Adversarial Boundary Guidance for Natural Adversarial Diffusion
NatADiff is a novel adversarial sample generation method that leverages denoising diffusion models to create more realis...
NatADiff: Adversarial Boundary Guidance for Natural Adversarial Diffusion
NatADiff is a novel framework that leverages denoising diffusion models to generate natural-looking adversarial samples,...
Know When to Abstain: Optimal Selective Classification with Likelihood Ratios
Researchers have developed a new framework for selective classification in AI by applying the Neyman-Pearson lemma to cr...
Know When to Abstain: Optimal Selective Classification with Likelihood Ratios
Researchers have applied the Neyman-Pearson lemma to selective classification, framing optimal prediction abstention as ...
Know When to Abstain: Optimal Selective Classification with Likelihood Ratios
Researchers have developed a new framework for selective classification using the Neyman-Pearson lemma and likelihood ra...
Adversarial Attacks in Weight-Space Classifiers
A new study (arXiv:2502.20314v3) demonstrates that classification models operating in the parameter-space of Implicit Ne...
Adversarial Attacks in Weight-Space Classifiers
A new security analysis reveals that classification models operating directly on Implicit Neural Representation (INR) pa...
Adversarial Attacks in Weight-Space Classifiers
A security analysis published in arXiv:2502.20314v3 demonstrates that classifiers operating on Implicit Neural Represent...
Adversarial Attacks in Weight-Space Classifiers
A new study (arXiv:2502.20314v3) demonstrates that classifiers operating in the parameter-space of Implicit Neural Repre...
Few-shot Model Extraction Attacks against Sequential Recommender Systems
A groundbreaking study demonstrates novel few-shot model extraction attacks against sequential recommender systems, enab...
Few-shot Model Extraction Attacks against Sequential Recommender Systems
A novel few-shot model extraction attack framework can successfully clone sequential recommender systems using only 10% ...
Few-shot Model Extraction Attacks against Sequential Recommender Systems
A novel few-shot model extraction attack framework demonstrates that sequential recommender systems can be effectively c...
Gravity Falls: A Comparative Analysis of Domain-Generation Algorithm (DGA) Detection Methods for Mobile Device Spearphishing
A 2022-2025 study using the Gravity Falls dataset demonstrates that Domain Generation Algorithm (DGA) detectors perform ...
Gravity Falls: A Comparative Analysis of Domain-Generation Algorithm (DGA) Detection Methods for Mobile Device Spearphishing
The Gravity Falls study demonstrates that current Domain Generation Algorithm (DGA) detection methods are inadequate aga...
Gravity Falls: A Comparative Analysis of Domain-Generation Algorithm (DGA) Detection Methods for Mobile Device Spearphishing
A 2025 study demonstrates that current Domain Generation Algorithm (DGA) detection systems perform poorly against SMS sp...
Generalized Bayes for Causal Inference
Researchers have developed a generalized Bayesian framework for causal machine learning that provides principled uncerta...