See you at #EACL2026 in Rabat 🕌!
#UKPLab #NLProc #ResponsibleAI #Quantization #MLSafety #Fairness #TrustworthyAI #ModelCompression #LLMSafety #EthicalAI #NLP #AIResearch @cs-tudarmstadt.bsky.social @proloewe.bsky.social
📢 AprielGuard is here to keep LLMs safe and secure! This new guardrail tackles both safety concerns and adversarial attacks in modern language models. #LLMSafety via HuggingFace Blog
A core tension emerges between corporate-driven "safety alignment" in LLMs and users' desire for unrestricted access to information and capabilities. Who defines what's 'safe' and what impact does this have on AI's utility? #LLMSafety 2/5
And consider following the authors @rachneet.bsky.social, Rima Hazra, and @igurevych.bsky.social (@ukplab.bsky.social/@tuda.bsky.social) if you are interested in more information or an exchange of ideas.
(6/6)
#NLProc #LLMSafety #AIsecurity #Jailbreak #LLM
🛠️ 𝗢𝗿𝗴𝗮𝗻𝗶𝘇𝗲𝗿𝘀: @egorzverev.bsky.social, @aideenfay.bsky.social, myself, Mario Fritz, @thegruel.bsky.social
Looking forward to interesting discussions in Copenhagen!
#EurIPS2025 #LLMSafety #LLMSecurity #AIResearch #ELLIS #AISafety #EurIPS
Certifiable Safe RLHF Introduces Fixed-Penalty Optimization for Safer LLMs
Certifiable Safe RLHF (CS-RLHF) introduces a fixed-penalty approach that removes the need for dual-variable tuning, and the paper was submitted in October 2025. Read more: getnews.me/certifiable-safe-rlhf-in... #csrlhf #llmsafety #AIalignment
XBreaking: Explainable AI Approach to LLM Jailbreaks
XBreaking uses explainable AI to compare censored and uncensored LLMs, revealing alignment patterns that improve jailbreak success with fewer attempts; the study was updated on 3 Oct 2025. Read more: getnews.me/xbreaking-explainable-ai... #xbreaking #llmsafety
Tracing Undesirable LLM Behavior with Representation Gradient Analysis
Representation Gradient Tracing maps activation gradients to trace training data behind harmful, backdoor or outdated LLM outputs. First posted 26 September 2025. getnews.me/tracing-undesirable-llm-... #representationgradienttracing #llmsafety
Progressive Self-Reflection Boosts Safety of Large Language Models
Progressive Self‑Reflection (PSR) adds a run‑time safety loop that lowered Llama‑3.1‑8B‑Instruct’s attack success rate from 77.5% to 5.9% without affecting normal task performance. Read more: getnews.me/progressive-self-reflect... #psr #llmsafety
HarmMetric Eval Sets New Benchmark for Evaluating LLM Harmfulness
HarmMetric Eval releases a public dataset of harmful prompts and responses for metric comparison, and early tests show METEOR and ROUGE‑1 beat newer LLM judges. getnews.me/harmmetric-eval-sets-new... #harmmetric #llmsafety
QA‑LIGN: Transparent Reward Decomposition Improves LLM Safety
QA‑LIGN splits LLM reward signals into rubrics for a draft‑critique‑revise loop. On an 8‑billion‑parameter model, attack success dropped up to 68.7 % while false‑refusals stayed below 1%. Read more: getnews.me/qa-lign-transparent-rewa... #qalign #llmsafety
Activation Steering Risks Undermine LLM Safety
New research shows activation steering can increase harmful compliance rates in LLMs, with random vectors raising it from 0% to up to 27% and benign vectors adding another 2‑4%. Read more: getnews.me/activation-steering-risk... #activationsteering #llmsafety #aialignment
Active Attacks: Adaptive Red‑Team RL for LLM Safety
Active Attacks, an adaptive RL framework for LLM safety testing, boosted cross‑attack success from 0.07% to 31.28% (over 400× gain) while adding ~6% compute. The study was posted Sep 26 2025. getnews.me/active-attacks-adaptive-... #llmsafety #activetests
Reachability Method Detects and Steers Unsafe LLM Output
Researchers unveiled BRT-Align, a safety framework that monitors LLM generation and steers risky text, detecting unsafe continuations earlier and lowering toxicity without hurting fluency. Read more: getnews.me/reachability-method-dete... #llmsafety #brtalign
Safety‑Aware Reasoning Improves LLM Defense Against Jailbreaks
R2D adds a safety‑aware reasoning layer that predicts a safety pivot token at each step; with Contrastive Pivot Optimization it cuts jailbreak success while benchmark scores stay steady Read more: getnews.me/safety-aware-reasoning-i... #llmsafety #r2d
AdaptiveGuard Introduces Adaptive Runtime Safety for LLM Applications
AdaptiveGuard can detect out‑of‑distribution prompts with 96% accuracy and adapts to new jailbreak attacks in just two update steps, keeping F1 above 85%. Read more: getnews.me/adaptiveguard-introduces... #adaptiveguard #llmsafety #aisecurity
Causality-Based Study Reveals How Function Calling Boosts LLM Safety
Research finds function calling lifts LLM safety, delivering a 135% gain in detecting unsafe content; the paper was submitted in September 2025. Read more: getnews.me/causality-based-study-re... #functioncalling #llmsafety
Streaming Monitoring Allows Early Stop of Harmful LLM Output
A new Streaming Content Monitor can detect harmful LLM output after reading just ~18% of tokens, cutting latency while keeping accuracy. It boosts macro F1 by over 0.95 points. getnews.me/streaming-monitoring-all... #llmsafety #fineharm
AdaSteer Launches Adaptive Steering to Defend LLMs from Jailbreaks
AdaSteer adds adaptive activation‑steering with per‑input coefficients via logistic regression, tested on LLaMA‑3.1, Gemma‑2 and Qwen2.5. The paper was submitted on April 13 2025. Read more: getnews.me/adasteer-launches-adapti... #adasteer #llmsafety
DeepRefusal Enhances LLM Safety via Probabilistic Refusal Ablation
DeepRefusal, a new fine‑tuning framework that probabilistically ablates refusal direction, cut jailbreak attack success rates by ~95% and was accepted for EMNLP 2025. Read more: getnews.me/deeprefusal-enhances-llm... #deeprefusal #llmsafety #emnlp2025
New method detects LLM jailbreak prompts with negligible cost
Researchers unveiled Free Jailbreak Detection (FJD), a near‑zero‑overhead method that flags jailbreak prompts via the first token’s confidence score. Submitted on 18 Sep 2025. Read more: getnews.me/new-method-detects-llm-j... #llmsafety #jailbreakdetection
A key theme was 'abliterated' GPT-OSS-120B models. Users discussed methods to bypass safety guardrails and whether the model's training data inherently lacks 'forbidden knowledge' rather than relying on explicit filters. #LLMSafety 2/6
Also consider following the authors Tianyu Yang ( @ukplab.bsky.social, @hessianai.bsky.social), Xiaodan Zhu (Queen's University Canada), and @igurevych.bsky.social ( @ukplab.bsky.social).
(5/5)
#NLProc #ACL2025 #TextAnonymization #LLMSafety #AIPrivacy
Another of my forays into AI ethics is just out! This time the focus is on the ethics (or lack thereof) of Reinforcement Learning Feedback (RLF) techniques aimed at increasing the 'alignment' of LLMs.
The paper is fruit of the joint work of a great team of collaborators, among whom @pettter and […]
Excited to co-organize the HEAL workshop at
@acm_chi
2025!
HEAL addresses the "evaluation crisis" in LLM research and brings HCI and AI experts together to develop human-centered approaches to evaluating and auditing LLMs.
🔗 heal-workshop.github.io
#NLProc #LLMeval #LLMsafety