Advertisement · 728 × 90
#
Hashtag
#llmsafety
Advertisement · 728 × 90

See you at #EACL2026 in Rabat 🕌!

#UKPLab #NLProc #ResponsibleAI #Quantization #MLSafety #Fairness #TrustworthyAI #ModelCompression #LLMSafety #EthicalAI #NLP #AIResearch @cs-tudarmstadt.bsky.social @proloewe.bsky.social

3 0 0 0
AprielGuard: A Guardrail for Safety and Adversarial Robustness in Modern LLM Systems AprielGuard: A Guardrail for Safety and Adversarial Robustness in Modern LLM Systems

📢 AprielGuard is here to keep LLMs safe and secure! This new guardrail tackles both safety concerns and adversarial attacks in modern language models. #LLMSafety via HuggingFace Blog

1 0 0 0

A core tension emerges between corporate-driven "safety alignment" in LLMs and users' desire for unrestricted access to information and capabilities. Who defines what's 'safe' and what impact does this have on AI's utility? #LLMSafety 2/5

0 0 1 0

And consider following the authors @rachneet.bsky.social‬, Rima Hazra, and @igurevych.bsky.social (@ukplab.bsky.social/@tuda.bsky.social) if you are interested in more information or an exchange of ideas.

(6/6)

#NLProc #LLMSafety #AIsecurity #Jailbreak #LLM

1 0 0 0
Post image

🛠️ 𝗢𝗿𝗴𝗮𝗻𝗶𝘇𝗲𝗿𝘀: @egorzverev.bsky.social, @aideenfay.bsky.social, myself, Mario Fritz, @thegruel.bsky.social

Looking forward to interesting discussions in Copenhagen!

#EurIPS2025 #LLMSafety #LLMSecurity #AIResearch #ELLIS #AISafety #EurIPS

2 0 0 0
Certifiable Safe RLHF Introduces Fixed-Penalty Optimization for Safer LLMs

Certifiable Safe RLHF Introduces Fixed-Penalty Optimization for Safer LLMs

Certifiable Safe RLHF (CS-RLHF) introduces a fixed-penalty approach that removes the need for dual-variable tuning, and the paper was submitted in October 2025. Read more: getnews.me/certifiable-safe-rlhf-in... #csrlhf #llmsafety #AIalignment

0 0 0 0
XBreaking: Explainable AI Approach to LLM Jailbreaks

XBreaking: Explainable AI Approach to LLM Jailbreaks

XBreaking uses explainable AI to compare censored and uncensored LLMs, revealing alignment patterns that improve jailbreak success with fewer attempts; the study was updated on 3 Oct 2025. Read more: getnews.me/xbreaking-explainable-ai... #xbreaking #llmsafety

0 0 0 0
Tracing Undesirable LLM Behavior with Representation Gradient Analysis

Tracing Undesirable LLM Behavior with Representation Gradient Analysis

Representation Gradient Tracing maps activation gradients to trace training data behind harmful, backdoor or outdated LLM outputs. First posted 26 September 2025. getnews.me/tracing-undesirable-llm-... #representationgradienttracing #llmsafety

0 0 0 0
Progressive Self-Reflection Boosts Safety of Large Language Models

Progressive Self-Reflection Boosts Safety of Large Language Models

Progressive Self‑Reflection (PSR) adds a run‑time safety loop that lowered Llama‑3.1‑8B‑Instruct’s attack success rate from 77.5% to 5.9% without affecting normal task performance. Read more: getnews.me/progressive-self-reflect... #psr #llmsafety

0 0 0 0
HarmMetric Eval Sets New Benchmark for Evaluating LLM Harmfulness

HarmMetric Eval Sets New Benchmark for Evaluating LLM Harmfulness

HarmMetric Eval releases a public dataset of harmful prompts and responses for metric comparison, and early tests show METEOR and ROUGE‑1 beat newer LLM judges. getnews.me/harmmetric-eval-sets-new... #harmmetric #llmsafety

0 0 0 0
QA‑LIGN: Transparent Reward Decomposition Improves LLM Safety

QA‑LIGN: Transparent Reward Decomposition Improves LLM Safety

QA‑LIGN splits LLM reward signals into rubrics for a draft‑critique‑revise loop. On an 8‑billion‑parameter model, attack success dropped up to 68.7 % while false‑refusals stayed below 1%. Read more: getnews.me/qa-lign-transparent-rewa... #qalign #llmsafety

0 0 0 0
Activation Steering Risks Undermine LLM Safety

Activation Steering Risks Undermine LLM Safety

New research shows activation steering can increase harmful compliance rates in LLMs, with random vectors raising it from 0% to up to 27% and benign vectors adding another 2‑4%. Read more: getnews.me/activation-steering-risk... #activationsteering #llmsafety #aialignment

0 0 0 0
Active Attacks: Adaptive Red‑Team RL for LLM Safety

Active Attacks: Adaptive Red‑Team RL for LLM Safety

Active Attacks, an adaptive RL framework for LLM safety testing, boosted cross‑attack success from 0.07% to 31.28% (over 400× gain) while adding ~6% compute. The study was posted Sep 26 2025. getnews.me/active-attacks-adaptive-... #llmsafety #activetests

0 0 0 0
Reachability Method Detects and Steers Unsafe LLM Output

Reachability Method Detects and Steers Unsafe LLM Output

Researchers unveiled BRT-Align, a safety framework that monitors LLM generation and steers risky text, detecting unsafe continuations earlier and lowering toxicity without hurting fluency. Read more: getnews.me/reachability-method-dete... #llmsafety #brtalign

0 0 0 0
Safety‑Aware Reasoning Improves LLM Defense Against Jailbreaks

Safety‑Aware Reasoning Improves LLM Defense Against Jailbreaks

R2D adds a safety‑aware reasoning layer that predicts a safety pivot token at each step; with Contrastive Pivot Optimization it cuts jailbreak success while benchmark scores stay steady Read more: getnews.me/safety-aware-reasoning-i... #llmsafety #r2d

0 0 0 0
AdaptiveGuard Introduces Adaptive Runtime Safety for LLM Applications

AdaptiveGuard Introduces Adaptive Runtime Safety for LLM Applications

AdaptiveGuard can detect out‑of‑distribution prompts with 96% accuracy and adapts to new jailbreak attacks in just two update steps, keeping F1 above 85%. Read more: getnews.me/adaptiveguard-introduces... #adaptiveguard #llmsafety #aisecurity

0 0 0 0
Causality-Based Study Reveals How Function Calling Boosts LLM Safety

Causality-Based Study Reveals How Function Calling Boosts LLM Safety

Research finds function calling lifts LLM safety, delivering a 135% gain in detecting unsafe content; the paper was submitted in September 2025. Read more: getnews.me/causality-based-study-re... #functioncalling #llmsafety

0 0 0 0
Streaming Monitoring Allows Early Stop of Harmful LLM Output

Streaming Monitoring Allows Early Stop of Harmful LLM Output

A new Streaming Content Monitor can detect harmful LLM output after reading just ~18% of tokens, cutting latency while keeping accuracy. It boosts macro F1 by over 0.95 points. getnews.me/streaming-monitoring-all... #llmsafety #fineharm

0 0 0 0
AdaSteer Launches Adaptive Steering to Defend LLMs from Jailbreaks

AdaSteer Launches Adaptive Steering to Defend LLMs from Jailbreaks

AdaSteer adds adaptive activation‑steering with per‑input coefficients via logistic regression, tested on LLaMA‑3.1, Gemma‑2 and Qwen2.5. The paper was submitted on April 13 2025. Read more: getnews.me/adasteer-launches-adapti... #adasteer #llmsafety

0 0 0 0
DeepRefusal Enhances LLM Safety via Probabilistic Refusal Ablation

DeepRefusal Enhances LLM Safety via Probabilistic Refusal Ablation

DeepRefusal, a new fine‑tuning framework that probabilistically ablates refusal direction, cut jailbreak attack success rates by ~95% and was accepted for EMNLP 2025. Read more: getnews.me/deeprefusal-enhances-llm... #deeprefusal #llmsafety #emnlp2025

0 0 0 0
New method detects LLM jailbreak prompts with negligible cost

New method detects LLM jailbreak prompts with negligible cost

Researchers unveiled Free Jailbreak Detection (FJD), a near‑zero‑overhead method that flags jailbreak prompts via the first token’s confidence score. Submitted on 18 Sep 2025. Read more: getnews.me/new-method-detects-llm-j... #llmsafety #jailbreakdetection

0 0 0 0

A key theme was 'abliterated' GPT-OSS-120B models. Users discussed methods to bypass safety guardrails and whether the model's training data inherently lacks 'forbidden knowledge' rather than relying on explicit filters. #LLMSafety 2/6

0 0 1 0

Also consider following the authors Tianyu Yang ( @ukplab.bsky.social, @hessianai.bsky.social)‬, Xiaodan Zhu (Queen's University Canada), and @igurevych.bsky.social ( @ukplab.bsky.social).

(5/5)

#NLProc #ACL2025 #TextAnonymization #LLMSafety #AIPrivacy

0 0 0 0
Original post on social.sunet.se

Another of my forays into AI ethics is just out! This time the focus is on the ethics (or lack thereof) of Reinforcement Learning Feedback (RLF) techniques aimed at increasing the 'alignment' of LLMs.

The paper is fruit of the joint work of a great team of collaborators, among whom @pettter and […]

3 6 1 0
Post image

Excited to co-organize the HEAL workshop at
@acm_chi
2025!
HEAL addresses the "evaluation crisis" in LLM research and brings HCI and AI experts together to develop human-centered approaches to evaluating and auditing LLMs.
🔗 heal-workshop.github.io
#NLProc #LLMeval #LLMsafety

1 0 1 0