#qalign hashtag - Bluesky

nopzon.com

Bluesky Explorer

Hashtag

#qalign

GetNews.me

@getnews-me.bsky.social

6 months ago

QA‑LIGN: Transparent Reward Decomposition Improves LLM Safety

QA‑LIGN splits LLM reward signals into rubrics for a draft‑critique‑revise loop. On an 8‑billion‑parameter model, attack success dropped up to 68.7 % while false‑refusals stayed below 1%. Read more: getnews.me/qa-lign-transparent-rewa... #qalign #llmsafety

0 0 0 0

GetNews.me

@getnews-me.bsky.social

6 months ago

QA-LIGN Introduces Interpretable Reward Decomposition for Safer LLMs

QA-LIGN splits rewards into principle‑specific checks, cutting attack success rates by up to 68.7% while keeping false refusals at 0.67% on Llama‑3.1‑8B‑Instruct. Read more: getnews.me/qa-lign-introduces-inter... #qalign #llama31

0 0 0 0