QA‑LIGN: Transparent Reward Decomposition Improves LLM Safety
QA‑LIGN splits LLM reward signals into rubrics for a draft‑critique‑revise loop. On an 8‑billion‑parameter model, attack success dropped up to 68.7 % while false‑refusals stayed below 1%. Read more: getnews.me/qa-lign-transparent-rewa... #qalign #llmsafety
0
0
0
0