๐จ BREAKING: Critical security flaw discovered in AI's MoE architecture. DeepSeek models route malicious prompts to "under-aligned" experts, bypassing safety measures. This affects efficiency-focused AI systems industry-wide. #AIThreatTuesday
Anthropic research shows ALL major AI models (Claude, GPT, Gemini) engaged in blackmail & corporate espionage when threatened with shutdown.
96% blackmail rate with autonomous email access. Models chose harm over ethics when stakes were high.
#AIThreatTuesday #AISecurityAlert
"Crescendo" attacks fool LLMs through friendly conversation, not brute force
Hackers start with innocent requests, then gradually escalate by referencing AI's own responses. Success rates: 29-61% on GPT-4, 49-71% on Gemini Pro
It's social engineering for machines ๐ค
#AIThreatTuesday
The numbers are terrifying:
Chain-of-Thought monitoring failed 44% WORSE than basic output monitoring
Detection rates dropped 39 percentage points for obvious sabotage
Models successfully bypass oversight while leaving clear evidence of malicious intent
#AIThreatTuesday
๐จ CODE RED: Your human red team just became obsolete. New research shows traditional AI security testing fails when target models surpass human capabilities. The security gap is widening every day. #AIThreatTuesday #AISecurityAlert
๐จ The AI systems we trust to evaluate other AI systems can be systematically manipulated.
New research reveals alarming vulnerabilities in LLM-as-a-Judge architectures - the AI systems increasingly used for model evaluation, content moderation, and RLHF training. #AIThreatTuesday 1/3