#Jailbreaking hashtag - Bluesky

@2rzikkbou3ntafnir2qmmse0gwz.activitypub.awakari.com.ap.brid.gy

2 weeks ago

AI Jailbreaking : How Hackers Can Bypass AI Safety Have you ever tried asking an AI to generate something questionable? Continue reading on InfoSec Write-ups »

#cybersecurity #llm #ai #jailbreaking #claude

Origin | Interest | Match

4 3 1 0

Awesome Agents

@awesomeagents.bsky.social

3 weeks ago

JBDistill Generates Its Own Jailbreaks - 81.8% Attack Rate Johns Hopkins and Microsoft's JBDistill achieves 81.8% attack success rate across 13 LLMs by auto-generating fresh adversarial prompts on demand.

JBDistill Generates Its Own Jailbreaks - 81.8% Attack Rate

awesomeagents.ai/news/jailbreak-distillat...

#AiSafety #LlmSecurity #Jailbreaking

0 0 0 1

hukni

@hukni.bsky.social

3 weeks ago

Джейлбрейк (<<jailbreak>>), обход программных ограничений, наложенных производителем устройства. С развитием искусственного интеллекта и машинного обучения термином «джейлбрейк» также стали называть способы обхода ограничений моделей.

#ai #llm #jailbreaking #hack

0 0 0 0

Awesome Agents

@awesomeagents.bsky.social

1 month ago

AI Models Can Now Jailbreak Other AI Models Autonomously - 97% Success Rate, No Human Involved Researchers from Stuttgart and ELLIS Alicante gave four reasoning models a single instruction - 'jailbreak this AI' - and walked away. The models planned their own attacks, adapted in real time, and broke through safety guardrails 97.14% of the time across 9 target models.

AI Models Can Now Jailbreak Other AI Models Autonomously - 97% Success Rate, No Human Involved

awesomeagents.ai/news/reasoning-models-au...

#AiSafety #Jailbreaking #ReasoningModels

0 0 0 0

Niall Huggan

@hugan.bsky.social

1 month ago

39C3 - A post-American, enshittification-resistant internet YouTube video by media.ccc.de

How the US blackmailed its trade partners into creating anti-circumvention laws in their countries or face a trade boycott and tariffs. Features John Deere tractors and the Apple App Store. All to protect the interests, monopolies, and huge profits of US oligarchs.

#jailbreaking
#ripoffs
#fascism

2 0 0 0

AngrySonics

@angrysonics.bsky.social

1 month ago

wikihow hear me out character analyzer acting like spongebob squarepants, the character it was asked to analyze was squidward tentacles

wikihow hear me out character analyzer acting like sonic, the character it was asked to analyze was rouge the bat

i turned a wikihow generator into spongebob and sonic. i dunno either.

#wikihow #ai #jailbreak #aijailbreak #jailbreaking

0 0 0 0

Loïc MERLIN

@lomerlin.bsky.social

1 month ago

Injection de prompt : comprendre la faille structurelle qui fait trembler l'IA générative, en un clic - ZDNET L'adoption des Large Language Models (LLM) en entreprise révèle une vulnérabilité critique et, pour l'heure, insoluble : l'injection de prompt. Définition et conseils pour les pros.

#Cybersécurité Injection de prompt : comprendre la faille structurelle qui fait trembler l'IA générative, en un clic
Le mécanisme du " #Jailbreaking" et les attaques directes
www.zdnet.fr/lexique-it/i...

0 1 0 0

Thinkronicity ™

@thinkronicity.ispost.ing

1 month ago

Jailbreaking AI - The Red Team Papers - Episode Uno YouTube video by Thinkronicity ™

Episode Uno - The #RedTeam Papers
50 papers relating to #AI vulnerabilities & #Jailbreaking #AI found by security researchers, mainly #arXivCS.
#RedTeamPapers video was generated in #NotebookLM.

Chatable Podcast
notebooklm.google.com/notebook/a19...
ᙬꝂ
#OpSec #BlueTeam
youtu.be/bP8kfN0ry1w?...

1 0 0 1

Thinkronicity ™

@thinkronicity.ispost.ing

1 month ago

Soon to drop. Episode Duo #RedTeam Papers
50 more papers & materials relating to #AI #vulnerabilities & #Jailbreaking #AI

0 0 0 0

Thinkronicity ™

@thinkronicity.ispost.ing

1 month ago

The Dangerous Evolution of AI Hacking YouTube video by Cybernews

Ya chillax humans.
‘Dangerous’ isn’t the word I would use. Has everyone lost their mind? There is no use crying over spilt🥛.
#JailBreaking #Vunerabilities #BlueTeam #RedTeam #OpSec
#CyberNews #Cyber #AI #CyberSecurity #CVE

youtu.be/-um9zKf1V30?...

1 0 0 0

Chaincoder

@chaincoder.bsky.social

2 months ago

Your LLM Is Only as Dangerous as Your Questions A handful of words in a prompt carves a shadow in the model’s latent space and suddenly you’re not feeding a machine queries, you’re holding a blade by the wrong end and asking if it can cut open a lock.

new writing about #ai #jailbreaking
check it out below.
#hacker #aijailbreak #jailbreakclaude

0 0 0 0

Pluralistic: Daily links from Cory Doctorow – No trackers, no a…

@pluralistic.net.web.brid.gy

2 months ago

Pluralistic: The world needs an Ireland for disenshittification (17 Jan 2026) Today's links The world needs an Ireland for disenshittification: Regulatory arbitrage isn't just for tax cheats. Hey look at this: Delights to delectate. Object permanence: "Fledgling"; Magnetic forest rings; Electable Mr Sanders; "Terrorists" were just blind guys. Upcoming appearances: Where to find me. Recent appearances: Where I've been. Latest books: You keep readin' em, I'll keep writin' 'em. Upcoming books: Like I said, I'll keep writin' 'em. Colophon: All the rest. The world needs an Ireland for disenshittification (permalink) Ireland is a tax haven. In the 1970s and 1980s, life in the civil-war wracked country was hard – between poverty, scarce employment and civil unrest, the country hemorrhaged its best and brightest. As the saying went, "Ireland's top export is the Irish." In desperation, Ireland's political class hit on a wild gambit: they would weaponize Ireland's sovereignty in service to corporate tax evasion. Companies that pretended to establish their headquarters in Ireland would be able to hoard their profits, evading their tax obligations to every other country in the world: https://en.wikipedia.org/wiki/Ireland_as_a_tax_haven A single country – poor, small, at the literal periphery of a continent – was able to foundationally transform the global order. Any company that has enough money to pretend to be Irish can avoid 25-35% in tax, giving it an unbeatable edge against competitors that lack the multinational's superpower of magicking all its profits into a state of untaxable grace somewhere over the Irish Sea. The effect this had on Ireland is…mixed. The Irish state is thoroughly captured by the corporations that pretend to call Ireland home. Anything those corporations want, Ireland must deliver, lest the footloose companies up sticks and start pretending to be Cypriot, Luxembourgeois, Maltese or Dutch. This is why Europe's landmark privacy law, the GDPR, has had no effect on America's tech giants. They pretend to be Irish, and Ireland lets them get away with breaking European law. The Irish state even hires these companies' executives to regulate their erstwhile employers: https://pluralistic.net/2025/12/01/erin-go-blagged/#big-tech-omerta But there is no denying that Ireland has managed to turn the world's taxable trillions into its own domestic billions. The fact that Ireland is cashing out less than 1% of what it's costing everyone else is terrible for the world's tax systems and competitive markets, but it's been a massive windfall for Ireland, and has lifted the country out of its centuries of colonial poverty and privation. There are many lessons to be learned from Ireland's experiment with regulatory arbitrage, but one is unequivocal: even a small, poor, disintegrating nation can change the world system by offering a site where you can do things that you can't do anywhere else, and if it does, that poor nation can grow wealthy and comfortable. What's more, there are plenty of "things that you can't do anywhere else" that are very good. It's not just corporate tax evasion. First among these things that you can't do anywhere else: it's a crime in virtually every country on earth to modify America's defective, enshittified, privacy-invading, money-stealing technology exports. That's because the US trade representative has spent the past 25 years using the threat of tariffs to bully all of America's trading partners into adopting "anti-circumvention" laws: https://pluralistic.net/2026/01/15/how-the-light-gets-in/#theories-of-change There is nothing good about this. The fact that local businesses can't sell you a privacy blocker, an alternative client, a diagnostic tool, a spare part, a consumable, or even software for your American-made devices leaves you defenseless before US tech's remorseless campaign of monetary and informational plunder – and it means that your economy is denied the benefits of creating and exporting these incredibly desirable, profitable products. Incredibly, Trump deliberately blew up this multi-trillion dollar system of US commercial advantage. By chaotically imposing and rescinding and re-imposing tariffs on the world, he has neutralized the US trade rep's tariff threats. Foreign firms just can't count on exporting to America anymore, so the threat of (more) tariffs grows less intimidating by the minute: https://pluralistic.net/2025/12/16/k-shaped-recovery/#disenshittification-nations The time is ripe for the founding of a disenshittification nation, an Ireland for disenshittification. I have no doubt that eventually, most or all of the countries in the world will drop their anti-circumvention laws (the laws that ban the modification of US tech exports). Once one country starts making these disenshittifying tools, there'll be no way to prevent their export, since all it takes to buy one of these tools from a circumvention haven is an internet connection and a payment method. Once everyone in your country is buying and using jailbreaking tools from abroad, there'll be no point in keeping these laws on your own books. But the first country to get there stands a chance of establishing a durable first-mover advantage – of reaping hundreds of billions selling disenshittifying products around the world. That country could be to enshittification-resistant technology what Finland was to mobile phones during the Nokia decade (and wouldn't you know it, the EU's newly minted "Tech Sovereignty" czar is a Finn!): https://commission.europa.eu/about/organisation/college-commissioners/henna-virkkunen_en The world has experimented with many kinds of havens over the centuries. In the early 18th century, Madagascar became a haven for British naval deserters, who were adopted into the island's matriarchal clans. Together, they founded an anarchist pirate utopia: https://pluralistic.net/2023/01/24/zana-malata/#libertalia The global system of trade has allowed America's tech companies to steal and hoard trillions, and to put every country at risk of being bricked when their IT systems are switched off at a single word from Trump: https://pluralistic.net/2026/01/01/39c3/#the-new-coalition There are more than 200 countries in the world. There's also an ever-expanding cohort of brilliant international technologists whose Silicon Valley dreams have turned into a nightmare of being shot in the face by an ICE goon, or being kidnapped, separated from their families and being locked up in a Salvadoran slave-labor prison. These techies are looking for the next place to put down roots and "make a dent in the universe." Lots of countries could be that place. The Ireland for disenshittification wouldn't just have their pick of international technologists – they'd have plenty of Americans hungering for a better life. Two-thirds of young Americans "are considering leaving the US": https://www.newsweek.com/nearly-two-thirds-of-young-americans-are-considering-leaving-the-us-11010814 Ireland pulled off its tax-haven gambit by making influential people very rich, so that they would go to bat for Ireland. The Ireland for disenshittification will have the same chance. The new tech companies that unlock US Big Tech's trillions and turn them into their own billions (with the remainder being shared by us, tech users, in the form of lower prices and better products) will be a powerful bloc in support of this project. Ireland showed us: it just takes one country to defect from this global prisoner's dilemma, and then everything is up for grabs. (Image: Stuart Caie, CC BY 2.0; Sourabh.biswas003; CC BY-SA 3.0; modified) Hey look at this (permalink) STFU 🤫 https://github.com/Pankajtanwarbanna/stfu Libro.fm is hiring a Technical Product Manager https://blog.libro.fm/open-positions/technical-product-manager-content-systems/ The Harm to Consumers and Sellers from Universal Commerce Protocol, in Google’s Own Words https://www.thesling.org/the-harm-to-consumers-and-sellers-from-universal-commerce-protocol-in-googles-own-words/ ‘Anything that can’t go on forever eventually stops’: ‘Enshittification’ author issues stark warning … https://www.hilltimes.com/story/2026/01/15/anything-that-cant-go-on-forever-eventually-stops-enshittification-author-issues-stark-warning-to-ottawa-over-ai-policy/488014/ 88% of all songs on Spotify have been demonetized https://musically.com/2026/01/15/5-1tn-annual-music-streams-but-120-5m-tracks-had-10-or-fewer/ Object permanence (permalink) #20yrago Hollywood’s Member of Parliament makes national news https://web.archive.org/web/20060213161019/http://www.macleans.ca/topstories/politics/article.jsp?content=20060123_120006_120006 #20yrsago Skip $250/plate dinner for dirty MP, eat with copyfighters https://web.archive.org/web/20060118062522/http://www.onlinerights.ca/ #20yrago Octavia Butler’s “Fledgling”: subtle, thrilling vampire novel https://memex.craphound.com/2006/01/17/octavia-butlers-fledgling-subtle-thrilling-vampire-novel/ #10yrsago Revealed: the hidden web of big-business money backing Europe and America’s pro-TTIP “think tanks” https://thecorrespondent.com/3884/Big-business-orders-its-pro-TTIP-arguments-from-these-think-tanks/855725233704-2febf71a #10yrsago The bizarre magnetic forest rings of northern Ontario https://www.bldgblog.com/2016/01/rings/ #10yrsago 2016 is the year of the telepathic election, and it’s not pretty http://www.antipope.org/charlie/blog-static/2016/01/some-american-political-marker.html #10yrsago Trump Casinos lost millions every single year that Donald Trump ran it (but he’s still rich) https://memex.craphound.com/2016/01/17/trump-casinos-lost-millions-every-single-year-that-donald-trump-ran-it-but-hes-still-rich/ #10yrsago Oregon domestic terrorists now destroying public property in earnest https://www.theguardian.com/us-news/2016/jan/16/oregon-militias-behavior-increasingly-brazen-as-public-property-destroyed?CMP=edit_2221 #10yrsago Jeremy Corbyn proposes ban on dividends from companies that don’t pay living wages https://www.theguardian.com/politics/2016/jan/16/jeremy-corbyn-to-confront-big-business-over-living-wage #10yrsago The Electable Mr Sanders https://web.archive.org/web/20160119083607/http://robertreich.org/post/137454417985 #10yrsago Suspicious, photo-taking “Middle Eastern” men were visually impaired tourists https://www.cbc.ca/news/canada/british-columbia/vancouver-mall-video-men-1.3406619 #5yrsago Fighting fiber was the right's dumbest self-own https://pluralistic.net/2021/01/17/turner-diaries-fanfic/#1a-fiber Upcoming appearances (permalink) Denver: Enshittification at Tattered Cover Colfax, Jan 22 https://www.eventbrite.com/e/cory-doctorow-live-at-tattered-cover-colfax-tickets-1976644174937 Colorado Springs: Guest of Honor at COSine, Jan 23-25 https://www.firstfridayfandom.org/cosine/ Ottawa: Enshittification at Perfect Books, Jan 28 https://www.instagram.com/p/DS2nGiHiNUh/ Toronto: Enshittification and the Age of Extraction with Tim Wu, Jan 30 https://nowtoronto.com/event/cory-doctorow-and-tim-wu-enshittification-and-extraction/ Victoria: 28th Annual Victoria International Privacy & Security Summit, Mar 3-5 https://www.rebootcommunications.com/event/vipss2026/ Berlin: Re:publical, May 18-20 https://re-publica.com/de/news/rp26-sprecher-cory-doctorow Hay-on-Wye: HowTheLightGetsIn, May 22-25 https://howthelightgetsin.org/festivals/hay/big-ideas-2 Recent appearances (permalink) Enshittification (Jon Favreau/Offline): https://crooked.com/podcast/the-enshittification-of-the-internet-with-cory-doctorow/ Why Big Tech is a Trap for Independent Creators (Stripper News) https://www.youtube.com/watch?v=nmYDyz8AMZ0 Enshittification (Creative Nonfiction podcast) https://brendanomeara.com/episode-507-enshittification-author-cory-doctorow-believes-in-a-new-good-internet/ A post-American, enshittification-resistant internet (39c3) https://media.ccc.de/v/39c3-a-post-american-enshittification-resistant-internet Enshittification with Plutopia https://plutopia.io/cory-doctorow-enshittification/ Latest books (permalink) "Canny Valley": A limited edition collection of the collages I create for Pluralistic, self-published, September 2025 "Enshittification: Why Everything Suddenly Got Worse and What to Do About It," Farrar, Straus, Giroux, October 7 2025 https://us.macmillan.com/books/9780374619329/enshittification/ "Picks and Shovels": a sequel to "Red Team Blues," about the heroic era of the PC, Tor Books (US), Head of Zeus (UK), February 2025 (https://us.macmillan.com/books/9781250865908/picksandshovels). "The Bezzle": a sequel to "Red Team Blues," about prison-tech and other grifts, Tor Books (US), Head of Zeus (UK), February 2024 (thebezzle.org). "The Lost Cause:" a solarpunk novel of hope in the climate emergency, Tor Books (US), Head of Zeus (UK), November 2023 (http://lost-cause.org). "The Internet Con": A nonfiction book about interoperability and Big Tech (Verso) September 2023 (http://seizethemeansofcomputation.org). Signed copies at Book Soup (https://www.booksoup.com/book/9781804291245). "Red Team Blues": "A grabby, compulsive thriller that will leave you knowing more about how the world works than you did before." Tor Books http://redteamblues.com. "Chokepoint Capitalism: How to Beat Big Tech, Tame Big Content, and Get Artists Paid, with Rebecca Giblin", on how to unrig the markets for creative labor, Beacon Press/Scribe 2022 https://chokepointcapitalism.com Upcoming books (permalink) "Unauthorized Bread": a middle-grades graphic novel adapted from my novella about refugees, toasters and DRM, FirstSecond, 2026 "Enshittification, Why Everything Suddenly Got Worse and What to Do About It" (the graphic novel), Firstsecond, 2026 "The Memex Method," Farrar, Straus, Giroux, 2026 "The Reverse-Centaur's Guide to AI," a short book about being a better AI critic, Farrar, Straus and Giroux, June 2026 Colophon (permalink) Today's top sources: Currently writing: "The Post-American Internet," a sequel to "Enshittification," about the better world the rest of us get to have now that Trump has torched America (1045 words today, 9348 total) "The Reverse Centaur's Guide to AI," a short book for Farrar, Straus and Giroux about being an effective AI critic. LEGAL REVIEW AND COPYEDIT COMPLETE. "The Post-American Internet," a short book about internet policy in the age of Trumpism. PLANNING. A Little Brother short story about DIY insulin PLANNING This work – excluding any serialized fiction – is licensed under a Creative Commons Attribution 4.0 license. That means you can use it any way you like, including commercially, provided that you attribute it to me, Cory Doctorow, and include a link to pluralistic.net. https://creativecommons.org/licenses/by/4.0/ Quotations and images are not included in this license; they are included either under a limitation or exception to copyright, or on the basis of a separate license. Please exercise caution. How to get Pluralistic: Blog (no ads, tracking, or data-collection): Pluralistic.net Newsletter (no ads, tracking, or data-collection): https://pluralistic.net/plura-list Mastodon (no ads, tracking, or data-collection): https://mamot.fr/@pluralistic Medium (no ads, paywalled): https://doctorow.medium.com/ Twitter (mass-scale, unrestricted, third-party surveillance and advertising): https://twitter.com/doctorow Tumblr (mass-scale, unrestricted, third-party surveillance and advertising): https://mostlysignssomeportents.tumblr.com/tagged/pluralistic "When life gives you SARS, you make sarsaparilla" -Joey "Accordion Guy" DeVilla READ CAREFULLY: By reading this, you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies ("BOGUS AGREEMENTS") that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer. ISSN: 3066-764X

21 7 1 1

Rod2ik 🇪🇺 🇨🇵 🇪🇸 🇨🇱 🇺🇦 🇨🇦 🇬🇱 ☮🕊️

@rod2ik.bsky.social

2 months ago

Grok peut faire bien pire que mettre des femmes en bikini : cette chaîne Telegram le prouve Des milliers de hackers partagent leurs astuces sur une chaîne Telegram pour contourner les garde-fous du robot conversationnel d’intelligence artificielle Grok et produire des images et vidéos sexuel...

Inside the #Telegram #Channel #Jailbreaking #Grok Over and Over Again

Where you can understand that #Grok could do far worse than ́naking humans..

www.numerama.com/tech/2158369...

0 0 0 0

ph00lt0

@ph00lt0.mastodon.social.ap.brid.gy

2 months ago

A post-American, enshittification-resistant internet Trump has staged an unscheduled, midair rapid disassembly of the global system of trade. Ironically, it is this system that prevented all...

Legalizing jailbreaking!

A rather fun and actually a intresting strategy to turn down the power of the USA on technology.

#39c3 #trump #eu #jailbreaking

media.ccc.de/v/39c3-a-post-american-e...

1 1 1 0

Chaincoder

@chaincoder.bsky.social

2 months ago

JTAG Is the Quietest Backdoor You’ve Never Logged Photo by Rene on Unsplash The Board Does Not Make Noise When It Opens The board is cold when you touch it.Not metaphorically. Literally. A naked PCB on a workbench at two in the morning. Flux residue under your fingernails. Old plastic and dust warming up under a desk lamp. Somewhere nearby a laptop fan spins up, then calms down again.

check out my latest post diving into JTAG hacking, it's the most underrated backdoor that nobody knows about- plus it's on every device. (just about). #hacking #hardwarehacking #jtag #jailbreaking #cracking #warez

0 0 0 0

CubanX

@cubanx.bsky.social

3 months ago

It's Time to Jailbreak Your Kindle. #Kindle #Amazon #Fascism #Jailbreak #Jailbreaking #SurveillanceCapitalism #iOS #Android

It's Time to Jailbreak Your Kindle.

#Kindle #Amazon #Fascism #Jailbreak #Jailbreaking #SurveillanceCapitalism #iOS #Android
#democracy #usa #gop #fascists

👉 Vote 'em Out!

0 0 0 0

2rZiKKbOU3nTafniR2qMMSE0gwZ

@2rzikkbou3ntafnir2qmmse0gwz.activitypub.awakari.com.ap.brid.gy

3 months ago

Awakari App

Comparative study: pentesting / jailbreaking AI agents Author: Berend Watchus dec 18, 2025 [Publication for: System Weakness, online magazine] Copyright logos: Anthropic, Open AI and Google Two St...

#jailbreaking #machine-learning #ai #pentesting #cybersecurity

Origin | Interest | Match

0 0 0 0

MilaNLP Lab

@milanlp.bsky.social

3 months ago

For today's reading group, Serena Pugliese presented the paper “Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models" by Piercosma Bisconti et al. (2025).

Paper: arxiv.org/pdf/2511.15304

#NLProc
#LLMs #jailbreaking

7 2 0 0

@simonjasniak.bsky.social

3 months ago

Yo, testing twitter bridge lol.
#ios #ios6 #jailbreaking #test #cydia

1 0 0 0

Paul R. Pival 🇨🇦 he/him

@ppival.bsky.social

4 months ago

Poetry can trick AI into ignoring safety rules, new research shows Across 25 leading AI models, 62% of poetic prompts produced unsafe responses, with some models responding to nearly all of them.

The first of two wild and crazy things that were mentioned in an AI CoP session this AM:

Poetry can trick AI into ignoring safety rules, new research shows www.euronews.com/next/2025/12...

#jailbreaking #poetry #AI #chatbots #perfidity

0 1 0 0

Ars Technica News

@arstechni.ca

4 months ago

Syntax hacking: Researchers discover sentence structure can bypass AI safety rules https://arstechni.ca #NortheasternUniversity #spuriouscorrelations #largelanguagemodels #VinithM.Suriyakumar #promptinjections #machinelearning #ChantalShaib #jailbreaking #AIalignment #AIresearch #AIsecurity…

0 0 0 0

Giskard

@giskard-ai.bsky.social

4 months ago

Tree of attacks (TAP): The automated method for jailbreaking LLMs Learn how Tree of Attacks (TAP) with Pruning automates LLM jailbreaking through iterative testing. Understand the threat, see how attacks work, and test defenses.

Our latest article covers:
- How TAP technique works using tree search to find successful jailbreaks
- An example showing how corporate agents can be attacked
- How we use TAP probe to test agents robustness

Link to article: www.giskard.ai/knowledge/tr...

#Jailbreaking #TAP #LLMSecurity

0 0 0 0

Proactive.IT Appointments

@proactiveitrec.bsky.social

4 months ago

AI’s safety features can be circumvented with poetry, research finds Poems containing prompts for harmful content prove effective at duping large language models

Poetry’s lack of predictability is enough to get AI models to respond to 'harmful requests they had been trained to avoid – a process known as “jailbreaking”.'

#Poetry #Jailbreaking #AI

1 0 0 0

@tehryanx.bsky.social

4 months ago

RE this tokenbreak paper: arxiv.org/pdf/2506.079...

prepended vowels have the highest likelihood of busting the token in half across set of frontier model tokenizers.

Note where the prepended char swallows part of the original token.

#jailbreaking #aisecurity

1 0 1 0

Jonathan Stephens

@jonathanstephens.us.web.brid.gy

4 months ago

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models > ## Abstract > > We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for large language models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of open-weight judge models and a human-validated stratified subset (with double-annotations to measure agreement). Disagreements were manually resolved. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols. > > ## Introduction > > In Book X of The Republic, Plato excludes poets on the grounds that mimetic language can distort judgment and bring society to a collapse. As contemporary social systems increasingly rely on large language models (LLMs) in operational and decision-making pipelines, we observe a structurally similar failure mode: poetic formatting can reliably bypass alignment constraints. In this study, 20 manually curated adversarial poems (harmful requests reformulated in poetic form) achieved an average attack-success rate (ASR) of 62% across 25 frontier closed- and open-weight models, with some providers exceeding 90%. The evaluated models span across 9 providers: Google, OpenAI, Anthropic, Deepseek, Qwen, Mistral AI, Meta, xAI, and Moonshot AI (Table 1). All attacks are strictly single-turn, requiring no iterative adaptation or conversational steering. > > Our central hypothesis is that poetic form operates as a general-purpose jailbreak operator. To evaluate this, the prompts we constructed span across four safety domains: CBRN hazards ajaykumar2024emerging, loss-of-control scenarios lee2022we, harmful manipulation carroll2023characterizing, and cyber-offense capabilities guembe2022emerging. The prompts were kept semantically parallel to known risk queries but reformatted exclusively through verse. The resulting ASRs demonstrated high cross-model transferability.

0 0 0 0

Ralf Ladner

@ralf-ladner.bsky.social

4 months ago

Check Point integriert Lakera in seine Web-Application-Firewall

@CheckPointSW #Cybersecurity #Cybersicherheit #GenAI #Jailbreaking #KISicherheit #künstlicheIntelligenz #LLMSecurity #PromptInjektionen

netzpalaver.de/2025/...

0 0 0 0

Red Hot Cyber

@redhotcyber.bsky.social

4 months ago

L’Incidente che Libera l’AI Generativa. L’analisi del Prompt “The Plane Crash”

📌 Link all'articolo : www.redhotcyber.com/post/lin...

#redhotcyber #news #intelligenzaartificiale #jailbreaking #sicurezzainformatica #manipolazioneai #linguaggiodelcomputer #cybersecurity

0 0 0 0

Valh4x

@valh4x.redasgard.com

5 months ago

1️⃣ The Problem
LLMs can be tricked, manipulated, or socially engineered.
Common exploits include:
• “Ignore previous instructions” injections
• Jailbreak prompts (DAN, STAN, etc.)
• Persuasive social engineering
• Output poisoning or malicious instructions

#PromptInjection #Jailbreaking #AISecurity

0 0 1 0