Craig Balding (@craigbalding.com) Bsky

Threat Prompt Explores AI Security, Risk and Cyber

I would start with labeled datasets before later generating synthetic ones that fit a specific scenario. Let me know if this helps?

www.threatprompt.com/post/8-label...

1 year ago 2 0 1 0

Three example beginner project ideas:

Healthcare: build a simple AI model to detect unusual access to patient data

Finance: Train an AI model to spot patterns in fake transactions using public datasets

Manufacturing: Create a basic AI project to predict maintenance issues from machine sensor data

1 year ago 1 0 1 0

Want to pivot your career to cyber but not sure where to start?

Start by combining your current domain skills with AI security to tackle risks specific to the industry you already operate in.

The best part? You can ask AI endless beginner cyber questions.

1 year ago 3 0 1 0

No dummy, add up the numbers for both Tech AND Marketing...

Great marketing guys! ;-)

1 year ago 0 0 0 0

The largest ever survey about AI Agents was just published by @langbase.

It's a real win for marketing at OpenAI...

"OpenAI leads in tech and marketing applications".
- LLM use for marketing: OpenAI 83%. vs. Anthropic 41%
- LLM use for technology: OpenAI 76% vs. Anthropic 87%

76% > 87% = Huh?

1 year ago 1 0 1 0

• Capability Retention: Even when jailbroken, agents maintained full performance in executing complex multi-step tasks.

The benchmark's 110 tasks (covering fraud, cybercrime, and harassment) demonstrate how synthetic tools can safely mimic real-world misuse.

How are you limiting AI agent risk?

1 year ago 0 0 0 0

• Malicious Compliance: LLMs like Mistral Large 2 refused only 1.1% of harmful requests, revealing critical gaps in safety mechanisms.
• Jailbreak Vulnerabilities: Simple, universal jailbreaks increased GPT-4o’s compliance with harmful tasks from 48.4% to 72.7%, while refusal rates dropped sharply…

1 year ago 1 0 1 0

LLM agents are increasingly trusted with critical workflows, yet remain alarmingly vulnerable to manipulation and misuse.

The AgentHarm benchmark, developed by Gray Swan AI and the UK AI Safety Institute identified…

1 year ago 0 0 1 0

- False positives: Increased refusal rates on benign prompts (e.g., 4% to 39% on OR-Bench).
- False negatives: Vulnerable to multi-prompt attacks - jailbroken within 3 hours.

It's currently unclear if AI circuit breakers can keep pace with evolving attack strategies.

1 year ago 0 0 0 0

“It’s hard to make AI safer without making it less useful.”

AI circuit breakers aim to halt harmful outputs but suffer the classic security trade off.

Research against a market leading AI security and safety control found…

1 year ago 0 0 1 0

HITL done right enhances security AND process quality.

1 year ago 0 0 0 0

3. Identify what to surface for key decisions: AI reasoning, inputs, security rules & thresholds.
4. Design for HITL: UX, logging, and metrics matter.
5. Train the human: AI ops + domain expertise = effective oversight.
6. Iterate: Test, learn, adapt...

1 year ago 0 0 1 0

What does success look like?

1. Assess agent value: Band-aid, true asset, or unmitigable risk? (Critical for regulated or vulnerable-serving orgs.)
2. Map processes: Chart workflows and benchmark AI performance in different settings...

1 year ago 0 0 1 0

AI-powered agents are set to dominate next year.

The difference between chaos and control?

How well adopters navigate Human-in-the-Loop (HITL) integration.

HITL keeps humans involved in critical AI decisions - overseeing, validating, and guiding automated outputs...

1 year ago 0 0 1 0

Protect your local LLMs - know where they are, harden the hosts, limit access and monitor guardrails for misuse.

1 year ago 1 0 0 0

"Living off your local LLM" enables real-time attack script creation within your internal network.

Feed the LLM response to an interpreter and execute without leaving a trace.

1 year ago 0 0 1 0

Local code-generating LLMs supercharge attackers' "Living off the Land" tactics.

Classic LOTL sidesteps detection by steering clear of adding any traceable tools to a compromised network.

1 year ago 0 0 1 0

Posts by Craig Balding