Advertisement · 728 × 90

Posts by Gabriel

Nice!

4 days ago 0 0 0 0
Post image

@notalawyer.bsky.social @michaelhobbes.bsky.social

The news cycle continues to bring fresh horrors, but Eric Adams is now an Albania citizen and we need an emergency IBCK update.

(Also did you end up getting that seiko presage?)

1 week ago 0 0 0 0
Preview
Building an Adversarial Consensus Engine | Multi-Agent LLMs for Automated Malware Analysis Single-tool LLM analysis produces reports that look authoritative but aren't. A serial consensus pipeline catches artifacts and hallucinations at source.

More great research from @philofishal.bsky.social evaluating LLMs for Malware Analysis.

s1.ai/advers-llm

- Multi-agent architecture for reversing macOS malware.
- Each tool output as independent input for robust analysis
- Specific deterministic bridge scripts for tool integrations

1 month ago 1 0 1 0
Preview
Challenge Tracks | NEBULA:FOG 2026 AI x Security Hackathon 4 AI security hackathon tracks: adversarial AI, defense systems, zero-knowledge proofs, autonomous agents. $5K+ prizes. March 14, SF.

Looking for an internship in AI Cybersecurity?

We’re running AI-driven security challenges at NEBULA:FOG's hackathon today.

Join us: nebulafog.ai/challenges

I’ll be there live giving updates.

1 month ago 1 2 0 0

I’ll be running panels at this event and helping out. There’s still plenty of room and we are looking for any AI/ML students in the bay who want to attend!

And it’s free!

1 month ago 1 1 1 0
Post image

Cmon man

1 month ago 10 0 1 0
Preview
Harness engineering: leveraging Codex in an agent-first world By Ryan Lopopolo, Member of the Technical Staff

This is a pretty important statement about engineering and experimentation speed.

openai.com/index/harnes...

2 months ago 0 0 0 0

Mossad or not-Mossad, but for model evals needing to be difficult, but not so difficult that they are written off as not a useful measurement.

2 months ago 1 0 0 0

Great idea

2 months ago 1 0 0 0
Advertisement

Everybody has a hard eval until gradient descent punches you in the face.

2 months ago 1 0 0 0

Accountability diffuses at the deployment layer, but dependency concentrates at the model supply layer.

The dominant risk is not what the models can do, but how fast capability diffuses, how it gets wired, and whether misuse feedback loops are actioned post release.

2 months ago 0 0 0 0

ok takeaways:

This is a huge unmanaged attack surface, 49% tool exposure and a bunch of residential hosts is a problem waiting ot happen.

Prioritizing a release to go far in this ecosystem? Go with 8-14B at 4bit quant.

2 months ago 0 0 2 0

22% of hosts have custom system prompts - we pulled and classified over 3k prompts the breakdown for the top 4 were:

1. Default Identity
2. Coding Assistants
3. Roleplay
4. Uncensored

2 months ago 0 0 1 0
Post image

Portable weights travel far in this network.

Probably not a huge surprise, but in this dataset 8-14B parameters is the most prevalent model size and 72% of models are 4 bit quantized.

49% of hosts enable tools.

2 months ago 0 0 1 0
Post image

The top 10 Model Families control 85% of the market. Other families in the long tail.

2 months ago 0 0 1 0
Post image

This is an exposure dataset which means we are trying to study something by measuring the shadow that it casts. We can’t poll these systems directly, but we can understand the shape of the ecosystem.

2 months ago 0 0 1 0
Advertisement

New research from @silascutler.bsky.social and myself.

We tracked 175k exposed Ollama endpoints for nearly a year. Collected and analyzed custom models, sizes, quantizations, system prompts, and more.

2 months ago 3 1 1 0

*vague posts about upcoming research*

2 months ago 0 0 0 0

Love getting malware under TLP:AMBER+S, when the S stands for “spite”. 🫖

2 months ago 0 0 0 0

We about to have some Llama Drama :)

2 months ago 1 1 1 0

and of course it’s chatgpt slop with the rhetorical flourish of a remedial high school debate club.

“from X to Y — or worse”

“This Isn’t X it’s Y.”

“Replace X with Y and it’s Z.”

“The most sobering part? It’s X.”

“your no longer dealing with X. You’re facing Y”

2 months ago 2 2 0 1

“Wow this dude has a really strong opinion about code review”

*scans posts*

“Oh that’s his only opinion”

2 months ago 0 0 0 0

—dangerously-skip-permissions is the only thing keeping claude code installed on my machine.

2 months ago 0 0 0 0

As a friend said awhile back “we are fine-tuning the models and they are coarse-tuning us in turn”

2 months ago 3 0 0 0

A deeper problem is that nobody has time for anything but LLM-as-a-judge evaluations (often vendor-on-vendor), creating these Ouroboros loops that are easy to overfit and hard to trust.

Thats a huge gap when we’re being asked to rely on them for SOC automations or enterprise security work.

3 months ago 0 0 0 0

CyberSOCEval (meta) found models can extract real signal from malware logs & CTI reports, but remain far from reliable.

Most importantly in this domain, reasoning models do not get their usual math/coding uplift suggesting that general capability ≠ analyst capability... yet.

3 months ago 0 0 1 0
Advertisement

The best “agentic” benchmark we saw (ExCyTIn-Bench) still shows how far we are. Even in a curated Azure-style environment models struggled with multi-hop investigations over heterogeneous logs (data be confusing like that).

3 months ago 0 0 1 0

Most security evals reduce workflows into MCQs/static Q&A. That bakes in unrealistic assumptions that the “right question” is already asked, evidence is pre-packaged, wrong answers are cheap, and there’s no triage/queue pressure or escalation decisions.

3 months ago 0 0 1 0
Preview
LLMs in the SOC (Part 1) | Why Benchmarks Fail Security Operations Teams LLM cybersecurity benchmarks fail to measure what defenders need: faster detection, reduced containment time, and better decisions under pressure.

Benchmarks for cybersecurity are everywhere and mostly measuring the wrong thing.

We reviewed evals from Microsoft, Meta and academia and found they don't measure what matters for defenders in real IR situations. 🧵

s1.ai/benchmk1

3 months ago 4 3 1 0

Reviewing AI cyber benchmarking and evaluations may break me.

Ya’ll will really LLM as a judge anything

3 months ago 0 0 0 0