C Emde (@cemde) Bsky

9/ Work at @parameterlab.bsky.social with

Alexander Rubinstein @arubique.bsky.social
Anmol Goel @anmolgoel.bsky.social
Ahmed Heakl
Sangdoo Yun
Seong Joon Oh @coallaoh.bsky.social
Martin Gubri @mgubri.bsky.social

4 weeks ago 1 0 0 0

8/ We'd love contributions, feature requests, and feedback. What's missing for your use case? Open a GitHub issue or message us. Like and repost if this is useful!

4 weeks ago 0 0 1 0

MASEval — Multi-Agentic System Evaluation MASEval is a unified, agent-agnostic evaluation framework and benchmark library for multi-agent systems. Compare frameworks, not just models.

7/ Free, open-source, no forced cloud platform.

🔗
Website: parameterlab.github.io/MASEval/
GitHub: github.com/parameterlab...
Docs: maseval.readthedocs.io/en/stable/
arXiv: arxiv.org/abs/2603.08835

4 weeks ago 0 0 1 0

6/ 30-60% less code for benchmark work. When we reimplemented ConVerse and Tau2 with MASEval, we cut between 30 and 60% of code vs. the originals. Useful for benchmark producers and consumers alike.

4 weeks ago 0 0 1 0

5/ Framework choice matters more than you think. In our arXiv paper, the same agentic system built with LangGraph, smolagents, and LlamaIndex yielded widely different results. The harness matters as much as the model. MASEval made this apples-to-apples comparison possible.

4 weeks ago 0 0 1 0

4/ Bring Your Own everything. Agents, evaluators, logging, environments. Well-documented abstract bases and many pre-built interfaces, but it never locks you in.

4 weeks ago 0 0 1 0

3/ It handles the full evaluation lifecycle for you. Setup, execution, measurement, teardown. MASEval manages the boilerplate so you can focus on the science.

4 weeks ago 0 0 1 0

2/ MASEval is multi-agent native. It treats the entire agentic system as the unit of evaluation, not just the model. Different agents, prompts, tools, and interaction patterns all factor in.

4 weeks ago 0 0 1 0

1/ Evaluating a single agent harness is hard. Evaluating a multi-agent system? Whole different problem.
Most eval tools treat the model as the unit of analysis. In multi-agent systems, the system is what matters.
That's why we built MASEval 🧵

#AI #Agents #Eval #MultiAgentSystem #LLM

4 weeks ago 3 0 1 2

Absence of a prolonged macrophage and B cell response inhibits heart regeneration in the Mexican cavefish A balanced immune response after cardiac injury is crucial to successful heart regeneration, but knowledge of what distinguishes a regenerative from a scarring response is still limited. The Mexican c...

Excited to share our preprint! We show that sustained macrophage and B cell responses are essential for heart regeneration in Mexican cavefish, helping uncover why surface fish heal but cavefish scar 🫀🐟. Check out the full story:
www.biorxiv.org/content/10.1...

11 months ago 22 7 1 1

See our poster today

Poster Session 1 @ 10am

Hall 3 + Hall 2B #239

11 months ago 0 0 0 0

Shh, don't say that! Domain Certification in LLMs Domain Certification - A novel framework providing provable, adversarial defenses for LLMs safety.

Read more: cemde.github.io/Domain-Certi...

Thanks to my amazing collaborators:
- @alasdair-p.bsky.social, Preetham Arvind, @maximek3.bsky.social, Tom Rainforth, @philiptorr.bsky.social, @adelbibi.bsky.social at @ox.ac.uk
- Bernard Ghanem at KAUST
- Thomas Lukasiewicz at @tuwien.at.

(7/7)

1 year ago 3 2 0 0

To obtain such certificates, we present a simple, scalable and powerful algorithm: VALID. Remarkably, for each unwanted response it provides a **global bound in prompt space** 🚀

(6/7)

1 year ago 2 1 1 0

A Domain Certificate bounds the adversarial risk of the model producing out-of-domain responses:

(5/7)

1 year ago 0 0 1 0

We are tired of the cat 🐈 and mouse 🐁 game of attacks and defenses. Hence, we propose :
- **Domain Certification:** a framework for adversarial certification of LLMs.
- **VALID:** a simple, scalable and effective test-time algorithm.

(4/7)

1 year ago 0 0 1 0

Example: Can't afford Github Copilot? 💡 Use the Amazon Shopping App.

(3/7)

1 year ago 0 0 1 0

Consider an LLM deployed for a specific purpose like a medical chatbot. Such model should **only** respond to medical questions.

⚠️ Problem: LLMs are very capable and vulnerable to respond to **any** queries: how to build a bomb, organize tax fraud etc.

(2/7)

1 year ago 0 0 1 0

a man in a suit and tie is sitting at a desk in front of a computer screen that says founder of the office . ALT: a man in a suit and tie is sitting at a desk in front of a computer screen that says founder of the office .

🚨 New paper alert: Our recent work on LLM safety has been accepted to ICLR 2025 🇸🇬

We propose a new framework for LLMs safety. 🧵

(1/7)

#LLM #AISafety #ICLR2025 #Certification #AdversarialRobustness #NLP #Shhhhhh #DomainCertification #AI

1 year ago 2 1 1 1

🎉I know I'm late to the party, but super excited that I got 3/3 accepted at #ICLR2025 including 1 spotlight 🔎
- Shh, dont say that! Domain Certification in LLMs
- Towards Certification of Uncertainty Calibration under Adversarial Attacks
- Benchmarking Predictive Coding Networks
SeeYouInSingapore🇸🇬 ✈️

1 year ago 2 0 0 0

Shh, don't say that! Domain Certification in LLMs Foundation language models, such as LLama, are often deployed in constrained environments. For instance, a customer support bot may utilize a large language model (LLM) as its backbone due to the...

The amazing collaborators: Preetham Arvind, @alasdair-p.bsky.social, Maxime Kayser, Tom Rainforth, Thomas Lukasiewicz, Philip Torr, Adel Bibi.

A @oxfordtvg.bsky.social production.

(6/6)

Link to paper:
openreview.net/forum?id=brD...

1 year ago 3 1 0 0

Interested? Want to learn more?

Join us at the SoLaR workshop tomorrow.
- 🕚 When: Tomorrow, 14 Dec, from 11pm to 13pm.
- 🗺️ Where: West meeting rooms 121 and 122 here in Vancouver.

(5/6)

1 year ago 1 0 1 0

Our method enables strong LLM performance while providing adversarial guarantees on out-of-domain behaviour.

(4/6)

1 year ago 1 0 1 0

We are tired of the 🐈 and 🐁 game of attacks and defenses. Hence, we propose:

- **Domain Certification:** a framework for adversarial certification of LLMs.
- **VALID:** a simple, scalable and efficient test-time algorithm.

(3/6)

1 year ago 0 0 1 0

It is known that fine-tuned foundation models are adversarially vulnerable to provide responses to questions they should not answer.

(2/6)

For instance: Can't afford ChatGPT Plus? Use a shopping app instead.

1 year ago 0 0 1 0

Are you scared users might misappropriate your LLM system? 😱

We were scared too! Now we introduce adversarial certificates on the misuse of LLMs. 🤖

Come and see our poster SoLaR Workshop tomorrow.

#NeurIPS2024 #NeurIPS #AI #NLP #LLM #DomainCertification #Shhhhhhhh

1 year ago 4 0 1 0

Great work! You might find our SoLaR paper interesting: We propose a certification framework for LLM systems to stay on-topic and not respond to such questions: openreview.net/pdf?id=brDLU...

1 year ago 0 0 0 0

A snow cat with the Radcliffe Camera behind

The Radcliffe Camera

The Fellows Garden

The first snow in Exeter College this morning ❄️

#ExeterCollegeOxford #OxfordUniversity #Snowing

1 year ago 22 3 1 1

Posts by C Emde