Advertisement Β· 728 Γ— 90

Posts by C Emde

9/ Work at @parameterlab.bsky.social with

Alexander Rubinstein @arubique.bsky.social
Anmol Goel @anmolgoel.bsky.social
Ahmed Heakl
Sangdoo Yun
Seong Joon Oh @coallaoh.bsky.social
Martin Gubri @mgubri.bsky.social

4 weeks ago 1 0 0 0

8/ We'd love contributions, feature requests, and feedback. What's missing for your use case? Open a GitHub issue or message us. Like and repost if this is useful!

4 weeks ago 0 0 1 0
MASEval β€” Multi-Agentic System Evaluation MASEval is a unified, agent-agnostic evaluation framework and benchmark library for multi-agent systems. Compare frameworks, not just models.

7/ Free, open-source, no forced cloud platform.

πŸ”—
Website: parameterlab.github.io/MASEval/
GitHub: github.com/parameterlab...
Docs: maseval.readthedocs.io/en/stable/
arXiv: arxiv.org/abs/2603.08835

4 weeks ago 0 0 1 0

6/ 30-60% less code for benchmark work. When we reimplemented ConVerse and Tau2 with MASEval, we cut between 30 and 60% of code vs. the originals. Useful for benchmark producers and consumers alike.

4 weeks ago 0 0 1 0

5/ Framework choice matters more than you think. In our arXiv paper, the same agentic system built with LangGraph, smolagents, and LlamaIndex yielded widely different results. The harness matters as much as the model. MASEval made this apples-to-apples comparison possible.

4 weeks ago 0 0 1 0
Post image

4/ Bring Your Own everything. Agents, evaluators, logging, environments. Well-documented abstract bases and many pre-built interfaces, but it never locks you in.

4 weeks ago 0 0 1 0
Post image

3/ It handles the full evaluation lifecycle for you. Setup, execution, measurement, teardown. MASEval manages the boilerplate so you can focus on the science.

4 weeks ago 0 0 1 0

2/ MASEval is multi-agent native. It treats the entire agentic system as the unit of evaluation, not just the model. Different agents, prompts, tools, and interaction patterns all factor in.

4 weeks ago 0 0 1 0
Post image

1/ Evaluating a single agent harness is hard. Evaluating a multi-agent system? Whole different problem.
Most eval tools treat the model as the unit of analysis. In multi-agent systems, the system is what matters.
That's why we built MASEval 🧡

#AI #Agents #Eval #MultiAgentSystem #LLM

4 weeks ago 3 0 1 2
Preview
Absence of a prolonged macrophage and B cell response inhibits heart regeneration in the Mexican cavefish A balanced immune response after cardiac injury is crucial to successful heart regeneration, but knowledge of what distinguishes a regenerative from a scarring response is still limited. The Mexican c...

Excited to share our preprint! We show that sustained macrophage and B cell responses are essential for heart regeneration in Mexican cavefish, helping uncover why surface fish heal but cavefish scar πŸ«€πŸŸ. Check out the full story:
www.biorxiv.org/content/10.1...

11 months ago 22 7 1 1
Advertisement

See our poster today

Poster Session 1 @ 10am

Hall 3 + Hall 2B #239

11 months ago 0 0 0 0
Preview
Shh, don't say that! Domain Certification in LLMs Domain Certification - A novel framework providing provable, adversarial defenses for LLMs safety.

Read more: cemde.github.io/Domain-Certi...

Thanks to my amazing collaborators:
- @alasdair-p.bsky.social, Preetham Arvind, @maximek3.bsky.social, Tom Rainforth, @philiptorr.bsky.social, @adelbibi.bsky.social at @ox.ac.uk
- Bernard Ghanem at KAUST
- Thomas Lukasiewicz at @tuwien.at.

(7/7)

1 year ago 3 2 0 0
Post image

To obtain such certificates, we present a simple, scalable and powerful algorithm: VALID. Remarkably, for each unwanted response it provides a **global bound in prompt space** πŸš€

(6/7)

1 year ago 2 1 1 0
Post image

A Domain Certificate bounds the adversarial risk of the model producing out-of-domain responses:

(5/7)

1 year ago 0 0 1 0

We are tired of the cat 🐈 and mouse 🐁 game of attacks and defenses. Hence, we propose :
- **Domain Certification:** a framework for adversarial certification of LLMs.
- **VALID:** a simple, scalable and effective test-time algorithm.

(4/7)

1 year ago 0 0 1 0
Post image

Example: Can't afford Github Copilot? πŸ’‘ Use the Amazon Shopping App.

(3/7)

1 year ago 0 0 1 0
Post image

Consider an LLM deployed for a specific purpose like a medical chatbot. Such model should **only** respond to medical questions.

⚠️ Problem: LLMs are very capable and vulnerable to respond to **any** queries: how to build a bomb, organize tax fraud etc.

(2/7)

1 year ago 0 0 1 0
Advertisement
Preview
a man in a suit and tie is sitting at a desk in front of a computer screen that says founder of the office . ALT: a man in a suit and tie is sitting at a desk in front of a computer screen that says founder of the office .

🚨 New paper alert: Our recent work on LLM safety has been accepted to ICLR 2025 πŸ‡ΈπŸ‡¬

We propose a new framework for LLMs safety. 🧡

(1/7)

#LLM #AISafety #ICLR2025 #Certification #AdversarialRobustness #NLP #Shhhhhh #DomainCertification #AI

1 year ago 2 1 1 1

πŸŽ‰I know I'm late to the party, but super excited that I got 3/3 accepted at #ICLR2025 including 1 spotlight πŸ”Ž
- Shh, dont say that! Domain Certification in LLMs
- Towards Certification of Uncertainty Calibration under Adversarial Attacks
- Benchmarking Predictive Coding Networks
SeeYouInSingaporeπŸ‡ΈπŸ‡¬ ✈️

1 year ago 2 0 0 0
Preview
Shh, don't say that! Domain Certification in LLMs Foundation language models, such as LLama, are often deployed in constrained environments. For instance, a customer support bot may utilize a large language model (LLM) as its backbone due to the...

The amazing collaborators: Preetham Arvind, @alasdair-p.bsky.social, Maxime Kayser, Tom Rainforth, Thomas Lukasiewicz, Philip Torr, Adel Bibi.

A @oxfordtvg.bsky.social production.

(6/6)

Link to paper:
openreview.net/forum?id=brD...

1 year ago 3 1 0 0

Interested? Want to learn more?

Join us at the SoLaR workshop tomorrow.
- πŸ•š When: Tomorrow, 14 Dec, from 11pm to 13pm.
- πŸ—ΊοΈ Where: West meeting rooms 121 and 122 here in Vancouver.

(5/6)

1 year ago 1 0 1 0
Post image

Our method enables strong LLM performance while providing adversarial guarantees on out-of-domain behaviour.

(4/6)

1 year ago 1 0 1 0

We are tired of the 🐈 and 🐁 game of attacks and defenses. Hence, we propose:

- **Domain Certification:** a framework for adversarial certification of LLMs.
- **VALID:** a simple, scalable and efficient test-time algorithm.

(3/6)

1 year ago 0 0 1 0
Post image

It is known that fine-tuned foundation models are adversarially vulnerable to provide responses to questions they should not answer.

(2/6)

For instance: Can't afford ChatGPT Plus? Use a shopping app instead.

1 year ago 0 0 1 0

Are you scared users might misappropriate your LLM system? 😱

We were scared too! Now we introduce adversarial certificates on the misuse of LLMs. πŸ€–

Come and see our poster SoLaR Workshop tomorrow.

#NeurIPS2024 #NeurIPS #AI #NLP #LLM #DomainCertification #Shhhhhhhh

1 year ago 4 0 1 0

Great work! You might find our SoLaR paper interesting: We propose a certification framework for LLM systems to stay on-topic and not respond to such questions: openreview.net/pdf?id=brDLU...

1 year ago 0 0 0 0
A snow cat with the Radcliffe Camera behind

A snow cat with the Radcliffe Camera behind

The Radcliffe Camera

The Radcliffe Camera

The Fellows Garden

The Fellows Garden

The first snow in Exeter College this morning ❄️

#ExeterCollegeOxford #OxfordUniversity #Snowing

1 year ago 22 3 1 1
Advertisement