If you care about rigorous evaluation of agentic systems, give it a look at MASEval!
The harness is an important element of agents. MASEval makes it straightforward to change its components and evaluate their impact.
MASEval is our first software! parameterlab.github.io/MASEval/
⬇️
Posts by Parameter Lab
‼️New paper from Parameter Lab!
⛓️💥 We identify privacy collapse, a silent failure mode of LLMs: LLMs fine-tuned on seemingly benign data can lose their ability to respect contextual privacy norms.
Done by @anmolgoel.bsky.social during his internship!
Check-out 👇
👏 Proud to share that the paper that Ahmed Heakl authored during his internship at Parameter Lab was accepted at #ICLR2026!
See how 🩺Dr.LLM increases accuracy and decreases inference computations of frozen LLMs: www.linkedin.com/posts/ahmed-...
Our #EMNLP2025 paper Leaky Thoughts 🫗 shows that Large Reasoning Models (LRMs) can unintentionally leak sensitive information hidden in their internal thoughts.
📍 Come chat with Tommaso at our poster on Friday 7th, 10:30–12:00 in Hall C3
📄 aclanthology.org/2025.emnlp-m...
We challenge the view that reasoning traces are a safe internal part of a model’s process. Our work shows they can leak information, through both deliberate attacks and accidental leakage.
RTAI: researchtrend.ai/papers/2506....
ArXiv: arxiv.org/abs/2506.15674
Code: github.com/parameterlab...
2/2
Overall diagram about contextual privacy & LRMs
🫗 An LLM's "private" reasoning may leak your sensitive data!
🎉 Excited to share our paper "Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers" was accepted at #EMNLP main!
1/2
Work done with: Haritz Puerto, Martin Gubri @mgubri.bsky.social , Tommaso Green, Sangdoo Yun and Seong Joon Oh @coallaoh.bsky.social
#SEO #AI #LLM #GenerativeAI #Marketing #DigitalMarketing #Perplexity #NLProc
Key takeaways:
❌ C-SEO doesn’t help improve visibility in AI answers.
🔎 Traditional SEO is your tool for online visibility.
🚀 Our benchmark sets the stage to develop C-SEO methods that might work in the future.
🔎 The results are clear: current C-SEO strategies don’t work. This challenges the recent hype and suggests that creators don’t need to game LLMs and create even more clickbaits. Just focus on producing genuinely good content and let traditional SEO do its work.
C-SEO Bench evaluates Conversational Search Engine Optimization (C-SEO) techniques on two key tasks:
🔍 Product Recommendation
❓ Question Answering
Spanning multiple domains, it tests both domain-specific performance and the generalization of C-SEO methods.
Illustration of a conversational search engine for product recommendation. After applying a C-SEO method on the third document, its ranking gets boosted by +2 positions.
💥 With the rise of conversational search, a new technique of "Conversational SEO" (C-SEO) emerged, claiming it can boost content inclusion in AI-generated answers. We put these claims to the test by building C-SEO Bench, the first comprehensive benchmark to rigorously evaluate these new strategies.
Paper thumbnail.
🔎Does Conversational SEO actually work? Our new benchmark has an answer!
Excited to announce our new paper: C-SEO Bench: Does Conversational SEO Work?
🌐 RTAI: researchtrend.ai/papers/2506....
📄 Paper: arxiv.org/abs/2506.11097
💻 Code: github.com/parameterlab...
📊 Data: huggingface.co/datasets/par...
Excited to share that our paper "Scaling Up Membership Inference: When and How Attacks Succeed on LLMs" will be presented next week at #NAACL2025!
🖼️ Catch us at Poster Session 8 - APP: NLP Applications
🗓️ May 2, 11:00 AM - 12:30 PM
🗺️ Hall 3
Hope to see you there!
Ready to Join? Send your resume + a short note on why you’re a great fit to recruit@parameterlab.de.
Be part of a team that’s redefining research with AI! #Hiring #DataEngineer #AI #RemoteJobs
Why Join Us?
🚀 Make a Difference – Your work directly enhances how research is shared and discovered.
🌍 Flexibility – Choose full-time or part-time, work remotely or locally.
⚡ Innovative Environment – AI, research, and data-driven solutions all in one place.
🤝 Great Team
What You Bring:
✅ Proficiency in Airflow & PostgreSQL – Complex workflows and databases.
✅ Strong Python Skills – Clean, efficient, and maintainable code is your thing.
✅ (Bonus) Experience with LLMs – A huge plus as we integrate AI-driven solutions.
✅ Problem-Solving Mindset
✅ Team Spirit
What You’ll Do:
✔ Build Scalable Data Pipelines – Design and optimize workflows using tools like Airflow.
✔ Work Closely with AI Experts & Engineers – Collaborate to solve real-world data challenges.
✔ Optimize and Maintain Systems – Keep our data infrastructure fast, secure, and adaptable.
Our LLM-powered ecosystem also bridges the gap between cutting-edge research and industry leaders. If you're passionate about data, AI, and making an impact, we’d love to have you on board!
👥 We're Hiring: Senior/Junior Data Engineer!
📍 Remote or Local | Full-Time or Part-Time
At ResearchTrend.AI, we’re building a platform that connects researchers and AI engineers worldwide—helping them stay ahead with daily digests, insightful summaries, and interactive events.
🔎 Wonder how to prove an LLM was trained on a specific text? The camera ready of our Findings of #NAACL 2025 paper is available!
📌 TLDR: longs texts are needed to gather enough evidence to determine whether specific data points were included in training of LLMs: arxiv.org/abs/2411.00154
We are delighted to announce that our research paper on the scale of LLM membership inference has been accepted for publication in the Findings of #NAACL2025! 🎉
There's an internship opening at @parameterlab.bsky.social : parameterlab.de/careers
The research outputs have been quite successful so far: researchtrend.ai/organization...
🎉We’re pleased to share the release of the models from our Apricot🍑 paper, accepted at ACL 2024!
At Parameter Lab, we believe openness and reproducibility are essential for advancing science, and we've put in our best effort to ensure it.
🤗 huggingface.co/collections/...
🧵 bsky.app/profile/dnns...
🔗 Links: Code and results https://github.com/parameterlab/mia-scaling Project Website: https://haritzpuerto.github.io/scaling-mia/ Paper: https://arxiv.org/pdf/2411.00154
🙌 Team Credits: This research was conducted by Haritz Puerto @mgubri.bsky.social @oodgnas.bsky.social and @coallaoh.bsky.social with support from NAVER AI Lab. Stay tuned for more updates! 🚀
🤓 Want More? Check out the community page of MIA for LLMs in http://ReserachTrend.AI https://researchtrend.ai/communities/MIALM You can see related works, the evolution of the community, and top authors!
💬 What Do You Think? Could MIA reach a level where data owners use it as legal evidence? How might this affect LLM deployment? Let us know! #AI #LLM #NLProc
🌐 Implications for Data Privacy: Our findings have real-world relevance for data owners worried about unauthorized use of their content in model training. It can also be used to support accountability of LLM evaluation in end-tasks.
🔎 Better Results in Fine-Tuning: Fine-tuned models show even stronger MIA results. The table shows the performance at sentence level and for collections of 20 sentences, evaluated on Phi-2 fine-tuned for QA (https://huggingface.co/haritzpuerto/phi-2-dcot ).
🔬 Our Testing Setup: We ran experiments using Pythia models (2.8B and 6.9B parameters) with training samples from The Pile dataset, comparing them to validation and test sets. This setup avoids data leakage to ensure a reliable evaluation of MIA.