What are your favorite recent papers on using LMs for annotation (especially in a loop with human annotators), synthetic data for task-specific prediction, active learning, and similar?
Looking for practical methods for settings where human annotations are costly.
A few examples in thread โด
Posts by ๐
I am once again pitching my romantic comedy:
- two academics start dating
- discover they are each other's terrible reviewer
- hijinks ensue
Working title: Love is Double-Blind
I'm extremely curious -- would you want digital tools that would help with this (e.g. planning, time organization) or embodied AI (e.g. physical assistance in-home, transportation)?
i wish i could shout this from the rooftops. relatedly, there's no need for robots to be limited by the human form.
similar/tangential thing came up in the 2010s with respect to self-driving: just because people only sense using their eyes doesn't mean cars have to only use cameras!
we are living in an empirical world and we are empirical girls
No labels, no problem! I am so excited for this release. We have been working on it for many months, and it's motivated by a common customer roadblock: insufficient labeled examples.
has anyone successfully gotten very involved with their local library system and, if so, how does one do so?
i know there are volunteer opportunities and it is my dream to one day organize a crafting circle, but i'm talking about how the library actually organizes / functions / prioritizes things!
@jfrankle.com @ericajiyuen.bsky.social
and a big shout out to my collaborators: Erica Ji Yuen, Kartik Sreenivasan, Yue (Andy) Zhang, Sam Havens, Michael Carbin, Matei Zaharia, Jonathan Frankle
3/3 ๐ Want to see how different models perform on enterprise tasks? Full analysis in the blog here: databricks.com/blog/benchma...!
๐ DIBS measures real enterprise needs. We tested 14 models & found:
- Academic benchmarks mask enterprise gaps
- No single model wins across all tasks
- Open models are competitive on key capabilities
- Some enterprise tasks show clear paths forward, others are more complex
2/3
๐งต Super proud to finally share this work I led last quarter - the
@databricks.bsky.social Domain Intelligence Benchmark Suite (DIBS)! TL;DR: Academic benchmarks โ real performance and domain intelligence > general capabilities for enterprise tasks. 1/3
@jfrankle.com @ericajiyuen.bsky.social
And of course a big shout out to my collaborators: Erica Ji Yuen, Kartik Sreenivasan, Yue (Andy) Zhang, Sam Havens, Michael Carbin, Matei Zaharia, and Jonathan Frankle for their help!
3/3 ๐ Want to see how different models perform on enterprise tasks? Full analysis in the blog here: databricks.com/blog/benchma...!
๐ DIBS measures real enterprise needs. We tested 14 models & found:
- Academic benchmarks mask enterprise gaps
- No single model wins across all tasks
- Open models are competitive on key capabilities
- Some enterprise tasks show clear paths forward, others are more complex
2/3
very demure, very mindful, very 2019-era mujoco humanoid learning to walk
"technology built to address people's needs" is the north star.
side note: it would be amazing to see this attitude in the physical, embodied world as well. it's amazing to see how older adults in dense, walkable areas have such different lifestyles than those in car-centric suburbs.
would love to be added :-)
brat tulu is amazing
this is incredible research, and beautiful. would love to know more about what it's like to meaningfully interact with genie 2, or similar models, e.g. to modify the outputs of such a model in the service of a design vision.
i know some labs are already starting to do this; i hope more continue to. it is challenging, complex technical work and we should think of it as a first-class contribution in the field. 5/5
๐ค we can start to more broadly value thoughtful, direction-setting benchmark work. it requires technical contributions, a keen sense of how people might meaningfully interact with a system, and the discernment to recognize where progress might yet be made. 4/5
i think as a field, we have a problematic tendency to focus on magnitude-related problems, like new architectures or training paradigms or other ways to maximize performance on whatever benchmarks we can. maybe this is because it is more akin to the training/experience many of us have. 3/5
in the LLM space, at this time, benchmarks/evaluations set the direction of that vector. it's extremely hard to make good benchmarks, and historically under-rewarded in the field. 2/5
i often talk about the importance of aligning both the magnitude AND direction of a workstream vector. 1/5
i do not study this, but i did just finish reading the anxious generation and so i'm very grateful that there are so many people who do indeed study such important things!