🐘 (@pkydrm) Bsky - nopzon.com

What are your favorite recent papers on using LMs for annotation (especially in a loop with human annotators), synthetic data for task-specific prediction, active learning, and similar?

Looking for practical methods for settings where human annotations are costly.

A few examples in thread ↴

8 months ago 79 23 13 3

9 months ago 0 0 0 0

I am once again pitching my romantic comedy:

- two academics start dating
- discover they are each other's terrible reviewer
- hijinks ensue

Working title: Love is Double-Blind

10 months ago 2604 344 95 65

I'm extremely curious -- would you want digital tools that would help with this (e.g. planning, time organization) or embodied AI (e.g. physical assistance in-home, transportation)?

1 year ago 0 0 1 0

i wish i could shout this from the rooftops. relatedly, there's no need for robots to be limited by the human form.

similar/tangential thing came up in the 2010s with respect to self-driving: just because people only sense using their eyes doesn't mean cars have to only use cameras!

1 year ago 5 0 0 0

we are living in an empirical world and we are empirical girls

1 year ago 1 0 0 0

No labels, no problem! I am so excited for this release. We have been working on it for many months, and it's motivated by a common customer roadblock: insufficient labeled examples.

1 year ago 1 0 0 0

has anyone successfully gotten very involved with their local library system and, if so, how does one do so?

i know there are volunteer opportunities and it is my dream to one day organize a crafting circle, but i'm talking about how the library actually organizes / functions / prioritizes things!

1 year ago 1 0 0 0

@jfrankle.com @ericajiyuen.bsky.social

1 year ago 0 0 0 0

and a big shout out to my collaborators: Erica Ji Yuen, Kartik Sreenivasan, Yue (Andy) Zhang, Sam Havens, Michael Carbin, Matei Zaharia, Jonathan Frankle

1 year ago 0 0 0 0

Benchmarking Domain Intelligence

3/3 🔑 Want to see how different models perform on enterprise tasks? Full analysis in the blog here: databricks.com/blog/benchma...!

1 year ago 0 0 0 0

📊 DIBS measures real enterprise needs. We tested 14 models & found:

- Academic benchmarks mask enterprise gaps
- No single model wins across all tasks
- Open models are competitive on key capabilities
- Some enterprise tasks show clear paths forward, others are more complex

2/3

1 year ago 0 0 0 0

🧵 Super proud to finally share this work I led last quarter - the
@databricks.bsky.social Domain Intelligence Benchmark Suite (DIBS)! TL;DR: Academic benchmarks ≠ real performance and domain intelligence > general capabilities for enterprise tasks. 1/3

1 year ago 5 4 4 1

@jfrankle.com @ericajiyuen.bsky.social

1 year ago 0 0 0 0

And of course a big shout out to my collaborators: Erica Ji Yuen, Kartik Sreenivasan, Yue (Andy) Zhang, Sam Havens, Michael Carbin, Matei Zaharia, and Jonathan Frankle for their help!

1 year ago 0 0 0 0

Benchmarking Domain Intelligence

3/3 🔑 Want to see how different models perform on enterprise tasks? Full analysis in the blog here: databricks.com/blog/benchma...!

1 year ago 1 0 0 0

📊 DIBS measures real enterprise needs. We tested 14 models & found:
- Academic benchmarks mask enterprise gaps
- No single model wins across all tasks
- Open models are competitive on key capabilities
- Some enterprise tasks show clear paths forward, others are more complex

2/3

1 year ago 0 0 0 0

very demure, very mindful, very 2019-era mujoco humanoid learning to walk

1 year ago 1 0 0 0

"technology built to address people's needs" is the north star.

side note: it would be amazing to see this attitude in the physical, embodied world as well. it's amazing to see how older adults in dense, walkable areas have such different lifestyles than those in car-centric suburbs.

1 year ago 0 1 0 0

would love to be added :-)

1 year ago 0 0 1 0

brat tulu is amazing

1 year ago 0 0 1 0

this is incredible research, and beautiful. would love to know more about what it's like to meaningfully interact with genie 2, or similar models, e.g. to modify the outputs of such a model in the service of a design vision.

1 year ago 0 0 0 0

1 year ago 1183 313 19 12

i know some labs are already starting to do this; i hope more continue to. it is challenging, complex technical work and we should think of it as a first-class contribution in the field. 5/5

1 year ago 0 0 0 0

🤞 we can start to more broadly value thoughtful, direction-setting benchmark work. it requires technical contributions, a keen sense of how people might meaningfully interact with a system, and the discernment to recognize where progress might yet be made. 4/5

1 year ago 0 0 1 0

i think as a field, we have a problematic tendency to focus on magnitude-related problems, like new architectures or training paradigms or other ways to maximize performance on whatever benchmarks we can. maybe this is because it is more akin to the training/experience many of us have. 3/5

1 year ago 0 0 1 0

in the LLM space, at this time, benchmarks/evaluations set the direction of that vector. it's extremely hard to make good benchmarks, and historically under-rewarded in the field. 2/5

1 year ago 0 0 1 0

i often talk about the importance of aligning both the magnitude AND direction of a workstream vector. 1/5

1 year ago 1 1 1 0

i do not study this, but i did just finish reading the anxious generation and so i'm very grateful that there are so many people who do indeed study such important things!

1 year ago 0 0 0 0

Posts by 🐘