Giwon Hong (@giwonhong) Bsky

MMLU-Redux Poster at NAACL 2025

MMLU-Redux just touched down at #NAACL2025! 🎉
Wish I could be there for our "Are We Done with MMLU?" poster today (9:00-10:30am in Hall 3, Poster Session 7), but visa drama said nope 😅
If anyone's swinging by, give our research some love! Hit me up if you check it out! 👋

11 months ago 17 11 0 0

Image illustrating that ALM can enable Ensembling, Transfer to Bytes, and general Cross-Tokenizer Distillation.

We created Approximate Likelihood Matching, a principled (and very effective) method for *cross-tokenizer distillation*!

With ALM, you can create ensembles of models from different families, convert existing subword-level models to byte-level and a bunch more🧵

1 year ago 25 14 1 0

Generative AI Laboratory

Joining the Generative AI Lab (GAIL, gail.ed.ac.uk) at the University of Edinburgh as a GAIL Fellow! Excited for what's ahead 🤗

1 year ago 19 2 0 0

Mixtures of In-Context Learners In-context learning (ICL) adapts LLMs by providing demonstrations without fine-tuning the model parameters; however, it does not differentiate between demonstrations and quadratically increases the co...

Work done with: @pminervini.bsky.social @edoardo-ponti.bsky.social @emilevankrieken.com and Nikolay Malkin

Paper: arxiv.org/abs/2411.02830
(🧵8/n)

1 year ago 3 0 0 0

🔍 Conclusion: 𝗠𝗼𝗜𝗖𝗟 offers a robust, efficient approach for combining demonstrations (experts), significantly boosting accuracy over baselines. 𝗠𝗼𝗜𝗖𝗟 is also resilient to low-quality demonstrations and achieves improved data and computational efficiency. (🧵7/n)

1 year ago 2 0 1 0

⚙️ Data and Compute Efficiency of 𝗠𝗼𝗜𝗖𝗟: We find that 𝗠𝗼𝗜𝗖𝗟 is more efficient in terms of data and computation compared to conventional (concat-based) ICL! (🧵6/n)

1 year ago 2 0 1 0

📉 Noisy and Imbalanced Demonstrations: By assigning weights to each demonstration subset, 𝗠𝗼𝗜𝗖𝗟 can effectively handle various practical applications where data quality varies. (🧵5/n)

1 year ago 2 0 1 0

🌐Generalization to Unseen Demonstrations: 𝙨𝙘𝙖𝙡𝙖𝙧 weights require predefined demonstration subsets.
Using 𝙃𝙮𝙥𝙚𝙧-𝙣𝙚𝙩𝙬𝙤𝙧𝙠—a smaller fine-tuned hyper-network that dynamically generates weights for each expert based on all concatenated demonstration subsets. (🧵4/n)

1 year ago 2 0 1 0

📊 𝗠𝗼𝗜𝗖𝗟 in Classification Tasks: 𝗠𝗼𝗜𝗖𝗟 outperformed Baseline ICL on 5 out of 7 datasets!
Using 𝙨𝙘𝙖𝙡𝙖𝙧 weights—a vector of trainable parameters that assign each expert a weight—we fine-tuned how demonstration subsets are combined. (🧵3/n)

1 year ago 2 0 1 0

🚀 How does 𝗠𝗼𝗜𝗖𝗟 improve In-Context Learning? 𝗠𝗼𝗜𝗖𝗟 prompts an LLM with multiple demonstration subsets, obtaining multiple experts, and merges their predictions via a trainable weighting function—it doesn’t require any fine-tuning of the LLM parameters! (🧵2/n)

1 year ago 3 0 1 0

🤔How to achieve efficient ICL without storing a huge dataset in one prompt?
💡Mixtures of In-Context Learners (𝗠𝗼𝗜𝗖𝗟): we treat LLMs prompted with subsets of demonstrations as experts and learn a weighting function to optimise the distribution over the continuation (🧵1/n)

1 year ago 33 4 1 2

I’ll be travelling to London from Wednesday to Friday for an upcoming event and would be very happy to meet up! 🚀
I'd love to chat about my recent works (DeCoRe, MMLU-Redux, etc.). DM me if you’re around! 👋

DeCoRe: arxiv.org/abs/2410.18860
MMLU-Redux: arxiv.org/abs/2406.04127

1 year ago 11 7 0 0

I would love to be added as well!

1 year ago 3 0 0 0

Posts by Giwon Hong