Initial tests with Claude Haiku show promise - it successfully distinguished between stronger and weaker models. Next step: building an evaluator with FastHTML to grade 450 completions (150 prompts × 3 models). Stay tuned!
Posts by vishal
My scoring system uses 0 (failure), 0.5 (partial success), or 1.0 (success) points per criterion for each category. Diff categories have diff numbers of criteria: Grammar (5), Creativity (4), Plot (4), Context-tracking (3), and Factual/Reasoning (1 each) so I'll need to normalize
Good creativity/plot prompts provide opportunities w/o sacrificing consistency. Example: "Once upon a time, there was a tiger who liked to play the guitar" offers fertile ground for creativity without losing coherence, while others might force models to choose between them.
When analyzing the 44 TinyStories eval prompts, I discovered factual prompts were the easiest to isolate, context-tracking prompts were dime a dozen, and reasoning prompts were hard to distinguish from context tracking. This led me to curate category-specific prompts.
Video Walkthrough: youtube.com/watch?v=k5J1...
Repo: github.com/vishalbakshi...
Designing robust LM evals = take something squishy like language and build structure around it. For TinyScaleLab, I create separate prompt sets for factual knowledge, reasoning, context tracking, creativity, and plot. 150 prompts total. Each category needs unique prompts imo 🧵
(beyond) cool to see scaling laws in action
After 1 epoch each (1.03B tokens):
llama-5m-L4: 4.6367
llama-25m-L4: 1.6456
llama-60m-A100: 1.286
llama-125m-A100: 1.1558
Project goals: Study both training dynamics of tiny models and their language capabilities (grammar, context tracking, factual knowledge, reasoning, creativity, and plot construction). Looking forward to sharing more progress soon!
/end
Next steps: Pausing training to build evaluation infrastructure. Will use Gemini Flash 2.5 or Claude Haiku 3.5 as LLM judges (on TinyStories-1M/8M/28M/33M-generated stories), comparing against manual evaluation to refine scoring prompts for six capability categories.
Details on the architectures I'm using---the LlamaConfigs are shared in the blog post above. I'm loosely referencing the official TinyStories models (intermediate dim = 4 x hidden dim). Intentionally undershooting named model sizes.
Cost analysis: L4 GPU is more efficient for 5M model (~$0.20/epoch), while A100 is better for larger models. 125M model costs ~$0.84/epoch. This gives me a baseline to plan my budget for longer training runs.
Video: youtube.com/watch?v=o_c3...
Blog post: vishalbakshi.github.io/blog/posts/2...
TinyScaleLab update: Completed initial training runs to estimate costs. Using 4 model sizes: 5M, 25M, 60M, and 125M parameters. Training on TinyStories dataset (4.9M stories, ~1B tokens) with Llama2 tokenizer (32k vocab).
Here's the blog post version of my TinyScale Lab research project kickoff video!
vishalbakshi.github.io/blog/posts/2...
Blog post written by Claude w/ a few minor formatting and 1 rephrasing edits done by me (prompt attached). It used most phrasing verbatim from my video transcript+slides.
P.S. if you are unfamiliar, here are my main takeaways from the TinyStories (Eldan/Li) and Small-scale proxies (Wortsman, et al) papers. Really incredibly inspiring work. I am giddy to jump into this project. LFG!!!
I'll end with one of my favorite quotes. I am standing on the shoulders of giants to even consider taking on this research project!
Here's a recap of my presentation, highlighting my goals for the TinyScale Lab research project!
My project timeline consists of 4 phases. I expect this project to take 8-12 months (which means it will probably take two years 😅). First order of business is building the eval and logging setup, and then running initial training runs. Phase 2 involves core experimentation!
My rough back-of-the-envelope budget is $2000. I'll closely monitor this each week. If it seems to be heading in that direction consistently, I'll have to seriously consider building my own GPU rig. But time will tell!
Following fastai principles, I'll build in public: sharing code, models, datasets, weekly updates, and interactive visualizations. If this work saves someone time, money, or gives them insight, that would be truly the best reward.
What excites me most: I cannot wait to see some of these capabilities emerge with model size or training steps - watching grammar emerge first, then consistency, and finally creativity, just as the TinyStories paper observed.
My plan: extensive logging of training dynamics + evaluating capabilities with LLM judge scoring. I'll train 100+ model variations across different learning rates and stability techniques (QK-layernorm and z-loss).
Making ML research accessible to resource-constrained environments isn't trivial - it's essential for the field's diversity and progress! I'm using modest computational resources (but substantial for me) to conduct what I think is meaningful research.
I believe this approach—using tiny models as proxies to study phenomena relevant to models of all sizes—represents an underexplored path that could benefit other resource-constrained researchers. I think this is how most of the world's potential researchers would need to work.
My hypothesis: training stability directly affects specific model capabilities in predictable ways. I'll train models from 3M to 120M params, analyzing how logits, gradients, parameters, and loss relate to capabilities like grammar, consistency, and reasoning.
Here's the 26 minute project kickoff presentation video:
www.youtube.com/watch?v=82mE...
Excited to kickoff a new research project: TinyScale Lab! I'm researching the connection b/w training dynamics and model capabilities in tiny LMs. Following the work of TinyStories (Eldan & Li) and Small-scale proxies (Wortsman, et al) papers, and building a bridge between them🧵
This research, along with TinyStories, has reignited my passion for tiny models! Tomorrow I'm launching my new "TinyScale Lab" research project where I'll apply these insights (and more!). Stay tuned - it'll be open source, fully documented and researched in public!
/end
Most exciting finding: we can actually predict when large models will become unstable by extrapolating from small models. With an instability threshold of attention logits > 10^4, they validated predictions on 5B parameter models.
LR sensitivity is the key metric (along with LR/loss curves)- it measures how loss worsens when you move away from optimal LR. Lower is better as it gives flexibility in LR choice: you don't have to pick the perfect LR, any reasonable one will give you good enough eval loss.