vishal (@vishal-learner) Bsky

Initial tests with Claude Haiku show promise - it successfully distinguished between stronger and weaker models. Next step: building an evaluator with FastHTML to grade 450 completions (150 prompts × 3 models). Stay tuned!

11 months ago 0 0 0 0

My scoring system uses 0 (failure), 0.5 (partial success), or 1.0 (success) points per criterion for each category. Diff categories have diff numbers of criteria: Grammar (5), Creativity (4), Plot (4), Context-tracking (3), and Factual/Reasoning (1 each) so I'll need to normalize

11 months ago 0 0 1 0

Good creativity/plot prompts provide opportunities w/o sacrificing consistency. Example: "Once upon a time, there was a tiger who liked to play the guitar" offers fertile ground for creativity without losing coherence, while others might force models to choose between them.

11 months ago 0 0 1 0

When analyzing the 44 TinyStories eval prompts, I discovered factual prompts were the easiest to isolate, context-tracking prompts were dime a dozen, and reasoning prompts were hard to distinguish from context tracking. This led me to curate category-specific prompts.

11 months ago 0 0 1 0

TSL: Curating Evaluation Prompts, Defining Scoring Criteria + Designing LLM Judge Prompt Template YouTube video by vishal

Video Walkthrough: youtube.com/watch?v=k5J1...

Repo: github.com/vishalbakshi...

11 months ago 0 0 1 0

Designing robust LM evals = take something squishy like language and build structure around it. For TinyScaleLab, I create separate prompt sets for factual knowledge, reasoning, context tracking, creativity, and plot. 150 prompts total. Each category needs unique prompts imo 🧵

11 months ago 0 0 1 0

(beyond) cool to see scaling laws in action

After 1 epoch each (1.03B tokens):

llama-5m-L4: 4.6367
llama-25m-L4: 1.6456
llama-60m-A100: 1.286
llama-125m-A100: 1.1558

11 months ago 0 0 0 0

Project goals: Study both training dynamics of tiny models and their language capabilities (grammar, context tracking, factual knowledge, reasoning, creativity, and plot construction). Looking forward to sharing more progress soon!

/end

11 months ago 0 0 0 0

Next steps: Pausing training to build evaluation infrastructure. Will use Gemini Flash 2.5 or Claude Haiku 3.5 as LLM judges (on TinyStories-1M/8M/28M/33M-generated stories), comparing against manual evaluation to refine scoring prompts for six capability categories.

11 months ago 0 0 1 0

Details on the architectures I'm using---the LlamaConfigs are shared in the blog post above. I'm loosely referencing the official TinyStories models (intermediate dim = 4 x hidden dim). Intentionally undershooting named model sizes.

11 months ago 0 0 1 0

Cost analysis: L4 GPU is more efficient for 5M model (~$0.20/epoch), while A100 is better for larger models. 125M model costs ~$0.84/epoch. This gives me a baseline to plan my budget for longer training runs.

11 months ago 0 0 1 0

TinyScaleLab Project Update: Training Cost Analysis and Evaluation Infrastructure Plans YouTube video by vishal

Video: youtube.com/watch?v=o_c3...

Blog post: vishalbakshi.github.io/blog/posts/2...

11 months ago 0 0 1 0

TinyScaleLab update: Completed initial training runs to estimate costs. Using 4 model sizes: 5M, 25M, 60M, and 125M parameters. Training on TinyStories dataset (4.9M stories, ~1B tokens) with Llama2 tokenizer (32k vocab).

11 months ago 0 0 1 1

Here's the blog post version of my TinyScale Lab research project kickoff video!

vishalbakshi.github.io/blog/posts/2...

Blog post written by Claude w/ a few minor formatting and 1 rephrasing edits done by me (prompt attached). It used most phrasing verbatim from my video transcript+slides.

11 months ago 0 0 0 0

P.S. if you are unfamiliar, here are my main takeaways from the TinyStories (Eldan/Li) and Small-scale proxies (Wortsman, et al) papers. Really incredibly inspiring work. I am giddy to jump into this project. LFG!!!

11 months ago 0 0 0 0

I'll end with one of my favorite quotes. I am standing on the shoulders of giants to even consider taking on this research project!

11 months ago 0 0 1 1

Here's a recap of my presentation, highlighting my goals for the TinyScale Lab research project!

11 months ago 0 0 1 0

My project timeline consists of 4 phases. I expect this project to take 8-12 months (which means it will probably take two years 😅). First order of business is building the eval and logging setup, and then running initial training runs. Phase 2 involves core experimentation!

11 months ago 0 0 1 0

My rough back-of-the-envelope budget is $2000. I'll closely monitor this each week. If it seems to be heading in that direction consistently, I'll have to seriously consider building my own GPU rig. But time will tell!

11 months ago 0 0 1 0

Following fastai principles, I'll build in public: sharing code, models, datasets, weekly updates, and interactive visualizations. If this work saves someone time, money, or gives them insight, that would be truly the best reward.

11 months ago 0 0 1 0

What excites me most: I cannot wait to see some of these capabilities emerge with model size or training steps - watching grammar emerge first, then consistency, and finally creativity, just as the TinyStories paper observed.

11 months ago 0 0 1 0

My plan: extensive logging of training dynamics + evaluating capabilities with LLM judge scoring. I'll train 100+ model variations across different learning rates and stability techniques (QK-layernorm and z-loss).

11 months ago 0 0 1 0

Making ML research accessible to resource-constrained environments isn't trivial - it's essential for the field's diversity and progress! I'm using modest computational resources (but substantial for me) to conduct what I think is meaningful research.

11 months ago 0 0 1 0

I believe this approach—using tiny models as proxies to study phenomena relevant to models of all sizes—represents an underexplored path that could benefit other resource-constrained researchers. I think this is how most of the world's potential researchers would need to work.

11 months ago 0 0 1 0

My hypothesis: training stability directly affects specific model capabilities in predictable ways. I'll train models from 3M to 120M params, analyzing how logits, gradients, parameters, and loss relate to capabilities like grammar, consistency, and reasoning.

11 months ago 0 0 1 0

TinyScale Lab: Exploring the Connection Between Training Dynamics and Model Capabilities YouTube video by vishal

Here's the 26 minute project kickoff presentation video:

www.youtube.com/watch?v=82mE...

11 months ago 0 0 1 0

Excited to kickoff a new research project: TinyScale Lab! I'm researching the connection b/w training dynamics and model capabilities in tiny LMs. Following the work of TinyStories (Eldan & Li) and Small-scale proxies (Wortsman, et al) papers, and building a bridge between them🧵

11 months ago 0 0 1 0

This research, along with TinyStories, has reignited my passion for tiny models! Tomorrow I'm launching my new "TinyScale Lab" research project where I'll apply these insights (and more!). Stay tuned - it'll be open source, fully documented and researched in public!

/end

11 months ago 0 0 0 0

Most exciting finding: we can actually predict when large models will become unstable by extrapolating from small models. With an instability threshold of attention logits > 10^4, they validated predictions on 5B parameter models.

11 months ago 0 0 1 0

LR sensitivity is the key metric (along with LR/loss curves)- it measures how loss worsens when you move away from optimal LR. Lower is better as it gives flexibility in LR choice: you don't have to pick the perfect LR, any reasonable one will give you good enough eval loss.

11 months ago 0 0 1 0

Posts by vishal