Horace He (@chhillee) Bsky

Yep that's right! A very common use-case is for "document masking" (i.e. variable length sequences), and that requires recomputing the mask on every iteration (which isn't "free", but is on the order of microseconds to milliseconds and not seconds).

1 year ago 3 0 0 0

What does "there" mean in this case :)

1 year ago 1 0 1 0

Building Machine Learning Systems for a Trillion Trillion Floating Point Operations YouTube video by Jane Street

@chhillee.bsky.social's talk at Jane Street is now up!

youtu.be/139UPjoq7Kw?...

1 year ago 31 7 0 1

I’ll count it!

1 year ago 2 0 0 0

GitHub - Smith42/astroPT: Transformer for galaxy images (and general astronomy) Transformer for galaxy images (and general astronomy) - Smith42/astroPT

Getting different attention masks working for AstroPT (a proto-foundation model for astronomy github.com/Smith42/astr...), so much nicer to do it with Flex Attention vs custom CUDA kernels -- thank you for releasing it to the world 🫡

1 year ago 4 1 0 0

Kinda interesting to me that the books I obsessively read as an elementary schooler are still some of the most popular series today.

1 year ago 3 0 1 0

x.com

I think torch-xla is definitely usable if you don’t want to train anything particularly weird or use unusual parallelism schemes. See this tweet from Saining Xie’s lab on evaluating torchxla vs. Jax for their use case: x.com/tongpetersb/...

1 year ago 1 0 1 0

The other nice parts about TPUs is that Google gives much more of them out for free compared to GPUs. Arguably this reflects how much people want to use them, but I think it's been a great boon for the academic labs willing to go through the effort.

1 year ago 0 0 2 0

I judge social networks by how many FlexAttention users I can find on each one, and by that metric, Bluesky is doing pretty good!

1 year ago 50 1 1 0

! What were you using it for?

1 year ago 1 0 1 0

A lot of PyTorch is about dealing with this stuff nowadays!

1 year ago 3 0 0 0

Out of curiosity, what kind of shapes are you typically looking at?

1 year ago 1 0 1 0

Are they actually using FlexAttention here? I didn't see it in the repo

1 year ago 0 0 1 0

Vote on new features! · pytorch torchtitan · Discussion #693 Hi torchtitanists, Thank you for your interests in torchtitan! Please upvote on what features you would like to see next, and add one if it's not already there. We'll try to prioritize on the most ...

If you'd like to influence what features the PyTorch distributed team work on in torchtitan (e.g. MoE, multimodal, context parallelism, etc.), go made your voices heard here!

1 year ago 11 1 0 0

First thought: Seems kinda "FlexAttention-y": bsky.app/profile/sungkim.bsky.soc...

Second thought: oh cool, they're already using FlexAttention!

it's a nice usage of the `or_masks` and `and_masks` API - I think they do (causal & sliding_window) | (register_mask)

1 year ago 9 0 0 0

Posts by Horace He