Hmm wdym? Predicting the training data generating process is basically all they're doing when actually learning. (not the same things as regurgitating the training data).
Posts by deen
one trained or specialized for one. Especially when there is such data disparity. Then key is to reduce hallucinations and ensure it can say I don't know about cultures that are mostly oral and also largely unrecorded so you can use it to safely zoom into subcultures of interest.
It might seem that way but LLMs do better and represent better with more data. There are also shared properties across all human languages and cultures that we don't so readily observe but are captured in those abstract relational spaces.
So an LLM trained on many languages will be better than
But note, this isn't something introduced by LLMs. It is downstream of wealth disparities and what the more privileged literati like to study and talk about. But also simple things like population size. So LLMs know a lot more about west african mythology than Central Asian.
styles creatively. In subsets of youtube you can find movies and tvshows from all over africa and southeast asia. In every LLM you can condition into a subspace with more concentrated knowledge of local languages, mythologies, customs and traditions than anything that came before by far!)
It's not as bad as you'd think. (The internet, youtube in particular and really anything that reduces communication latency and enables the spread of ideas has strong homogenization properties. But cultures are also robust. So hip-hop dances are globally dominant but artists mash with traditional
No. Just split the task. Given enough data it's a simple thing to bruteforce memorize a simple map from a token to its contents. There is: typos, joinedwords, educational spelling materials, translation to and from phonetic codes, acronyms, camelCase and more. Tons of signal to learn from over time.
Gain of function for AI is stuff like continual learning, recurrence, unobservable iterative refinement. Stuff that makes them more expressive and opaque. Luckily the first is hard, second faces major stability issues and the last is hard to do flexibly--but folks keep chasing after them
¯\_(ツ)_/¯
There's also (in the other direction), the fact that transformer forward pass cannot compute P-complete functions (fundamentally sequential problems), and no amount of scaling mattering for deep sequential dependencies. CoT extends LLMs to P and so is most faithful for harder problems/computations.
What of just joules?
Possibly. I don't know how to think about this properly yet. But what I'm vaguely pointing at is that I'm not as clear about what it means for a thing to exist as I used to be. Which also unclarifies what non-existence means too.
I disagree, I'm fairly certain that below the surface is an argument that's interesting to analyze. I'll revisit it again soon to be sure. I have a faint memory of there being a subtle aspect that most skipped over because the claim feels so obviously wrong.
To have sufficient static capacity to match this, we are in fact still several [(5+)?] orders of magnitude off. I'll revisit this again with carefully done calculations next time.
The brain not only changes its synaptic weights but also rewires its physical routing structure. This means the idea that we are close is only deceptively true. The brain, navigating a ~2^(10^15) large state space, rewires more terabytes worth of data in a month than the LLM's total specification.
That's about 5 terabytes, which is only [small world network based estimation elided] about a factor of 50-60 off the human brain. A static snapshot in time of the human brain.
The thing that makes this complicated is that while the LLM is fixed in its addressable space, the brain is plastic.
Something curious is that diffusion based pure image models can't count past 4-8, which is also about the subitizing limit in animals.
Synapses hold about 4-6 bits each and each parameter in LLMs captures about 4 bits of information.
If it's true that a ~10^13 total param model exists, then
Predictively relevant representations learned from text would likely already carry the most important bits.
I agree the world would be sampled more richly given more senses, but I'd be surprised if much more than already captured in text is required by modeling. Like how consciousness only processes some tens of bits at most, or how bad most of us are at drawing what we see/saw.
I do think that whole branch of philosophy/religion is surprisingly very relevant yes!
Tbh, I'm also struggling with this too.
I suspect only a modest lift, in terms of being slightly better at process explanations and a modest boost to spatial reasoning, if at all (proprioceptive data is most important for that).
Multimodality is indeed a weak aspect of models. But it's not an afterthought--it's (oh dear this pattern is low class now isn't it?)--it's just difficult.
However, what independent/novel bits do you suppose exist in image and video data that is not already present in sufficient form in text?
Alas, only for easy/well-structured spaces. Which this problem was explicitly constructed to have.
(We see buried evidence as lack of well-definedness of progress as achilles heel. In essense the LLMs have weaker priors but highly structured relational model of what makes up the problem space, which is aligned enough with what they know. Then novel combinations of those can make progress)
Hmm, depends on the structure of the solution space. For this problem, it seems there were lots of shallow solutions readily reachable (ie thinking of this as searching a decision tree). For problems with deeply buried evidence and or sparse leaf nodes, volume strat won't work anywhere near as well.
Can you explain?
As language models the continuous ones do not work well and the discrete ones don't change things (they work better though still not as well as AR-LLMs--but they do make things wonderfully convoluted for the maybe conscious LLM insister).
There's also the side-possibility that knowing too much sometimes flips and makes some inference problem harder, and a solver (person) worked it out not due to cleverness particularly, but cause they just happened to know the right things (and just that) by chance that made the problem legible.
Where being super-intelligent and capable does not mean being able to select well across combinatorial decision spaces. In a sense, there's a kind of arbitrariness to it and maybe linked to our roles as sophisticated observers embedded in this universe. Which is a deeper thing than we suspect?
I think it's more that good ideas are hard to find; then the ability to see it through is a further non-trivial challenge. In theory anyone can write a story but a good story idea and then polishing it towards a novella is much, much harder.
I also think moravec's paradox will hit AIs hard here too