📄 New paper: “A Minimum Description Length Approach to Regularization in Neural Networks”
with Orr Well, Emmanuel Chemla, @rkatzir.bsky.social and @nurikolan.bsky.social
We explore why neural networks often struggle with simple, structured tasks.
Spoiler: our regularizers might be the problem.
🧵
Posts by Nur Lan
Sorry, missed the G&H ref!
Re: precision I meant that there's a chance that the "compressed" network is in practice not a bottleneck if it can store the same/approx. solution as the larger net in high precision weights.
Would be interesting to see how it does when quantized (either at test/train)
Hi, interesting work. Did you try limiting the precision of the "compressed" network? Number of params is a very crude proxy for the actual information capacity.
See e.g. aclanthology.org/2024.acl-lon... and doi.org/10.1162/tacl...
and very similar work by Gaier & Ha 2019 arxiv.org/abs/1906.04358
⭐️🗞️ Accepted to ACL 2024 main conference! #ACL2024NLP
Neural nets can in theory learn formal languages such as aⁿbⁿ & Dyck. Yet no one ever finds such nets using standard techniques. Why?
We suggest that the culprit might have been the objective function all along 👇
arxiv.org/abs/2402.10013
Our findings are in line with works such as El-Naggar et al. (2023) who found similar shortcomings of common objectives for other archs:
proceedings.mlr.press/v217/el-nagg...
As well as with our MDL RNNs who achieve perfect generalization on aⁿbⁿ, Dyck-1, etc:
direct.mit.edu/tacl/article...
3/3
Training an LSTM for aⁿbⁿ using the cross-entropy loss consistently leads to imperfect counting, while using Minimum Description Length (MDL) leads to a provably perfect counting net.
We build an optimal aⁿbⁿ LSTM based on @gail_w et al. (2018) and find that it is not an optimum of the standard cross-entropy loss, even with regularization terms that are expected to lead good generalization (L1/L2).
Meta-heuristics (early stop, dropout) don't help either.
2/3
We build an optimal aⁿbⁿ LSTM based on Weiss et al. (2018), and find that it does not lie at optima of standard loss terms (cross-entropy with/without L1/L2). Moving to the Minimum Description Length objective (MDL) aligns the network with an optimum of the loss.
🧪🗞️ New paper with Emmanuel Chemla and @rkatzir.bsky.social:
Neural nets offer good approximation but consistently fail to generalize perfectly, even when perfect solutions are proved to exist.
We check whether the culprit might be their training objective.
arxiv.org/abs/2402.10013
There is in Al today a tendency toward flashy, splashy domains--that is, toward developing programs that can do such things as medical diagnosis, geological consultation (for oil prospecting), designing of experiments in molecular biology, molecular spectroscopy, configuring of large computer systems, designing ofVLSI circuits, and on and on.Yet there is no program that has common sense; no program that learns things that it has not been explicitly taughthow to learn; no program that can recover gracefully from itsown errors. The "artificial expertise" programs that do exist are rigid, brittle, inflexible. Like chess programs, they may serve a useful intellectual or even practical purpose, butdespite much fanfare, they are not shedding much lighto nhuman intelligence. Mostly, they are being developed simply because various agencies or industries fund them. This doesnotfollow the traditional pattern of basic science. That pattern is to try to isolate a phenomenon, to reduce it to its si
Douglas Hofstadter on toy tasks, in Waking Up from the Boolean Dream, 1982
We take this to show that recent claims about LLMs undermining the argument from the poverty of the stimulus are premature.
Surprisal values for the sentence "I know who John met recently and is going to annoy soon", and its ungrammatical variant "I know who John met recently and is going to annoy you soon". GPT-2 and GPT-j wrongly assign higher probabilities to the ungrammatical continuation.
We now test a much larger battery of models on important syntactic phenomena: across-the-board movement and parasitic gaps.
Using cases where humans have clear acceptability judgements, we find that all models systematically fail to assign higher probabilities to grammatical continuations.
Accuracy figure for large language models tested on across-the-board sentences
⚡ 🗞️ New up-to-date version of Large Language Models and the Argument from the Poverty of the Stimulus, work with Emmanuel Chemla and @rkatzir.bsky.social:
ling.auf.net/lingbuzz/006...
We find that minimizing the algorithmic complexity of the net (w/ MDL) results in better generalization, using significantly less data.
The second-best net, a Memoy-Augmented RNN by Suzgun et al., shows that expressive power is important for GI, but isn't enough for little data.
Why a new benchmark?
A long line of work tested GI in different ways.
Many showed nets generalizing to some extent beyond training, but usually did not explain why generalization stopped at arbitrary points – why would a net get a¹⁰¹⁷b¹⁰¹⁷ right but a¹⁰¹⁸b¹⁰¹⁸ wrong?
We introduce BLISS - a Benchmark for Language Induction from Small Sets.
The benchmark assigns a generalization index to a model based on how much it generalizes from how little training data.
The initial release includes languages such as aⁿbⁿ, aⁿbᵐcⁿ⁺ᵐ, and Dyck 1-2.
Grammar induction (GI) involves learning a formal grammar from a finite, often small, sample of a typically infinite language. To do this, a model must be able to generalize well.
Humans do this remarkably well based on very little data. What about neural nets?