Advertisement · 728 × 90

Posts by Nur Lan

📄 New paper: “A Minimum Description Length Approach to Regularization in Neural Networks”
with Orr Well, Emmanuel Chemla, @rkatzir.bsky.social and @nurikolan.bsky.social

We explore why neural networks often struggle with simple, structured tasks.
Spoiler: our regularizers might be the problem.

🧵

10 months ago 3 2 1 0

Sorry, missed the G&H ref!

Re: precision I meant that there's a chance that the "compressed" network is in practice not a bottleneck if it can store the same/approx. solution as the larger net in high precision weights.
Would be interesting to see how it does when quantized (either at test/train)

1 year ago 1 0 1 0

Hi, interesting work. Did you try limiting the precision of the "compressed" network? Number of params is a very crude proxy for the actual information capacity.

See e.g. aclanthology.org/2024.acl-lon... and doi.org/10.1162/tacl...

and very similar work by Gaier & Ha 2019 arxiv.org/abs/1906.04358

1 year ago 1 0 2 0

⭐️🗞️ Accepted to ACL 2024 main conference! #ACL2024NLP

Neural nets can in theory learn formal languages such as aⁿbⁿ & Dyck. Yet no one ever finds such nets using standard techniques. Why?

We suggest that the culprit might have been the objective function all along 👇

arxiv.org/abs/2402.10013

1 year ago 4 0 0 0
Preview
Minimum Description Length Recurrent Neural Networks Abstract. We train neural networks to optimize a Minimum Description Length score, that is, to balance between the complexity of the network and its accuracy at a task. We show that networks optimizin...

Our findings are in line with works such as El-Naggar et al. (2023) who found similar shortcomings of common objectives for other archs:
proceedings.mlr.press/v217/el-nagg...

As well as with our MDL RNNs who achieve perfect generalization on aⁿbⁿ, Dyck-1, etc:
direct.mit.edu/tacl/article...

3/3

2 years ago 0 0 0 0
Training an LSTM for aⁿbⁿ using the cross-entropy loss consistently leads to imperfect counting, while using Minimum Description Length (MDL) leads to a provably perfect counting net.

Training an LSTM for aⁿbⁿ using the cross-entropy loss consistently leads to imperfect counting, while using Minimum Description Length (MDL) leads to a provably perfect counting net.

We build an optimal aⁿbⁿ LSTM based on @gail_w et al. (2018) and find that it is not an optimum of the standard cross-entropy loss, even with regularization terms that are expected to lead good generalization (L1/L2).

Meta-heuristics (early stop, dropout) don't help either.

2/3

2 years ago 0 0 1 0
We build an optimal aⁿbⁿ LSTM based on Weiss et al. (2018), and find that it does not lie at optima of standard loss terms (cross-entropy with/without L1/L2). 
Moving to the Minimum Description Length objective (MDL) aligns the network with an optimum of the loss.

We build an optimal aⁿbⁿ LSTM based on Weiss et al. (2018), and find that it does not lie at optima of standard loss terms (cross-entropy with/without L1/L2). Moving to the Minimum Description Length objective (MDL) aligns the network with an optimum of the loss.

🧪🗞️ New paper with Emmanuel Chemla and @rkatzir.bsky.social:

Neural nets offer good approximation but consistently fail to generalize perfectly, even when perfect solutions are proved to exist.

We check whether the culprit might be their training objective.

arxiv.org/abs/2402.10013

2 years ago 0 0 1 1
Preview
GitHub - 0xnurl/gpts-cant-count: Demo of even the most advanced LLMs' inability to handle basic arit... Demo of even the most advanced LLMs' inability to handle basic arithmetic. - GitHub - 0xnurl/gpts-cant-count: Demo of even the most advanced LLMs' inability to handle basic arithmetic.

🎲 GPTs can't count – new simple demo of LLMs' very partial arithmetics.

github.com/0xnurl/gpts-...

2 years ago 2 0 0 0
Advertisement
There is in Al today a tendency toward flashy, splashy domains--that is, toward developing programs that can do such things as medical diagnosis, geological consultation (for oil prospecting), designing of experiments in molecular biology, molecular spectroscopy, configuring of large computer systems, designing ofVLSI circuits, and on and on.Yet there is no program that has common sense; no program that learns things that it has not been explicitly taughthow to learn; no program that can recover gracefully from itsown errors. The "artificial expertise" programs that do exist are rigid, brittle, inflexible. Like chess programs, they may serve a useful intellectual or even practical purpose, butdespite much fanfare, they are not shedding much lighto nhuman intelligence. Mostly, they are being developed simply because various agencies or industries fund them. This doesnotfollow the traditional pattern of basic science. That pattern is to try to isolate a phenomenon, to reduce it to its si

There is in Al today a tendency toward flashy, splashy domains--that is, toward developing programs that can do such things as medical diagnosis, geological consultation (for oil prospecting), designing of experiments in molecular biology, molecular spectroscopy, configuring of large computer systems, designing ofVLSI circuits, and on and on.Yet there is no program that has common sense; no program that learns things that it has not been explicitly taughthow to learn; no program that can recover gracefully from itsown errors. The "artificial expertise" programs that do exist are rigid, brittle, inflexible. Like chess programs, they may serve a useful intellectual or even practical purpose, butdespite much fanfare, they are not shedding much lighto nhuman intelligence. Mostly, they are being developed simply because various agencies or industries fund them. This doesnotfollow the traditional pattern of basic science. That pattern is to try to isolate a phenomenon, to reduce it to its si

Douglas Hofstadter on toy tasks, in Waking Up from the Boolean Dream, 1982

2 years ago 0 0 0 0
Post image

We take this to show that recent claims about LLMs undermining the argument from the poverty of the stimulus are premature.

2 years ago 0 0 0 0
Surprisal values for the sentence "I know who John met recently and is going to annoy soon", and its ungrammatical variant "I know who John met recently and is going to annoy you soon". GPT-2 and GPT-j wrongly assign higher probabilities to the ungrammatical continuation.

Surprisal values for the sentence "I know who John met recently and is going to annoy soon", and its ungrammatical variant "I know who John met recently and is going to annoy you soon". GPT-2 and GPT-j wrongly assign higher probabilities to the ungrammatical continuation.

We now test a much larger battery of models on important syntactic phenomena: across-the-board movement and parasitic gaps.

Using cases where humans have clear acceptability judgements, we find that all models systematically fail to assign higher probabilities to grammatical continuations.

2 years ago 0 0 1 0
Accuracy figure for large language models tested on across-the-board sentences

Accuracy figure for large language models tested on across-the-board sentences

⚡ 🗞️ New up-to-date version of Large Language Models and the Argument from the Poverty of the Stimulus, work with Emmanuel Chemla and @rkatzir.bsky.social:

ling.auf.net/lingbuzz/006...

2 years ago 1 0 1 0
Post image

We find that minimizing the algorithmic complexity of the net (w/ MDL) results in better generalization, using significantly less data.

The second-best net, a Memoy-Augmented RNN by Suzgun et al., shows that expressive power is important for GI, but isn't enough for little data.

2 years ago 0 0 0 0

Why a new benchmark?

A long line of work tested GI in different ways.

Many showed nets generalizing to some extent beyond training, but usually did not explain why generalization stopped at arbitrary points – why would a net get a¹⁰¹⁷b¹⁰¹⁷ right but a¹⁰¹⁸b¹⁰¹⁸ wrong?

2 years ago 0 0 1 0
Post image

We introduce BLISS - a Benchmark for Language Induction from Small Sets.

The benchmark assigns a generalization index to a model based on how much it generalizes from how little training data.

The initial release includes languages such as aⁿbⁿ, aⁿbᵐcⁿ⁺ᵐ, and Dyck 1-2.

2 years ago 0 0 1 0
Advertisement
Post image

Grammar induction (GI) involves learning a formal grammar from a finite, often small, sample of a typically infinite language. To do this, a model must be able to generalize well.

Humans do this remarkably well based on very little data. What about neural nets?

2 years ago 0 0 1 0
Preview
GitHub - taucompling/bliss: 🧘 BLISS – a Benchmark for Language Induction from Small Sets 🧘 BLISS – a Benchmark for Language Induction from Small Sets - GitHub - taucompling/bliss: 🧘 BLISS – a Benchmark for Language Induction from Small Sets

How well can neural networks generalize from how little data?

New work with Emmanuel Chemla and Roni Katzir:

Benchmark:
github.com/taucompling/...

Paper:
aclanthology.org/2023.clasp-1...

🧵

2 years ago 3 0 1 0