Getting Past Past-Tense
[ANNs] are not perfect: they are not really explainable, they are not
pliable, i.e., they cannot be easily modified to correct any errors
observed, and they are not efficient due to the overhead of decoding. In
contrast, rule-based methods are more transparent to subject matter
experts; they are amenable to having a human in the loop through
intervention, manipulation and incorporation of domain knowledge;
and further the resulting systems tend to be lightweight and fast.
(Chiticariu et al. 2023, p. iii)
In what is known in the literature as the past-tense debate (e.g.,
Elman et al., 1996; Pinker & Ullman, 2002), cognition and its
underpinning substrates were discussed in terms of whether hard-
wired capacities, such as grammatical rules for English past-tense
formation, are encoded in the genes or otherwise without learning.
Furthermore, claims were made about connectionist systems, such
as, ANN “models cannot deal with languages such as Hebrew,
where regular and irregular nouns are intermingled in the same
phonological neighborhoods” (Pinker & Ullman, 2002, p. 459).
While it may have been true for models at the time that certain data
sets were unlearnable, or specific nondeep ANNs had limited
learning abilities due to their architecture or training set or regimen,
this both does not hold in the present day for certain data sets
(discussed below) and continues to hold in the sense that there are
data sets that are inaccessible to modeling endeavors using ANNs
(see proof in van Rooij et al., 2024). Work such as Zhang et al.
(2016, 2017) can serve to neutralize the claim that ANNs might
struggle with certain unstructured data sets, for example, “where
regular and irregular nouns are intermingled” (Pinker & Ullman,
2002, p. 459), by demonstrating that ANNs can learn utterly random
mappings between inputs and outputs. Of course, such a finding
about ANNs is also problematic to C-connectionists, who propose
that in many cases similar input–output…
universal statistical approximation technique rather than a source of
empirical predictions” (Pinker & Ullman, 2002, p. 474). This is
perhaps prescient; compare this to the Goal row in Table 1. The
reality is complex because it is both the case that ANNs can learn an
infinite set of impressive input–output mappings—hence all the
hype—but it is not the case, and formally so, that they can learn any
such mapping (van Rooij et al., 2024). We unpack this below.
Rehashing the past-tense debate is not useful (for our purposes),
but learning from the mistakes and pitfalls of past rhetoric is useful
to the practitioners who wish to carry out connectionist modeling.
On the one hand, it may not come as a surprise to some that even
at the birth of M-connectionism (circa 2010; Table 1) and to this
day, the past-tense “veritable brouhaha” (Kirov & Cotterell, 2018)
was and is discussed by practitioners (e.g. Corkery et al., 2019;
Kohli et al., 2020; X. Ma & Gao, 2022; Oh et al., 2011; Seidenberg
& Plaut, 2014; Westermann & Ruh, 2012).
On the other hand, ANNs, on the cusp of M-connectionism, are
far from their days of being framed as flawed for being unable to
compute XOR. They are now seemingly impervious to critique and
in fact an old theoretical weakness is now coopted, reframed as a
strength—these models are now upgraded to universal function
approximators:
Notably, these statements do not follow one way or another. If a
model is indeed a universal approximator for any function, why
would scientists need to “show that neural networks could reproduce
the gamut of psychological phenomena”? On the contrary, this is
given if they are indeed so powerful (hence the critique above by
Pinker & Ullman, 2002). To analyze this properly, as many mis-
communications abound with respect to this period (Olazaran, 1996;
Schmidhuber, 2015), what is proven by results such as Cybenko
(1989), Hornik (1991), and Hornik et al. (1989) are not that ANNs
can find a function approximation for any input–output mapping, but
that in principle a model that looks like an ANN, that is could be
built up of ANN components, can stand in for any function from a
given class of functions.
First, this has nothing to do with backpropagation, as the learning
algorithm is not implicated in the universal approximation proofs
cited (Cybenko, 1989; Hornik, 1991; Hornik et al., 1989)—only
relevant is the idea of multiple hidden unit layers, which was known
at the time of the perceptrons controversy and proponents repeated
ah you'll like this section of my paper here...
doi.org/10.1037/rev0...
pdf: olivia.science/doc/GuestMar...