“Statistical and inductive inference by minimum message length”, by Chris Wallace, is honestly one of my favorite books.
“An introduction to universal artificial intelligence”, by Marcus Hutter, also goes very in depth into Solomonoff induction & why it’s so powerful.
Posts by Petal Mokryn
I’m personally more familiar with the Bayesian approaches to this, less so the MDL & NML stuff
However, whichever one you use, both rely on the idea that inference and codes are deeply related, and that codes can be used as a powerful approach to inference.
This stuff is heavily related to Minimum Description Length principles, which in turn relate to Kolmogorov complexity.
The Bayesian counterparts are Minimum Message Length & Solomonoff induction.
There’s indeed a ton of really fascinating stuff there.
Thus far I couldn’t find anything, but yeah I’m sure someone must’ve at least thought about it before.
So I think we should just use the posterior joint log probability of observed data as a summary statistic to test in posterior predictive checks.
Or rather, use “H + (log(P)/N)”, where P is the posterior probability (or density) of the observed data. (8/8)
Asymptotic Equipartition has been shown to hold for so many different classes of stochastic processes, I suspect it might be fundamental to the very idea of stochastic processes. At the very least, anything that violates AEP must be very pathological. (7/?)
But there’s not a unified formal way to do this, it’s done on a more case by case basis. And I’m just here thinking…. Why not use the posterior joint log probability of the observed data (6/?)
, then the observed data should be similar to the pseudodata. A lot of people just plot them together and look if the observed data lies within the pseudodata. Sometimes a summary statistic is tested, such as the mean of the observed data VS the mean of pseudodata. (5/?)
Sorry, it should be near -NH.
So, a lot of practitioners do a posterior predictive check simply by generating a lot of pseudodata from the posterior, and comparing it to the observed data that generated the posterior. If the posterior is accurate to the true data generating process, (4/?)
Asymptotically so, of course. For finite sample size you get Frequentist bounds on the joint log probability of the sampled sequence - it should be near N*H, where H is the entropy rate of the process.
Now, how does this relate to posterior predictive checks? (3/?)
This holds for some very wide classes of stochastic processes, which have an “Asymptotic Equipartition Property” - there’s a Typical set of sequences with nearly equal probabilities, that are each very small, but as N increases, you’re sampling from that Typical set with probability 1. (2/?)
If you do N IID Bernoulli experiments with p=0.9, a sequence of N 1’s is the highest probability sequence, but it’s not a *typical* sample of the N Bernoullis - you don’t actually expect to draw N ones.
Turns out, there’s a set of sequences you *do* expect to draw from - the Typical set. (1/?)
One small thing I might eventually seek to push for is using typicality as a more unified summary statistic for posterior predictive checks, since all such checks essentially do a Frequentist hypothesis test on whether the observed data is in the typical set of the posterior model
Just a pet peeve
I specifically prefer the Bayesian approach due to how explicit all the assumptions are, and how they’re all essentially in one place (in the model itself) which I hope should make it easier to catch bad modeling/cheating.
And there are a ton of diagnostics to check posteriors, from MCMC or VI
What good reasons are there to forgo Bayesian inference?
I’ve yet to see circumstances in which Bayesian inference is wholly inappropriate, other than it perhaps being tricky/costly to fit or if the choice of priors is bad
I’m always happy to learn of new perspectives, if you’re willing to share!
The disappointment I felt when I realized you didn’t mean E. T. Jaynes…
(6/6) The Frequentist approach, I’d teach students only *after* they learn Bayesian stats, and intuitively understand stats as a form of applied epistemology.
Just my personal opinion, I think that’d really help students develop stats intuition instead of seeing it as a list of recipes for data.
(5/6) After that 1 introductory lesson, I’d teach students a full course on Bayesian statistics, which is far more approachable intuitively compared to the Frequentist paradigm.
(4/?) I’d finish the intro by talking on how you can compose Frequentist asymptotic analysis on top of estimators derived from Bayesian inference - a parameter estimator derived from a Bayesian method (e.g. MAP, posterior mode, MML) can be treated like any other estimator in the Frequentist paradigm
(3/?) - as possible about the data generating process, used together with asymptotic theory (what’d happen if you were to repeat the experiment N->inf times?) to derive properties of estimators.
I’d first explain what estimators are at the start of the Frequentist explanation, ofc.
(2/?) First, I’d give a brief overview of what statistics aims to do (forming beliefs about the world), and roughly how Bayesians & Frequentists both go about it.
Bayesians with fully specified prior distributions over all considered possible models, and Frequentists with as few assumptions -
(1/?) I’ve never (formally) taught a course before, so this is just my subjective pedagogy opinion
But if I were to teach statistics however I want to, I’d probably do roughly as follows:
In inf dim it’s more about the limitations of the math we have thus far. People are working very hard to push things further in functional analysis.
The Gaussian stuff *is* meaningful in certain contexts, e.g. in particle physics it’s kinetic energy
In other contexts it’s just easier to math
Something I personally think is cool is how the Gaussian integral is really the *only* integral we can analytically solve in arbitrarily high & infinite dimensions, and how so so so many problems across much of modern science are just about trying to extend stuff from the Gaussian case
Maybe that penalized MLE estimators can be interpreted from a Bayesian perspective?
There’s a whole zoo of Bayesian point estimators e.g. maximum a-posteriori, posterior mean, minimum message length, etc. And in the end, most of them look like some form of penalized maximum likelihood estimator.
I think these papers may be relevant here
www.nature.com/articles/s41...
www.pnas.org/doi/pdf/10.1...
Tl;dr, the idea is that a wide class of functions exhibit “simplicity bias” when the inputs are selected randomly from a uniform distribution, and that this effect biases biology to simplicity.
- on an emotional level. Concrete predictions, put into brief but compelling story form.
Would that be any good in your opinion, or nah? 😅
If the goal is to get the message across to people not already informed on the matter, maybe specific forecasts on likely ways society will collapse if the issues aren’t resolved?
A few different possibilities (gotta represent forecasting uncertainty ofc), each a story hammering the point home -
There are still limitations of course. Especially in mathematical tractability - IFT is a statistical field theory, and things can get complicated fast if the spatial/spatiotemporal data you’re trying to infer has particularly complicated dynamics/statistics.
I also think there’s a lot of room both in making new & exciting variations on the method, and in applying it in various applications.
Oh it’s also very scalable. That’s a major bonus. Can’t forget about the scalability.