I recently held an accelerated course in "Statistical data analysis for fundamental science" for the Instats site. Within only 15 hours of online lectures (albeit these are full 1-hour blocks, unlike the leaky academic-style hours that last 75% of that) I had to cover not just parameter estimation, hypothesis testing, modeling, and goodness of fit, plus several ancillary concepts of high relevance such as ancillarity (yep), conditioning, the likelihood principle, coverage, and frequentist versus bayesian inference, but an introduction to machine learning! How did I do?
I think overall the experiment went quite well. Lectures by the way were recorded, so you can decide for yourself (caveat emptor, there is a registration fee), something that at the beginning was a source of stress for me: I found it a bit hard to keep everything under control at first - audio and video quality, timed breaks, shared screens and the like - while I was lecturing, but I soon found my way of handling it.
Anyway, what I mean to report on today is that by bringing together statistics and machine learning in a lecture set I could exploit a few interesting connections between the two subjects. In particular, fitting (and overfitting) on one side, and training (and overtraining) in supervised regression. If the connection is not self evident to you, perhaps reading what follows may be of some use.
We find that problem in statistics within the huge body of literature that goes by the name of "parameter estimation", which is itself divided into "point estimation" and "interval estimation". Of the two parts, the latter is the bigger chunk, and is a subject of thriving research. In my lectures, which are focused on practical use rather than theoretical issues, I try to convey to the students the crucial ability to make sense of the procedures and to endow them with a few vaccination shots against common misconceptions. Since some of the students come to the lectures equipped with bases in machine learning, it is useful to leverage those knowledge bits by pointing out the parallelism I mentioned. The same works also in the opposite direction, of course.
When you fit a model through your data points, you aim to estimate the model parameters and their uncertainty. Whatever method you employ, the result will make sense only inasmuch as your model was sensible; and in general, we tend to assume (correctly, in most cases) that our data are produced through complex mechanisms that a functional form, however flexible, cannot reproduce perfectly.
This assumption, correct or not, tends to push us in the direction of preferring more capient models - ones with a larger number of parameters. In principle there should be nothing very wrong with that, as the more capient model may, for some fixed value of the added parameters, collapse into the simpler models. E.g., a fourth order polynomial will turn into any cubic polynomial any time you fix to zero the coefficient of x^4, or to a quadratic if you also fix to 0 the coefficient of x^3. This is referred to as the smaller model being "embedded" in the larger one. But the use of too capient models will invariably be a poor representation of the true distribution you want to infer. The warning sign, of course, is the probability of your chisquared getting too high - the model overfits the data, as it adapts to them too well.
In my course I always run the experiment of letting students pick a model among three that fit some data reasonably well. They typically end up preferring models that have one or too parameters too many. Only by asking them to reason on the supposed meaning of the uncertainty bars of the data, and to let them count what fraction of those vertical bars are intercepted by the models, do they realize the mistake. For one-sigma uncertainty bars you expect a sound model to miss about 32% of them!
In machine learning training of models you cannot count hits and misses, unless you endow yourself with validation data and/or cross validation techniques. Without validation of the model performance you will watch that loss function go down, and you will not know where to stop. The validation set will usually show a rise of the loss much earlier than you would have liked, a phenomenon similar to the noted psychological bias resulting in the mentioned preference of overly complex models in data fitting. What is interesting, however, is that by enlarging your machine learning model (e.g. increasing the nodes and layers of a neural network, or the number of trees in a random forest) you may still manage to bring the validation loss down to lower values. And this is not overfitting or overtraining, but a push in a different direction! Where is the extra generalization power coming from, though? And even more important: how do we get a parallel phenomenon in data fitting?
Overparametrization in ML is a crucial tecnhique that increases the predictive power of models. By letting our model have multiple sets of optimal solutions we smoothen the shape of the loss function in the parameter space, which guarantees convergence to the loss minimum by gradient methods. We do not care at all about the non-unique nature of our solution, as long as it has high generalization power. If we were just as relaxed with the non reproducibility of our best fit solution in statistical model fitting, we could then be led to think that we might purchase some extra precision in our fits.... But how?
The answer is that we should wildly enlarge the space of our functions, and conceive some method to pick the best one in a stochastic way. How we pull that off is to simply realize that there is no real conceptual difference between a closed form function we use to fit some data points and a bulky neural network with a thousand nodes in ten layers: the problem was one and the same in the first place, and only our entirely different approach in statistics and machine learning, and the resulting employed algorithms, make them look distinct.
The above identification may be useful, and it is also instructive IMO. However, there are good reasons to keep the statistical problem of data fitting well separated from the ML problem of finding highly generalizing models. Indeed, this becomes apparent when we assess the severity of the mistake of reading too much into our data. In fact, overfitting in statistics is a much worse sin than it is in machine learning! The underlying reason of this lays with the very objectives of the two disciplines: correctness of inference claims on one side, and performance of the results on the other.
Lateral thinking about the complex tortures we subject our data to is a powerful didactical method. What other examples can you make of it?
On Overfitting In Statistics And In Machine Learning
Comments