It is conventional to begin any scientific document with an introduction that explains why the subject matter is important. Let us break with tradition and observe that in almost all cases in which scientists fit a straight line to their data, they are doing something that is simultaneously wrong and unnecessary.
Hear that? Next time you fit a straight line to your data, consider that you’re probably wasting your time. Stop pandering to style to get a “catchy punchline and compact, approximate representations”.
The problem is this:
It is a miracle with which we hope everyone reading this is familiar that if you have a set of two-dimensional points (x,y) that depart from a perfect, narrow, straight line y=mx+b only by the addition of Gaussian-distributed noise of known amplitudes in the y-direction only, then the maximum likelihood or best-fit line for the points has a slope m and intercept b that can be obtained justifiably by a perfectly linear matrix-algebra operation known as “weighted linear least-square fitting”. This miracle deserves contemplation.
To cater for the situations where the standard least-squares fit is inappropriate, there appears to be
a wide range of possible options with little but aesthetics to distinguish them
perhaps because of this plethora of options, or perhaps because there is no agreed-upon bible or cookbook to turn to, or perhaps because most investigators would rather make some stuff up that works “good enough” under deadline, or perhaps because many realize, deep down, that much fitting is really unnecessary anyway, there are some egregious procedures and associated errors and absurdities in the literature.
With that out of the way, Hogg goes on to give a very clear and in depth overview of the implications of fitting a straight line to a dataset, commonly made errors, and appropriate ways of dealing with outliers and uncertainties.
The introduction to this paper amused me greatly. But after reading it and recognising myself in the careless investigator, I’m ashamed. I want to go back and check every straight line fit I’ve ever done; I’m pretty sure 90% of them were wrong or at least unnecessary. How will I publish ever again?
My own feelings aside, this is a wonderfully informative paper on a topic most of us consider incredibly dull (cue: more shame). But data fitting matters, in astrophysics and every other kind of data-driven investigation. We should understand the limitations of using approximate measures to represent information.
David Hogg is the data scientist’s Cassandra – but let’s hope more people are paying attention. At AstroInformatics in June he gave an excellent talk (pdf) about his vision of data-driven astronomy. Semantic astronomy, he says, is doomed. We are in danger of becoming overly reliant on catalogs, forgetting that their content consists of meta-data rather than absolute truth. Great stuff.
So scientists, all of you, next time your points look like they might follow a trend, resist that itch. Stand up to the least squares fit! Just say no.
David W. Hogg, Jo Bovy, & Dustin Lang (2010). Data analysis recipes: Fitting a model to data Arxiv arXiv: 1008.4686v1