Think you’ve got your least squares down to a tee? Think again.

In a paper posted to the Arxiv in late August, David Hogg of NYU and his collaborators take us to task on our sloppy data fitting habits. And he’s not in the mood to mince his words.

It is conventional to begin any scientific document with an introduction that explains why the subject matter is important. Let us break with tradition and observe that in almost all cases in which scientists fit a straight line to their data, they are doing something that is simultaneously

wrongandunnecessary.

Hear that? Next time you fit a straight line to your data, consider that you’re probably wasting your time. Stop pandering to style to get a “catchy punchline and compact, approximate representations”.

The problem is this:

It is a miracle with which we hope everyone reading this is familiar that

ifyou have a set of two-dimensional points (x,y) that depart from a perfect, narrow, straight line y=mx+b only by the addition of Gaussian-distributed noise of known amplitudes in the y-direction only,thenthe maximum likelihood or best-fit line for the points has a slope m and intercept b that can be obtained justifiably by a perfectly linear matrix-algebra operation known as “weighted linear least-square fitting”. This miracle deserves contemplation.

To cater for the situations where the standard least-squares fit is inappropriate, there appears to be

a wide range of possible options with little but aesthetics to distinguish them

and

perhaps because of this plethora of options, or perhaps because there is no agreed-upon bible or cookbook to turn to, or perhaps because most investigators would rather make some stuff up that works “good enough” under deadline, or perhaps because many realize, deep down, that much fitting is really unnecessary anyway, there are some egregious procedures and associated errors and absurdities in the literature.

With that out of the way, Hogg goes on to give a very clear and in depth overview of the implications of fitting a straight line to a dataset, commonly made errors, and appropriate ways of dealing with outliers and uncertainties.

The introduction to this paper amused me greatly. But after reading it and recognising myself in the careless investigator, I’m ashamed. I want to go back and check every straight line fit I’ve ever done; I’m pretty sure 90% of them were wrong or at least unnecessary. How will I publish ever again?

My own feelings aside, this is a wonderfully informative paper on a topic most of us consider incredibly dull (cue: more shame). But data fitting matters, in astrophysics and every other kind of data-driven investigation. We should understand the limitations of using approximate measures to represent information.

David Hogg is the data scientist’s Cassandra – but let’s hope more people are paying attention. At AstroInformatics in June he gave an excellent talk (pdf) about his vision of data-driven astronomy. Semantic astronomy, he says, is doomed. We are in danger of becoming overly reliant on catalogs, forgetting that their content consists of meta-data rather than absolute truth. Great stuff.

So scientists, all of you, next time your points look like they might follow a trend, resist that itch. Stand up to the least squares fit! Just say no.

David W. Hogg, Jo Bovy, & Dustin Lang (2010). Data analysis recipes: Fitting a model to data Arxiv arXiv: 1008.4686v1

I found the tone of the paper off-putting. Hogg said that even professional statisticians get this wrong, so how do we know *he* knows what he’s talking about?

Can I be an absolute ass and point out two typos?

1) To cater for the situations where the standard least-*squares*…

2) …associated *errors* and absurdities in the literature.

To be fair, David Hogg did actually agree that not all semantic astronomy is doomed just a rather naive interpretation of it.

Paul – pedants are welcome here. Typos corrected, thanks.

Matthew – he did, and he had positive suggestions. Anyone interested in semantic astronomy should look up his talk (linked in the text).

This is a cultural problem, really — we’ve all been at plenary talks where someone draws a straight line through four data points and claims a trend. C’mon, folks, we’re *scientists*, so instead of taking the lowest common denominator (least-squares), at least put in some effort and draw error bars, and try out alternative regression schemes. The universe isn’t handing us linear trends on a plate, but rather providing fascinating non-linearities that we have to search for.

Just reading your excerpts and the abstract, I get the impression that the author is one of those people who believes that nothing should be published until a proper model can be generated for the data.

This ignores the value of “getting it out there” so that others can think about it, and start examining other datasets to see if the ridiculously simple (i.e. linear) model is even generalizable.

It’s not even about the question of linear vs. non-linear.

When you fit a straight line to data with any of the usual textbook methods, you are assuming a model, which is that before measurement errors, the true quantities lie along a straight line with zero intrinsic scatter and no outliers. This is hardly ever the case. Because people take fitting a straight line for granted, they don’t think of it as a model assumption.

There are relatively simple ways of including terms in the model to account for intrinsic scatter, and slightly more sophisticated ways of including a term that represents a percentage of outliers. People should be using these more often. That is a lot of what Hogg’s paper is about.

For an example of how to do fitting with intrinsic scatter, that includes code, see B. Kelly, 2007, ApJ, 665, 1489.

One of the reasons people should worry about this is that there is a tendency in the community to just grab a fitting method out of Isobe et al 1990, especially bisector fits, and some of those methods have poor statistical justification, especially bisector fits. For ex, buried in a paper I have written on Tully-Fisher, there is a discussion of how, when you have lots of intrinsic scatter and a selection limit, bisector fits can give you a very wrong answer. Kelly 2007 works a fit method out more rigorously than I did.