Big Data, and The Laws of Statistical Analyses

The work I described in yesterday’s post has got me thinking lots about statistics. I used to hate statistics in school and at university. During his talk at NAM, Mark Thompson, the astronomer at the University of Hertfordshire whose recent work was the basis for mine, proclaimed that “Hey! I discovered I like statistics”. I laughed when he said that, because this recent paper had exactly the same effect on me. I wear my Histogram Girl badge with much pride!

(I should have mentioned, by the way, that you really shouldn’t go looking for pretty pictures of bubbles in the paper. There’s only one.)

One man who says sensible things about statistics is neuroscientist Bradley Voytek. I really enjoyed the post he wrote today on O’Reilly Radar entitled “Automated science, deep data and the paradox of information”, on the potential of Big Data and its pitfalls. He states the following three laws of statistical analysis based on Arthur C. Clarke’s well known “Three Laws”:

  1. The more advanced the statistical methods used, the fewer critics are available to be properly skeptical.
  2. The more advanced the statistical methods used, the more likely the data analyst will be to use math as a shield.
  3. Any sufficiently advanced statistics can trick people into believing the results reflect truth.

After spending the last few months knee-deep in histograms, correlation functions and statistical tests, these three points feel very relevant. Statistics, and the mathematical methods associated with them, are an immensely powerful tool for turning data into information – indeed when datasets become so large or multi-dimensional that one mind can’t gain an overview of them, and the Milky Way Project bubbles are perhaps just at that limit, it’s the only tool.

But it’s incredibly easy to “over-statisticise”, to venture so far away from the data that you lose sight of what is being measured. You can still produce clever-looking plots and  numbers that would convince all but the most pedantic of readers (which may or may not be your peer reviewer). It’s important to stay as close to the data as possible and find the right method to answer the question at hand – and that is the difficult bit with a statistical analysis.

 

In bubbles too, correlation ≠ causation

ResearchBlogging.org

The last couple of weeks saw the culmination for me of several months’ worth of hard work on a paper following up on our exciting new Milky Way Project catalogue. With the help of a number of MWP science teamers, I performed a statistical study looking at the correlation of the 5000-odd bubbles sample with a catalogue of known massive young stars detected by the infrared Midcourse Space Experiment (MSX) satellite. What we found was that these two types of sources are strongly associated with one another, which is not unexpected. But we also noticed that the largest of our bubbles appear to have a disproportionally large number of massive young stars around their edges, which is a more exciting find. It confirms results from a very recent study by UK/Australian/German colleagues.

From previous studies of smaller samples of infrared bubbles, we know that many of these beautiful sources form around massive young stars or clusters, which clear away their surrounding cloud material with powerful UV radiation and stellar winds. The resulting cavity filled with hot dust glows brightly in the infrared (the red stuff), and the complex carbon-based molecules in the rims are excited by the stars’ UV photons (the green stuff).

Studying the correlation between massive young stars on the one hand and bubbles on the other, can tell us two things: (i) can we detect the massive young stars at the centre of the bubbles in the young stars catalogue, and (ii) is there any evidence of triggered star formation happening on the outskirts of the bubbles? The answer to (i) was pretty easy to test with our methods and resulted in a statistically resounding “yes”: the data tell us that we find loads of massive young sources in bubble interiors – far more than we’d expect from chance alignments.

Triggering is a special mode of star formation that we think might occur when energetic events, such as supernova explosions or bubble expansions, shock and compress molecular gas around them, causing dense pockets to collapse and form new stars in regions where this would otherwise not have happened.

Many papers have been published in recent years studying this phenomenon in theoretical calculations and simulations, and showing tentative evidence in observations. Triggered star formation is a potentially important phenomenon, as it might allow star formation in galaxies to sort of “daisy-chain” through a galaxy, which each generation of young stars providing the energetic kick into the surrounding gas to set off the next.

So in this paper, I show how the correlation between our bubbles and the catalogue of MSX sources (called the RMS catalogue, curated by the Leeds astrophysics group) paints a picture that is possibly consistent with triggered star formation happening around the largest of the MWP bubbles.

The really important caveat to the work is that this association does not imply that triggering is really happening. With bubbles, like with everything, correlation does not equal causation. The analysis I performed looks at a simple 2D projection of these objects on the sky, ignoring the 3D structure of both the bubbles themselves and of the Milky Way Galaxy. And demonstrating this causal effect between one newly born cluster and new stars forming in the same area is a really tough challenge that I’d argue very few authors have convincingly overcome (though I haven’t read every single paper).

I presented this work at the joint UK/German National Astronomy Meeting in Manchester earlier this week, which Rob wrote about on the Milky Way Project blog (with photographic evidence). Rob & Chris’ Recycled Electrons podcast also throws some random thoughts on the work around in typical style, and Will Gater interviewed Rob and me for Sky at Night magazine.

Of course, there’s an awful lot more to this work than I can capture here or in a 15-minute talk. If you’re interested, check out the paper on astro-ph but note that it’s not actually been accepted for publication yet – so all findings should be considered preliminary. I also submitted my Python code to the journal so that should be made available once the paper gets published as well. All data I used for the analysis are publicly available from either the MWP webpages or the RMS database.

The Manchester NAM was excellent fun. I heard some great talks and met lots of interesting likeminded people.  Since becoming a conference organiser myself I really appreciate a smoothly run event – so big thanks to the organisers for that.

Here’s the details of the paper:

Sarah Kendrew, Robert J. Simpson, Eli Bressert, Matthew S. Povich, Reid Sherman, Chris Lintott, Thomas P. Robitaille, Kevin Schawinski, & Grace Wolf-Chase (2012). The Milky Way Project: A statistical study of massive star formation associated with infrared bubbles ApJ submitted arXiv: 1203.5486v1

Comments welcome!

xkcd: Conditional Risk

While I’m busy packing my life into boxes, here’s an xkcd to make you smile.

xkcd_conditional_risk

Scientific hubris, or: Everything you thought you knew about straight line fits is wrong

ResearchBlogging.orgThink you’ve got your least squares down to a tee? Think again.

In a paper posted to the Arxiv in late August, David Hogg of NYU and his collaborators take us to task on our sloppy data fitting habits. And he’s not in the mood to mince his words.

It is conventional to begin any scientific document with an introduction that explains why the subject matter is important. Let us break with tradition and observe that in almost all cases in which scientists fit a straight line to their data, they are doing something that is simultaneously wrong and unnecessary.

Hear that? Next time you fit a straight line to your data, consider that you’re probably wasting your time. Stop pandering to style to get a “catchy punchline and compact, approximate representations”.

[Read more...]

Lucia cleared, Dutch justice shamed

Quick update from the frontlines of judicial excellence. As expected, nurse Lucia de Berk was cleared of all murder charges by the court of Arnhem on 14 April last week. The case has been extensively covered in the Dutch media, with some frank editorials, most of which are sadly hiding behind a paywall. The Haga Hospital, which owns the Juliana Children’s Hospital where Lucia worked at the time of her arrest, will pay her 45,000 euro in compensation for wrongfully firing her. While that’s a decent amount of money, given that the hospital’s own apparently shabby internal investigation led to her arrest in the first place, I think it’s a pretty measly gesture. The hospital’s own statement is very brief and terse.

Everyone’s been falling over each other to apologise to Lucia for this awful miscarriage of justice – Justice Minister Ernst Hirsch Ballin, Harm Brouwer, Chairman of the Public Prosecution – and apparently negotiations on what compensation she will receive from the government are ongoing.

As usual the best coverage comes from GeenStijl, the Netherlands’ answer to The Onion, who report that Lucia has signed up to star in Kafka: The Musical. If you know Dutch, go read.

Here’s a short news report in Dutch from NOS:
[Read more...]