Big Data, and The Laws of Statistical Analyses

The work I described in yesterday’s post has got me thinking lots about statistics. I used to hate statistics in school and at university. During his talk at NAM, Mark Thompson, the astronomer at the University of Hertfordshire whose recent work was the basis for mine, proclaimed that “Hey! I discovered I like statistics”. I laughed when he said that, because this recent paper had exactly the same effect on me. I wear my Histogram Girl badge with much pride!

(I should have mentioned, by the way, that you really shouldn’t go looking for pretty pictures of bubbles in the paper. There’s only one.)

One man who says sensible things about statistics is neuroscientist Bradley Voytek. I really enjoyed the post he wrote today on O’Reilly Radar entitled “Automated science, deep data and the paradox of information”, on the potential of Big Data and its pitfalls. He states the following three laws of statistical analysis based on Arthur C. Clarke’s well known “Three Laws”:

  1. The more advanced the statistical methods used, the fewer critics are available to be properly skeptical.
  2. The more advanced the statistical methods used, the more likely the data analyst will be to use math as a shield.
  3. Any sufficiently advanced statistics can trick people into believing the results reflect truth.

After spending the last few months knee-deep in histograms, correlation functions and statistical tests, these three points feel very relevant. Statistics, and the mathematical methods associated with them, are an immensely powerful tool for turning data into information – indeed when datasets become so large or multi-dimensional that one mind can’t gain an overview of them, and the Milky Way Project bubbles are perhaps just at that limit, it’s the only tool.

But it’s incredibly easy to “over-statisticise”, to venture so far away from the data that you lose sight of what is being measured. You can still produce clever-looking plots and  numbers that would convince all but the most pedantic of readers (which may or may not be your peer reviewer). It’s important to stay as close to the data as possible and find the right method to answer the question at hand – and that is the difficult bit with a statistical analysis.

 

In bubbles too, correlation ≠ causation

ResearchBlogging.org

The last couple of weeks saw the culmination for me of several months’ worth of hard work on a paper following up on our exciting new Milky Way Project catalogue. With the help of a number of MWP science teamers, I performed a statistical study looking at the correlation of the 5000-odd bubbles sample with a catalogue of known massive young stars detected by the infrared Midcourse Space Experiment (MSX) satellite. What we found was that these two types of sources are strongly associated with one another, which is not unexpected. But we also noticed that the largest of our bubbles appear to have a disproportionally large number of massive young stars around their edges, which is a more exciting find. It confirms results from a very recent study by UK/Australian/German colleagues.

From previous studies of smaller samples of infrared bubbles, we know that many of these beautiful sources form around massive young stars or clusters, which clear away their surrounding cloud material with powerful UV radiation and stellar winds. The resulting cavity filled with hot dust glows brightly in the infrared (the red stuff), and the complex carbon-based molecules in the rims are excited by the stars’ UV photons (the green stuff).

Studying the correlation between massive young stars on the one hand and bubbles on the other, can tell us two things: (i) can we detect the massive young stars at the centre of the bubbles in the young stars catalogue, and (ii) is there any evidence of triggered star formation happening on the outskirts of the bubbles? The answer to (i) was pretty easy to test with our methods and resulted in a statistically resounding “yes”: the data tell us that we find loads of massive young sources in bubble interiors – far more than we’d expect from chance alignments.

Triggering is a special mode of star formation that we think might occur when energetic events, such as supernova explosions or bubble expansions, shock and compress molecular gas around them, causing dense pockets to collapse and form new stars in regions where this would otherwise not have happened.

Many papers have been published in recent years studying this phenomenon in theoretical calculations and simulations, and showing tentative evidence in observations. Triggered star formation is a potentially important phenomenon, as it might allow star formation in galaxies to sort of “daisy-chain” through a galaxy, which each generation of young stars providing the energetic kick into the surrounding gas to set off the next.

So in this paper, I show how the correlation between our bubbles and the catalogue of MSX sources (called the RMS catalogue, curated by the Leeds astrophysics group) paints a picture that is possibly consistent with triggered star formation happening around the largest of the MWP bubbles.

The really important caveat to the work is that this association does not imply that triggering is really happening. With bubbles, like with everything, correlation does not equal causation. The analysis I performed looks at a simple 2D projection of these objects on the sky, ignoring the 3D structure of both the bubbles themselves and of the Milky Way Galaxy. And demonstrating this causal effect between one newly born cluster and new stars forming in the same area is a really tough challenge that I’d argue very few authors have convincingly overcome (though I haven’t read every single paper).

I presented this work at the joint UK/German National Astronomy Meeting in Manchester earlier this week, which Rob wrote about on the Milky Way Project blog (with photographic evidence). Rob & Chris’ Recycled Electrons podcast also throws some random thoughts on the work around in typical style, and Will Gater interviewed Rob and me for Sky at Night magazine.

Of course, there’s an awful lot more to this work than I can capture here or in a 15-minute talk. If you’re interested, check out the paper on astro-ph but note that it’s not actually been accepted for publication yet – so all findings should be considered preliminary. I also submitted my Python code to the journal so that should be made available once the paper gets published as well. All data I used for the analysis are publicly available from either the MWP webpages or the RMS database.

The Manchester NAM was excellent fun. I heard some great talks and met lots of interesting likeminded people.  Since becoming a conference organiser myself I really appreciate a smoothly run event – so big thanks to the organisers for that.

Here’s the details of the paper:

Sarah Kendrew, Robert J. Simpson, Eli Bressert, Matthew S. Povich, Reid Sherman, Chris Lintott, Thomas P. Robitaille, Kevin Schawinski, & Grace Wolf-Chase (2012). The Milky Way Project: A statistical study of massive star formation associated with infrared bubbles ApJ submitted arXiv: 1203.5486v1

Comments welcome!

What’s our greatest weakness?

I’m curious: What do ya’ll think is the bit of professional astronomy that most needs to be changed? Regardless of government funding levels, is there one thing that’s holding us back from being the best astronomers we can be more than others? What’s our greatest weakness? Is it the disconnect between course work (theory) and practical astronomy (programming)? Disconnect between telescope time and funding? Not enough support for career tracks other than academia? Not enough open access to results? Competitive culture? Not competitive enough? If there was one thing you could change about our culture and traditions that would have the biggest impact on making astronomy more productive as a whole and an even better career choice than it is now, what would it be?

These questions by Kelle Cruz over on Astrobetter have sparked a pretty lively discussion, about careers, money, bad behaviour, and short-termism in science.

I was particularly piqued by one commenter, who seems to suggest that we shouldn’t make astronomy too attractive a career, as there are too many of us already. “We are all in it for the thrills of science.” Right. (In fairness, he does go on to mitigate the statement. But still.)

Got a bee in you bonnet? Go comment here.

Champagne and Chocolate

Many of my recent blog posts have all been about Milky Way Project, and there’s a good reason for that. The publication of our first paper, which is in press at the moment with Monthly Notices, was just a first big milestone, with more to come. I’m currently writing a follow-up paper using the initial data catalogues, and as I’m scheduled to give a talk about it at the end of the month at the joint UK/German National Astronomy Meeting in Manchester, I’d better make a move on with getting the results out.

The paper won’t be the photogenic blockbuster that Rob wrote for us,  but just in case you don’t share my histogram-fetish (… you simple soul!), I’ve managed to find space for one rather sexy bubble picture to add a bit of spice. If and when the paper gets accepted I’ll instruct the editor to place it on Page 3.

My own data adventures aside, this week was another heap of fun for the project. NASA put out a press release to mark the first data release. It didn’t get picked up in too many places – there was Astronomy Magazine, Space.com, and also a short piece in the Mail Online with obligatory pretty pics of the Spitzer images and our MWP heat maps. The Mail upped Eli Bressert’s “champagne bubble” quote to liken the Milky Way to a nougat-y chocolate bar.

If I’m being a pedantic scientist, I should add that neither of those analogies are actually very accurate. Champagne bubbles are maybe somewhat similar in that they’re lighter than the liquid they’re in, but our interstellar bubbles aren’t thought to be floating or rising through the interstellar medium. But they do expand. As for chocolate bars… No, that doesn’t work either.

At Milky Way Project HQ, we launched a new phase of the project. While we continue to collect your ‘regular’ bubble drawings, we’ve now added close-up images of bubbles that are already in the catalogue, for which we’re trying to get more precise sizes and thicknesses. Rob explains all here. Our drawing tools were fairly coarse, as some users had remarked, particularly for drawing smaller bubbles. So with these new images we will try to gather more precise measurements.

I’m really looking forward to the NAM conference later this month. I haven’t been to one of these meetings since the first year of my PhD (Dublin!), and they’re great for catching up with old friends and colleagues. Having it joint with its German equivalent meeting (the AG) means that both old and new friends will be at the meeting. Another factlet is that I’m actually half-Mancie, and although my association with the city is pretty patchy (what, you haven’t noticed my striking Northern accent?), it’s fun to be there.

The Sky At Night this week

This week’s edition of the BBC’s The Sky a Night is about Citizen Astronomy:

Amateur astronomers are scanning the night skies looking for asteroids, comets and supernovae, and making vital discoveries in our quest for knowledge. Meanwhile space missions produce millions of images, but who is to say which ones are truly unusual and interesting? It is a job that computers struggle with, but one in which humans excel. This, more than ever, is the age of the amateur astronomer and Sir Patrick Moore explains how everybody can play a part whilst also enjoying the beautiful cosmos.

The programme will feature .Astronomy chief honcho  and Milky Way Project PI Rob Simpson. The programme is repeated several times over the week, check out times here. If you need any additional reasons to watch, apparently it’s also Sir Patrick Moore’s birthday. Happy birthday Patrick!