Big Data, and The Laws of Statistical Analyses

The work I described in yesterday’s post has got me thinking lots about statistics. I used to hate statistics in school and at university. During his talk at NAM, Mark Thompson, the astronomer at the University of Hertfordshire whose recent work was the basis for mine, proclaimed that “Hey! I discovered I like statistics”. I laughed when he said that, because this recent paper had exactly the same effect on me. I wear my Histogram Girl badge with much pride!

(I should have mentioned, by the way, that you really shouldn’t go looking for pretty pictures of bubbles in the paper. There’s only one.)

One man who says sensible things about statistics is neuroscientist Bradley Voytek. I really enjoyed the post he wrote today on O’Reilly Radar entitled “Automated science, deep data and the paradox of information”, on the potential of Big Data and its pitfalls. He states the following three laws of statistical analysis based on Arthur C. Clarke’s well known “Three Laws”:

  1. The more advanced the statistical methods used, the fewer critics are available to be properly skeptical.
  2. The more advanced the statistical methods used, the more likely the data analyst will be to use math as a shield.
  3. Any sufficiently advanced statistics can trick people into believing the results reflect truth.

After spending the last few months knee-deep in histograms, correlation functions and statistical tests, these three points feel very relevant. Statistics, and the mathematical methods associated with them, are an immensely powerful tool for turning data into information – indeed when datasets become so large or multi-dimensional that one mind can’t gain an overview of them, and the Milky Way Project bubbles are perhaps just at that limit, it’s the only tool.

But it’s incredibly easy to “over-statisticise”, to venture so far away from the data that you lose sight of what is being measured. You can still produce clever-looking plots and  numbers that would convince all but the most pedantic of readers (which may or may not be your peer reviewer). It’s important to stay as close to the data as possible and find the right method to answer the question at hand – and that is the difficult bit with a statistical analysis.

 

Comments

  1. Nice post – I’ve not seen these ‘three laws of statistical analysis’ before. I like the underlying philosophy of staying as close to the data as possible. Just enough statistics to do the job, and no more.

  2. It’s great to hear you have learned to like statistics. To paraphrase Jessica Rabbit – they aren’t bad, they are just taught that way. I really like the Arthur C. Clarke quote. I recently wrote a post on how to help people like statistics. You might be interested in it.
    http://www.learnandteachstatistics.wordpress.com
    I also make light-hearted videos about statistical concepts. You can see them on www,youtube.com/creativeheuristics.
    Best wishes on your postdoc.

  3. I agree! Well, it’s Bradley who formulated them in his O’Reilly Radar post so they’re likely to be unfamiliar. I think I may post them on my office wall :-)