Thursday, June 20, 2013

Simple Model + Small Data is the only way to go?

A question that bothers me:

1. simple model + small data: OK, if you have good understanding on the generative process.
2. complex model + small data: most likely fail due to overfitting, unless the noise is very very small so that the system is almost deterministic.
3. simple model + big data: calculation of parameter estimate is challenging although sometimes tractable using techniques like SGD, but frequentist hypothesis testing almost always fails because you might be oversimplifying the problem.
4. complex model + big data: calculation of parameter estimate is impossible.

Conclusion: a good statistician should only work with 1. simple model + small data!?

Thursday, June 13, 2013

The Signal and the Noise

At last, I have finished reading the book 'the signal and the noise: why so many predictions fail- but some don't'. It does a very good job in explaining 1) why it is important for us to understand uncertainty - the central theme of statistics -, 2) why statistical analysis is so challenging, and 3) how we can (sometimes) improve the model, in plain English (that is, without statistical jargons). This book shines the most in its careful selection of problems it discusses; baseball, earthquake, stock market, chess, and terrorism are very good examples which shows different aspects of the 'prediction problem'.

I would strongly recommend this book to those who are interested in understanding why people are making such a big fuss about Big Data/machine learning/statistics. As Larry Wasserman pointed out in his blog post, however, its treatment of frequentist statistics is very unfair... and I feel very uncomfortable every time a Bayesian claims that Bayesian statistics is a magic bullet to every problem frequentist statistics has. But this is a pop science book after all... probably it was a necessary sacrifice to deliver the idea to non-academics.