\sum_{i,j,k} d_{ijk}

Tuesday, September 7, 2010

What does 'probability' mean in Statistics?

I majored in Industrial Engineering and Mathematics when I was an undergraduate student: I changed my major to Statistics when I came to graduate school. So frankly I have no philosophy in Statistics: I'm in the process of "building" it.

I've taken courses mainly on mathematical statistics/probability/computational statistics. Those courses only talk about theoretical well-defined problem settings, so although it helped me refine my mathematical skills, but didn't really help me build the philosophy. Now I'm taking 'STAT525: Intermediate Statistical Methodology', and it makes me to think more on foundations of Statistics.

At the time I first heard about Bayesian Statistics when I was an undergraduate, it didn't appealed much to me since Bayesians' notion of "personal" probability didn't make much sense. In most cases, nobody knows which prior to use. Then what does posterior probability mean?

But it seems that even frequentists are not that rigorous in interpreting probability, when it comes to application of their theory (which should be the ultimate goal of every statistical research). When using linear regression models, we all know that no error is really Gaussian, so every goodness-of-fit test fails when the number of data points is sufficiently large. However, we still use usual regression models since they are "robust" to deviation from normality: our estimates and confidence intervals may not very too much in mild conditions. How can we interpret probabilities, then? For example, how can we interpret what 95% confidence interval means when it does not contain the true parameter for 95% frequency?

Are Statisticians really in the position to blame data mining/machine learning people because outcome of their methods do not have clear probabilistic interpretations? We know that SVM is asymptotically median classifier. Don't most of statistical methods rely only on asymptotic results, especially when they're analyzing practical problems?

How much "objective" a statistical analysis can be? How much of "objectivity" is required? Can I answer these questions before I get Ph.D degree?

Tuesday, July 13, 2010

Statistical Research and Ethics

In my opinion, Statistics is an attempt to draw a line between "speakable"s and "unspeakable"s. "Is there really a relationship between X and Y?" Usually you cannot say "yes" or "no" for sure, but there is always something left for you to "speak". Statistical Theory is about those which are left.

In this respect, statistical analysis is an ethical deed, because speaking of unspeakable is unethical. This is why I'm infuriated when some people blur this "line" between two instead of making it clearer. I'm sick of people making exaggerations (especially with those "social network" things) about data they have in hand, thus I became a Statistician.

In this line, speaking of "unspeakable"s by merely introducing a prior distribution that you're not sure of validity is an unethical deed. One should be very careful when interpreting the outcome of Bayesian analysis. When we have no idea about what the prior distribution is, then we also have no idea about how the posterior distribution can be interpreted. But it's really tempting to "intentionally" ignore the fact that the selection of prior was ad-hoc.

Hence it is natural to favor a framework of statistical inference which this distinction between "speakable" and "unspeakable" is clear. This is why there are much more people using frequentist methods than Bayesian methods. Neyman-Pearson framework, However, the dominant statistical procedure used in hypothesis testing, puts so much emphasis on null hypothesis that in many cases there's little left to talk about alternative hypotheses. For example, low p-value indicates the null hypothesis is not a good model for your data, but it does not necessarily mean there is a good model in models of alternative hypothesis. This limitation sometimes makes people make statements about "unspeakable"s of alternative hypothesis, which is also unethical.

Maybe no statistical procedure can be perfect. We conduct statistical analysis since there are uncertainties. If we can speak with 100% certainty, then it is not about statistics. Maybe you're talking about Mathematics (although as you might know, you cannot always speak with 100% certainty even in Mathematics). But you should also agree that there are "speakable"s in data, because those accumulated "speakable"s have constructed the 'Science' which aids us every day. So we should investigate how we can develop a procedure which we can easily avoid speaking the "unspeakable" while there's still much thing left to "speak".

In this respect, I like Liu's theory (and the framework) of statistical inference since it makes statistical statements to be clear of what is speakable and what is not. So it is safer than other frameworks from committing an unethical deed.

http://www.stat.purdue.edu/~chuanhai/research.html

Shamefully I'm a novice at Statistics, so I'm yet to judge usefulness or impact of what his group has done. However, I like them going a research in this direction, since this is what we 'should' do, helping people avoid committing "unethical" deed. I'm not saying we have a solution now. I'm saying we have something to do, and we're doing it.

Friday, July 2, 2010

My position on Graph Visualization

Actually, I opened this blog again to write this article: while posting a sequence of tweets, I thought I would rather make my point clear.

Here, I would like to share some of my thoughts about Graph Visualization.

What is graph visualization? Take some look at the homepage of people who are quite good at it.
Just take some look at those images! No need to read details!

http://www.graphviz.org/

See? When we have data about entities and interactions among them, it's useful to visualize it as a graph.

Why is it useful? We human beings are trained to discover the 'structure' of the object we are looking at. For example, when you're looking at a desk, you can easily figure out how one part is physically/functionally connected to other parts. By visualizing a graph, we can make use of the same natural-born ability for the data-analytic purpose. Take some look at images from the URL I mentioned above, and try to think of a way to deliver the same information more effectively/economically. I hardly believe there is, except for some trivial circumstances.

No matter how attractive the fruit of the task is, it is a notoriously difficult problem. There are several ways of visualizing a graph, but the most common way is to try to preserve the length between two nodes/vertices to be proportional to geodesic path length on the graph (original data). That is, we want two nodes to be placed closer if they're closer on the original graph, and vice versa. If we have n number of nodes, there are n^2 = n * n number of pairwise distances on the original graph, so it's intractable to be find a diagram that satisfies every constraint in 2 or 3-dimensional euclidean space. Therefore what we do is, as in most problems of science/engineering, approximation: doing as good as we can, in certain perspective.

The site manager of http://www.graphviz.org clearly understands what we look for when we visualize a graph. When the size of the graph is small, there are less constraints to satisfy, and the structure (what we're aimed at) is much simpler than large graphs. I don't think implementation of GraphViz is less scalable than others. However, some people merely apply the same algorithm which was designed to visualize small graphs to large ones, and the result is like this: http://nodexl.codeplex.com/ . Again, just take some look at images, images of those complex graphs. What structure can you find from those graphs? In my opinion, there are only two cases: 1) the graph is so complex that you cannot find any structure, 2) you can find a structure, but it is misleading because the loss of information is not uniform: certain parts of information are overly removed to "invent" the structure.

Well, I don't want to offend the developers of NodeXL. Maybe this is a main reason I write a blog post instead of consecutive 140-character tweets. I haven't looked at the very details of the software, but I believe algorithms implemented in NodeXL are conventional ones used in other applications of graph visualization, and having a new implementation with widely-used and convenient interface like Microsoft Excel is the thing which a lot of people have been waiting for. I do believe their contribution is really significant. The visualization of large graphs, however, is not the thing their algorithms are aimed at, and they're doing really a poor job on the graphs on their website. It may attract some of those who are new to this area and have little experience making a real "use" of those graphs for a while, but nothing more than that.

I have never seen a meaningful visualization of graphs with >1000 nodes, unless it has a certain special structure. For example, think of a graph which has a circular structure. It can be easily done in 2 dimensions. Likewise, planar graphs and graphs with fractal structures are much easier to be visualized than others. In other cases, however, I believe the task is hopeless.

To make it worse, people want more than the preservation of graph geodesic distances on their drawings. They want to discover clustered-ness, bipartite-ness (if present), and a lot others from outputs of graph visualization algorithms, since it's quite doable when you're working with small graphs.

Then, why is the reason those NodeXL people shows visualization of large graphs, although they're not doing a good job on those graphs? Because in many of data analysis projects, the graph you have to deal with is much bigger than those that can be effectively handled by graph visualization algorithms.

So there is a task to be done, which cannot be easily done. In such a case, good scientists make it clear what's their contribution (what they could do), and what's the limitation of their work (what they could not do). Bad scientists blur those two things, which I believe to be highly unethical. Sadly, it's not difficult to find such unethical works, and images on NodeXL website reminds me a lot of such works. NodeXL is merely a tool (actually an excellent one), and it's not ethical or unethical by itself. But it's questionable it can be used in good ways, when even its own website does not demonstrate how to make a good use of it appropriately.

However, it's not easy to offend people who make misleading visualizations of graphs, since there is no good alternative yet. One professor told me that "do not blame a method before you have an alternative," and it applies to this case. But at least, we have to seek a way to make improvements.

In my point of view, merely applying the algorithm which works well with small graphs to large graphs is hopeless, as I'm repeating again and again. One of the reason is that there are too many things we look for from just one shot of image. We have to be more specific to get some meaningful job done. That's why I'm studying Statistics instead of Computer Science / Numerical Optimization to make some contribution. Methods like these looks promising in this respect:

Hoff, P.D., Raftery, A.E., and Handcock, M.S. (2002) "Latent Space Approaches to Social Network Analysis"
Journal of the American Statistical Association , vol. 97, no. 460, 1090-1098.
abstract , postscript , pdf .

E. Airoldi, D. Blei, S. Fienberg, and E. P. Xing, Mixed Membership Stochastic Blockmodel, Journal of Machine Learning Research, 9(Sep):1981--2014, 2008.A shorter version of this paper appears in Proceeding of the 22nd Neural Information Processing Systems, (NIPS 2008).

I don't think their works are practical / general / comprehensive enough to be used universally, but they are quite promising. And now is the time for me to contribute something to the literature, instead of constantly picking defects of others... :)

Thursday, July 1, 2010

Switching the blog "again"

I'm an ignorant first-year Ph.D student who knows virtually nothing, so there isn't much for me to share with others. That's why I have more blogs than my posts: I simply tested how good a service was. At the last time, my impression on blogger.com was pretty bad. For a Korean user, every UI looked unintuitive. Now I discover a lot of improvements, however, and another important reason: I hate my former-ID/nickname I used before! It was too childish! But wordpress.com did not allow me to change my domain, so I had to switch to the service where it was possible... :D