Friday, July 2, 2010

My position on Graph Visualization

Actually, I opened this blog again to write this article: while posting a sequence of tweets, I thought I would rather make my point clear.

Here, I would like to share some of my thoughts about Graph Visualization.

What is graph visualization? Take some look at the homepage of people who are quite good at it.
Just take some look at those images! No need to read details!

http://www.graphviz.org/

See? When we have data about entities and interactions among them, it's useful to visualize it as a graph.

Why is it useful? We human beings are trained to discover the 'structure' of the object we are looking at. For example, when you're looking at a desk, you can easily figure out how one part is physically/functionally connected to other parts. By visualizing a graph, we can make use of the same natural-born ability for the data-analytic purpose. Take some look at images from the URL I mentioned above, and try to think of a way to deliver the same information more effectively/economically. I hardly believe there is, except for some trivial circumstances.



No matter how attractive the fruit of the task is, it is a notoriously difficult problem. There are several ways of visualizing a graph, but the most common way is to try to preserve the length between two nodes/vertices to be proportional to geodesic path length on the graph (original data). That is, we want two nodes to be placed closer if they're closer on the original graph, and vice versa. If we have n number of nodes, there are n^2 = n * n number of pairwise distances on the original graph, so it's intractable to be find a diagram that satisfies every constraint in 2 or 3-dimensional euclidean space. Therefore what we do is, as in most problems of science/engineering, approximation: doing as good as we can, in certain perspective.

The site manager of http://www.graphviz.org clearly understands what we look for when we visualize a graph. When the size of the graph is small, there are less constraints to satisfy, and the structure (what we're aimed at) is much simpler than large graphs. I don't think implementation of GraphViz is less scalable than others. However, some people merely apply the same algorithm which was designed to visualize small graphs to large ones, and the result is like this: http://nodexl.codeplex.com/ . Again, just take some look at images, images of those complex graphs. What structure can you find from those graphs? In my opinion, there are only two cases: 1) the graph is so complex that you cannot find any structure, 2) you can find a structure, but it is misleading because the loss of information is not uniform: certain parts of information are overly removed to "invent" the structure.

Well, I don't want to offend the developers of NodeXL. Maybe this is a main reason I write a blog post instead of consecutive 140-character tweets. I haven't looked at the very details of the software, but I believe algorithms implemented in NodeXL are conventional ones used in other applications of graph visualization, and having a new implementation with widely-used and convenient interface like Microsoft Excel is the thing which a lot of people have been waiting for. I do believe their contribution is really significant. The visualization of large graphs, however, is not the thing their algorithms are aimed at, and they're doing really a poor job on the graphs on their website. It may attract some of those who are new to this area and have little experience making a real "use" of those graphs for a while, but nothing more than that.

I have never seen a meaningful visualization of graphs with >1000 nodes, unless it has a certain special structure. For example, think of a graph which has a circular structure. It can be easily done in 2 dimensions. Likewise, planar graphs and graphs with fractal structures are much easier to be visualized than others. In other cases, however, I believe the task is hopeless.

To make it worse, people want more than the preservation of graph geodesic distances on their drawings. They want to discover clustered-ness, bipartite-ness (if present), and a lot others from outputs of graph visualization algorithms, since it's quite doable when you're working with small graphs.

Then, why is the reason those NodeXL people shows visualization of large graphs, although they're not doing a good job on those graphs? Because in many of data analysis projects, the graph you have to deal with is much bigger than those that can be effectively handled by graph visualization algorithms.

So there is a task to be done, which cannot be easily done. In such a case, good scientists make it clear what's their contribution (what they could do), and what's the limitation of their work (what they could not do). Bad scientists blur those two things, which I  believe to be highly unethical. Sadly, it's not difficult to find such unethical works, and images on NodeXL website reminds me a lot of such works. NodeXL is merely a tool (actually an excellent one), and it's not ethical or unethical by itself. But it's questionable it can be used in good ways, when even its own website does not demonstrate how to make a good use of it appropriately.



However, it's not easy to offend people who make misleading visualizations of graphs, since there is no good alternative yet. One professor told me that "do not blame a method before you have an alternative," and it applies to this case. But at least, we have to seek a way to make improvements.

In my point of view, merely applying the algorithm which works well with small graphs to large graphs is hopeless, as I'm repeating again and again. One of the reason is that there are too many things we look for from just one shot of image. We have to be more specific to get some meaningful job done. That's why I'm studying Statistics instead of Computer Science / Numerical Optimization to make some  contribution. Methods like these looks promising in this respect:

Hoff, P.D., Raftery, A.E., and Handcock, M.S. (2002) "Latent Space Approaches to Social Network Analysis"
Journal of the American Statistical Association , vol. 97, no. 460, 1090-1098.
abstract postscript pdf .


E. Airoldi, D. Blei, S. Fienberg, and E. P. Xing, Mixed Membership Stochastic Blockmodel, Journal of Machine Learning Research, 9(Sep):1981--2014, 2008.A shorter version of this paper appears in Proceeding of the 22nd Neural Information Processing Systems, (NIPS 2008).

I don't think their works are practical / general / comprehensive enough to be used universally, but they are quite promising. And now is the time for me to contribute something to the literature, instead of constantly picking defects of others... :)

No comments:

Post a Comment