\sum_{i,j,k} d_{ijk}_stra

Analogy between Software Development and Stochastic Gradient Descent

2013-08-18T13:27:00.004-07:00

When I am developing a software, I feel like I am executing a stochastic gradient descent algorithm myself. You start with a large step size: you define a lot of important classes and everything is very flexible at that time. Then, as the # of lines of your code gets larger and larger, your step size gets smaller: you make more of local changes than global changes (ex: let's change the signature of this function so that I can pass this variable...). But just like problems of stochastic approximation, it is difficult and costly to get global estimate of how good your current solution is, so you have to make decision based on a local observation: current feature request by your boss.

Sometimes I feel like I am stuck in local optimum and write everything from scratch to find better solution, but usually when the new implementation is finally done I realize that it is not much better than the previous one. Similar things happen in stochastic gradient descent as well: I have rarely seen you reach a significantly better solution by running it again, although it is a local method. But you've spent 2x more time by re-executing it!

Also, the step-size schedule is very very important. You need to decay it in the right rate. So in SGD you test it on sub-sample. In software engineering you develop prototypes.

You should've inferred at this point that I am a crappy software engineer. Yes I do suck.

How to waste time

2013-07-06T17:23:00.002-07:00

The reason I wasted 2 sweet afternoon hours of Saturday: 1) cmake 2.8.10 has a bug which removing CMakeCache.txt does not completely remove cached values of previous execution. Attempting -DCMAKE_CXX_COMPILER=icpc even results in an infinte loop! (seriously?) 2) intel compiler does not support override keyword.

At this point, I really want to abandon cmake. I have cumulatively spent at least (literally!) a week on figuring out why my cmake file is not working on a new system I want to deploy my stuff on. But what's an alternative? bjam? Seriously?

decltype extravaganza

2013-07-04T16:24:00.001-07:00

OK, I thought C++11 was cool, but I am ending up writing a code like this:

Simple Model + Small Data is the only way to go?

2013-06-20T00:20:00.002-07:00

A question that bothers me:

1. simple model + small data: OK, if you have good understanding on the generative process.
2. complex model + small data: most likely fail due to overfitting, unless the noise is very very small so that the system is almost deterministic.
3. simple model + big data: calculation of parameter estimate is challenging although sometimes tractable using techniques like SGD, but frequentist hypothesis testing almost always fails because you might be oversimplifying the problem.
4. complex model + big data: calculation of parameter estimate is impossible.

Conclusion: a good statistician should only work with 1. simple model + small data!?

The Signal and the Noise

2013-06-13T22:52:00.001-07:00

At last, I have finished reading the book 'the signal and the noise: why so many predictions fail- but some don't'. It does a very good job in explaining 1) why it is important for us to understand uncertainty - the central theme of statistics -, 2) why statistical analysis is so challenging, and 3) how we can (sometimes) improve the model, in plain English (that is, without statistical jargons). This book shines the most in its careful selection of problems it discusses; baseball, earthquake, stock market, chess, and terrorism are very good examples which shows different aspects of the 'prediction problem'.

I would strongly recommend this book to those who are interested in understanding why people are making such a big fuss about Big Data/machine learning/statistics. As Larry Wasserman pointed out in his blog post, however, its treatment of frequentist statistics is very unfair... and I feel very uncomfortable every time a Bayesian claims that Bayesian statistics is a magic bullet to every problem frequentist statistics has. But this is a pop science book after all... probably it was a necessary sacrifice to deliver the idea to non-academics.

2012-01-13T08:19:00.001-08:00

http://arxiv.org/pdf/1201.2590v1 William M. Briggs - 'It is Time to Stop Teaching Frequentism to Non-statisticians'

Every time a Bayesian attacks fallacies of frequentism, I agree with their points. But that does not necessarily imply Bayesian is 'better' than frequentist method and should replace it. Bayesians have their own problems, and it cannot be solved by blaming frequentists. I really like Bayesian ideas and their methods... but I hate some of Bayesians boasting their results too much - "Frequentists are all wrong, while Bayesian methods are perfect" is clearly an overstatement.

Defining Google+ circles

2011-07-12T13:57:00.000-07:00

I have not been able to start using G+ actively, since it took quite a time for me to come up with nice circles definition, that is
1) MECE (Mutually Exclusive, Collectively Exhaustive)
2) Fits into memory - no more than 5 groups!
3) Sizes of circles are well-balanced

Inspired by sociologist Mark Granovetter's idea of 'weak tie', I think now it is done quite elegantly:
1) Koreans - who are not annoyed because I am posting in Korean
1-A) Korean Strong Tie - those I can speak international matters with
1-B) Korean Weak Tie
2) International Friends - those who don't read Korean
# 2-A) International Strong Tie
# 2-B) International Weak Tie
Currently 2) is kinda small relative to 1), thus 2 is not yet sub-divided into 2-A) and 2-B).3) Family

But still, I need yet another rule to decide on which occasion should I use facebook/twitter/G+. Man, using SNS is quite a burden!

How to send search requests to Journal Websites

2011-07-12T05:54:00.000-07:00

When I search for papers, I simply use google search. Sometimes I use google scholar, but oddly enough, the former usually gives me better quality results. But in some areas of study people use journal websites directly, maybe because terminologies they use are quite general, thus they want to exclude non-academic websites. (Maybe they don't like Google? Actually I don't really know why :D)

However, you do not want to visit every journal website for every single query. If there are ten journals in your area of study, then visiting all ten journal websites should consume a lot of your precious time. So a friend of mine wants to make a web-page which can send search request to any journal website she wants to use.

If the request is done in GET, it is easy figure out what parameters you need. For example, if you search "graphs" in Google, the address bar on your browser shows the following URL:

http://www.google.de/search?sourceid=chrome&ie=UTF-8&q=graphs

So you can simply substitute "graphs" part by the word of your choice to request search to google. But when you search 'graphs' in APS (American Physics Society) website, it simply shows

http://publish.aps.org/search

because the request is done in POST. There are two ways of finding out parameters used in POST, I think. The first is to look at HTML source code of the web page which requests search. But usually HTML source codes are very unreadable, so reading it is not very fun. The other way is usually more efficient: to look at sent request. You may use fancy network monitoring tools, but actually it suffices to use Google Chrome.

To do this, click on 'Wrench' icon next to the address bar, go to 'Tools' menu, and turn on developer tools. Then a fancy window appears on bottom of the browser. Choose 'Network' tab, then you can see something like the following:

From 'Form Data', you can see which parameters were requested in 'POST' message. In this case, it seems that 'q%5Bclauses%5D%5B%5D%5Bfield%5D' denotes the type of the field, and 'q%5Bclauses%5D%5B%5D%5Bvalue%5D' is the value of the field. Thus, by using the following URL:

http://publish.aps.org/search/query?q%5Bclauses%5D%5B%5D%5Bfield%5D=abstitle&q%5Bclauses%5D%5B%5D%5Bvalue%5D=graphs

you can search for 'graphs' in APS website. Similarly,

http://jcp.aip.org/search?key=JCPSA6&societykey=AIP&coden=JCPSA6&q=graphs&displayid=AIP&sortby=newestdate&faceted=faceted&sortby=newestdate&CP_Style=false&alias=&searchzone=2

Oh, in this case search was done in GET...!?

http://pubs.acs.org/action/doSearch?action=search&searchText=graphs&qsSearchArea=searchText&type=within&publication=40001010

Oh... it was also done in GET... So there was only one case that was done in POST... Why did I start writing this article in the first place... OTL Anyways, this is how to do it.

Using tuple as a key of unordered_map in boost

2011-07-05T07:35:00.000-07:00

Yes, this is a very trivial thing, but (surprisingly) none of the sources on web gave me the direct solution. Although the solution is straightforward after all, but it took me tremendous time for me to figure this out, so I would like to leave some memo for future reference.

When you define hash_value(), it is very important that the function is in the same namespace to that of key class.
boost::tuple is in namespace boost::tuples. (Why!?!?)

So you have to include the following code:

typedef tuple param_tuple;

namespace boost {

namespace tuples {

std::size_t hash_value(param_tuple const& e) {

std::size_t seed = 0;

boost::hash_combine( seed, e.get<0>() );

boost::hash_combine( seed, e.get<1>() );

boost::hash_combine( seed, e.get<2>() );

return seed;

}

Using ROC Curves

2011-02-10T14:17:00.000-08:00

As a homework of machine learning course, I'm implementing a spam filter with Naive Bayes classifier. To evaluate the performance, I wanted to use ROC curve, but I was unsure of how can I use it in a proper way. So I found the following tutorial extremely useful:

http://www.cs.bris.ac.uk/~flach/ICML04tutorial/

It has a lot of materials, maybe a little too much since I'm not that committed to the theory of ROC curve for now, but maybe I should definitely return to this material since this is surely of practical importance.

My labmate Nguyen Cao informed me about the R package on ROC curves. I haven't taken a serious look at it yet, but it looks pretty nice. Following is the link to the website:

http://rocr.bioinf.mpi-sb.mpg.de/

There are a lot of things to learn...! :D

Installed doxygen + doxymacs

2011-02-08T17:30:00.001-08:00

I'm trying to learn to use Doxygen.

http://www.stack.nl/~dimitri/doxygen/

This way I hope I can learn to document my code better :)

The following page helped me install doxymacs, emacs plug-in for doxygen.
(Actually what I had to do was just to type apt-get install doxymacs)
http://emacs-fu.blogspot.com/2009/01/commenting-your-functions.html

Then I confronted the problem of not displaying my e-mail address correctly.
I think this is because I didn't configure my Ubuntu properly, but at least I could find that

 (setq user-mail-address "my@email.com")

will set my user-mail-address variable, and would propagate to doxymacs-user-mail-address.

As a statistics Ph.D student, I felt obliged to know how to document R codes (although R is not my favorite language for scientific computation). The solution is to try Roxygen in the following project page:

http://roxygen.org/

It's good that ESS (Emacs Speaks Statistics) supports Roxygen! Actually I've never done anything ambitious in R (just homework problems), but I would definitely try to use Roxygen when the appropriate time comes!

What does 'probability' mean in Statistics?

2010-09-07T19:36:00.000-07:00

I majored in Industrial Engineering and Mathematics when I was an undergraduate student: I changed my major to Statistics when I came to graduate school. So frankly I have no philosophy in Statistics: I'm in the process of "building" it.

I've taken courses mainly on mathematical statistics/probability/computational statistics. Those courses only talk about theoretical well-defined problem settings, so although it helped me refine my mathematical skills, but didn't really help me build the philosophy. Now I'm taking 'STAT525: Intermediate Statistical Methodology', and it makes me to think more on foundations of Statistics.

At the time I first heard about Bayesian Statistics when I was an undergraduate, it didn't appealed much to me since Bayesians' notion of "personal" probability didn't make much sense. In most cases, nobody knows which prior to use. Then what does posterior probability mean?

But it seems that even frequentists are not that rigorous in interpreting probability, when it comes to application of their theory (which should be the ultimate goal of every statistical research). When using linear regression models, we all know that no error is really Gaussian, so every goodness-of-fit test fails when the number of data points is sufficiently large. However, we still use usual regression models since they are "robust" to deviation from normality: our estimates and confidence intervals may not very too much in mild conditions. How can we interpret probabilities, then? For example, how can we interpret what 95% confidence interval means when it does not contain the true parameter for 95% frequency?

Are Statisticians really in the position to blame data mining/machine learning people because outcome of their methods do not have clear probabilistic interpretations? We know that SVM is asymptotically median classifier. Don't most of statistical methods rely only on asymptotic results, especially when they're analyzing practical problems?

How much "objective" a statistical analysis can be? How much of "objectivity" is required? Can I answer these questions before I get Ph.D degree?

Statistical Research and Ethics

2010-07-13T23:19:00.000-07:00

In my opinion, Statistics is an attempt to draw a line between "speakable"s and "unspeakable"s. "Is there really a relationship between X and Y?" Usually you cannot say "yes" or "no" for sure, but there is always something left for you to "speak". Statistical Theory is about those which are left.

In this respect, statistical analysis is an ethical deed, because speaking of unspeakable is unethical. This is why I'm infuriated when some people blur this "line" between two instead of making it clearer. I'm sick of people making exaggerations (especially with those "social network" things) about data they have in hand, thus I became a Statistician.

In this line, speaking of "unspeakable"s by merely introducing a prior distribution that you're not sure of validity is an unethical deed. One should be very careful when interpreting the outcome of Bayesian analysis. When we have no idea about what the prior distribution is, then we also have no idea about how the posterior distribution can be interpreted. But it's really tempting to "intentionally" ignore the fact that the selection of prior was ad-hoc.

Hence it is natural to favor a framework of statistical inference which this distinction between "speakable" and "unspeakable" is clear. This is why there are much more people using frequentist methods than Bayesian methods. Neyman-Pearson framework, However, the dominant statistical procedure used in hypothesis testing, puts so much emphasis on null hypothesis that in many cases there's little left to talk about alternative hypotheses. For example, low p-value indicates the null hypothesis is not a good model for your data, but it does not necessarily mean there is a good model in models of alternative hypothesis. This limitation sometimes makes people make statements about "unspeakable"s of alternative hypothesis, which is also unethical.

Maybe no statistical procedure can be perfect. We conduct statistical analysis since there are uncertainties. If we can speak with 100% certainty, then it is not about statistics. Maybe you're talking about Mathematics (although as you might know, you cannot always speak with 100% certainty even in Mathematics). But you should also agree that there are "speakable"s in data, because those accumulated "speakable"s have constructed the 'Science' which aids us every day. So we should investigate how we can develop a procedure which we can easily avoid speaking the "unspeakable" while there's still much thing left to "speak".

In this respect, I like Liu's theory (and the framework) of statistical inference since it makes statistical statements to be clear of what is speakable and what is not. So it is safer than other frameworks from committing an unethical deed.

http://www.stat.purdue.edu/~chuanhai/research.html

Shamefully I'm a novice at Statistics, so I'm yet to judge usefulness or impact of what his group has done. However, I like them going a research in this direction, since this is what we 'should' do, helping people avoid committing "unethical" deed. I'm not saying we have a solution now. I'm saying we have something to do, and we're doing it.

My position on Graph Visualization

2010-07-02T00:34:00.000-07:00

Actually, I opened this blog again to write this article: while posting a sequence of tweets, I thought I would rather make my point clear.

Here, I would like to share some of my thoughts about Graph Visualization.

What is graph visualization? Take some look at the homepage of people who are quite good at it.
Just take some look at those images! No need to read details!

http://www.graphviz.org/

See? When we have data about entities and interactions among them, it's useful to visualize it as a graph.

Why is it useful? We human beings are trained to discover the 'structure' of the object we are looking at. For example, when you're looking at a desk, you can easily figure out how one part is physically/functionally connected to other parts. By visualizing a graph, we can make use of the same natural-born ability for the data-analytic purpose. Take some look at images from the URL I mentioned above, and try to think of a way to deliver the same information more effectively/economically. I hardly believe there is, except for some trivial circumstances.

No matter how attractive the fruit of the task is, it is a notoriously difficult problem. There are several ways of visualizing a graph, but the most common way is to try to preserve the length between two nodes/vertices to be proportional to geodesic path length on the graph (original data). That is, we want two nodes to be placed closer if they're closer on the original graph, and vice versa. If we have n number of nodes, there are n^2 = n * n number of pairwise distances on the original graph, so it's intractable to be find a diagram that satisfies every constraint in 2 or 3-dimensional euclidean space. Therefore what we do is, as in most problems of science/engineering, approximation: doing as good as we can, in certain perspective.

The site manager of http://www.graphviz.org clearly understands what we look for when we visualize a graph. When the size of the graph is small, there are less constraints to satisfy, and the structure (what we're aimed at) is much simpler than large graphs. I don't think implementation of GraphViz is less scalable than others. However, some people merely apply the same algorithm which was designed to visualize small graphs to large ones, and the result is like this: http://nodexl.codeplex.com/ . Again, just take some look at images, images of those complex graphs. What structure can you find from those graphs? In my opinion, there are only two cases: 1) the graph is so complex that you cannot find any structure, 2) you can find a structure, but it is misleading because the loss of information is not uniform: certain parts of information are overly removed to "invent" the structure.

Well, I don't want to offend the developers of NodeXL. Maybe this is a main reason I write a blog post instead of consecutive 140-character tweets. I haven't looked at the very details of the software, but I believe algorithms implemented in NodeXL are conventional ones used in other applications of graph visualization, and having a new implementation with widely-used and convenient interface like Microsoft Excel is the thing which a lot of people have been waiting for. I do believe their contribution is really significant. The visualization of large graphs, however, is not the thing their algorithms are aimed at, and they're doing really a poor job on the graphs on their website. It may attract some of those who are new to this area and have little experience making a real "use" of those graphs for a while, but nothing more than that.

I have never seen a meaningful visualization of graphs with >1000 nodes, unless it has a certain special structure. For example, think of a graph which has a circular structure. It can be easily done in 2 dimensions. Likewise, planar graphs and graphs with fractal structures are much easier to be visualized than others. In other cases, however, I believe the task is hopeless.

To make it worse, people want more than the preservation of graph geodesic distances on their drawings. They want to discover clustered-ness, bipartite-ness (if present), and a lot others from outputs of graph visualization algorithms, since it's quite doable when you're working with small graphs.

Then, why is the reason those NodeXL people shows visualization of large graphs, although they're not doing a good job on those graphs? Because in many of data analysis projects, the graph you have to deal with is much bigger than those that can be effectively handled by graph visualization algorithms.

So there is a task to be done, which cannot be easily done. In such a case, good scientists make it clear what's their contribution (what they could do), and what's the limitation of their work (what they could not do). Bad scientists blur those two things, which I believe to be highly unethical. Sadly, it's not difficult to find such unethical works, and images on NodeXL website reminds me a lot of such works. NodeXL is merely a tool (actually an excellent one), and it's not ethical or unethical by itself. But it's questionable it can be used in good ways, when even its own website does not demonstrate how to make a good use of it appropriately.

However, it's not easy to offend people who make misleading visualizations of graphs, since there is no good alternative yet. One professor told me that "do not blame a method before you have an alternative," and it applies to this case. But at least, we have to seek a way to make improvements.

In my point of view, merely applying the algorithm which works well with small graphs to large graphs is hopeless, as I'm repeating again and again. One of the reason is that there are too many things we look for from just one shot of image. We have to be more specific to get some meaningful job done. That's why I'm studying Statistics instead of Computer Science / Numerical Optimization to make some contribution. Methods like these looks promising in this respect:

Hoff, P.D., Raftery, A.E., and Handcock, M.S. (2002) "Latent Space Approaches to Social Network Analysis"
Journal of the American Statistical Association , vol. 97, no. 460, 1090-1098.
abstract , postscript , pdf .

E. Airoldi, D. Blei, S. Fienberg, and E. P. Xing, Mixed Membership Stochastic Blockmodel, Journal of Machine Learning Research, 9(Sep):1981--2014, 2008.A shorter version of this paper appears in Proceeding of the 22nd Neural Information Processing Systems, (NIPS 2008).

I don't think their works are practical / general / comprehensive enough to be used universally, but they are quite promising. And now is the time for me to contribute something to the literature, instead of constantly picking defects of others... :)

Switching the blog "again"

2010-07-01T23:08:00.000-07:00

I'm an ignorant first-year Ph.D student who knows virtually nothing, so there isn't much for me to share with others. That's why I have more blogs than my posts: I simply tested how good a service was. At the last time, my impression on blogger.com was pretty bad. For a Korean user, every UI looked unintuitive. Now I discover a lot of improvements, however, and another important reason: I hate my former-ID/nickname I used before! It was too childish! But wordpress.com did not allow me to change my domain, so I had to switch to the service where it was possible... :D

Birthday gift to my brother

2007-07-13T04:59:00.000-07:00

I have bought my brother an USB memory stick for his birthday gift that passed two weeks ago :-(

I had been having no interest of this sort, so I was surprised to see that this 4GB, SLC one costs only about $60(maybe cheaper outside Korea)!

Now it is pretty possible that some kind of 'system' can be installed on it. A MIDI working environment, for example, with his favorite sound sources. Or a quite mobile DB server with some kind of information like his/her own pictures? It may be quite searchable then(you know, most modern DBs support images to be inserted as a field). And while we're at it, it may contain some web server(it can be as light as calc.exe! :D) for user-friendly interface to manage data in it.

Brave new world!

While Dual/Quad core CPUs dominate...

2007-07-11T17:56:00.000-07:00

I'm reading 'An Algorithm for Subgraph Isomorphism(Ullmann, 1974)'
which was referenced by 'Network Motif Discovery Using Subgraph
Enumeration and Symmetry-Breaking(Grochow and Kellis)'.

It's a bit interesting to read this line:

'A parallel asynchronous logic-in-memory implementation of a
vital part of this algorithm is also described, although this hardware has not
actually been built. The hardware implementation would allow
very rapid determination of isomorphism'

Now 30 years have passed,
and Dual-Core and even Quad-Core CPUs are largely in use.
Multi-thread programming is universal nowadays.

I don't know J. R. Ullmann personally,
how old he is... what he is doing now...

but I wonder he is living a better life now.