\sum_{i,j,k} d_{ijk}

Friday, January 13, 2012

http://arxiv.org/pdf/1201.2590v1 William M. Briggs - 'It is Time to Stop Teaching Frequentism to Non-statisticians'

Every time a Bayesian attacks fallacies of frequentism, I agree with their points. But that does not necessarily imply Bayesian is 'better' than frequentist method and should replace it. Bayesians have their own problems, and it cannot be solved by blaming frequentists. I really like Bayesian ideas and their methods... but I hate some of Bayesians boasting their results too much - "Frequentists are all wrong, while Bayesian methods are perfect" is clearly an overstatement.

Tuesday, July 12, 2011

Defining Google+ circles

I have not been able to start using G+ actively, since it took quite a time for me to come up with nice circles definition, that is
1) MECE (Mutually Exclusive, Collectively Exhaustive)
2) Fits into memory - no more than 5 groups!
3) Sizes of circles are well-balanced

Inspired by sociologist Mark Granovetter's idea of 'weak tie', I think now it is done quite elegantly:
1) Koreans - who are not annoyed because I am posting in Korean
1-A) Korean Strong Tie - those I can speak international matters with
1-B) Korean Weak Tie
2) International Friends - those who don't read Korean
# 2-A) International Strong Tie
# 2-B) International Weak Tie
Currently 2) is kinda small relative to 1), thus 2 is not yet sub-divided into 2-A) and 2-B).3) Family

But still, I need yet another rule to decide on which occasion should I use facebook/twitter/G+. Man, using SNS is quite a burden!

How to send search requests to Journal Websites

When I search for papers, I simply use google search. Sometimes I use google scholar, but oddly enough, the former usually gives me better quality results. But in some areas of study people use journal websites directly, maybe because terminologies they use are quite general, thus they want to exclude non-academic websites. (Maybe they don't like Google? Actually I don't really know why :D)

However, you do not want to visit every journal website for every single query. If there are ten journals in your area of study, then visiting all ten journal websites should consume a lot of your precious time. So a friend of mine wants to make a web-page which can send search request to any journal website she wants to use.

If the request is done in GET, it is easy figure out what parameters you need. For example, if you search "graphs" in Google, the address bar on your browser shows the following URL:

http://www.google.de/search?sourceid=chrome&ie=UTF-8&q=graphs

So you can simply substitute "graphs" part by the word of your choice to request search to google. But when you search 'graphs' in APS (American Physics Society) website, it simply shows

http://publish.aps.org/search

because the request is done in POST. There are two ways of finding out parameters used in POST, I think. The first is to look at HTML source code of the web page which requests search. But usually HTML source codes are very unreadable, so reading it is not very fun. The other way is usually more efficient: to look at sent request. You may use fancy network monitoring tools, but actually it suffices to use Google Chrome.

To do this, click on 'Wrench' icon next to the address bar, go to 'Tools' menu, and turn on developer tools. Then a fancy window appears on bottom of the browser. Choose 'Network' tab, then you can see something like the following:

From 'Form Data', you can see which parameters were requested in 'POST' message. In this case, it seems that 'q%5Bclauses%5D%5B%5D%5Bfield%5D' denotes the type of the field, and 'q%5Bclauses%5D%5B%5D%5Bvalue%5D' is the value of the field. Thus, by using the following URL:

http://publish.aps.org/search/query?q%5Bclauses%5D%5B%5D%5Bfield%5D=abstitle&q%5Bclauses%5D%5B%5D%5Bvalue%5D=graphs

you can search for 'graphs' in APS website. Similarly,

http://jcp.aip.org/search?key=JCPSA6&societykey=AIP&coden=JCPSA6&q=graphs&displayid=AIP&sortby=newestdate&faceted=faceted&sortby=newestdate&CP_Style=false&alias=&searchzone=2

Oh, in this case search was done in GET...!?

http://pubs.acs.org/action/doSearch?action=search&searchText=graphs&qsSearchArea=searchText&type=within&publication=40001010

Oh... it was also done in GET... So there was only one case that was done in POST... Why did I start writing this article in the first place... OTL Anyways, this is how to do it.

Tuesday, July 5, 2011

Using tuple as a key of unordered_map in boost

Yes, this is a very trivial thing, but (surprisingly) none of the sources on web gave me the direct solution. Although the solution is straightforward after all, but it took me tremendous time for me to figure this out, so I would like to leave some memo for future reference.

When you define hash_value(), it is very important that the function is in the same namespace to that of key class.
boost::tuple is in namespace boost::tuples. (Why!?!?)

So you have to include the following code:

typedef tuple param_tuple;

namespace boost {

namespace tuples {

std::size_t hash_value(param_tuple const& e) {

std::size_t seed = 0;

boost::hash_combine( seed, e.get<0>() );

boost::hash_combine( seed, e.get<1>() );

boost::hash_combine( seed, e.get<2>() );

return seed;

}

Thursday, February 10, 2011

Using ROC Curves

As a homework of machine learning course, I'm implementing a spam filter with Naive Bayes classifier. To evaluate the performance, I wanted to use ROC curve, but I was unsure of how can I use it in a proper way. So I found the following tutorial extremely useful:

http://www.cs.bris.ac.uk/~flach/ICML04tutorial/

It has a lot of materials, maybe a little too much since I'm not that committed to the theory of ROC curve for now, but maybe I should definitely return to this material since this is surely of practical importance.

My labmate Nguyen Cao informed me about the R package on ROC curves. I haven't taken a serious look at it yet, but it looks pretty nice. Following is the link to the website:

http://rocr.bioinf.mpi-sb.mpg.de/

There are a lot of things to learn...! :D

Tuesday, February 8, 2011

Installed doxygen + doxymacs

I'm trying to learn to use Doxygen.

http://www.stack.nl/~dimitri/doxygen/

This way I hope I can learn to document my code better :)

The following page helped me install doxymacs, emacs plug-in for doxygen.
(Actually what I had to do was just to type apt-get install doxymacs)
http://emacs-fu.blogspot.com/2009/01/commenting-your-functions.html

Then I confronted the problem of not displaying my e-mail address correctly.
I think this is because I didn't configure my Ubuntu properly, but at least I could find that

 (setq user-mail-address "my@email.com")

will set my user-mail-address variable, and would propagate to doxymacs-user-mail-address.

As a statistics Ph.D student, I felt obliged to know how to document R codes (although R is not my favorite language for scientific computation). The solution is to try Roxygen in the following project page:

http://roxygen.org/

It's good that ESS (Emacs Speaks Statistics) supports Roxygen! Actually I've never done anything ambitious in R (just homework problems), but I would definitely try to use Roxygen when the appropriate time comes!

Tuesday, September 7, 2010

What does 'probability' mean in Statistics?

I majored in Industrial Engineering and Mathematics when I was an undergraduate student: I changed my major to Statistics when I came to graduate school. So frankly I have no philosophy in Statistics: I'm in the process of "building" it.

I've taken courses mainly on mathematical statistics/probability/computational statistics. Those courses only talk about theoretical well-defined problem settings, so although it helped me refine my mathematical skills, but didn't really help me build the philosophy. Now I'm taking 'STAT525: Intermediate Statistical Methodology', and it makes me to think more on foundations of Statistics.

At the time I first heard about Bayesian Statistics when I was an undergraduate, it didn't appealed much to me since Bayesians' notion of "personal" probability didn't make much sense. In most cases, nobody knows which prior to use. Then what does posterior probability mean?

But it seems that even frequentists are not that rigorous in interpreting probability, when it comes to application of their theory (which should be the ultimate goal of every statistical research). When using linear regression models, we all know that no error is really Gaussian, so every goodness-of-fit test fails when the number of data points is sufficiently large. However, we still use usual regression models since they are "robust" to deviation from normality: our estimates and confidence intervals may not very too much in mild conditions. How can we interpret probabilities, then? For example, how can we interpret what 95% confidence interval means when it does not contain the true parameter for 95% frequency?

Are Statisticians really in the position to blame data mining/machine learning people because outcome of their methods do not have clear probabilistic interpretations? We know that SVM is asymptotically median classifier. Don't most of statistical methods rely only on asymptotic results, especially when they're analyzing practical problems?

How much "objective" a statistical analysis can be? How much of "objectivity" is required? Can I answer these questions before I get Ph.D degree?