Data-Intensive Text Processing with Map Reduce

Lately I have been learning Map/Reduce programming on clusters such as Hadoop. While there are many books teaching how to get a cluster up and running, the number of resources teaching advanced map/reduce algorithms are much more limited.

One excellent resource for learning these algorithms is Data-Intensive Text Processing with MapReduce (free download) by Lin and Dyer (2010). This paper presents algorithms for word co-occurrence frequencies, building an inverted index, and computing PageRank, just to name a few. All algorithms are presented as pseudo-code.

In this series of posts I will be sharing the code I have written to implement these algorithms specifically within Hadoop. Each post describes one algorithm's implementation. It is assumed that you can follow basic Java without extensive explanation.

The posts also assume you have a Hadoop installation you can compile and run against. If you are running Mac OSX, you can install Hadoop in about 5 minutes.

While this is not intended to be a beginner's introduction to programming for Hadoop (there are many great online resources for that), I do explain why I do things the way I do when they deviate from the defaults. The algorithms themselves come directly from the document described above; please use it as a reference to understand the algorithmic choices.

Code can be downloaded from the repository on Github. Once you have a local copy, simply type mvn package in the top directory. The repository is an Eclipse package and can be imported the usual way.

Let's get into the code!