Using the Hadoop book without an Amazon account

I'm working through some of the examples in the O'Reilly book "Hadoop: The Definitive Guide". One of the first things I noticed is that to get the data set the author uses, you must have a cluster set up using Amazon AWS. Since I don't (I've got a small cluster at home), I didn't have a way to get the data. Therefore, I've done the hard work to get all the data downloaded, transformed, and uploaded to my site so that you can avoid the hard work of tracking this down yourself.

Note that the files are a total of approximately 32GB... this may take a while to download. Note also that this data has been decompressed, reorganized, and re-compressed to match the final format used by the Hadoop book. As a result, the scripts shown in the Hadoop book will not work with this data.

The easiest way to download these files is with the following command:

$ wget --mirror

Because these files are being loaded in a different way than is explained in the Hadoop book, loading them into Hadoop requires a different upload process. To upload it, just execute the following commands from the same directory you were in when you performed the wget command above:

$ for file in*
> do
>     filename=`basename $file`
>     $HADOOP_INSTALL/bin/hadoop fs -put $file gz/$filename
> done

In case you're interested, the script I wrote to download everything is here.


Add new comment