Wednesday, January 5, 2011

Setting up CDH3 Hadoop on my new Macbook Pro

A New Machine 
I'm fortunate enough to have recently received a Macbook Pro, 2.8 GHz Intel dual core, with 8GB RAM.  This is the third time I've turned a vanilla mac into a ninja coding machine, and following my design principle of "first time = coincidence, second time = annoying, third time = pattern", I've decided to write down the details for the next time.

Baseline
This section details the pre-hadoop installs I did.

Java
Previously I was running on Leopard, i.e. 10.4, and had to install soylatte to get the latest version of Java. In Snow Leopard, java jdk 1.6.0_22 is installed by default. That's good enough for me, for now.

Gcc, etc.
In order to get these on the box, I had to install XCode, making sure to check the 'linux dev tools' option.

MacPorts
I installed MacPorts in case I needed to upgrade any native libs or tools.

Eclipse
I downloaded the 64 bit Java EE version of Helios.

Tomcat
Tomcat is part of my daily fun, and these instructions to install tomcat6 where helpful. One thing to note is that in order to access the tomcat manager panel, you also need to specify

<role rolename="manager"/>

prior to defining

<user username="admin" password="password" roles="standard,manager,admin"/>

Also, I run tomcat standalone (no httpd), so the mod_jk install part didnt apply. Finally, I chose not to daemonize tomcat because this is a dev box, not a server, and the instructions for compiling and using jsvc for 64 bit sounded iffy at best.

Hadoop
I use the CDH distro. The install was amazingly easy, and their support rocks. Unfortunately, they don't have a dmg that drops Hadoop on the box configured and ready to run, so I need to build up my own psuedo mac node. This is what I want my mac to have (for starters):
  1. distinct processes for namenode, job tracker node, and datanode/task tracker nodes.
  2. formatted HDFS
  3. Pig 0.8.0
I'm not going to try to auto start hadoop because (again) this is a dev box, and start-all.sh should handle bringing up the JVMs around namenode, job tracker, datanode/tasktracker.

I am installing CDH3, because I've been running it in psuedo-mode on my Ubuntu dev box for the last month and have had no issues with it. Also, I want to run Pig 0.8.0, and that version may have some assumptions about the version of Hadoop that it needs.

All of the CDH3 Tarballs can be found at http://archive.cloudera.com/cdh/3/, and damn, that's a lot of tarballs.

I downloaded hadoop 0.20.2+737, it's (currently) the latest version out there. Because this is my new dev box, I decided to forego the usual security motivated setup of the hadoop user. When this decision comes back to bite me, I'll be sure to update this post. In fact, for ease of permissions/etc, I decided to install under my home dir, under  a CDH3 dir, so I could group all CDH3 related installs together. I symlinked the hadoop-0.20+737 dir to hadoop, and I'll update it if CDH3 updates their version of hadoop.

After untarring to the directory, all that was left was to make sure the ~/CDH3/hadoop/bin directory was in my .profile PATH settings.

Psuedo Mode Config
I'm going to set up Hadoop in psuedo distributed mode, just like I have on my Ubuntu box. Unlike Debian/Red Hat CDH distros, where this is an apt-get or yum command, I need to set up conf files on my own.

Fortunately the example-confs subdir of the Hadoop install has a conf.psuedo subdir. I needed to modify the following in core-site.xml:

 <property>
     <name>hadoop.tmp.dir</name>
     <value>changed_to_a_valid_dir_I_own</value>
 </property>

and the following in hdfs-site.xml:

 <property>
     <!-- specify this so that running 'hadoop namenode -format' formats the right dir -->
     <name>dfs.name.dir</name>
     <value>changed_to_a_different_dir_I_own</value>
  </property>

I also had to create masters and slaves files in the example-confs/conf.pseudo directory:

echo localhost > master
echo localhost > slave

finally, I symlinked the conf dir at the top level of the Hadoop install to example-configs/conf.pseudo after saving off the original conf:

mv ./conf install-conf
ln -sf ./example-confs/conf.pseudo conf

Pig
Installing Pig is as simple as downloading the tar, setting the path up, and going, sort of. The first time I ran pig, it tried to connect to the default install location of hadoop, /usr/lib/hadoop-0.20/. I made sure to set HADOOP_HOME to point to my install, and verified that the grunt shell connected to my configured HDFS (on port 8020).

More To Come
This psuedo node install was relatively painless. I'm going to continue to install Hadoop/HDFS based tools that may need more (HBase) or less (Hive) configuration, and update in successive posts.