Tuesday, January 29, 2013

Configuring Distributed Hadoop Cluster on ubuntu 10.04

Configuring Hadoop  on ubuntu

Recommended Number of hosts

1 for Namenode
1 for Jobtracker & Secondary NameNode
3 for Datanodes & Task tracker

 For the purpose of this i would use the following hosts names
  • namenode.example.com
  • jobtracker.example.com
  • slave1.example.com,slave2.example.com,slave3.example.com,
On All the hosts that you have for the cluster do the following
 
create /etc/apt/sources.d/cdh3.list file and add  the repo info

deb http://archive.cloudera.com/debian lucid-cdh3u3 contrib
run

sudo apt-get update

You would also need to install sun-java6-jdk, sun-java6-jre & sun-java6-jvm

 on namenode.example.com run

sudo apt-get install hadoop-namenode

on jobtracker.example.com

sudo apt-get install hadoop-jobtracker, hadoop-secondarynamenode

on slave{1,2,3}.example.com

sudo apt-get install hadoop-datanode hadoop-tasktracker


Configurations for namenode & jobtracker

$ cat /usr/lib/hadoop/conf/core-site.xml

Property Name : fs.default.nameProperty Value : hdfs://namenode.example.com/


 /usr/lib/hadoop/conf/hdfs-site.xml

Property Name :  dfs.data.dir
 Property Value :   /grid/g1/hadoop-data/hadoop-${user.name}

Property Name : dfs.name.dir
Property Value :   /grid/g1/grid-image1, /grid/g1/grid-image2



$ cat /usr/lib/hadoop/conf/mapred-site.xml


Property Name : mapred.job.tracker
Property Value : jobtracker.example.com:8021

Configurations for datanode and tasktracker

$cat /usr/lib/hadoop/conf/hdfs-site.xml 



Property Name :  dfs.data.dir
Property Value : /grid/g1/hadoop-data/hadoop-${user.name} (Add extra disk locations)

$cat /usr/lib/hadoop/conf/mapred-site.xml 

Property Name :  mapred.job.tracker
Property Value : jobtracker.example.com

Property Name : mapred.tasktracker.map.tasks.maximum
Property Value :$number_of_maps depending on your host config


Property Name : mapred.tasktracker.reducer.tasks.maximum
Property Value : $number_of_maps depending on your host config  

Config for secondary namenode to be in hdfs-site.xml


Property Name : dfs.secondary.http.address
Property Value :  jobtracker.example.com:50090

Property Name : dfs.http.address
Property Value : namenode.example.com:50070

Property Name : fs.checkpoint.dir
Property Value : /grid/g1/checkpoint,/grid/g2/checkpoint

Property Name :  fs.checkpoint.edits.dir
Property Value : /grid/g1/checkpoint/edits,/grid/g2/checkpoint/edits

General Commands &  Steps  


On namenode

mkdir -p  /grid/g1/hadoop-data/; sudo chown hdfs  /grid/g1/hadoop-data
sudo mkdir -p /grid/g1/grid-image1 /grid/g1/grid-image2
sudo chown -R hdfs /grid/g1/grid-image1  /grid/g1/grid-image2
sudo -u hdfs hadoop namenode -format
sudo /etc/init.d/hadoop-namenode start

sudo -u hdfs hadoop fs -mkdir /user/mapred
sudo -u hdfs hadoop fs -mkdir /user/hdfs

sudo -u hdfs hadoop fs -chown hdfs  /user/hdfssudo -u hdfs hadoop fs -chown mapred  /user/mapred

on jobtracker

sudo mkdir -p /grid/g1/checkpoint/edits /grid/g2/checkpoint/edits
sudo chown -R hdfs  /grid/g1/checkpoint/edits /grid/g2/checkpoint/edits
sudo /etc/init.d/hadoop-*jobtracker start
sudo /etc/init.d/hadoop*secondarynamenode start

on slave nodes

sudo mkdir -p /grid/g1/hadoop-data/
sudo chown hdfs -R /grid/g1/hadoop-data
sudo /etc/init.d/hadoop*datanode start
sudo /etc/init.d/hadoop-tasktracker start


* Please check the logs to confirm if the host has come up

you should not be able to check the UI

http://namenode.example.com:50070
http://jobtracker.example.com:50030
http://jobtracker.example.com:50090 for secondary namenode

No comments: