In this tutorial I will
describe the steps for setting up a multi-node i.e. five nodes Hadoop cluster running
on Ubuntu. I have created this setup inside a VMware ESXi box. There are five
virtual machines all having the same common configurations like 20GB
HDD, 512MB RAM, etc. For this I will be using Ubuntu 14.04 LTS as
operating system and Hadoop 1.2.1. All the machines will be
distinguished by their names which correspond to namenode, secondary
namenode, datanode1, datanode2 and datanode3. So it is a five node
hadoop cluster.
MY LAB SETUP
Node Name
|
Hostname
|
IP Address
|
Namenode
|
nn
|
10.2.0.195
|
Secondary Namenode
|
nn2
|
10.2.0.196
|
Datanode1
|
dn1
|
10.2.0.197
|
Datanode2
|
dn2
|
10.2.0.198
|
Datanode3
|
dn3
|
10.2.0.199
|
Install Ubuntu 14.04
LTS in all the nodes and update/upgrade it.
To pull the newest
sources, execute below command.
user@nn:~$ sudo apt-get
update
To pull upgrades for
the os and packages execute below command.
user@nn:~$ sudo apt-get
upgrade
Reboot all the
machines.
NOTE:
There are a few configurations that are needed to be done in all
nodes. I will show for single node as the commands and steps will be
same for all other nodes.
HADOOP
USER & GROUP CONFIGURATION
Create
a hadoop group
user@nn:~$
sudo groupadd hadoop
Create
a hadoop user “huser” and place the user in hadoop group
user@nn:~$
sudo useradd -m -d /home/huser -g hadoop huser
Set
the password for the user “huser”
user@nn:~$
sudo passwd huser
By
default, ubuntu does not create users with sudo permissions. Hence we
need to assign “huser” user sudo permissions.
user@nn:~$
sudo chmod 740 /etc/sudoers
user@nn:~$
sudo vi /etc/sudoers
huser
ALL=(ALL:ALL) ALL
NOTE:
User and group configuration has to be done in all nodes.
OPENSSH
SERVER INSTALLATION
Hadoop
requires openssh-server for creating passwordless ssh environment to manage hadoop services from namenode to datanodes. The namenode seldom communicates with the datanodes. Instead datanodes communicate with namenode to send heartbeat and build block locations with the help of block reports sent by datanodes to namenodes. The datanode to namenode communication is based on simple socket communications. We will use ssh sessions that will be run as Hadoop user, “huser” that we created earlier.
user@nn:~$ su – huser
NOTE:
Login as “huser” user on all nodes as we will be performing all
the tasks as huser from now onwards on all nodes.
huser@nn:~$
sudo apt-get install openssh-server
PASSWORDLESS
SSH CONFIGURATION
huser@nn:~$
ssh-keygen -t rsa
huser@nn:~$
ssh-copy-id -i ~/.ssh/id_rsa.pub huser@nn2
huser@nn:~$
ssh-copy-id -i ~/.ssh/id_rsa.pub huser@dn1
huser@nn:~$
ssh-copy-id -i ~/.ssh/id_rsa.pub huser@dn2
huser@nn:~$
ssh-copy-id -i ~/.ssh/id_rsa.pub huser@dn3
NOTE:
Verify the passwordless ssh environment from namenode to all
datanodes as “huser” user.
OPENJDK
7 INSTALLATION
I
will be using openjdk-7-jdk package which installs easily.
huser@nn:~$
sudo apt-get install openjdk-7-jdk
NOTE:
Openjdk installation has to be done on all nodes. After the
installation verify the java installation by running the below
command.
huser@nn:~$ java -version
HADOOP
INSTALLATION
Hadoop
package can be downloaded from these
mirror locations. I had copied the hadoop-1.2.1.tar.gz file into all
nodes' “huser” home folder.
Unzip
the downloaded file in your “huser” home directory itself.
huser@nn:~$
tar -xzvf hadoop-1.2.1.tar.gz
Now
move it to /usr/local location.
huser@nn:~$
sudo mv hadoop-1.2.1 /usr/local/hadoop
Assign the permissions in favor of user “huser”.
huser@nn:~$ sudo chown -R huser:hadoop /usr/local/hadoop/
NOTE:
Installation of hadoop has to be done on all nodes.
USER
ENVIRONMENT CONFIGURATION
Now
we will configure the environment for user “huser”. As we are
already logged into the huser home directory, we need to append the
.bashrc file with the below mentioned variables to set the user
environment.
huser@nn:~$
vi .bashrc
###HADOOP
ENVIRONMENT###
export
JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/
export
PATH=$PATH:$JAVA_HOME/bin
export
HADOOP_PREFIX=/usr/local/hadoop/
export
PATH=$PATH:$HADOOP_PREFIX/bin
export
PATH=$PATH:$HADOOP_PREFIX/sbin
Execute
the .bashrc file.
huser@nn:~$
exec bash
To
verify the defined hadoop variables execute the below command to
check the hadoop version that is installed.
huser@nn:~$
hadoop version
NOTE:
User environment has to be configured on all nodes.
It
is very important to define the hosts in /etc/hosts file on all
nodes. Hence, I have copied the content of my namenode /etc/hosts
file to all the datanodes and secondary namenode.
HADOOP
CONFIGURATION
Finally
we have arrived at this point where we are going to begin with hadoop
configuration. All the configuration files are present in
/usr/local/hadoop/conf directory.
Hadoop
Environment - hadoop-env.sh
First of all we will go ahead and
edit the hadoop-env.sh file which plays a major role in defining the
overall runtime parameters for hadoop. Mainly we require to set the
JAVA_HOME parameter to specify the location of java installation. If
anyone has IPv6 enabled at your location, he has to define IPv4 Stack
to true as hadoop does not work on IPv6. Also I have defined hadoop
log directory inside /var/log/ directory for my own convenience.
huser@nn:~$
vi
/usr/local/hadoop/conf/hadoop-env.sh
export
JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export
HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
export HADOOP_LOG_DIR=/var/log/hadoop-log
After defining the variables we need to create the hadoop-log
directory in /var/log location on all nodes with the required “huser” user ownership and permissions.
huser@nn:~$ sudo mkdir /var/log/hadoop-log
huser@nn:~$ sudo chown -R huser:hadoop /var/log/hadoop-log
NOTE:
It is necessary to modify hadoop-env.sh with same parameters across
all nodes in the cluster and /var/log/hadoop-log directory has to be
created on all nodes with required “huser” user ownership and permissions
Default
Filesystem - core-site.xml
Next
we move to core-site.xml file which helps in pointing the datanodes
to namenodes and which port they should listen to. It mainly contains
configuration properties like Apache Hadoop Core (such as I/O) that
hadoop uses when starting up. These settings are common to HDFS and
MapReduce.
Here we will configure a parameter called fs.default.name. This
parameter is used by clients to access the hdfs filesystem or the
metadata for the requested file precisely. fs.default.name parameter
is set to an HDFS URI like hdfs://<authority>:<port>, the
URI identifies the filesystem and path specifies the location of
file/directory in the filesystem. As the name suggests it points to
the default URI for all filesystem requests in hadoop. Parameter:
fs.default.name is used by client to get URI of the default
filesystem and namenode to read the block adresses. It is basically
specifies the hostname and the binded port number of the HDFS
filesystem. Hence this file has to be defined in all nodes.
The next parameter that we will configure here is hadoop.tmp.dir.
This parameter I have used to specify the base for temporary
directories locally and also in HDFS.
Create
the tmp dir as specified in the core-site.xml file.
huser@nn:~$
mkdir /usr/local/hadoop/tmp
NOTE: Copy this file to all
nodes. We can use IP address or hostname and whichever port number of
the namenode. Commonly used port number is 9000, but I have chosen
10001. “/tmp” directory has to be
created on all nodes.
MapReduce
Framework – mapred-site.xml
This
file used to point all the task-trackers to the job-tracker. The
parameter: mapred.job.tracker sets the hostname or IP and port of
job-tracker.
NOTE:
Copy this file to all nodes. Port number 10002 is my choice, it is
not mandatory to use the same port number.
HDFS
Configuration – hdfs-site.xml
File hdfs-site.xml contains various parmeters that configures HDFS.
The parameter “dfs.name.dir” and “dfs.data.dir” define the
configuration settings for HDFS daemons: the namenode and the
datanodes. “dfs.name.dir” defines the directory where namenode
stores it's metadata and “dfs.data.dir” defines the directory
where datanodes will store their HDFS data blocks. “dfs.replication”
parameter defines the number of replicas to be made for each block of
data that is created.
Since
our cluster is a multi-node cluster it is important to node that the
hdfs-site.xml file would be different on both namenode and datanodes. The
namenode will only carry “dfs.name.dir” parameter, as we have
separate machines for datanodes which will only carry “dfs.data.dir”
parameter. The “dfs.replication” parameter will be common and
same for all nodes.
hdfs-site.xml configuration for namenode
huser@nn:~$
vi /usr/local/hadoop/conf/hdfs-site.xml
As specified in the hdfs-site.xml file, we need to create a
directories hdfs and name in namenode with appropriate “huser”
user ownership and permissions.
huser@nn:~$ sudo mkdir -p /hdfs/name
huser@nn:~$ sudo chown -R huser:hadoop /hdfs
hdfs-site.xml configuration for all datanodes
huser@dn1:~$ vi /usr/local/hadoop/conf/hdfs-site.xml
Similary in all datanodes we need to create hdfs and data
directories.
huser@dn1:~$ sudo mkdir -p /hdfs/data
huser@dn1:~$ sudo chown -R huser:hadoop /hdfs
NOTE: The hdfs-site.xml file is
different in both namenode and datanode because I am not running
datanode daemon on the namenode and we are trying to create a fully
distributed mode cluster.
conf/masters
The masters file contains the location where the secondary namenode
daemon would start.
huser@nn:~$
vi /usr/local/hadoop/conf/masters
nn2
NOTE:
We
have configured separate machine for the secondary namenode hence we
are defining secondary namenode explicitly here. There
is no need to duplicate this file to all nodes.
conf/slaves
The
slaves file contains the list of datanodes where the datanode and
task-tracker daemons will run.
huser@nn:~$
vi /usr/local/hadoop/conf/slaves
dn1
dn2
dn3
NOTE:
The file should contain one entry per line. We can also mention
“namenode” and “secondary namenode” hostname in this file if
we want to run datanode and task-tracker daemons on “namenode”
and “secondary namenode” too.
STARTING HADOOP
CLUSTER
Formatting HDFS via
namenode
Before we start our
cluster we will format the HDFS via namenode. Formatting the namenode
means to initialize the directory specified in “dfs.name.dir” and
“dfs.data.dir” parameter in hdfs-site.xml file. After formatting
the namenode, “current”, “image” and “previous.checkpoint”
directories will be created on namenode and the data directory in
datanodes will simply be formatted.
huser@nn1:~$
hadoop namenode -format
NOTE:
We need to format the namenode only the first time we setup hadoop
cluster. Formatting a running cluster will destroy all the existing
data. After executing the format command, it will prompt for
confirmation where we need to type “Y” as it is case-sensitive.
Starting
the Multi-Node Hadoop Cluster
Starting
a hadoop cluster can be done by a single command mentioned below.
huser@nn:~$
start-all.sh
This command will
first start the HDFS daemons. The namenode daemon is started on
namenode and datanode daemon is started on all datanodes. The
secondary namenode daemon is also started on secondary namenode. In
the second phase it will start theMapReduce daemons, job-tracker on
namenode and task-trackers on datanodes.
JPS (Java Virtual
Machine Process Status Tool) is used to verify that all the daemons
are running on their respective nodes, we will execute “jps”
command to see the java processes running.
Namenode java processes
Datanode java processes
Secondary namenode java processes
NOTE:
Check for errors in /var/log/hadoop-log/ directory in case there is
any missing java process.
Checking
Ports Status
To check whether
hadoop is listening to the configured ports execute the below
command.
huser@nn:~$ sudo
netstat -ntlp | grep java
We can see in the above screenshot that our namenode is running hdfs on port 50070 and job-tracker is running on port 50030.
Checking
HDFS Status
huser@nn:~$ hadoop
dfsadmin -report
Configured Capacity:
14713540608 (13.7 GB)
Present Capacity:
13865287680 (12.91 GB)
DFS Remaining:
13865164800 (12.91 GB)
DFS Used: 122880
(120 KB)
DFS Used%: 0%
Under replicated
blocks: 0
Blocks with corrupt
replicas: 0
Missing blocks: 0
-------------------------------------------------
Datanodes available:
3 (3 total, 0 dead)
Name:
10.2.0.199:50010
Decommission Status
: Normal
Configured Capacity:
4904513536 (4.57 GB)
DFS Used: 40960 (40
KB)
Non DFS Used:
282750976 (269.65 MB)
DFS Remaining:
4621721600(4.3 GB)
DFS Used%: 0%
DFS Remaining%:
94.23%
Last contact: Sun
May 18 12:11:49 IST 2014
Name:
10.2.0.198:50010
Decommission Status
: Normal
Configured Capacity:
4904513536 (4.57 GB)
DFS Used: 40960 (40
KB)
Non DFS Used:
282750976 (269.65 MB)
DFS Remaining:
4621721600(4.3 GB)
DFS Used%: 0%
DFS Remaining%:
94.23%
Last contact: Sun
May 18 12:11:49 IST 2014
Name:
10.2.0.197:50010
Decommission Status
: Normal
Configured Capacity:
4904513536 (4.57 GB)
DFS Used: 40960 (40
KB)
Non DFS Used:
282750976 (269.65 MB)
DFS Remaining:
4621721600(4.3 GB)
DFS Used%: 0%
DFS Remaining%:
94.23%
Last contact: Sun
May 18 12:11:48 IST 2014
huser@nn:~$
Checking
HDFS content
Hadoop API provides
the facility to run linux like commands to check and manage HDFS
content.
Listing files in /
Creating
directory
hadoop
fs -mkdir /user
Copying
Files to HDFS
Before
we run a mapreduce job we will copy some text files from our local
filesystem into the HDFS.
huser@nn:~$
hadoop dfs -copyFromLocal ~/Books/ /user/huser/Books/
MONITORING
Both HDFS and
MapReduce provide Web-UI Management websites to browse the HDFS,
monitor the logs and jobs.
HDFS:
http://nn:50070
The
DFSHealth site allows you to browse the HDFS, monitor namenode logs,
view nodes, check the space used, etc.
MapReduce
Job-Tracker:
http://nn:50030
The job-tracker site
provides the information regarding running map and reduce tasks,
running and completed jobs and much more.
DECOMMISSIONING
NODES
If we want to bring a data node down for whatever
reason, we have to decommission that datanode. To decommission a
datanode we just have to create a file named excludes in the hadoop
conf directory (/usr/local/hadoop/conf/) and mention the hostname of
the datanode we want to decommission.
huser@nn:~$
vi /usr/local/hadoop/conf/excludes
dn3
Next we need to edit the core-site.xml file provide the
excludes file location into a property setting.
Similarly we can have includes files too. Now we can
restart the namenode or just refresh the configuration using below
command.
huser@nn:~$
hadoop dfsadmin -refreshNodes
We
can see the decommissioned nodes under “Decommissioning Nodes”
section on namenode webUI (http://nn:50070).
So here we discussed about creating a fully distributed multi-node Hadoop cluster on Ubuntu 14.04. Hope that was easy to understand and implement. In the coming posts we will discuss about running a mapreduce job on our Hadoop cluster. Happy Hadooping!!!
Related Links:
Single-Node Hadoop Cluster on Ubuntu 14.04
Multi-Node Hadoop Cluster on Oracle Solaris 11 using Zones
Fully Distributed Hadoop Cluster - Automatic Failover HA Cluster with ZooKeeper & QJM
Single-Node Hadoop Cluster on Ubuntu 14.04
Multi-Node Hadoop Cluster on Oracle Solaris 11 using Zones
Fully Distributed Hadoop Federation Cluster
Fully Distributed Hadoop Cluster - Manual Failover HA with NFS
Fully Distributed Hadoop Cluster - Automatic Failover HA Cluster with ZooKeeper & NFSFully Distributed Hadoop Cluster - Manual Failover HA with NFS
Fully Distributed Hadoop Cluster - Automatic Failover HA Cluster with ZooKeeper & QJM