This tutorial
demonstrates how to setup an Apache Hadoop 1.2.1 Cluster using Oracle
Solaris 11.1 Virtualization Technology or Zones.
I am running this setup
inside Oracle VM VirtualBox 4.3.10 on Ubuntu 12.04 and the guest
machine is running Oracle Solaris 11. Namenode will run inside Global
Zone whereas we will be configuring 4 guest zones with almost same
configuration for separate “Secondary Namenode” and 2 datanodes
respectively.
LAB SETUP
Nodename
|
Zone
|
Hostname
|
IP-Address
|
Namenode
|
Global Zone
|
nn
|
10.0.2.15
|
Secondary Namenode
|
Guest Zone
|
nn2
|
10.0.2.16
|
Datanode1
|
Guest Zone
|
dn1
|
10.0.2.17
|
Datanode2
|
Guest Zone
|
dn2
|
10.0.2.18
|
NOTE: There are a few configurations that are needed to be done in all nodes as required. So I will show for a single node as the commands and steps will be same for all other nodes.
HADOOP
USER & GROUP CONFIGURATION
Solaris installed
inside a machine is in itself a “Global Zone”. So we have a
global zone and say you had created a “user” user. Though we can
use the same user to setup our Hadoop cluster but we will use a
distinctive user named “huser” to perform our hadoop tasks more clearly.
Creating hadoop group
user$ sudo groupadd
hadoop
Creating
a hadoop user “huser” and place the user in hadoop group
user$ sudo useradd -m
-d /export/home/huser -g hadoop huser
Setting
the password for the user “huser”
user$ sudo passwd huser
By
default, Oracle Solaris 11.1 does not create users with sudo
permissions. Hence we need to assign “huser” user sudo
permissions.
user@nn:~$
sudo vi /etc/sudoers
huser
ALL=(ALL:ALL) ALL
After adding the user
we continue our configurations using “huser” logins.
user@nn:~$ su – huser
NOTE:
User and group creation, and configuring sudo permissions is done on
the Global Zone which will be configured as namenode. The same will
be done again on other zones also after we configure secondary
namenode and datanodes zones separately.
HADOOP
INSTALLATION ON GLOBAL ZONE (NAMENODE)
huser@nn:~$ sudo mkdir
-p /usr/local/
huser@nn:~$ sudo chown
-R huser:hadoop /usr/local/
huser@nn:~$ sudo tar
-xzvf /tmp/hadoop-1.2.1.tar.gz
huser@nn:~$ sudo mv
/tmp/hadoop-1.2.1 /usr/local/hadoop
NOTE:
Hadoop installation is done only on namenode because in Oracle
Solaris 11 Zones we have a facility to mount and share the data that is same across different zones. Hence it is important to
install hadoop before we go for secondary namenode and datanode zones
configuration.
CREATING
A LOCAL REPOSITORY
ZONES CONFIGURATION
Creating virtual
network interfaces for each zone
huser@nn:~$ sudo dladm
create-vnic -l net0 nn2
huser@nn:~$ sudo dladm
create-vnic -l net0 dn1
huser@nn:~$ sudo dladm
create-vnic -l net0 dn2
huser@nn:~$ dladm show-link
Next we will create a
zfs dataset for all zones.
huser@nn:~$ sudo zfs create
-o mountpoint=/zonefs rpool/zonefs
NOTE:
The zfs dataset for zones should not be rpool/ROOT dataset or immediately under the global zone filesystem root ("/") dataset.
Creating secondary
namenode zone
huser@nn:~$ sudo zonecfg -z
nn2
Use 'create' to begin
configuring a new zone.
zonecfg:nn2> create
create: Using system
default template 'SYSdefault'
zonecfg:nn2> set
autoboot=true
zonecfg:nn2> set
zonepath=/zonefs/nn2
zonecfg:nn2> add fs
zonecfg:nn2:fs> set
dir=/usr/local/hadoop
zonecfg:nn2:fs> set
special=/usr/local/hadoop
zonecfg:nn2:fs> set
type=lofs
zonecfg:nn2:fs> set
options=[ro,nodevices]
zonecfg:nn2:fs> end
zonecfg:nn2> add net
zonecfg:nn2:net> set
physical=nn2
zonecfg:nn2:net> end
zonecfg:nn2> verify
zonecfg:nn2> commit
zonecfg:nn2> exit
Creating datanode
zones
huser@nn:~$ sudo zonecfg -z
dn1
Use 'create' to begin
configuring a new zone.
zonecfg:dn1> create
create: Using system
default template 'SYSdefault'
zonecfg:dn1> set
autoboot=true
zonecfg:dn1> set
zonepath=/zonefs/dn1
zonecfg:dn1> add fs
zonecfg:dn1:fs> set
dir=/usr/local/hadoop
zonecfg:dn1:fs> set
special=/usr/local/hadoop
zonecfg:dn1:fs> set
type=lofs
zonecfg:dn1:fs> set
options=[ro,nodevices]
zonecfg:dn1:fs> end
zonecfg:dn1> add net
zonecfg:dn1:net> set
physical=dn1
zonecfg:dn1:net> end
zonecfg:dn1> verify
zonecfg:dn1> commit
zonecfg:dn1> exit
huser@nn:~$ sudo zonecfg -z
dn2
Use 'create' to begin
configuring a new zone.
zonecfg:dn2> create
create: Using system
default template 'SYSdefault'
zonecfg:dn2> set
autoboot=true
zonecfg:dn2> set
zonepath=/zonefs/dn2
zonecfg:dn2> add fs
zonecfg:dn2:fs> set
dir=/usr/local/hadoop
zonecfg:dn2:fs> set
special=/usr/local/hadoop
zonecfg:dn2:fs> set
type=lofs
zonecfg:dn2:fs> set
options=[ro,nodevices]
zonecfg:dn2:fs> end
zonecfg:dn2> add net
zonecfg:dn2:net> set
physical=dn2
zonecfg:dn2:net> end
zonecfg:dn2> verify
zonecfg:dn2> commit
zonecfg:dn2> exit
All the zones that are created can be listed using the below command.
huser@nn:~$ zoneadm list -cv
Installing
zones
huser@nn:~$ sudo zoneadm -z
nn2 install
huser@nn:~$ sudo zoneadm -z
dn1 install
huser@nn:~$ sudo zoneadm -z
dn2 install
It takes time to install the zones depending on your hardware setup.
We can see the list of all installed zones in the below screenshot.
Booting and
configuring zones
huser@nn:~$ sudo zoneadm -z
nn2 boot
huser@nn:~$ sudo zlogin -C
nn2
huser@nn:~$ sudo zoneadm -z
dn1 boot
huser@nn:~$ sudo zlogin -C
dn1
huser@nn:~$ sudo zoneadm -z
dn2 boot
huser@nn:~$ sudo zlogin -C
dn2
NOTE:
After executing the boot command, we can see the consoles of their
respective zones using “zlogin” command. Here we will configure
the system configurations as shown in the table for “Lab Setup”.
Additionally, we will specify “user” user in user configuration
screen without dns and authentication as we had configured global zone while installation. After the system
configuration has been done, we will login as “user” user in all
nodes.
Create "huser" user & "hadoop" group in all nodes as done earlier in global zone. If the installation & configuration is done differently than the global zone, then make sure that the "huser" user-id & hadoop group-id in datanodes is identical as in namenode.
HOSTS FILE
CONFIGURATION
It is necessary to
populate the hosts file all nodes with the ip addresses of all other
nodes.
PASSWORDLESS SSH
CONFIGURATION
OPENJDK
7 INSTALLATION
I
will be using openjdk-7-jdk package which installs easily.
huser@nn:~$ sudo pkg install --accept developer/java/jdk-7
NOTE:
OpenJDK installation has to be done on all nodes.
USER
ENVIRONMENT CONFIGURATION
As we have already
logged in as “huser” user we will edit the .bashrc file in the
home directory of “huser” to set the user environment and paths.
huser@nn:~$ vi .bashrc
Execute the .bashrc
file
huser@nn:~$
exec bash
NOTE:
User environment has to be configured on all nodes to declare the essential
variables for both java and hadoop mainly.
HADOOP CONFIGURATION
Here
we will begin our hadoop configuration. All the configuration files
are present in “/usr/local/hadoop/conf” directory
NOTE:
All configurations have to be done only on namenode because we have
already shared and mounted the global zone "/usr/local/hadoop" mountpoint onto all other guest zones.
Hadoop
Environment - hadoop-env.sh
huser@nn:~$ vi
/usr/local/hadoop/conf/hadoop-env.sh
export
JAVA_HOME=/usr/java
export
HADOOP_LOG_DIR=/var/log/hadoop-log
Create "hadoop-log" directory as specified in "HADOOP LOG DIR" variable in the hadoop-env.sh configuration file
huser@nn:~$ sudo
mkdir /var/log/hadoop-log
huser@nn:~$ sudo chown -R huser:hadoop /var/log/hadoop-log
NOTE:
/var/log/hadoop-log directory has to be created on all nodes with
required “huser” user and group ownership and permissions.
Default
Filesystem - core-site.xml
core-site.xml file which helps in pointing the datanodes to namenodes
and which port they should listen to.
huser@nn:~$ vi
/usr/local/hadoop/conf/core-site.xml
NOTE:
We can use IP address or hostname and whichever port number of the
namenode. Commonly used port number is 9000, but I have chosen 10001.
“/tmp” directory has to be created only on namenode global zone.
MapReduce
Framework – mapred-site.xml
This file used to point all the task-trackers to the job-tracker. The
parameter: mapred.job.tracker sets the hostname or IP and port of
job-tracker.
huser@nn:~$ vi /usr/local/hadoop/conf/mapred-site.xml
huser@nn:~$ vi /usr/local/hadoop/conf/mapred-site.xml
NOTE:
Port number 10002 is my choice, it is not mandatory to use the same
port number.
HDFS
Configuration – hdfs-site.xml
File hdfs-site.xml contains various parmeters that configures HDFS.
huser@nn:~$ vi
/usr/local/hadoop/conf/hdfs-site.xml
As specified in the hdfs-site.xml file, we need to create directories "hdfs" and "name" in namenode with appropriate “huser” user & group ownership.
Namenode Configuration
huser@nn:~$ sudo mkdir -p /hdfs/name
huser@nn:~$ sudo chown -R huser:hadoop /hdfs
All Datanodes Configuration
huser@dn1:~$
sudo mkdir /hdfs/data
huser@dn1:~$
sudo chown -R huser:hadoop /hdfs
NOTE:
Repeat the datanodes configuration steps in all datanode zones.
conf/masters
The masters file contains the location where the secondary namenode
daemon would start.
huser@nn:~$
vi /usr/local/hadoop/conf/masters
nn2
NOTE:
We have configured separate machine for the secondary namenode hence
we are defining secondary namenode explicitly here.
conf/slaves
The
slaves file contains the list of datanodes where the datanode and
task-tracker daemons will run.
huser@nn:~$
vi /usr/local/hadoop/conf/slaves
dn1
dn2
dn3
NOTE:
The file should contain one entry per line. We can also mention
“namenode” and “secondary namenode” hostname in this file if
we want to run datanode and task-tracker daemons on “namenode”
and “secondary namenode” too.
STARTING HADOOP
CLUSTER
Formatting HDFS via
namenode
Before we start our
cluster we will format the HDFS via namenode. Formatting the namenode
means to initialize the directory specified in “dfs.name.dir” and
“dfs.data.dir” parameter in hdfs-site.xml file. After formatting
the namenode, “current”, “image” and “previous.checkpoint”
directories will be created on namenode and the data directory in
datanodes will simply be formatted.
huser@nn1:~$
hadoop namenode -format
NOTE:
We need to format the namenode only the first time we setup hadoop
cluster. Formatting a running cluster will destroy all the existing
data. After executing the format command, it will prompt for
confirmation where we need to type “Y” as it is case-sensitive.
Starting
the Multi-Node Hadoop Cluster
Starting
a hadoop cluster can be done by a single command mentioned below.
huser@nn:~$
start-all.sh
This command will first start the HDFS daemons. The
namenode daemon is started on namenode and datanode daemon is started
on all datanodes. The secondary namenode daemon is also started on
secondary namenode. In the second phase it will start theMapReduce
daemons, job-tracker on namenode and task-trackers on datanodes.
Execute “jps” command on all nodes to see the java
processes running,
“NameNode and JobTracker” processes on namenode,
“SecondaryNameNode” in seconday namenode zone and,
“DataNode and TaskTracker” on datanode zones.
MONITORING
HADOOP
Both HDFS and MapReduce provide Web-UI Management
websites to browse the HDFS, monitor the logs and jobs.
HDFS: http://nn:50070
The DFSHealth site allows you to browse the HDFS,
monitor namenode logs, view nodes, check the space used, etc.
MapReduce
Job-Tracker: http://nn:50030
The job-tracker site provides the information regarding
running map and reduce tasks, running and completed jobs and much
more.
Here we have completed creating a Multi-Node Hadoop
Cluster using Oracle Solaris 11 Zones.
Enjoy!!!
Follow the below links to create:
Single-Node Hadoop Cluster on Ubuntu 14.04
Multi-Node Hadoop Cluster on Ubuntu 14.04
Follow the below links to create:
Single-Node Hadoop Cluster on Ubuntu 14.04
Multi-Node Hadoop Cluster on Ubuntu 14.04