Thursday, June 05, 2014

Multi-Node Hadoop Cluster On Ubuntu 14.04

In this tutorial I will describe the steps for setting up a multi-node i.e. five nodes Hadoop cluster running on Ubuntu. I have created this setup inside a VMware ESXi box. There are five virtual machines all having the same common configurations like 20GB HDD, 512MB RAM, etc. For this I will be using Ubuntu 14.04 LTS as operating system and Hadoop 1.2.1. All the machines will be distinguished by their names which correspond to namenode, secondary namenode, datanode1, datanode2 and datanode3. So it is a five node hadoop cluster.

MY LAB SETUP
Node Name
Hostname
IP Address
Namenode
nn
10.2.0.195
Secondary Namenode
nn2
10.2.0.196
Datanode1
dn1
10.2.0.197
Datanode2
dn2
10.2.0.198
Datanode3
dn3
10.2.0.199

Install Ubuntu 14.04 LTS in all the nodes and update/upgrade it.
To pull the newest sources, execute below command.
user@nn:~$ sudo apt-get update

To pull upgrades for the os and packages execute below command.
user@nn:~$ sudo apt-get upgrade

Reboot all the machines.

NOTE: There are a few configurations that are needed to be done in all nodes. I will show for single node as the commands and steps will be same for all other nodes.

HADOOP USER & GROUP CONFIGURATION
Create a hadoop group
user@nn:~$ sudo groupadd hadoop

Create a hadoop user “huser” and place the user in hadoop group
user@nn:~$ sudo useradd -m -d /home/huser -g hadoop huser

Set the password for the user “huser”
user@nn:~$ sudo passwd huser

By default, ubuntu does not create users with sudo permissions. Hence we need to assign “huser” user sudo permissions.
user@nn:~$ sudo chmod 740 /etc/sudoers
user@nn:~$ sudo vi /etc/sudoers
huser ALL=(ALL:ALL) ALL


NOTE: User and group configuration has to be done in all nodes.

OPENSSH SERVER INSTALLATION
Hadoop requires openssh-server for creating passwordless ssh environment to manage hadoop services from namenode to datanodes. The namenode seldom communicates with the datanodes. Instead datanodes communicate with namenode to send heartbeat and build block locations with the help of block reports sent by datanodes to namenodes. The datanode to namenode communication is based on simple socket communications. We will use ssh sessions that will be run as Hadoop user, “huser” that we created earlier.

user@nn:~$ su – huser

NOTE: Login as “huser” user on all nodes as we will be performing all the tasks as huser from now onwards on all nodes.

huser@nn:~$ sudo apt-get install openssh-server

PASSWORDLESS SSH CONFIGURATION
huser@nn:~$ ssh-keygen -t rsa
huser@nn:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub huser@nn2
huser@nn:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub huser@dn1
huser@nn:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub huser@dn2
huser@nn:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub huser@dn3

NOTE: Verify the passwordless ssh environment from namenode to all datanodes as “huser” user.

OPENJDK 7 INSTALLATION
I will be using openjdk-7-jdk package which installs easily.
huser@nn:~$ sudo apt-get install openjdk-7-jdk

NOTE: Openjdk installation has to be done on all nodes. After the installation verify the java installation by running the below command.
huser@nn:~$ java -version


HADOOP INSTALLATION
Hadoop package can be downloaded from these mirror locations. I had copied the hadoop-1.2.1.tar.gz file into all nodes' “huser” home folder.
Unzip the downloaded file in your “huser” home directory itself.
huser@nn:~$ tar -xzvf hadoop-1.2.1.tar.gz

Now move it to /usr/local location.
huser@nn:~$ sudo mv hadoop-1.2.1 /usr/local/hadoop

Assign the permissions in favor of user “huser”.
 huser@nn:~$ sudo chown -R huser:hadoop /usr/local/hadoop/

NOTE: Installation of hadoop has to be done on all nodes.

USER ENVIRONMENT CONFIGURATION
Now we will configure the environment for user “huser”. As we are already logged into the huser home directory, we need to append the .bashrc file with the below mentioned variables to set the user environment.

huser@nn:~$ vi .bashrc
###HADOOP ENVIRONMENT###
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_PREFIX=/usr/local/hadoop/
export PATH=$PATH:$HADOOP_PREFIX/bin
export PATH=$PATH:$HADOOP_PREFIX/sbin


Execute the .bashrc file.
huser@nn:~$ exec bash

To verify the defined hadoop variables execute the below command to check the hadoop version that is installed.

huser@nn:~$ hadoop version


NOTE: User environment has to be configured on all nodes.

It is very important to define the hosts in /etc/hosts file on all nodes. Hence, I have copied the content of my namenode /etc/hosts file to all the datanodes and secondary namenode.


HADOOP CONFIGURATION
Finally we have arrived at this point where we are going to begin with hadoop configuration. All the configuration files are present in /usr/local/hadoop/conf directory.

Hadoop Environment - hadoop-env.sh
First of all we will go ahead and edit the hadoop-env.sh file which plays a major role in defining the overall runtime parameters for hadoop. Mainly we require to set the JAVA_HOME parameter to specify the location of java installation. If anyone has IPv6 enabled at your location, he has to define IPv4 Stack to true as hadoop does not work on IPv6. Also I have defined hadoop log directory inside /var/log/ directory for my own convenience.

huser@nn:~$ vi /usr/local/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
export HADOOP_LOG_DIR=/var/log/hadoop-log


After defining the variables we need to create the hadoop-log directory in /var/log location on all nodes with the required “huser” user ownership and permissions.

huser@nn:~$ sudo mkdir /var/log/hadoop-log
huser@nn:~$ sudo chown -R huser:hadoop /var/log/hadoop-log

NOTE: It is necessary to modify hadoop-env.sh with same parameters across all nodes in the cluster and /var/log/hadoop-log directory has to be created on all nodes with required “huser” user ownership and permissions

Default Filesystem - core-site.xml
Next we move to core-site.xml file which helps in pointing the datanodes to namenodes and which port they should listen to. It mainly contains configuration properties like Apache Hadoop Core (such as I/O) that hadoop uses when starting up. These settings are common to HDFS and MapReduce.

Here we will configure a parameter called fs.default.name. This parameter is used by clients to access the hdfs filesystem or the metadata for the requested file precisely. fs.default.name parameter is set to an HDFS URI like hdfs://<authority>:<port>, the URI identifies the filesystem and path specifies the location of file/directory in the filesystem. As the name suggests it points to the default URI for all filesystem requests in hadoop. Parameter: fs.default.name is used by client to get URI of the default filesystem and namenode to read the block adresses. It is basically specifies the hostname and the binded port number of the HDFS filesystem. Hence this file has to be defined in all nodes.

The next parameter that we will configure here is hadoop.tmp.dir. This parameter I have used to specify the base for temporary directories locally and also in HDFS.


Create the tmp dir as specified in the core-site.xml file.

huser@nn:~$ mkdir /usr/local/hadoop/tmp

NOTE: Copy this file to all nodes. We can use IP address or hostname and whichever port number of the namenode. Commonly used port number is 9000, but I have chosen 10001. “/tmp” directory has to be created on all nodes.

MapReduce Framework – mapred-site.xml
This file used to point all the task-trackers to the job-tracker. The parameter: mapred.job.tracker sets the hostname or IP and port of job-tracker.


NOTE: Copy this file to all nodes. Port number 10002 is my choice, it is not mandatory to use the same port number.

HDFS Configuration – hdfs-site.xml
File hdfs-site.xml contains various parmeters that configures HDFS.
The parameter “dfs.name.dir” and “dfs.data.dir” define the configuration settings for HDFS daemons: the namenode and the datanodes. “dfs.name.dir” defines the directory where namenode stores it's metadata and “dfs.data.dir” defines the directory where datanodes will store their HDFS data blocks. “dfs.replication” parameter defines the number of replicas to be made for each block of data that is created.
Since our cluster is a multi-node cluster it is important to node that the hdfs-site.xml file would be different on both namenode and datanodes. The namenode will only carry “dfs.name.dir” parameter, as we have separate machines for datanodes which will only carry “dfs.data.dir” parameter. The “dfs.replication” parameter will be common and same for all nodes.

hdfs-site.xml configuration for namenode
huser@nn:~$ vi /usr/local/hadoop/conf/hdfs-site.xml


As specified in the hdfs-site.xml file, we need to create a directories hdfs and name in namenode with appropriate “huser” user ownership and permissions.
huser@nn:~$ sudo mkdir -p /hdfs/name
huser@nn:~$ sudo chown -R huser:hadoop /hdfs

hdfs-site.xml configuration for all datanodes
huser@dn1:~$ vi /usr/local/hadoop/conf/hdfs-site.xml


Similary in all datanodes we need to create hdfs and data directories.
huser@dn1:~$ sudo mkdir -p /hdfs/data
huser@dn1:~$ sudo chown -R huser:hadoop /hdfs

NOTE: The hdfs-site.xml file is different in both namenode and datanode because I am not running datanode daemon on the namenode and we are trying to create a fully distributed mode cluster.

conf/masters
The masters file contains the location where the secondary namenode daemon would start.
huser@nn:~$ vi /usr/local/hadoop/conf/masters
nn2

NOTE
We have configured separate machine for the secondary namenode hence we are defining secondary namenode explicitly here. There is no need to duplicate this file to all nodes.

conf/slaves
The slaves file contains the list of datanodes where the datanode and task-tracker daemons will run.
huser@nn:~$ vi /usr/local/hadoop/conf/slaves
dn1
dn2
dn3

NOTE: The file should contain one entry per line. We can also mention “namenode” and “secondary namenode” hostname in this file if we want to run datanode and task-tracker daemons on “namenode” and “secondary namenode” too.

STARTING HADOOP CLUSTER
Formatting HDFS via namenode
Before we start our cluster we will format the HDFS via namenode. Formatting the namenode means to initialize the directory specified in “dfs.name.dir” and “dfs.data.dir” parameter in hdfs-site.xml file. After formatting the namenode, “current”, “image” and “previous.checkpoint” directories will be created on namenode and the data directory in datanodes will simply be formatted.

huser@nn1:~$ hadoop namenode -format

NOTE: We need to format the namenode only the first time we setup hadoop cluster. Formatting a running cluster will destroy all the existing data. After executing the format command, it will prompt for confirmation where we need to type “Y” as it is case-sensitive.

Starting the Multi-Node Hadoop Cluster
Starting a hadoop cluster can be done by a single command mentioned below.
huser@nn:~$ start-all.sh

This command will first start the HDFS daemons. The namenode daemon is started on namenode and datanode daemon is started on all datanodes. The secondary namenode daemon is also started on secondary namenode. In the second phase it will start theMapReduce daemons, job-tracker on namenode and task-trackers on datanodes.

JPS (Java Virtual Machine Process Status Tool) is used to verify that all the daemons are running on their respective nodes, we will execute “jps” command to see the java processes running.

Namenode java processes


Datanode java processes


Secondary namenode java processes


NOTE: Check for errors in /var/log/hadoop-log/ directory in case there is any missing java process.

Checking Ports Status
To check whether hadoop is listening to the configured ports execute the below command.

huser@nn:~$ sudo netstat -ntlp | grep java


We can see in the above screenshot that our namenode is running hdfs on port 50070 and job-tracker is running on port 50030.

Checking HDFS Status
huser@nn:~$ hadoop dfsadmin -report
Configured Capacity: 14713540608 (13.7 GB)
Present Capacity: 13865287680 (12.91 GB)
DFS Remaining: 13865164800 (12.91 GB)
DFS Used: 122880 (120 KB)
DFS Used%: 0%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------

Datanodes available: 3 (3 total, 0 dead)

Name: 10.2.0.199:50010
Decommission Status : Normal
Configured Capacity: 4904513536 (4.57 GB)
DFS Used: 40960 (40 KB)
Non DFS Used: 282750976 (269.65 MB)
DFS Remaining: 4621721600(4.3 GB)
DFS Used%: 0%
DFS Remaining%: 94.23%
Last contact: Sun May 18 12:11:49 IST 2014

Name: 10.2.0.198:50010
Decommission Status : Normal
Configured Capacity: 4904513536 (4.57 GB)
DFS Used: 40960 (40 KB)
Non DFS Used: 282750976 (269.65 MB)
DFS Remaining: 4621721600(4.3 GB)
DFS Used%: 0%
DFS Remaining%: 94.23%
Last contact: Sun May 18 12:11:49 IST 2014

Name: 10.2.0.197:50010
Decommission Status : Normal
Configured Capacity: 4904513536 (4.57 GB)
DFS Used: 40960 (40 KB)
Non DFS Used: 282750976 (269.65 MB)
DFS Remaining: 4621721600(4.3 GB)
DFS Used%: 0%
DFS Remaining%: 94.23%
Last contact: Sun May 18 12:11:48 IST 2014
huser@nn:~$

Checking HDFS content
Hadoop API provides the facility to run linux like commands to check and manage HDFS content.
Listing files in /
huser@nn:~$ hadoop fs -ls /



Creating directory
hadoop fs -mkdir /user

Copying Files to HDFS

Before we run a mapreduce job we will copy some text files from our local filesystem into the HDFS.
huser@nn:~$ hadoop dfs -copyFromLocal ~/Books/ /user/huser/Books/


MONITORING
Both HDFS and MapReduce provide Web-UI Management websites to browse the HDFS, monitor the logs and jobs.
HDFS: http://nn:50070
The DFSHealth site allows you to browse the HDFS, monitor namenode logs, view nodes, check the space used, etc.

MapReduce
Job-Tracker: http://nn:50030
The job-tracker site provides the information regarding running map and reduce tasks, running and completed jobs and much more.

DECOMMISSIONING NODES
If we want to bring a data node down for whatever reason, we have to decommission that datanode. To decommission a datanode we just have to create a file named excludes in the hadoop conf directory (/usr/local/hadoop/conf/) and mention the hostname of the datanode we want to decommission.
huser@nn:~$ vi /usr/local/hadoop/conf/excludes
dn3

Next we need to edit the core-site.xml file provide the excludes file location into a property setting.


Similarly we can have includes files too. Now we can restart the namenode or just refresh the configuration using below command.
huser@nn:~$ hadoop dfsadmin -refreshNodes

We can see the decommissioned nodes under “Decommissioning Nodes” section on namenode webUI (http://nn:50070).

So here we discussed about creating a fully distributed multi-node Hadoop cluster on Ubuntu 14.04. Hope that was easy to understand and implement. In the coming posts we will discuss about running a mapreduce job on our Hadoop cluster. Happy Hadooping!!!


Single-Node Hadoop Cluster on Ubuntu 14.04

In this tutorial I will demonstrate how to install and run Single-Node Hadoop Cluster in Ubuntu 14.04.

JAVA INSTALLATION
As a requirement java needs to be installed.
user@hadoop-lab:~$ sudo apt-get install openjdk-7-jdk

HADOOP USER & GROUP CREATION
Create a dedicated user account and group for hadoop.
user@hadoop-lab:~$ sudo groupadd hadoop
user@hadoop-lab:~$ sudo useradd -m -d /home/huser/ -g hadoop huser
user@hadoop-lab:~$ sudo passwd huser

Login as “huser” user to do further configurations.
user@hadoop-lab:~$ su – huser

SSH INSTALLATION & PASSPHRASELESS SSH CONFIGURATION
The master node manages slave nodes (starting and stopping services) using SSH.
huser@hadoop-lab:~$ sudo apt-get install openssh-server
huser@hadoop-lab:~$ ssh-keygen -t rsa
huser@hadoop-lab:~$ cp /home/huser/.ssh/id_rsa.pub authorized_keys

Testing SSH Setup
huser@hadoop-lab:~$ ssh localhost

HADOOP INSTALLATION
Download Apache Hadoop Release 1.2.1 from Apache Download Mirrors site into huser home directory, extract it and assign hadoop user & group ownership.
huser@hadoop-lab:~$ tar -xzvf hadoop-1.2.1.tar.gz
huser@hadoop-lab:~$ sudo mv /home/huser/hadoop-1.2.1 /usr/local/hadoop/
huser@hadoop-lab:~$ sudo chown -R huser:hadoop /usr/local/hadoop/

HADOOP USER ENVIRONMENT CONFIGURATION
Set user environment variables for java and hadoop home directories.
huser@hadoop-lab:~$ vi .bashrc


HADOOP CONFIGURATION
Hadoop Environment Setup
Setup Java Home for Hadoop and disable IPv6.
huser@hadoop-lab$ vi /usr/local/hadoop/conf/hadoop-env.sh


The main hadoop configurations are stored in 3 files listed below.
core-site.xml
mapred-site.xml
hdfs-site.xml

core-site.xml
Contains default values for core Hadoop properties.
huser@hadoop-lab$ vi /usr/local/hadoop/conf/core-site.xml


huser@hadoop-lab:~$ mkdir /usr/local/hadoop/tmp

mapred-site.xml
Contains configuration information for MapReduce properties.
huser@hadoop-lab$ vi /usr/local/hadoop/conf/mapred-site.xml


hdfs-site.xml
Contains server side configuration for Hadoop Distributed File System
huser@hadoop-lab$ vi /usr/local/hadoop/conf/hdfs-site.xml


huser@hadoop-lab:~$ sudo mkdir -p /hdfs/name
huser@hadoop-lab:~$ sudo mkdir /hdfs/data
huser@hadoop-lab:~$ sudo chown -R huser:hadoop /hdfs

FORMATTING NAMENODE
Before we start adding files to HDFS we need to format it. Type 'Y' for the prompt. Y/N prompt is case-sensitive.
huser@hadoop-lab:~$ hadoop namenode -format

STARTING SERVICES
After the namenode has been formatted, it is time to launch hadoop.
huser@hadoop-lab:~$ start-all.sh

As we all know that hadoop is written in java language, hence we can JPS (Java Process Status) tool to check the processes that are running in jvm.
huser@hadoop-lab:~$ jps
The output should look something like shown below in screenshot.


We can also have a look at the web interfaces of HDFS and MapReduce.
HDFS : http://localhost:50070



MapReduce : http://localhost:50030


That's it. I hope you enjoyed learning building up a Single-Node Hadoop Cluster on Ubuntu 14.04. If you have any suggestions, questions or any comments to make, please leave a comment.

Follow the below links to create:
Multi-Node Hadoop Cluster on Ubuntu 14.04

Wednesday, May 28, 2014

Multi-Node Hadoop Cluster on Oracle Solaris 11 using Zones

This tutorial demonstrates how to setup an Apache Hadoop 1.2.1 Cluster using Oracle Solaris 11.1 Virtualization Technology or Zones.
I am running this setup inside Oracle VM VirtualBox 4.3.10 on Ubuntu 12.04 and the guest machine is running Oracle Solaris 11. Namenode will run inside Global Zone whereas we will be configuring 4 guest zones with almost same configuration for separate “Secondary Namenode” and 2 datanodes respectively.

LAB SETUP
Nodename
Zone
Hostname
IP-Address
Namenode
Global Zone
nn
10.0.2.15
Secondary Namenode
Guest Zone
nn2
10.0.2.16
Datanode1
Guest Zone
dn1
10.0.2.17
Datanode2
Guest Zone
dn2
10.0.2.18





NOTE: There are a few configurations that are needed to be done in all nodes as required. So I will show for a single node as the commands and steps will be same for all other nodes.

HADOOP USER & GROUP CONFIGURATION
Solaris installed inside a machine is in itself a “Global Zone”. So we have a global zone and say you had created a “user” user. Though we can use the same user to setup our Hadoop cluster but we will use a distinctive user named “huser” to perform our hadoop tasks more clearly.

Creating hadoop group
user$ sudo groupadd hadoop

Creating a hadoop user “huser” and place the user in hadoop group
user$ sudo useradd -m -d /export/home/huser -g hadoop huser

Setting the password for the user “huser”
user$ sudo passwd huser

By default, Oracle Solaris 11.1 does not create users with sudo permissions. Hence we need to assign “huser” user sudo permissions.
user@nn:~$ sudo vi /etc/sudoers
huser ALL=(ALL:ALL) ALL

After adding the user we continue our configurations using “huser” logins.
user@nn:~$ su – huser

NOTE: User and group creation, and configuring sudo permissions is done on the Global Zone which will be configured as namenode. The same will be done again on other zones also after we configure secondary namenode and datanodes zones separately.

HADOOP INSTALLATION ON GLOBAL ZONE (NAMENODE)
huser@nn:~$ sudo mkdir -p /usr/local/
huser@nn:~$ sudo chown -R huser:hadoop /usr/local/
huser@nn:~$ sudo tar -xzvf /tmp/hadoop-1.2.1.tar.gz
huser@nn:~$ sudo mv /tmp/hadoop-1.2.1 /usr/local/hadoop

NOTE: Hadoop installation is done only on namenode because in Oracle Solaris 11 Zones we have a facility to mount and share the data that is same across different zones. Hence it is important to install hadoop before we go for secondary namenode and datanode zones configuration.

CREATING A LOCAL REPOSITORY

ZONES CONFIGURATION
Creating virtual network interfaces for each zone
huser@nn:~$ sudo dladm create-vnic -l net0 nn2
huser@nn:~$ sudo dladm create-vnic -l net0 dn1
huser@nn:~$ sudo dladm create-vnic -l net0 dn2

huser@nn:~$ dladm show-link


Next we will create a zfs dataset for all zones.
huser@nn:~$ sudo zfs create -o mountpoint=/zonefs rpool/zonefs

NOTE: The zfs dataset for zones should not be rpool/ROOT dataset or immediately under the global zone filesystem root ("/") dataset.

Creating secondary namenode zone
huser@nn:~$ sudo zonecfg -z nn2
Use 'create' to begin configuring a new zone.
zonecfg:nn2> create
create: Using system default template 'SYSdefault'
zonecfg:nn2> set autoboot=true
zonecfg:nn2> set zonepath=/zonefs/nn2
zonecfg:nn2> add fs
zonecfg:nn2:fs> set dir=/usr/local/hadoop
zonecfg:nn2:fs> set special=/usr/local/hadoop
zonecfg:nn2:fs> set type=lofs
zonecfg:nn2:fs> set options=[ro,nodevices]
zonecfg:nn2:fs> end
zonecfg:nn2> add net
zonecfg:nn2:net> set physical=nn2
zonecfg:nn2:net> end
zonecfg:nn2> verify
zonecfg:nn2> commit
zonecfg:nn2> exit

Creating datanode zones
huser@nn:~$ sudo zonecfg -z dn1
Use 'create' to begin configuring a new zone.
zonecfg:dn1> create
create: Using system default template 'SYSdefault'
zonecfg:dn1> set autoboot=true
zonecfg:dn1> set zonepath=/zonefs/dn1
zonecfg:dn1> add fs
zonecfg:dn1:fs> set dir=/usr/local/hadoop
zonecfg:dn1:fs> set special=/usr/local/hadoop
zonecfg:dn1:fs> set type=lofs
zonecfg:dn1:fs> set options=[ro,nodevices]
zonecfg:dn1:fs> end
zonecfg:dn1> add net
zonecfg:dn1:net> set physical=dn1
zonecfg:dn1:net> end
zonecfg:dn1> verify
zonecfg:dn1> commit
zonecfg:dn1> exit

huser@nn:~$ sudo zonecfg -z dn2
Use 'create' to begin configuring a new zone.
zonecfg:dn2> create
create: Using system default template 'SYSdefault'
zonecfg:dn2> set autoboot=true
zonecfg:dn2> set zonepath=/zonefs/dn2
zonecfg:dn2> add fs
zonecfg:dn2:fs> set dir=/usr/local/hadoop
zonecfg:dn2:fs> set special=/usr/local/hadoop
zonecfg:dn2:fs> set type=lofs
zonecfg:dn2:fs> set options=[ro,nodevices]
zonecfg:dn2:fs> end
zonecfg:dn2> add net
zonecfg:dn2:net> set physical=dn2
zonecfg:dn2:net> end
zonecfg:dn2> verify
zonecfg:dn2> commit
zonecfg:dn2> exit

All the zones that are created can be listed using the below command.
huser@nn:~$ zoneadm list -cv


Installing zones
huser@nn:~$ sudo zoneadm -z nn2 install
huser@nn:~$ sudo zoneadm -z dn1 install
huser@nn:~$ sudo zoneadm -z dn2 install


It takes time to install the zones depending on your hardware setup.

We can see the list of all installed zones in the below screenshot.


Booting and configuring zones
huser@nn:~$ sudo zoneadm -z nn2 boot
huser@nn:~$ sudo zlogin -C nn2
huser@nn:~$ sudo zoneadm -z dn1 boot
huser@nn:~$ sudo zlogin -C dn1
huser@nn:~$ sudo zoneadm -z dn2 boot
huser@nn:~$ sudo zlogin -C dn2


NOTE: After executing the boot command, we can see the consoles of their respective zones using “zlogin” command. Here we will configure the system configurations as shown in the table for “Lab Setup”. Additionally, we will specify “user” user in user configuration screen without dns and authentication as we had configured global zone while installation. After the system configuration has been done, we will login as “user” user in all nodes.


Create "huser" user & "hadoop" group in all nodes as done earlier in global zone. If the installation & configuration is done differently than the global zone, then make sure that the "huser" user-id & hadoop group-id in datanodes is identical as in namenode.

HOSTS FILE CONFIGURATION
It is necessary to populate the hosts file all nodes with the ip addresses of all other nodes.


PASSWORDLESS SSH CONFIGURATION

OPENJDK 7 INSTALLATION
I will be using openjdk-7-jdk package which installs easily.
huser@nn:~$ sudo pkg install --accept developer/java/jdk-7


NOTE: OpenJDK installation has to be done on all nodes.

USER ENVIRONMENT CONFIGURATION
As we have already logged in as “huser” user we will edit the .bashrc file in the home directory of “huser” to set the user environment and paths.
huser@nn:~$ vi .bashrc


Execute the .bashrc file
huser@nn:~$ exec bash

NOTE: User environment has to be configured on all nodes to declare the essential variables for both java and hadoop mainly.

HADOOP CONFIGURATION
Here we will begin our hadoop configuration. All the configuration files are present in “/usr/local/hadoop/conf” directory

NOTE: All configurations have to be done only on namenode because we have already shared and mounted the global zone "/usr/local/hadoop" mountpoint onto all other guest zones.

Hadoop Environment - hadoop-env.sh
huser@nn:~$ vi /usr/local/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/usr/java
export HADOOP_LOG_DIR=/var/log/hadoop-log

Create "hadoop-log" directory as specified in "HADOOP LOG DIR" variable in the hadoop-env.sh configuration file
huser@nn:~$ sudo mkdir /var/log/hadoop-log
huser@nn:~$ sudo chown -R huser:hadoop /var/log/hadoop-log


NOTE: /var/log/hadoop-log directory has to be created on all nodes with required “huser” user and group ownership and permissions.

Default Filesystem - core-site.xml
core-site.xml file which helps in pointing the datanodes to namenodes and which port they should listen to.
huser@nn:~$ vi /usr/local/hadoop/conf/core-site.xml


NOTE: We can use IP address or hostname and whichever port number of the namenode. Commonly used port number is 9000, but I have chosen 10001. “/tmp” directory has to be created only on namenode global zone.

MapReduce Framework – mapred-site.xml
This file used to point all the task-trackers to the job-tracker. The parameter: mapred.job.tracker sets the hostname or IP and port of job-tracker.
huser@nn:~$ vi /usr/local/hadoop/conf/mapred-site.xml


NOTE: Port number 10002 is my choice, it is not mandatory to use the same port number.

HDFS Configuration – hdfs-site.xml
File hdfs-site.xml contains various parmeters that configures HDFS.
huser@nn:~$ vi /usr/local/hadoop/conf/hdfs-site.xml


As specified in the hdfs-site.xml file, we need to create directories "hdfs" and "name" in namenode with appropriate “huser” user & group ownership.
Namenode Configuration
huser@nn:~$ sudo mkdir -p /hdfs/name
huser@nn:~$ sudo chown -R huser:hadoop /hdfs

All Datanodes Configuration
huser@dn1:~$ sudo mkdir /hdfs/data
huser@dn1:~$ sudo chown -R huser:hadoop /hdfs

NOTE: Repeat the datanodes configuration steps in all datanode zones.

conf/masters
The masters file contains the location where the secondary namenode daemon would start.
huser@nn:~$ vi /usr/local/hadoop/conf/masters
nn2

NOTE: We have configured separate machine for the secondary namenode hence we are defining secondary namenode explicitly here.

conf/slaves
The slaves file contains the list of datanodes where the datanode and task-tracker daemons will run.
huser@nn:~$ vi /usr/local/hadoop/conf/slaves
dn1
dn2
dn3

NOTE: The file should contain one entry per line. We can also mention “namenode” and “secondary namenode” hostname in this file if we want to run datanode and task-tracker daemons on “namenode” and “secondary namenode” too.

STARTING HADOOP CLUSTER
Formatting HDFS via namenode
Before we start our cluster we will format the HDFS via namenode. Formatting the namenode means to initialize the directory specified in “dfs.name.dir” and “dfs.data.dir” parameter in hdfs-site.xml file. After formatting the namenode, “current”, “image” and “previous.checkpoint” directories will be created on namenode and the data directory in datanodes will simply be formatted.

huser@nn1:~$ hadoop namenode -format

NOTE: We need to format the namenode only the first time we setup hadoop cluster. Formatting a running cluster will destroy all the existing data. After executing the format command, it will prompt for confirmation where we need to type “Y” as it is case-sensitive.

Starting the Multi-Node Hadoop Cluster
Starting a hadoop cluster can be done by a single command mentioned below.
huser@nn:~$ start-all.sh

This command will first start the HDFS daemons. The namenode daemon is started on namenode and datanode daemon is started on all datanodes. The secondary namenode daemon is also started on secondary namenode. In the second phase it will start theMapReduce daemons, job-tracker on namenode and task-trackers on datanodes.

Execute “jps” command on all nodes to see the java processes running,
NameNode and JobTracker” processes on namenode,
SecondaryNameNode” in seconday namenode zone and,
DataNode and TaskTracker” on datanode zones.


MONITORING HADOOP
Both HDFS and MapReduce provide Web-UI Management websites to browse the HDFS, monitor the logs and jobs.

HDFS: http://nn:50070
The DFSHealth site allows you to browse the HDFS, monitor namenode logs, view nodes, check the space used, etc.


MapReduce
Job-Tracker: http://nn:50030

The job-tracker site provides the information regarding running map and reduce tasks, running and completed jobs and much more.


Here we have completed creating a Multi-Node Hadoop Cluster using Oracle Solaris 11 Zones.