Federation Concepts
“HDFS
Federation improves the existing HDFS architecture through a clear
separation of namespace and storage, enabling generic block storage
layer. It enables support for multiple namespaces in the cluster to
improve scalability and isolation. Federation also opens up the
architecture, expanding the applicability of HDFS cluster to new
implementations and use cases.
I have created two guest machines for the namenodes and datanodes to run on. Each VM is allocated 1GB of RAM and 20GB of HDD. Each machine is connected through host-only adapter. Client will be the host machine itself which is able to ssh and connect to both namenodes inside the VirtualBox virtual machines.”
-
An excerpt from a Hortonworks blog by
Suresh
Srinivas.
Find
more on Apache
Hadoop Federation.
Setup Summary
I
have implemented Hadoop Federation inside ESXi server to create a
fully distributed cluster on Apache Hadoop 2.6.
Setup
lists two namenodes and two datanodes each with 1GB RAM and 20GB HDD
alloted. Also I have configured a HDFS client on a separate machine
with same hardware configuration. All the nodes are running on Ubuntu
14.10. It is also recommended to use latest Oracle JDK for namenodes
and datanodes. This setup only demonstrates how to build a federated
cluster and hence YARN is not configured.
Below
mentioned hostnames and ip-addresses are just an example to provide a
distinctive view for understanding the cluster setup.
namenode1
fed-nn01 192.168.56.101
namenode2
fed-nn02 192.168.56.102
datanode1
fed-dn01 192.168.56.103
datanode2
fed-dn02 192.168.56.104
client fed-client 192.168.56.105
Downloads
Download
the below packages and place the tarballs on all namenodes, datanodes
and client machines.
Apache
Hadoop 2.6 Download
Oracle
JDK8 Download
Note:
Before moving directly towards hadoop installation and configuration
-
1.
Disable firewall on all namenodes, datanodes and client machines.
2.
Comment 127.0.1.1 address line in /etc/hosts file on all namenodes,
datanodes and client machines.
3.
Update correct hostnames and their respective ip-addresses of all namenodes, datanodes and client machines in their /etc/hosts file.
4.
Install ssh on all namenodes and datanodes. Not required for client
machine.
INSTALLATION & CONFIGURATION
Hadoop
installation and configuration includes setting up of hadoop user,
installing java, passwordless ssh, and finally hadoop installation,
configuration and monitoring.
User Configuration
Location:
fed-nn01, fed-nn02, fed-dn01, fed-dn02, fed-client
First
we will create a group “hadoop” and a user “huser” for all
hadoop administrative tasks and set password for huser.
$
sudo groupadd hadoop
$
sudo useradd -m -d /home/huser -g hadoop huser
$
sudo passwd huser
Note:
Here onwards we shall use newly created huser for all hadoop tasks that we
perform.
Java Installation
We
will go for the recommended Oracle JDK installation and
configuration. At the time of writing this document JDK8 was the
latest that was available.
Location:
fed-nn01, fed-nn02, fed-dn01, fed-dn02
Make
a directory named java inside /usr
huser:~$
sudo mkdir /usr/java
Copy
the tarball.
huser:~$
sudo cp /path/to/jdk-8u25-linux-x64.tar.gz
/usr/java
Extract
the tarballs
huser:~$
sudo tar -xzvf /usr/java/ jdk-8u25-linux-x64.tar.gz
Set
the environment for jdk8.
huser:~$
sudo vi /etc/profile.d/java.sh
JAVA_HOME=/usr/java/jdk1.8.0_25/
PATH=$JAVA_HOME/bin:$PATH
export
PATH JAVA_HOME
export
CLASSPATH=.
huser:~$
sudo chmod +x /etc/profile.d/java.sh
huser:~$
source /etc/profile.d/java.sh
Testing
Java Installation
huser:~$
java -version
java
version "1.8.0_25"
Java(TM)
SE Runtime Environment (build 1.8.0_25-b17)
Java
HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)
Note:
Java installation on client can be skipped if it already has openjdk
installed. If not above method can be repeated for the client also.
Passwordless SSH Configuration
Passwordless
ssh is required by the namenode only to start the HDFS and MapReduce
daemons in various nodes.
Location:
fed-nn01
huser@fed01:~$
ssh-keygen -t rsa
huser@fed01:~$
touch /home/huser/.ssh/authorized_keys
huser@fed01:~$
cp /home/huser/.ssh/id_rsa.pub authorized_keys
huser@fed01:~$
ssh-copy-id -i ~/.ssh/id_rsa.pub huser@fed-nn02
huser@fed01:~$
ssh-copy-id -i ~/.ssh/id_rsa.pub huser@fed-dn01
huser@fed01:~$
ssh-copy-id -i ~/.ssh/id_rsa.pub huser@fed-dn02
Location:
fed-nn02
huser@fed02:~$
ssh-keygen -t rsa
huser@fed02:~$
touch /home/huser/.ssh/authorized_keys
huser@fed02:~$
cp /home/huser/.ssh/id_rsa.pub authorized_keys
huser@fed02:~$
ssh-copy-id -i ~/.ssh/id_rsa.pub huser@fed-nn01
huser@fed02:~$
ssh-copy-id -i ~/.ssh/id_rsa.pub huser@fed-dn01
huser@fed02:~$
ssh-copy-id -i ~/.ssh/id_rsa.pub huser@fed-dn02
Testing
passwordless ssh on fed-nn01 & fed-nn02
Run
below commands to test the passwordless logins from both fed-nn01 &
fed-nn02.
huser:~$
ssh fed-nn01
huser:~$
ssh fed-nn02
huser:~$
ssh fed-dn01
huser:~$
ssh fed-dn02
Note:
Client does not require any ssh settings to be configured. The reason
behind it is that the client communicates with the namenode using namenode's configurable TCP
port, which is a RPC connection
Whereas,
client communicates with the datanode directly for I/O operations
(read/write) using Data Transfer Protocol defined in
DataTransferProtocol.java. For the purpose of performance this is a
streaming protocol and not RPC. Before streaming the block to the
datanode, the client buffers the data until a full block (64MB/128MB)
has been created.
Hadoop Installation
We
will go forth with the latest stable Apache Hadoop 2.6.0 release.
Location:
fed-nn01, fed-nn02, fed-dn01, fed-dn02, fed-client
We will untar the tarball and place it in /opt directory.
huser:~$
tar -xzvf hadoop-2.6.0.tar.gz
Changing
the ownership of the directory to huser.
huser:~$
sudo chown -R huser:hadoop /opt/hadoop-2.6.0/
Setting
the environment variables in .bashrc file of “huser” user.
huser:~$
sudo vi ~/.bashrc
###JAVA
CONFIGURATION###
JAVA_HOME=/usr/java/jdk1.8.0_25/
PATH=$JAVA_HOME/bin:$PATH
export
PATH JAVA_HOME
export
CLASSPATH=.
###HADOOP
CONFIGURATION###
HADOOP_PREFIX=/opt/hadoop-2.6.0/
PATH=$HADOOP_PREFIX/bin:$PATH
PATH=$HADOOP_PREFIX/sbin:$PATH
export
PATH
Activate
the configured environment settings for “huser” user by running
the below command.
huser:~$
exec bash
Testing
Hadoop Installation
huser:~$
hadoop version
Hadoop
2.6.0
Subversion
https://git-wip-us.apache.org/repos/asf/hadoop.git -r
e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1
Compiled
by jenkins on 2014-11-13T21:10Z
Compiled
with protoc 2.5.0
From
source with checksum 18e43357c8f927c0695f1e9522859d6a
This
command was run using
/opt/hadoop-2.6.0/share/hadoop/common/hadoop-common-2.6.0.jar
Note:
There is no need to mention the java environment variables in the
.bashrc file of client machine if it already has java installed.
Hadoop Configuration
There
are a few files to be configured which will make hadoop up and
running. For us these configuration files reside in
/opt/hadoop-2.6.0/etc/hadoop/ directory.
hadoop-env.sh
Location:
fed-nn01, fed-nn02, fed-dn01, fed-dn02, fed-client
huser:~$
vi /opt/hadoop-2.6.0/etc/hadoop/hadoop-env.sh
export
JAVA_HOME=/usr/java/jdk1.8.0_25/
export
HADOOP_LOG_DIR=/var/log/hadoop/
Create
a log directory in the specified path mentioned in hadoop-env.sh file
under the parameter HADOOP_LOG_DIR and change the ownership of the
directory to “huser” user.
$
sudo mkdir /var/log/hadoop
$
sudo chown -R huser:hadoop /var/log/hadoop
Note:
In the client machine there is no need to specify and create a log
directory. Similarly, it is needless to declare java's home directory
if it is pre-installed.
core-site.xml
Location:
fed-nn01
huser@fed-nn01:~$
sudo vi /opt/hadoop-2.6.0/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://fed-nn01</value>
</property>
</configuration>
Location:
fed-nn02
huser@fed-nn02:~$
sudo vi /opt/hadoop-2.6.0/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://fed-nn02</value>
</property>
</configuration>
Location:
fed-dn01, fed-dn02
huser@fed-dn01:~$
sudo vi /opt/hadoop-2.6.0/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://fed-nn01,hdfs://fed-nn02</value>
</property>
</configuration>
Location:
fed-client
huser@fed-client:~$
sudo vi /opt/hadoop-2.6.0/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>viewfs://fedcluster/</value>
</property>
<property>
<name>fs.viewfs.mounttable.fedcluster.link./ns1</name>
<value>hdfs://fed-nn01:8020</value>
</property>
<property>
<name>fs.viewfs.mounttable.fedcluster.link./ns2</name>
<value>hdfs://fed-nn02:8020</value>
</property>
</configuration>
Note:
Client is configured to use ViewFS plugin for a more user-friendly &
aggregated view of HDFS / filesystems namespaces mounted under
respective /ns1 & /ns2 mountpoints. Here we have provided
mounttable name as “fedcluster”, which if not specified “default”
is taken.
hdfs-site.xml
Location:
fed-nn01, fed-nn02
hsuer:~$
sudo vi /opt/hadoop-2.6.0/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///hdfs/name</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>ns1,ns2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ns1</name>
<value>fed01:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ns2</name>
<value>fed02:8020</value>
</property>
</configuration>
Create
a directory for the namenode to store it's persistent metadata and
change it's ownership to “huser” user.
$
sudo mkdir -p /hdfs/name
$
sudo chown -R huser:hadoop /hdfs/
Location:
fed-dn01, fed-dn02
hsuer:~$
sudo vi /opt/hadoop-2.6.0/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///hdfs/data</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>ns1,ns2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ns1</name>
<value>fed01:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ns2</name>
<value>fed02:8020</value>
</property>
</configuration>
Create
a directory for the datanode to store blocks and
change it's ownership to “huser” user.
huser:~$
sudo mkdir -p /hdfs/data
huser:~$
sudo chown -R huser:hadoop /hdfs/
Location:
fed-client
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>ns1,ns2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ns1</name>
<value>fed-nn01:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ns2</name>
<value>fed-nn02:8020</value>
</property>
</configuration>
slaves
Location:
fed-nn01, fed-nn02
huser:~$
vi /opt/hadoop-2.6.0/etc/hadoop/slaves
fed-dn01
fed-dn02
Formatting HDFS Filesystem
Location:
fed-nn01, fed-nn02
huser:~$
hdfs namenode -format -clusterID fedcluster
Note:
ClusterID if not specified is randomly chosen by the system.
Starting
HDFS
Run
the below command either on fed-nn01 or fed-nn02 namenode to start
the dfs.
$
start-dfs.sh
At
this point, listing of files directly through fed-nn01 and fed-nn02
will show the files that are present only in their respective
configured namespaces i.e. ns1 and ns2. Whereas, listing of files on
client machine will show both namespaces mountpoints mounted under /.
Though for list or copy/retrieve file operations from one namenode to other can be done by specifying full HDFS URI in the commands.
Examples:
Listing files in fed-nn01 namenode
huser@fed-nn01:~$ hdfs dfs -ls -h /
Found 2 items
-rw-r--r-- 2 huser hadoop 128 M 2015-01-02 16:23 /512mb-junk
drwxr-xr-x - huser hadoop 0 2015-01-02 16:28 /user
Listing files in fed-nn02 namenode from fed-nn01 namenode
huser@fed-nn01:~$ hdfs dfs -ls hdfs://fed-nn02:8020/user
Found 4 items
-rw-r--r-- 2 huser hadoop 128 M 2015-01-02 16:26 hdfs://hdfs-fed02:8020/user/512mb-junk
drwxr-xr-x - huser hadoop 0 2015-01-03 02:32 hdfs://hdfs-fed02:8020/user/dir1
-rw-r--r-- 2 huser hadoop 1 G 2015-01-02 15:25 hdfs://hdfs-fed02:8020/user/lgb-junk
-rw-r--r-- 2 huser hadoop 1 G 2015-01-02 15:31 hdfs://hdfs-fed02:8020/user/lgb-junk.1
Listing files from fed-client machines
huser@fed-client:~$ hdfs dfs -ls /
Found 2 items
-r-xr-xr-x - ibm ibm 0 2015-01-03 02:55 /ns1
-r-xr-xr-x - ibm ibm 0 2015-01-03 02:55 /ns2
Listing files on fed-nn02 from fed-client
$ hdfs dfs -ls -h /b/user
Found 4 items
-rw-r--r-- 2 huser hadoop 128 M 2015-01-02 16:26 hdfs://hdfs-fed02:8020/user/512mb-junk
drwxr-xr-x - huser hadoop 0 2015-01-03 02:32 hdfs://hdfs-fed02:8020/user/dir1
-rw-r--r-- 2 huser hadoop 1 G 2015-01-02 15:25 hdfs://hdfs-fed02:8020/user/lgb-junk
-rw-r--r-- 2 huser hadoop 1 G 2015-01-02 15:31 hdfs://hdfs-fed02:8020/user/lgb-junk.1
As you must have seen from above examples that the client provides an aggregated view of the HDFS filesystem using viewfs. Similarly we can copy/retrieve and create files/directories from client machine.
Monitoring
Below http links are also helpful in monitoring and browsing HDFS filesystem using a web-browser on any of the namenodes or client machines.
http://fed-nn01:50070/dfshealth.jsp
http://fed-nn02:50070/dfshealth.jsp
http://fed-nn01:50070/dfsclusterhealth.jsp
http://fed-nn02:50070/dfsclusterhealth.jsp
Related Links
Single-Node Hadoop Cluster on Ubuntu 14.04
Multi-Node Hadoop Cluster on Ubuntu 14.04
Multi-Node Hadoop Cluster on Solaris 11 using Zones
Fully Distributed Hadoop Federation Cluster
Fully Distributed Hadoop Cluster - Manual Failover HA with NFS
Fully Distributed Hadoop Cluster - Automatic Failover HA Cluster with ZooKeeper & NFS
Multi-Node Hadoop Cluster on Ubuntu 14.04
Multi-Node Hadoop Cluster on Solaris 11 using Zones
Fully Distributed Hadoop Federation Cluster
Fully Distributed Hadoop Cluster - Manual Failover HA with NFS
Fully Distributed Hadoop Cluster - Automatic Failover HA Cluster with ZooKeeper & NFS
No comments:
Post a Comment