# HashPrompt

Federation Concepts

“HDFS Federation improves the existing HDFS architecture through a clear separation of namespace and storage, enabling generic block storage layer. It enables support for multiple namespaces in the cluster to improve scalability and isolation. Federation also opens up the architecture, expanding the applicability of HDFS cluster to new implementations and use cases.

I have created two guest machines for the namenodes and datanodes to run on. Each VM is allocated 1GB of RAM and 20GB of HDD. Each machine is connected through host-only adapter. Client will be the host machine itself which is able to ssh and connect to both namenodes inside the VirtualBox virtual machines.”

- An excerpt from a Hortonworks blog by Suresh Srinivas.

Find more on Apache Hadoop Federation.

Setup Summary

I have implemented Hadoop Federation inside ESXi server to create a fully distributed cluster on Apache Hadoop 2.6.

Setup lists two namenodes and two datanodes each with 1GB RAM and 20GB HDD alloted. Also I have configured a HDFS client on a separate machine with same hardware configuration. All the nodes are running on Ubuntu 14.10. It is also recommended to use latest Oracle JDK for namenodes and datanodes. This setup only demonstrates how to build a federated cluster and hence YARN is not configured.

Below mentioned hostnames and ip-addresses are just an example to provide a distinctive view for understanding the cluster setup.

namenode1 fed-nn01 192.168.56.101

namenode2 fed-nn02 192.168.56.102

datanode1 fed-dn01 192.168.56.103

datanode2 fed-dn02 192.168.56.104

client fed-client 192.168.56.105

Downloads

Download the below packages and place the tarballs on all namenodes, datanodes and client machines.

Apache Hadoop 2.6 Download

Oracle JDK8 Download

Note: Before moving directly towards hadoop installation and configuration -

1. Disable firewall on all namenodes, datanodes and client machines.

2. Comment 127.0.1.1 address line in /etc/hosts file on all namenodes, datanodes and client machines.

3. Update correct hostnames and their respective ip-addresses of all namenodes, datanodes and client machines in their /etc/hosts file.

4. Install ssh on all namenodes and datanodes. Not required for client machine.

INSTALLATION & CONFIGURATION

Hadoop installation and configuration includes setting up of hadoop user, installing java, passwordless ssh, and finally hadoop installation, configuration and monitoring.

User Configuration

Location: fed-nn01, fed-nn02, fed-dn01, fed-dn02, fed-client

First we will create a group “hadoop” and a user “huser” for all hadoop administrative tasks and set password for huser.

$ sudo groupadd hadoop

$ sudo useradd -m -d /home/huser -g hadoop huser

$ sudo passwd huser

Note: Here onwards we shall use newly created huser for all hadoop tasks that we perform.

Java Installation

We will go for the recommended Oracle JDK installation and configuration. At the time of writing this document JDK8 was the latest that was available.

Location: fed-nn01, fed-nn02, fed-dn01, fed-dn02

Make a directory named java inside /usr

huser:~$ sudo mkdir /usr/java

Copy the tarball.

huser:~$ sudo cp /path/to/jdk-8u25-linux-x64.tar.gz /usr/java

Extract the tarballs

huser:~$ sudo tar -xzvf /usr/java/ jdk-8u25-linux-x64.tar.gz

Set the environment for jdk8.

huser:~$ sudo vi /etc/profile.d/java.sh

JAVA_HOME=/usr/java/jdk1.8.0_25/

PATH=$JAVA_HOME/bin:$PATH

export PATH JAVA_HOME

export CLASSPATH=.

huser:~$ sudo chmod +x /etc/profile.d/java.sh

huser:~$ source /etc/profile.d/java.sh

Testing Java Installation

huser:~$ java -version

java version "1.8.0_25"

Java(TM) SE Runtime Environment (build 1.8.0_25-b17)

Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)

Note: Java installation on client can be skipped if it already has openjdk installed. If not above method can be repeated for the client also.

Passwordless SSH Configuration

Passwordless ssh is required by the namenode only to start the HDFS and MapReduce daemons in various nodes.

Location: fed-nn01

huser@fed01:~$ ssh-keygen -t rsa

huser@fed01:~$ touch /home/huser/.ssh/authorized_keys

huser@fed01:~$ cp /home/huser/.ssh/id_rsa.pub authorized_keys

huser@fed01:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub huser@fed-nn02

huser@fed01:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub huser@fed-dn01

huser@fed01:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub huser@fed-dn02

Location: fed-nn02

huser@fed02:~$ ssh-keygen -t rsa

huser@fed02:~$ touch /home/huser/.ssh/authorized_keys

huser@fed02:~$ cp /home/huser/.ssh/id_rsa.pub authorized_keys

huser@fed02:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub huser@fed-nn01

huser@fed02:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub huser@fed-dn01

huser@fed02:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub huser@fed-dn02

Testing passwordless ssh on fed-nn01 & fed-nn02

Run below commands to test the passwordless logins from both fed-nn01 & fed-nn02.

huser:~$ ssh fed-nn01

huser:~$ ssh fed-nn02

huser:~$ ssh fed-dn01

huser:~$ ssh fed-dn02

Note: Client does not require any ssh settings to be configured. The reason behind it is that the client communicates with the namenode using namenode's configurable TCP port, which is a RPC connection

Whereas, client communicates with the datanode directly for I/O operations (read/write) using Data Transfer Protocol defined in DataTransferProtocol.java. For the purpose of performance this is a streaming protocol and not RPC. Before streaming the block to the datanode, the client buffers the data until a full block (64MB/128MB) has been created.

Hadoop Installation

We will go forth with the latest stable Apache Hadoop 2.6.0 release.

Location: fed-nn01, fed-nn02, fed-dn01, fed-dn02, fed-client

We will untar the tarball and place it in /opt directory.

huser:~$ tar -xzvf hadoop-2.6.0.tar.gz

Changing the ownership of the directory to huser.

huser:~$ sudo chown -R huser:hadoop /opt/hadoop-2.6.0/

Setting the environment variables in .bashrc file of “huser” user.

huser:~$ sudo vi ~/.bashrc

###JAVA CONFIGURATION###

JAVA_HOME=/usr/java/jdk1.8.0_25/

PATH=$JAVA_HOME/bin:$PATH

export PATH JAVA_HOME

export CLASSPATH=.

###HADOOP CONFIGURATION###

HADOOP_PREFIX=/opt/hadoop-2.6.0/

PATH=$HADOOP_PREFIX/bin:$PATH

PATH=$HADOOP_PREFIX/sbin:$PATH

export PATH

Activate the configured environment settings for “huser” user by running the below command.

huser:~$ exec bash

Testing Hadoop Installation

huser:~$ hadoop version

Hadoop 2.6.0

Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1

Compiled by jenkins on 2014-11-13T21:10Z

Compiled with protoc 2.5.0

From source with checksum 18e43357c8f927c0695f1e9522859d6a

This command was run using /opt/hadoop-2.6.0/share/hadoop/common/hadoop-common-2.6.0.jar

Note: There is no need to mention the java environment variables in the .bashrc file of client machine if it already has java installed.

Hadoop Configuration

There are a few files to be configured which will make hadoop up and running. For us these configuration files reside in /opt/hadoop-2.6.0/etc/hadoop/ directory.

hadoop-env.sh

Location: fed-nn01, fed-nn02, fed-dn01, fed-dn02, fed-client

huser:~$ vi /opt/hadoop-2.6.0/etc/hadoop/hadoop-env.sh

export JAVA_HOME=/usr/java/jdk1.8.0_25/

export HADOOP_LOG_DIR=/var/log/hadoop/

Create a log directory in the specified path mentioned in hadoop-env.sh file under the parameter HADOOP_LOG_DIR and change the ownership of the directory to “huser” user.

$ sudo mkdir /var/log/hadoop

$ sudo chown -R huser:hadoop /var/log/hadoop

Note: In the client machine there is no need to specify and create a log directory. Similarly, it is needless to declare java's home directory if it is pre-installed.

core-site.xml

Location: fed-nn01

huser@fed-nn01:~$ sudo vi /opt/hadoop-2.6.0/etc/hadoop/core-site.xml

<name>fs.default.name</name>

</property>

</configuration>

Location: fed-nn02

huser@fed-nn02:~$ sudo vi /opt/hadoop-2.6.0/etc/hadoop/core-site.xml

<name>fs.default.name</name>

</property>

</configuration>

Location: fed-dn01, fed-dn02

huser@fed-dn01:~$ sudo vi /opt/hadoop-2.6.0/etc/hadoop/core-site.xml

<name>fs.default.name</name>

</property>

</configuration>

Location: fed-client

huser@fed-client:~$ sudo vi /opt/hadoop-2.6.0/etc/hadoop/core-site.xml

<name>fs.default.name</name>

<value>viewfs://fedcluster/</value>

</property>

<name>fs.viewfs.mounttable.fedcluster.link./ns1</name>

</property>

<name>fs.viewfs.mounttable.fedcluster.link./ns2</name>

</property>

</configuration>

Note: Client is configured to use ViewFS plugin for a more user-friendly & aggregated view of HDFS / filesystems namespaces mounted under respective /ns1 & /ns2 mountpoints. Here we have provided mounttable name as “fedcluster”, which if not specified “default” is taken.

hdfs-site.xml

Location: fed-nn01, fed-nn02

hsuer:~$ sudo vi /opt/hadoop-2.6.0/etc/hadoop/hdfs-site.xml

<name>dfs.replication</name>

</property>

</property>

<name>dfs.permissions</name>

<value>false</value>

</property>

<name>dfs.nameservices</name>

</property>

<name>dfs.namenode.rpc-address.ns1</name>

</property>

<name>dfs.namenode.rpc-address.ns2</name>

</property>

</configuration>

Create a directory for the namenode to store it's persistent metadata and change it's ownership to “huser” user.

$ sudo mkdir -p /hdfs/name

$ sudo chown -R huser:hadoop /hdfs/

Location: fed-dn01, fed-dn02

hsuer:~$ sudo vi /opt/hadoop-2.6.0/etc/hadoop/hdfs-site.xml

<name>dfs.replication</name>

</property>

</property>

<name>dfs.permissions</name>

<value>false</value>

</property>

<name>dfs.nameservices</name>

</property>

<name>dfs.namenode.rpc-address.ns1</name>

</property>

<name>dfs.namenode.rpc-address.ns2</name>

</property>

</configuration>

Create a directory for the datanode to store blocks and change it's ownership to “huser” user.

huser:~$ sudo mkdir -p /hdfs/data

huser:~$ sudo chown -R huser:hadoop /hdfs/

Location: fed-client

<name>dfs.replication</name>

</property>

<name>dfs.permissions</name>

<value>false</value>

</property>

<name>dfs.nameservices</name>

</property>

<name>dfs.namenode.rpc-address.ns1</name>

</property>

<name>dfs.namenode.rpc-address.ns2</name>

</property>

</configuration>

slaves

Location: fed-nn01, fed-nn02

huser:~$ vi /opt/hadoop-2.6.0/etc/hadoop/slaves

fed-dn01

fed-dn02

Formatting HDFS Filesystem

Location: fed-nn01, fed-nn02

huser:~$ hdfs namenode -format -clusterID fedcluster

Note: ClusterID if not specified is randomly chosen by the system.

Starting HDFS

Run the below command either on fed-nn01 or fed-nn02 namenode to start the dfs.

$ start-dfs.sh

At this point, listing of files directly through fed-nn01 and fed-nn02 will show the files that are present only in their respective configured namespaces i.e. ns1 and ns2. Whereas, listing of files on client machine will show both namespaces mountpoints mounted under /. Though for list or copy/retrieve file operations from one namenode to other can be done by specifying full HDFS URI in the commands.

Examples:

Listing files in fed-nn01 namenode

huser@fed-nn01:~$ hdfs dfs -ls -h /

Found 2 items

-rw-r--r-- 2 huser hadoop 128 M 2015-01-02 16:23 /512mb-junk

drwxr-xr-x - huser hadoop 0 2015-01-02 16:28 /user

Listing files in fed-nn02 namenode from fed-nn01 namenode

huser@fed-nn01:~$ hdfs dfs -ls hdfs://fed-nn02:8020/user

Found 4 items

-rw-r--r-- 2 huser hadoop 128 M 2015-01-02 16:26 hdfs://hdfs-fed02:8020/user/512mb-junk

drwxr-xr-x - huser hadoop 0 2015-01-03 02:32 hdfs://hdfs-fed02:8020/user/dir1

-rw-r--r-- 2 huser hadoop 1 G 2015-01-02 15:25 hdfs://hdfs-fed02:8020/user/lgb-junk

-rw-r--r-- 2 huser hadoop 1 G 2015-01-02 15:31 hdfs://hdfs-fed02:8020/user/lgb-junk.1

Listing files from fed-client machines

huser@fed-client:~$ hdfs dfs -ls /

Found 2 items

-r-xr-xr-x - ibm ibm 0 2015-01-03 02:55 /ns1

-r-xr-xr-x - ibm ibm 0 2015-01-03 02:55 /ns2

Listing files on fed-nn02 from fed-client

$ hdfs dfs -ls -h /b/user

Found 4 items

-rw-r--r-- 2 huser hadoop 128 M 2015-01-02 16:26 hdfs://hdfs-fed02:8020/user/512mb-junk

drwxr-xr-x - huser hadoop 0 2015-01-03 02:32 hdfs://hdfs-fed02:8020/user/dir1

-rw-r--r-- 2 huser hadoop 1 G 2015-01-02 15:25 hdfs://hdfs-fed02:8020/user/lgb-junk

-rw-r--r-- 2 huser hadoop 1 G 2015-01-02 15:31 hdfs://hdfs-fed02:8020/user/lgb-junk.1

As you must have seen from above examples that the client provides an aggregated view of the HDFS filesystem using viewfs. Similarly we can copy/retrieve and create files/directories from client machine.

Monitoring

Below http links are also helpful in monitoring and browsing HDFS filesystem using a web-browser on any of the namenodes or client machines.

http://fed-nn01:50070/dfshealth.jsp

http://fed-nn02:50070/dfshealth.jsp

http://fed-nn01:50070/dfsclusterhealth.jsp

http://fed-nn02:50070/dfsclusterhealth.jsp

# HashPrompt

Pages

About Me

Sunday, January 04, 2015

Fully Distributed Hadoop Federation Cluster

Federation Concepts

Setup Summary

Downloads

INSTALLATION & CONFIGURATION

User Configuration

Java Installation

Passwordless SSH Configuration

Hadoop Installation

Hadoop Configuration

Formatting HDFS Filesystem

Monitoring

No comments:

Post a Comment