Monday, May 01, 2017

Cloudera Multi-Node Cluster Installation

Here we are going to setup a multi-node fully distributed Cloudera Hadoop cluster configured with "MySQL" as external database. We will also configure our cluster to authenticate using Kerberos and authorize using OpenLDAP as additional security implementations.


Operating System: CentOS-6.8
Cloudera Manager Version: 5.9.1
CDH Version: 5.9.1

We will create three virtual machines as namenode, datanode1 and datanode2 respectively using VirtualBox. Cloning of virtual machines can also be performed in order to save time. Below is the required overall configuration of virtual machines.

No of network interfaces
Adapter1 (Private)
Adapter2 (Public)
IP Address (Private)
IP Address (Public)

Note: Internet connectivity is required for namenode and the above mentioned configuration serves the purpose for our setup.

i) Operating System Configuration

OS Partitioning
We are following the below partition table for our lab setup and this is not the recommended partitioning scheme to be followed in standard installation environments. The partition table layout varies upon the requirement.

OS Local Repository Configuration
Location: Namenode (nn.cdh.demo)

Starting the HTTP service and persisting it across reboots.
# service httpd start
# chkconfig httpd on

Mount the CentOS-6.8 ISO image in the namenode.

Create a directory in "/var/www/html" directory
# mkdir /var/www/html/CentOS-6.8

Copy the contents of the DVD to "/var/www/html" directory
# rsync -arvp --progress /media/CentOS_6.8_Final/* /var/www/html/CentOS-6.8/

Location: All nodes
Creating the repo file
# vi /etc/yum.repos.d/centos68local.repo

Location: All nodes
Move the previous OS repo files to a backup location
# mkdir /etc/yum.repos.d/repobkp
# mv /etc/yum.repos.d/CentOS-* /etc/yum.repos.d/repobkp

# yum repolist

Open a browser on any node and check that httpd server is working for CentOS-6.8 local repository.

Dependency Installation
Location: All nodes

The following dependencies need to be installed before we proceed further. The following packages are installed based upon our requirement for the installation of CDH. It is not a standard practice to install the same packages and it may vary according to the individual or customer requirement.
# yum install openssh* openssl* httpd elinks epel-release pssh createrepo wget ntp ntpdate ntp-doc yum-utils mod_ssl
# yum groupinstall "Development Tools"
# yum update -y

Disabling Firewall
Location: All nodes

It is recommended to disable firewall. Run the below command to start setup utility.

# setup

# service iptables stop
# service ip6tables stop
# chkconfig iptables off
# chkconfig ip6tables off

Disabling Network Manager
# service NetworkManager stop
# chkconfig NetworkManager off

Disabling Selinux
Location: All nodes

It is recommended to disable SElinux
# vi /etc/sysconfig/selinux

Configuring File Descriptor & ulimits
Location: All nodes

The recommended value of open file descriptors is "10000" or more. If the values are not greater than or equal to 10000, run the below to set the value to "10000".

Verifying the ulimit values.
# ulimit -Sn
# ulimit -Hn

Configuring the ulimit value.
# ulimit -n 10000

Configuring hosts file
Location: All nodes

The /etc/hosts file should be edited in below format.
<ip-address>  <FQDN>  <Short Name>

A sample /etc/hosts file is shown below.
# vi /etc/hosts   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6  nn.cdh.demo     nn  dn1.cdh.demo    dn1  dn2.cdh.demo    dn2

NTP Configuration
Location: All nodes

The clocks on all the nodes in the cluster must synchronize with each other for time synchronization across the cluster.
# service ntpd start
# chkconfig ntpd on
# ntpdate

Set the date using below example.
# date -s "01 MAY 2017 00:20:00"
# hwclock -w
# hwclock -r
# date

SSH Configuration (Optional)
Location: Namenode (nn.cdh.demo)

Everytime we login to a server, it asks for permission to login to the new host. To avoid answering "yes/no", we can modify "StrictHostKeyChecking" parameter value to "no" in ssh_config file as shown below.
# vi /etc/ssh/ssh_config
StrictHostKeyChecking no

OS Kernel Tuning
Location: All nodes

Disable Host Swappiness
This linux parameter controls how aggressively memory pages are swapped to disk. The value can be between 0-100. The higher the value, the more aggressive is kernel in swapping out inactive memory pages to disk which can lead to issues like lengthy garbage collection pauses for important system daemons because swap space is much slower than memory as it is backed by disk instead of RAM. Cloudera recommends setting this parameter to "0", but it has been found that in recent kernels setting this to "0" causes out-of-memory issues
(Ref: Link)

To change the swappiness value to "10" edit the "sysctl.conf" file as mentioned below.
# vi /etc/sysctl.conf

Disable Transparent Huge Pages Compaction
THP is known to cause 30% CPU overhead and can seriously degrade system performance. It is recommended by both Cloudera and Hortonworks to disable THP to reduce the amount of system CPU utilization  on the worker nodes.
(Ref: Link1/Link2)

Run the below mentioned command to disable transparent huge pages.
# vi /etc/rc.local
# vi /etc/rc.local
test -f /sys/kernel/mm/transparent_hugepage/enabled; then echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled

test -f /sys/kernel/mm/transparent_hugepage/defrag; then echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag

Verification of defined kernel specific parameters
# sysctl -p | grep vm.swappiness
# cat /sys/kernel/mm/transparent_hugepage/defrag

Improve Virtual Memory Usage
The vm.dirty_background_ratio and vm.dirty_ratio parameters control the percentage of system memory that can be filled with memory pages that still need to be written to disk. Ratios too small force frequent IO operations, and too large leave too much data stored in volatile memory, so optimizing this ration is a careful balance between optimizing IO operations and reducing risk of data loss.

# vi /etc/sysctl.conf

Configure CPU Performance Scaling
CPU Scaling is configurable and defaults commonly to favor power saving over performance. For Hadoop clusters, it is important that we configure then for better performance over other options.
# cpufreq-set -r -g performance
# vi /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

ii) Installing VirtualBox Guest Additions

Location: All nodes

The VirtualBox Guest Additions consist of device drivers and system applications that optimize the operating system for better performance and usability. Proceeding without installing VirtualBox Guest Additions is safe and does not hamper our proceedings but still we recommend to do so for convenient usage purpose of the user.
# cd /media/VBOXADDITIONS_5.1.14_112924/
# ./

iii) VM Cloning

Note down the MAC address
# vi /etc/sysconfig/network
# vi /etc/sysconfig/hosts
# vi /etc/sysconfig/network-scripts/ifcfg-eth0
Change ip and comment UUID & HWADDR
# vi /etc/udev/rules.d/70-persistent-net.rules
Comment the old MAC addresses and rename the interface names to eth0 & eth1

If the machines are not cloned and installed separately, it might be tedious work copying files and configuring parameters on each and every machine exclusively. Though we can use a for loop for the same, but the parallel shell comes handy with the operating system and can be used for the purpose conveniently as an alternate.

# pscp.pssh -h hosts -l root /etc/hosts /etc
# pssh -h hosts -i -l root hostname

iv) Passwordless-SSH Configuration

Location: Namenode (nn.cdh.demo)

After the cloning of virtual machines, we need to setup a passwordless-SSH environment exclusively for "root" user on the namenode. Configuring passwordless-SSH environment is a mandatory requirement of namenode to start various hadoop daemons across the cluster. Run the below mentioned commands to setup passwordless-SSH.

Generate SSH key for root user
# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/
The key fingerprint is:
45:9e:72:ac:72:f8:57:44:c1:bc:8e:42:c9:64:97:19 root@cdh1.demo.lab
The key's randomart image is:
+--[ RSA 2048]----+
|          E*o.   |
|        o++oo    |
|       +.o* ..   |
|       .+= ..    |
|      o.S  o.    |
|       +. ...    |
|        ...      |
|         .       |
|                 |

Copy the SSH key to all other nodes
# ssh-copy-id -i /root/.ssh/ root@dn1.cdh.demo
# ssh-copy-id -i /root/.ssh/ root@dn2.cdh.demo

Self passwordless-SSH for namenode
# ssh-copy-id -i /root/.ssh/ root@nn.cdh.demo

Testing Passwordless-SSH Environment
Try logging into the datanodes from namenode machine. Also check the self passwordless-SSH into namenode. All the listed below commands should login into the respective machines without asking for a password.
[root@nn1 ~]# ssh nn1
[root@nn1 ~]# ssh dn1
[root@nn1 ~]# ssh dn2

v) Java Installation Using Alternatives

Location: All nodes

Check the java prerequisites for your cdh version from this link. Currently our version supports minimum version Oracle JDK1.8_31 and Oracle JDK1.7_55. The excluded versions are Oracle JDK1.8_40/JDK1.8_45. Cloudera Manager can self install Oracle JDK1.7_67 during installation and upgrade.

Download java from the this link and place the java tarball in /opt directory.
# cd /opt
# tar -xzvf jdk-8u121-linux-x64.tar.gz
alternatives --install /usr/bin/java java /opt/jdk1.8.0_121/bin/java 300000

Select the latest installed java version as default java using below command.
# alternatives --config java

Setting paths
# vi /etc/profile.d/
export JAVA_HOME=/opt/jdk1.8.0_121/
export PATH=$PATH:$JAVA_HOME/bin

Execute the java profile
# source /etc/profile

# echo $JAVA_HOME

Verifying Java Installation
Run the commands mentioned in the screenshot to identify the Java version.

vi) Identifying Python Installation

A minimum version of Python 2.4 is required by cloudera manager. But many other components of Hadoop are dependent upon Python 2.6. Currently the cloudera manager does not support Python 3.0 and higher.

Issue the command listed in the below screenshot to verify the python version.


Location: Namenode (nn.cdh.demo)

i) Installation Using Internet based YUM

Cloudera Manager & other big data components like HIVE, OOZIE & HUE, all require SQL based datastore to store their metadata. For production cluster it is recommended to use external database instead of embedded database. For database version and compatibility matrix you need to check Cloudera official version specific documents. It is also recommended to have one DB instance per cluster, but in production if you have HA configured for your DB instances you can use that DB for multiple cluster setups. Requirement varies based on scenario.

If OS default MySQL is not supported then download required version of MySQL rpm bundle from here. Whereas you can also install MySQL server packages via YUM or RPM.

Before going forth, check the version of MySQL that is supported by CDH 5.9.1. Click here to find the compatible version. The compatible versions of MySQL that are supported by CDH 5.9.1 are 5.7/5.6/5.5/5.1. By default MySQL 5.1 comes from the operatin system repository. Below is the Cloudera supported database matrix for further information.

Download the relevant YUM repository rpm for CentOS 6.8 from this link or alternatively using wget command as mentioned below.
# wget -c

Install the repository rpm
# rpm -ivh mysql57-community-release-el6-9.noarch.rpm

Now we have to make sure that the MySQL 5.6 repository is properly configured.
Edit the "mysql-community.repo" file in /etc/yum.repos.d/ location. Ensure that "MySQL Connectors Community", "MySQL Tools Community" & "MySQL 5.6 Community Server" repositories are only enabled and rest all are disabled.  Below shown is a sample file for reference.

# cat /etc/yum.repos.d/mysql-community.repo
name=MySQL Connectors Community

name=MySQL Tools Community

# Enable to use MySQL 5.5
name=MySQL 5.5 Community Server

# Enable to use MySQL 5.6
name=MySQL 5.6 Community Server

name=MySQL 5.7 Community Server

name=MySQL 8.0 Community Server

name=MySQL Tools Preview

The YUM repository is now updated and should show the MySQL 5.6 repository.
# yum repolist

# yum install mysql-community-server
# service mysqld start
# chkconfig mysqld on

ii) Configuration of MySQL Database

Next we will configure MySQL as an external database for our setup.

The below command output is trucated intentionally.
# mysql_secure_installation
Enter current password for root (enter for none):
Set root password? [Y/n] y
New password:
Re-enter new password:
Password updated successfully!
Reloading privilege tables..
 ... Success!
Remove anonymous users? [Y/n] y
 ... Success!
Disallow root login remotely? [Y/n] n
Remove test database and access to it? [Y/n] y
Reload privilege tables now? [Y/n] y
 ... Success!
All done!  If you've completed all of the above steps, your MySQL
installation should now be secure.
Thanks for using MySQL!
Cleaning up…

Login into the MySQL database and create a test database.
# mysql -u root -p
mysql> create database test DEFAULT CHARACTER SET utf8;
mysql> grant all on test.* TO 'test'@'%' IDENTIFIED BY 'test';

Creating Database for Activity Monitor
Database Name: amon
Username: amon
Password: amon

mysql> create database amon DEFAULT CHARACTER SET utf8;
mysql> grant all on amon.* TO 'amon'@'%' IDENTIFIED BY 'amon';

Creating Database for Reports Manager
Database Name: rman
Username: rman
Password: rman

mysql> create database rman DEFAULT CHARACTER SET utf8;
mysql> grant all on rman.* TO 'rman'@'%' IDENTIFIED BY 'rman';

Creating Database for Hive Metastore Server
Database Name: metastore
Username: hive
Password: hive

mysql> create database metastore DEFAULT CHARACTER SET utf8;
mysql> grant all on metastore.* TO 'hive'@'%' IDENTIFIED BY 'hive';

Creating Database for Sentry Server
Database Name: sentry
Username: sentry
Password: sentry

mysql> create database sentry DEFAULT CHARACTER SET utf8;
mysql> grant all on sentry.* TO 'sentry'@'%' IDENTIFIED BY 'sentry';

Creating Database for Cloudera Navigator Audit Server
Database Name: nav
Username: nav
Password: nav

mysql> create database nav DEFAULT CHARACTER SET utf8;
mysql> grant all on nav.* TO 'nav'@'%' IDENTIFIED BY 'nav';

Creating Database for Cloudera Navigator Metadata Server
Database Name: navms
Username: navms
Password: navms

mysql> create database navms DEFAULT CHARACTER SET utf8;
mysql> grant all on navms.* TO 'navms'@'%' IDENTIFIED BY 'navms';

Creating Database for Oozie
Database Name: oozie
Username: oozie
Password: oozie

mysql> create database oozie;
mysql> grant all privileges on oozie.* to 'oozie'@'nn.cdh.demo' identified by 'oozie';
mysql> grant all privileges on oozie.* to 'oozie'@'%' identified by 'oozie';

Creating Database for Hue
Database Name: hue
Username: hue
Password: hue

mysql> create database hue;
mysql> grant all privileges on hue.* to 'hue'@'nn.cdh.demo' identified by 'hue';

Run the below commands to verify the MySQL 5.6 installation.

Login into MySQL shell
# mysql -u root -p

mysql> show tables;
mysql> show databases;

iii) Installation of MySQL JDBC Driver

Location: Namenode (nn.cdh.demo)

The MySQL JDBC driver needs to be installed on all nodes. The MySQL 5.6 requires driver version 5.1.26 or higher. Hence, we shall install driver version 5.1.40.

Download the driver from this link. Alternatively we can download the driver directly using the below wget command.
# wget -c
# tar -xzvf mysql-connector-java-5.1.40.tar.gz
# cd mysql-connector-java-5.1.40
# ln -s mysql-connector-java-5.1.40-bin.jar /usr/share/java/mysql-connector-java.jar


Cloudera Manager can be installed using both local yum repository as well as internet based repositories. We will install using local repository.

i) Cloudera Manager Local YUM Repository Configuration

Location: Namenode (nn.cdh.demo)

Generally a local yum repository is configured to save both bandwidth and time.

Create a directory in "/var/www/html" location and download the rpms.
# mkdir -p /var/www/html/CDH591/cm/5.9.1
# cd /var/www/html/CDH591/cm/5.9.1

Download the rpms for Cloudera Manager 5.9.1 using the below commands.
# wget -c

# wget -c

# wget -c

# wget -c

# wget -c
# wget -c
# wget -c
# wget -c
# createrepo .

Open a browser and check that httpd server is working for Cloudera Manager 5.9.1 local repository.

Location: All nodes

Create repo file on all nodes that will be used for Cloudera Manager daemons installation.

# vi /etc/yum.repos.d/cm591.repo

Repository Verification
# yum repolist

Preparing Parcels Repository for later use
Location: Namenode (nn.cdh.demo)

# mkdir /var/www/html/cloudera591/parcels
# cd /var/www/html/cloudera591/parcels
# wget -c
# wget -c
# wget -c

Open a browser and verify the http server working

ii) Cloudera Manager 5.9.1 Installation

We will begin installing Cloudera Manager version 5.9.1.
Location: Namenode (nn.cdh.demo)

# cd /var/www/html/CDH591/cm/5.9.1
# yum install cloudera-manager-server

iii) Cloudera Manager Configuration

Create SCM Database in MySQL Database

Location: Namenode (nn.cdh.demo)

# /usr/share/cmf/schema/ mysql -h nn.cdh.demo -u root -p --scm-host nn.cdh.demo scm scm scm

(Note: Skipping this step throws "cloudera-scm-server dead but pid file exists" error)

Starting Cloudera Manager Server Daemon
# service cloudera-scm-server start

Once the Cloudera scm server daemon has started, you will notice the 7180 port running. Run the below command to verify.
# netstat -ntulp | grep 7180

iv) CDH 5.9.1 Installation

Finally, point your browser to the below mentioned address

Enter the login credentials as mentioned below.
Username: admin
Password: admin

Click on "Continue" tab.

Select "Cloudera Enterprise Data Hub Edition Trial" and click on "Continue".

Click on "Continue".

Specify the hosts and click on "Search".

After the namenode and datanodes are detected, select all nodes and click on "Continue".

We will install CDH using local parcel repository.

Click on "More Options" to configure parcel repository paths are related settings. Click "Save Changes" and click on "Continue".

If you still haven't installed Oracle JDK, the installation wizard will automatically install it in all nodes. Since, we have already installed the latest version of Oracle JDK we will click on "Continue". It is always recommended to install the latest Oracle JDK prior to installing CDH.

We can configure CDH to run its various components like HDFS, HBase, Hive, etc using a single user mode. But we will go for distinct users for various CDH components. Click on "Continue" to proceed further.

The next screen asks to provide the SSH login credentials and in our case we will provide "root" user credentials. Click "Continue" to proceed.

Installation of CDH begins in this step. Check for the logs in "/var/log/".

Click "Continue" to go forth.

After the selected parcels have been successfully downloaded, distributed and activated in all nodes, click on "Continue".

Verify that there are no pending tasks on the Validations page. You can correct the errors and click on "Run Again" to recheck the validations.

Go through the versions of various Hadoop components that are going to be installed and click on "Finish".

Choose the services that you want to install. We have selected "Custom Services" where we will install only HDFS, MRv2, ZooKeeper, Oozie & Hive. We can always go for installation of additional components as per requirement. Click on "Continue".

Customize the roles for each node as per your requirement and click on "Continue".

Setup the databases for the respective Hadoop components.

Review the configuration and click on "Continue".

Now the Hadoop services will start to run for the first time on the cluster.

After the “First Run Command” has successfully completed, the Cloudera Manager Admin Console will open up for administration activities like configuring, managing and monitoring CDH.

Now the installation of multi-node Cloudera cluster is complete.

Related Posts:

1 comment:

  1. Hey, Wow all the posts are very informative for the people who visit this site. Good work! We also have a Website. Please feel free to visit our site. Thank you for sharing. Well written article Thank You for Sharing with Us pmp training institute in chennai | pmp training class in chennai | pmp training near me |project management training in chennai | project management certification online | project management course online |