Rapid deployment of Hadoop clusters using Cloudera Manager

Introduction

This blog post is about my experience of using cloudera manager for installing and provisioning Hadoop clusters. While installing Hadoop there are many factors to consider and many decisions to be made starting with the type of the cluster you want, number of master servers to run master daemons like namenode, resource manager, hiveserver2 and the choice of a database to store metadata for cloudera management service, hive metastore etc. This blog post documents the choices made and challenges faced

Why Cloudera Manager?

Hadoop framework consists of several modules, HDFS, YARN, MapReduce, Spark, Hive, Impala etc and each of them has its own set of configuration files, and it is up to administrators to ensure that they are kept in sync across the nodes in the Hadoop cluster. In addition if you expand the cluster later on new machines can very well be of different hardware spec which require different configuration. So the configuration management of Hadoop cluster is a complex task and most likely you need a configuration management tool which can be generic tools like Puppet, Chef etc or dedicated purpose built Hadoop cluster management tools like Cloudera Manager or Apache Ambari.

Assigning roles to the machines

Apache Hadoop components are typically designed to have master-slave architecture, the master’s do the coordination and ensure consistency and the slaves are the workhorses of the cluster. The cluster can survive the loss of a worker node, however losing the master means the full downtime of the service. So it is generally advisable to have 3 nodes classed as masters for load balancing and general high availability of master daemons and the rest as slaves/workers. Following table lists the master and slave daemons for commonly used Hadoop components

Master Worker
HDFS NameNode DataNode
YARN ResourceManager NodeManager
Hive HiveServer2, Hive MetaStore Client Gateway
Impala Catalog and StateStore daemons Impala daemon
Hbase Master Region Servers

 

The processing frameworks like MapReduce and Spark typically have utilities like Job History Server which again you can run on masters to ensure high availability. For a typical production Hadoop cluster you should have 3 machines for running master daemons and the rest as workers.

Typical production cluster:

3 machines for running master daemons
1 machine for running cloudera manager
rest workers / datanodes

Typical test cluster:

1 machines for running master daemons
1 machine for running cloudera manager
rest workers / datanodes

For this test case, I have

cdhcm – running cloudera manager
cdhnn – running master daemons
chhd[1-5] – running worker daemons

Preparation of the machines

Hadoop can run on a wide range of Linux flavors, the full list can be found here

Configure Swap

You can configure Swap on CentOS7 as below, change location and the size as per your needs

dd if=/dev/zero of=/swapfile count=4096 bs=1MiB
chmod 600 /swapfile
ls -lh /swapfile
mkswap /swapfile
swapon /swapfile
swapon -s;free -m
echo "/swapfile swap  swap  sw 0  0" >> /etc/fstab

swapiness

vm.swapiness controls how memory pages are swapped to disk, It is recommended to set this value to between 10-30

cat /proc/sys/vm/swappiness
sysctl -w vm.swappiness=10

Configure filesystem

The underlying filesystem for HDFS can be ext3/ext4/xfs

mkfs -t ext4 /dev/vdb
mkdir -p /df/dn
echo "/dev/vdb /df/dn ext4 rw,noatime,nodiratime 0 0" >> /etc/fstab
mount /df/dn

Install Java

For CDH distribution you must install oracle-java 1.7 or 1.8 as below

yum install oracle-j2sdk1.7

Others

If you enable IPTables then you need to ensure that all the ports required for Hadoop Components are open and you also need to ensure to open a range of ports are open for Apache Spark driver to communicate with it’s executors.

Installing packages with Puppet

Cloudera manager provides several ways to install Hadoop cluster, a) you can use parcels b) you can use packages and let cloudera manager do both the installation and configuration or c) you can install packages on the master and worker nodes manually or using puppet/chef and let cloudera manager perform only the configuration of the cluster. I found and recommend from my experience that installing hadoop packages using puppet/manually and configuring the cluster using cloudera manager provides better control and stability of the installation. This approach clearly defines the role of the cloudera manager and in future you can easily come out of cloudera manager if required. For a typical cluster with HDFS, YARN, ZooKeeper, Hive and Spark you should install the following packages manually or using puppet

namenode: cdhnn

yum install hadoop-hdfs-namenode
yum install hadoop-yarn-resourcemanager
yum install hadoop-yarn-proxyserver
yum install hadoop-mapreduce-historyserver
yum install hive-server2 hive-metastore hive
yum install spark-core spark-python spark-history-server
yum install zookeeper

datanodes: cdhd[1-5]

yum install hadoop-hdfs-datanode
yum install hadoop-yarn-nodemanager
yum install hadoop-mapreduce
yum install hive
yum install spark-core spark-python spark-history-server

Configuring Hadoop cluster using cloudera manager

Installing cloudera manager

You need to install cloudera-manager-server on cdhcm and cloudera-manager-agent on all the hadoop nodes

On the cloudera management server

yum install cloudera-manager-server

On all the Hadoop nodes

yum install cloudera-manager-agent
modify server_host & server_port in /etc/cloudera-scm-agent/config.ini to point to the management server and 7182 respectively

Preparing the database for metadata

You need an external database schema in Oracle/mysql/postgres to store metadata for the following components

cloudera manager server (cdh_mgr)

cloudera manager service

activity monitor (cdh_amon)
reports manager (cdh_rman) – requires enterprise license
audit server (cdh_aud) – requires enterprise license
metadata server (cdh_meta) – requires enterprise license

Hive Metastore (cdh_hive)

I choose oracle to host these schema’s as our group also runs oracle service. The main criteria for choosing the database type should be the availabilty criteria, the loss of metastore directly impacts the availability of Hive and Spark. The required schemas are created as below

create user cdh_mgr identified by XXXXXXXXX profile APP_PROFILE default tablespace TOOLS;
grant CREATE SESSION to cdh_mgr;
grant CREATE TABLE to cdh_mgr;
grant CREATE SEQUENCE to cdh_mgr;
grant CREATE INDEX to cdh_mgr;
grant ALTER TABLE to cdh_mgr;
grant ALTER INDEX to cdh_mgr;
# you need this additional privilege, it's not mentioned in the documentation
GRANT EXECUTE ON SYS.DBMS_LOB TO cdh_mgr;
alter user cdh_mgr quota 500m on TOOLS;

create user cdh_amon identified by XXXXXXXXX profile APP_PROFILE default tablespace TOOLS;
grant CREATE SESSION to cdh_amon;
grant CREATE TABLE to cdh_amon;
grant CREATE SEQUENCE to cdh_amon;
grant CREATE INDEX to cdh_amon;
grant ALTER TABLE to cdh_amon;
grant ALTER INDEX to cdh_amon;
alter user cdh_amon quota 500m on TOOLS;

create user cdh_rman identified by XXXXXXXXX profile APP_PROFILE default tablespace TOOLS;
grant CREATE SESSION to cdh_rman;
grant CREATE TABLE to cdh_rman;
grant CREATE SEQUENCE to cdh_rman;
grant CREATE INDEX to cdh_rman;
grant ALTER TABLE to cdh_rman;
grant ALTER INDEX to cdh_rman;
alter user cdh_rman quota 500m on TOOLS;

create user cdh_aud identified by XXXXXXXXX profile APP_PROFILE default tablespace TOOLS;
grant CREATE SESSION to cdh_aud;
grant CREATE TABLE to cdh_aud;
grant CREATE SEQUENCE to cdh_aud;
grant CREATE INDEX to cdh_aud;
grant ALTER TABLE to cdh_aud;
grant ALTER INDEX to cdh_aud;
alter user cdh_aud quota 500m on TOOLS;
GRANT EXECUTE ON sys.dbms_crypto TO cdh_aud;
GRANT CREATE VIEW TO cdh_aud;

create user cdh_meta identified by XXXXXXXXX cloudera123 profile APP_PROFILE default tablespace TOOLS;
grant CREATE SESSION to cdh_meta;
grant CREATE TABLE to cdh_meta;
grant CREATE SEQUENCE to cdh_meta;
grant CREATE INDEX to cdh_meta;
grant ALTER TABLE to cdh_meta;
grant ALTER INDEX to cdh_meta;
alter user cdh_meta quota 500m on TOOLS;

create user cdh_hive identified by XXXXXXXXX profile APP_PROFILE default tablespace TOOLS;
grant CREATE SESSION to cdh_hive;
grant CREATE TABLE to cdh_hive;
grant CREATE SEQUENCE to cdh_hive;
grant CREATE INDEX to cdh_hive;
grant ALTER TABLE to cdh_hive;
grant ALTER INDEX to cdh_hive;
alter user cdh_hive quota 500m on TOOLS;

Configuring Cloudera Management Service

run the following to prepare the database

/usr/share/cmf/schema/scm_prepare_database.sh -h db_host_name -P db_port oracle db_service_name db_user db_password

ensure that the java security policy is properly configured, otherwise it simply hangs and you will receive ‘connection refused’ error message

In $JAVA_HOME/jre/lib/security/java.security change securerandom.source=file:/dev/urandom to securerandom.source=file:///dev/urandom

Before proceeding further ensure that there are no error messages in cloudera manager service

Cloudera Manager Installation wizard

Log in to the cloudera manager web interface – http://clouderamanager:7180 with the default username and password of admin/admin. This gives you the installation wizard and from here it’s a matter of choosing the right options and assigning the right roles/services to your cluster nodes. It is best depicted with screenshots as they are self explanatory

As you can see if you have got your initial setup and configuration correct then installing Hadoop cluster with cloudera manager is click-click-click

Adding new components

Again adding new components is straightforward

First install the packages on the cluster machines using puppet / manually, in my case I install impala packages as below

on namenode

yum install impala-state-store impala-catalog

on datanode

yum install impala impala-shell

Then you can configure and start the service using cloudera manager which again best explained using screeshots

Issues encountered

My installation and configuration experience is not without any trouble, I faced fair amount of issues while setting up the cluster. I list them below for the benefit of others

Issue 1

Immediately after installing the cloudera management agent I saw the following error message in the log file

ValueError: too many values to unpack

solution :- 

This error message is due to cloudera manager inability to handle the output of alternatives command
The following commands identify the alternatives to be removed and restarts the cm agent

alternatives --list | awk '{print $1}' | xargs -n1 alternatives --display |grep "^/" > all_alternatives.txt
awk -F ' ' -v NCOLS=4 'NF!=NCOLS{printf "Wrong number of columns at line %d\n", NR}' all_alternatives.txt
cat all_alternatives.txt

and then remove whatever is listed in alternatives.txt or uninstall if those packages are not required

alternatives --remove jre_openjdk /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.121-2.6.8.0.el7_3.x86_64/jre
alternatives --remove java /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.121-2.6.8.0.el7_3.x86_64/jre/bin/java
alternatives --remove jre_1.7.0 /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.121-2.6.8.0.el7_3.x86_64/jre
alternatives --remove jre_1.7.0_openjdk /usr/lib/jvm/jre-1.7.0-openjdk-1.7.0.121-2.6.8.0.el7_3.x86_64
/etc/init.d/cloudera-scm-agent restart
tail -f /var/log/cloudera-scm-agent/cloudera-scm-agent.log

More information here on this issue

Issue 2

If you reinstall the cloudera manager server then the agents cannot connect due to the following error message

Error, CM server guid updated, expected 4f6cbd92-c545-4e0b-828b-a2e939e14910, received 6b9b1881-5441-40fc-8cab-5fe8648f7856

solution :-

rm -f /var/lib/cloudera-scm-agent/cm_guid
/etc/init.d/cloudera-scm-agent restart
tail -f /var/log/cloudera-scm-agent/cloudera-scm-agent.log

Issue 3

If you encounter ‘No CDH version detected’ while provisioning the cluster using the wizard

solution – check cloudera agent log and ensure that there are no error messages

Issue 4

Sometimes cm agent may have trouble detecting the java on your machine as you can see in the below screenshot

cdh_p1

In this case you can explicitly declare the JAVA_HOME by clicking on the host->configuration->Java Home Directory

Issue 5

If you use oracle for storing hive metadata then you must choose the jdbc that is certified for that particular version. E.g if you are going to use Oracle 11.2.0.4 then you must download the ojdbc6.jar drivers built for 11.2.0.4 and also Oracle 12c is not supported yet. I spent quite some time debugging this issue as everytime the hive metastore service crashed with OOM errors as below

javax.jdo.JDOUserException: Exception thrown while loading remaining rows of query

Conclusion

In this blog post we have concentrated on getting the right setup for cloudera manager for installing Hadoop clusters like a) separating the installation of packages and configuration of the services b) using external database to store metadata for cloudera manager and hive and c) getting the initial setup right to ease the installation and configuration of Hadoop services. In addition to these you can also add features like kerberos security, high availability for master daemons like namenode and rolling upgrades. In my opinion cloudera manager shines in standardizing and automating the hadoop cluster deployment and configuration management with the downside of knowing yet another tool and the associated costs involved.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s