Rapid deployment of Hadoop clusters using Cloudera Manager

Introduction This blog post is about my experience of using cloudera manager for installing and provisioning Hadoop clusters. While installing Hadoop there are many factors to consider and many decisions to be made starting with the type of the cluster you want, number of master servers to run master daemons like namenode, resource manager, hiveserver2 and the … Continue reading Rapid deployment of Hadoop clusters using Cloudera Manager

Offline analysis of HDFS metadata

Introduction HDFS is part of the core Hadoop ecosystem and serves as a storage layer for the Hadoop computational frameworks like Spark, MapReduce. Like other distributed file systems, HDFS is based on an architecture where namespace is decoupled from the data. The namespace contains the file system metadata which is maintained by dedicated server called namenode … Continue reading Offline analysis of HDFS metadata

Using Tiered Storage in Alluxio

Introduction Alluxio is an open source memory speed virtual distributed storage system. An brief overview of Alluxio has been covered in a previous blog. This post will cover one of the most powerful features of Alluxio, which is its tiered storage capabilities. Tiered storage allows the Alluxio volume to be expended outside of just memory. … Continue reading Using Tiered Storage in Alluxio

Experiences of Using Alluxio with Spark

Introduction Alluxio refers to itself as an "Open Source Memory Speed Virtual Distributed Storage" platform. It sits between the storage and processing framework layers in the distributed computing ecosystem and claims to heavily improve performance when multiple jobs are reading/writing from/to the same data. This post will cover some of the basic features of Alluxio and will … Continue reading Experiences of Using Alluxio with Spark

Real-time visualisation of Hadoop resources

At CERN we run multiple Hadoop clusters to satisfy demanding requirements from our experiments and accelerator communities. The usage and criticality of the clusters are increasing dramatically as more users are looking at Hadoop to process and archive the vast amounts of data coming out of LHC. Sometimes, we as Hadoop administrators are faced with … Continue reading Real-time visualisation of Hadoop resources

Integrating Hadoop and Elasticsearch – Part 2 – Querying and Writing to Elasticsearch from Apache Spark

Introduction In the part 2 of 'Integrating Hadoop and Elasticsearch' blogpost series we look at bridging Apache Spark and Elasticsearch. I assume that you have access to Hadoop and Elasticsearch clusters and you are faced with the challenge of bridging these two distributed systems. As spark code can be written in scala, python and java, … Continue reading Integrating Hadoop and Elasticsearch – Part 2 – Querying and Writing to Elasticsearch from Apache Spark