Connecting Apache Spark and SQL databases

Introduction This blog post demonstrates how to connect to SQL databases using Apache Spark JDBC datasource. This allows us to process data from HDFS and SQL databases like Oracle, MySQL in a single Spark SQL query Apache Spark SQL includes jdbc datasource that can read from (and write to) SQL databases. In all the examples … Continue reading Connecting Apache Spark and SQL databases


Rapid deployment of Hadoop clusters using Cloudera Manager

Introduction This blog post is about my experience of using cloudera manager for installing and provisioning Hadoop clusters. While installing Hadoop there are many factors to consider and many decisions to be made starting with the type of the cluster you want, number of master servers to run master daemons like namenode, resource manager, hiveserver2 and the … Continue reading Rapid deployment of Hadoop clusters using Cloudera Manager

Offline analysis of HDFS metadata

Introduction HDFS is part of the core Hadoop ecosystem and serves as a storage layer for the Hadoop computational frameworks like Spark, MapReduce. Like other distributed file systems, HDFS is based on an architecture where namespace is decoupled from the data. The namespace contains the file system metadata which is maintained by dedicated server called namenode … Continue reading Offline analysis of HDFS metadata

Using Tiered Storage in Alluxio

Introduction Alluxio is an open source memory speed virtual distributed storage system. An brief overview of Alluxio has been covered in a previous blog. This post will cover one of the most powerful features of Alluxio, which is its tiered storage capabilities. Tiered storage allows the Alluxio volume to be expended outside of just memory. … Continue reading Using Tiered Storage in Alluxio

Experiences of Using Alluxio with Spark

Introduction Alluxio refers to itself as an "Open Source Memory Speed Virtual Distributed Storage" platform. It sits between the storage and processing framework layers in the distributed computing ecosystem and claims to heavily improve performance when multiple jobs are reading/writing from/to the same data. This post will cover some of the basic features of Alluxio and will … Continue reading Experiences of Using Alluxio with Spark

Real-time visualisation of Hadoop resources

At CERN we run multiple Hadoop clusters to satisfy demanding requirements from our experiments and accelerator communities. The usage and criticality of the clusters are increasing dramatically as more users are looking at Hadoop to process and archive the vast amounts of data coming out of LHC. Sometimes, we as Hadoop administrators are faced with … Continue reading Real-time visualisation of Hadoop resources