Connecting Apache Spark and SQL databases

Introduction This blog post demonstrates how to connect to SQL databases using Apache Spark JDBC datasource. This allows us to process data from HDFS and SQL databases like Oracle, MySQL in a single Spark SQL query Apache Spark SQL includes jdbc datasource that can read from (and write to) SQL databases. In all the examples … Continue reading Connecting Apache Spark and SQL databases


Analysis and plotting of UK house prices

Introduction This blog post attempts to analyse UK house price data from 2016 with multiple objective in mind fun way to get started with pandas library as well as plotting libraries seaborn and cartopy try to identify trends opportunities for value investing input to buying decisions in unknown place Load data into pandas dataframe and … Continue reading Analysis and plotting of UK house prices

Rapid deployment of Hadoop clusters using Cloudera Manager

Introduction This blog post is about my experience of using cloudera manager for installing and provisioning Hadoop clusters. While installing Hadoop there are many factors to consider and many decisions to be made starting with the type of the cluster you want, number of master servers to run master daemons like namenode, resource manager, hiveserver2 and the … Continue reading Rapid deployment of Hadoop clusters using Cloudera Manager

Offline analysis of HDFS metadata

Introduction HDFS is part of the core Hadoop ecosystem and serves as a storage layer for the Hadoop computational frameworks like Spark, MapReduce. Like other distributed file systems, HDFS is based on an architecture where namespace is decoupled from the data. The namespace contains the file system metadata which is maintained by dedicated server called namenode … Continue reading Offline analysis of HDFS metadata

Using Tiered Storage in Alluxio

Introduction Alluxio is an open source memory speed virtual distributed storage system. An brief overview of Alluxio has been covered in a previous blog. This post will cover one of the most powerful features of Alluxio, which is its tiered storage capabilities. Tiered storage allows the Alluxio volume to be expended outside of just memory. … Continue reading Using Tiered Storage in Alluxio

Experiences of Using Alluxio with Spark

Introduction Alluxio refers to itself as an "Open Source Memory Speed Virtual Distributed Storage" platform. It sits between the storage and processing framework layers in the distributed computing ecosystem and claims to heavily improve performance when multiple jobs are reading/writing from/to the same data. This post will cover some of the basic features of Alluxio and will … Continue reading Experiences of Using Alluxio with Spark