Introduction This blog post demonstrates how to connect to SQL databases using Apache Spark JDBC datasource. This allows us to process data from HDFS and SQL databases like Oracle, MySQL in a single Spark SQL query Apache Spark SQL includes jdbc datasource that can read from (and write to) SQL databases. In all the examples


Introduction This blog post attempts to analyse UK house price data from 2016 with multiple objective in mind fun way to get started with pandas library as well as plotting libraries seaborn and cartopy try to identify trends opportunities for value investing input to buying decisions in unknown place Load data into pandas dataframe and

Introduction This blog post is about my experience of using cloudera manager for installing and provisioning Hadoop clusters. While installing Hadoop there are many factors to consider and many decisions to be made starting with the type of the cluster you want, number of master servers to run master daemons like namenode, resource manager, hiveserver2 and the

Introduction HDFS is part of the core Hadoop ecosystem and serves as a storage layer for the Hadoop computational frameworks like Spark, MapReduce. Like other distributed file systems, HDFS is based on an architecture where namespace is decoupled from the data. The namespace contains the file system metadata which is maintained by dedicated server called namenode

Introduction Alluxio is an open source memory speed virtual distributed storage system. An brief overview of Alluxio has been covered in a previous blog. This post will cover one of the most powerful features of Alluxio, which is its tiered storage capabilities. Tiered storage allows the Alluxio volume to be expended outside of just memory.

Introduction Alluxio refers to itself as an "Open Source Memory Speed Virtual Distributed Storage" platform. It sits between the storage and processing framework layers in the distributed computing ecosystem and claims to heavily improve performance when multiple jobs are reading/writing from/to the same data. This post will cover some of the basic features of Alluxio and will