Introduction to HDFS

What is HDFS?

  • HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.
  • HDFS is the primary distributed storage for Hadoop applications.
  • Hadoop is written in JAVA and is supported on all major platforms.
  • HDFS is designed to ‘just work’, however a working knowledge helps in diagnostics and improvements.

Components of HDFS

There are two (and a half) types of machines in a HDFS cluster

  • NameNode :– is the heart of an HDFS filesystem, it maintains and manages the file system metadata. E.g; what blocks make up a file, and on which datanodes those blocks are stored.
  • DataNode :- where HDFS stores the actual data, there are usually quite a few of these.

Single NameNode is Single Point Of Failure in HDFS cluster, there are HA options available in the form of Shared Storage for NameNode using NFS and Quorum Journal Manager. I will discuss these in a separate blog post.

Unique features of HDFS

HDFS also has a bunch of unique features that make it ideal for distributed systems:

  • Failure tolerant – data is duplicated across multiple DataNodes to protect against machine failures. The default is a replication factor of 3 (every block is stored on three machines).
  • Scalability – data transfers happen directly with the DataNodes so your read/write capacity scales fairly well with the number of DataNodes
  • Space – need more disk space? Just add more DataNodes and re-balance
  • Industry standard – Other distributed applications are built on top of HDFS (HBase, Map-Reduce)

HDFS is designed to process large data sets with write-once-read-many semantics, it is not for low latency access

HDFS – Data Organization

  • Each file written into HDFS is split into data blocks
  • Each block is stored on one or more nodes
  • Each copy of the block is called replica
  • Block placement policy

    1. First replica is placed on the local node (or random node)
    2. Second replica is placed in a different rack
    3. Third replica is placed in the same rack as the second replica

HDFS Configuration

HDFS Defaults

  • Block Size – 64 MB
  • Replication Factor – 3
  • Web UI Port – 50070

HDFS conf file – /etc/hadoop/conf/hdfs-site.xml

 <property>
    <name>dfs.blocksize</name>
    <value>268435456</value>
  </property>

 <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>

 <property>
    <name>dfs.namenode.http-address</name>
    <value>host_name:50070</value>
  </property>

Interfaces to HDFS

  • Java API (FileSystem)
  • C wrapper (libhdfs)
  • HTTP protocol
  • WebDAV protocol
  • Shell Commands

However the command line is one of the simplest and most familiar

HDFS – Shell Commands

There are two types of shell commands

User Commands

  • hdfs dfs – runs filesystem commands on the HDFS
  • hdfs fsck – runs a HDFS filesystem checking command
  • Administration Commands

  • hdfs dfsadmin – runs HDFS administration commands
  • HDFS – User Commands

    List directory contents

    hdfs dfs -ls /
    hdfs dfs -ls /user
    hdfs dfs -ls -R /var
    

    Display the disk space used by files

    hdfs dfs -du -h /
    hdfs dfs -du /hbase/data/hbase/namespace/
    hdfs dfs -du -h /hbase/data/hbase/namespace/
    hdfs dfs -du -s /hbase/data/hbase/namespace/
    

    Copy data to HDFS

    hdfs dfs -mkdir tdataset
    hdfs dfs -ls
    hdfs dfs -put DEC_00_SF3_P077_with_ann.csv tdataset
    hdfs dfs -ls –R
    echo "blah blah blah" | hdfs dfs -put - tdataset/tfile.txt
    hdfs dfs -ls –R
    hdfs dfs -cat tdataset/tfile.txt
    

    List file attributes and acls

    hdfs dfs -getfacl tdataset/tfile.txt
    hdfs dfs -getfattr -d tdataset/tfile.txt
    

    Removing a file

    hdfs dfs -rm tdataset/tfile.txt
    hdfs dfs -ls –R
    

    List the blocks of a file and their locations

    hdfs fsck /user/cloudera/tdataset/DEC_00_SF3_P077_with_ann.csv -files -blocks –locations
    

    Print missing blocks and the files they belong to

    hdfs fsck / -list-corruptfileblocks
    

    HDFS – Administration Commands

    Comprehensive status report of HDFS cluster

    hdfs dfsadmin –report 
    

    Prints a tree of racks and their nodes

    hdfs dfsadmin –printTopology
    

    Get the information for a given datanode (like ping)

    hdfs dfsadmin -getDatanodeInfo 10.0.2.15:50020
    

    Refreshes the set of datanodes that are allowed to connect to namenode

    hdfs dfsadmin -refreshNodes
    

    Other Interfaces to HDFS

    HTTP Interface

    http://quickstart.cloudera:50070
    

    MountableHDFS – FUSE

    mkdir /home/cloudera/hdfs
    sudo hadoop-fuse-dfs dfs://quickstart.cloudera:8020 /home/cloudera/hdfs
    

    Once mounted all operations on HDFS can be performed using standard Unix utilities such as ‘ls’, ‘cd’, ‘cp’, ‘mkdir’, ‘find’, ‘grep’,

    I hope this gives you a good introduction to HDFS and some useful HDFS commands to get started with your Big Data project. This is first in a series of blogs on Hadoop ecosystem, bookmark this blog if you are getting started with Hadoop ecosystem.

    Advertisements

    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out / Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out / Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out / Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out / Change )

    Connecting to %s