What is HDFS?
- HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.
- HDFS is the primary distributed storage for Hadoop applications.
- Hadoop is written in JAVA and is supported on all major platforms.
- HDFS is designed to ‘just work’, however a working knowledge helps in diagnostics and improvements.
Components of HDFS
There are two (and a half) types of machines in a HDFS cluster
- NameNode :– is the heart of an HDFS filesystem, it maintains and manages the file system metadata. E.g; what blocks make up a file, and on which datanodes those blocks are stored.
- DataNode :- where HDFS stores the actual data, there are usually quite a few of these.
Single NameNode is Single Point Of Failure in HDFS cluster, there are HA options available in the form of Shared Storage for NameNode using NFS and Quorum Journal Manager. I will discuss these in a separate blog post.
Unique features of HDFS
HDFS also has a bunch of unique features that make it ideal for distributed systems:
- Failure tolerant – data is duplicated across multiple DataNodes to protect against machine failures. The default is a replication factor of 3 (every block is stored on three machines).
- Scalability – data transfers happen directly with the DataNodes so your read/write capacity scales fairly well with the number of DataNodes
- Space – need more disk space? Just add more DataNodes and re-balance
- Industry standard – Other distributed applications are built on top of HDFS (HBase, Map-Reduce)
HDFS is designed to process large data sets with write-once-read-many semantics, it is not for low latency access
HDFS – Data Organization
- Each file written into HDFS is split into data blocks
- Each block is stored on one or more nodes
- Each copy of the block is called replica
Block placement policy
- First replica is placed on the local node (or random node)
- Second replica is placed in a different rack
- Third replica is placed in the same rack as the second replica
- Block Size – 64 MB
- Replication Factor – 3
- Web UI Port – 50070
HDFS conf file – /etc/hadoop/conf/hdfs-site.xml
<property> <name>dfs.blocksize</name> <value>268435456</value> </property> <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.namenode.http-address</name> <value>host_name:50070</value> </property>
Interfaces to HDFS
- Java API (FileSystem)
- C wrapper (libhdfs)
- HTTP protocol
- WebDAV protocol
- Shell Commands
However the command line is one of the simplest and most familiar
HDFS – Shell Commands
There are two types of shell commands
HDFS – User Commands
List directory contents
hdfs dfs -ls / hdfs dfs -ls /user hdfs dfs -ls -R /var
Display the disk space used by files
hdfs dfs -du -h / hdfs dfs -du /hbase/data/hbase/namespace/ hdfs dfs -du -h /hbase/data/hbase/namespace/ hdfs dfs -du -s /hbase/data/hbase/namespace/
Copy data to HDFS
hdfs dfs -mkdir tdataset hdfs dfs -ls hdfs dfs -put DEC_00_SF3_P077_with_ann.csv tdataset hdfs dfs -ls –R echo "blah blah blah" | hdfs dfs -put - tdataset/tfile.txt hdfs dfs -ls –R hdfs dfs -cat tdataset/tfile.txt
List file attributes and acls
hdfs dfs -getfacl tdataset/tfile.txt hdfs dfs -getfattr -d tdataset/tfile.txt
Removing a file
hdfs dfs -rm tdataset/tfile.txt hdfs dfs -ls –R
List the blocks of a file and their locations
hdfs fsck /user/cloudera/tdataset/DEC_00_SF3_P077_with_ann.csv -files -blocks –locations
Print missing blocks and the files they belong to
hdfs fsck / -list-corruptfileblocks
HDFS – Administration Commands
Comprehensive status report of HDFS cluster
hdfs dfsadmin –report
Prints a tree of racks and their nodes
hdfs dfsadmin –printTopology
Get the information for a given datanode (like ping)
hdfs dfsadmin -getDatanodeInfo 10.0.2.15:50020
Refreshes the set of datanodes that are allowed to connect to namenode
hdfs dfsadmin -refreshNodes
Other Interfaces to HDFS
MountableHDFS – FUSE
mkdir /home/cloudera/hdfs sudo hadoop-fuse-dfs dfs://quickstart.cloudera:8020 /home/cloudera/hdfs
Once mounted all operations on HDFS can be performed using standard Unix utilities such as ‘ls’, ‘cd’, ‘cp’, ‘mkdir’, ‘find’, ‘grep’,
I hope this gives you a good introduction to HDFS and some useful HDFS commands to get started with your Big Data project. This is first in a series of blogs on Hadoop ecosystem, bookmark this blog if you are getting started with Hadoop ecosystem.