HDFS: Big data analytics' weakest link

For large-scale analytics, a distributed file system is kind of important. Even if you’re using Spark you need to pull a lot of data into memory very quickly. Having a file system that supports high burst rates — up to network saturation — is a good thing. However, Hadoop’s eponymous file system (Hadoop Distributed File System, aka HDFS) may not be all it’s cracked up to be.

What is a distributed file system? Think of your normal file system, which stores files in blocks. It has some way of noting where on the physical disk a block starts and how that block matches to a file. (One implementation is a file allocation table or FAT of sorts.) In a distributed file system, the blocks are “distributed” among disks attached to multiple computers. Additionally, like RAID or most SAN systems, the blocks are duplicated so that if a node is lost from the network then no data is lost.

What’s wrong with HDFS?

In HDFS, the role of the “file allocation table” is taken by the namenode. You can have more than one namenode (for redundancy), but essentially the namenode constitutes both a failure point and a type of bottleneck. While a namenode can fail over, that does take time. It also means keeping in sequence, which introduces more latency. In HDFS there is also some threading and locking stuff that happens as well as the fact that it is garbage-collected Java. Garbage collection — especially Java garbage collection — requires a lot of memory (generally at least 10x to be as efficient as native memory).

Moreover, in developing applications for distributed computing we often figure that whatever inefficiency we inject in language choice will be outweighed by I/O. Meaning so what if it took me 1,000 operations to open a file and give you some data, because the time it took for an I/O operation was 10x that. Simplistically speaking, the higher level the language, the more operations or “work” is executed per line of code.