How much data can our Hadoop instance hold, and how can I make it hold more?
Hadoop is a lot of things, and one of those is a distributed, abstracted file system. It’s called HDFS (for “hadoop distributed file system,” maybe), and it has its uses.
HDFS isn’t a file system in the interacts-with-OS sense. It’s more of a file system on top of file systems: the underlying (normal) file systems each run on one computer, while HDFS spans several computers. Within HDFS, files are divided into blocks; blocks are scattered across multiple machines, usually stored on more than one for redundancy.
There’s one NameNode (computer) that knows where everything is, and several core nodes (Amazon’s term) that hold and serve data. You can log in to any of these nodes and do ordinary filesystem commands like ls and df, but those are reflecting the local filesystem. It knows nothing about files in HDFS. The distributed file system is a layer above; to query it, you have to go through hadoop. A whole ‘nother file manager, with its own hierarchy of what’s where.
Why? The main purpose is: stream one file faster. Several machines can read and process one file at the same time, because parts of the file are scattered across machines. Also, HDFS can back up files to multiple machines. This means there is redundancy in storage, and also in access: if one machine is busy it could read from the other. In the end, we use it at Outpace because it can store files that are too big to put all in one place.
Negatives? HDFS files are write-once or append-only. This sounds great: they’re immutable, right? until I do need to make a small change, and copy-on-mod means copying hundreds of gigabytes. We don’t have the space for that!
How much space do we have?
number of core nodes * space per node / replication factor.
I can find the number of core nodes and the space on each one, along with the total disk space that HDFS finds available, by logging in to the NameNode (master node, in Amazon terms) and running
hadoop dfsadmin -report
Here, one uses hadoop as a top-level command, then dfsadmin as a subcommand, and then -report to tell dfsadmin what to do. This seems to be typical of dealing with hadoop.
This prints a summary for the whole cluster, and then details for each node. The summary looks like: