You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Avrilia Floratou <fl...@cs.wisc.edu> on 2012/02/01 00:53:15 UTC

RCFile and Hadoop Counters

Hi,

I have a question related to the hadoop counters when RCFile is used.
I have 16TB of (uncompressed) data stored in compressed RCFile format. The size of the compressed RCFile is approximately 3 TB.
I ran a simple scan query on this table. Each split is 256 MB (HDFS block size). 

From the counters of each individual map task I can see the following info:

HDFS_BYTES_READ : 91,235,561
Map input bytes: 268,191,006

Then I looked at the aggregate counters produced by the MR job. I see:

HDFS_BYTES_READ :  1,049,781,904,232
Map input bytes:  3,088,881,678,946

The total job time is 4980 sec. During the job I was running iostat to check the bw I was getting from my disks and that was 40 MB/sec at each of my 16
nodes. That means a total of 40*16 = 640 MB/sec across the cluster.

If the raw data read was 1,049,781,904,232 according to the HDFS_BYTES_READ counter then the job would finish in 1640 sec (1TB/ 640mb/sec).
What is wrong here?

I'm actually wondering what these two counters HDFS_BYTES_READ and Map Input Bytes actually represent when compressed RCFiles are used 
as a storage layer and how these are related to the raw bandwidth I can get from iostat.

Thanks,
Avrilia