You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Songting Chen <ke...@yahoo.com> on 2008/09/25 20:20:57 UTC

compressed files on HFDS

Does HDFS guarantee that all the blocks of a particular compressed file exist on the same datanode? (same question for any of its replicas)

This is important because in HaDoop, each Map is required to process the entire compressed file. If the blocks of the compressed
file exist on different datanodes, that could introduce significant network transfer cost.

Any ideas of that? Thanks,
-Songting Chen

Re: compressed files on HFDS

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

On Sep 25, 2008, at 11:20 AM, Songting Chen wrote:

> Does HDFS guarantee that all the blocks of a particular compressed  
> file exist on the same datanode? (same question for any of its  
> replicas)
>

No. HDFS does not understand data formats - it just stores raw data in  
multiple blocks across DataNodes.

> This is important because in HaDoop, each Map is required to process  
> the entire compressed file. If the blocks of the compressed
> file exist on different datanodes, that could introduce significant  
> network transfer cost.
>

You might want to use SequenceFiles with RECORD/BLOCK compression.  
These ensure compressed data is splittable for processing via Map- 
Reduce.

Arun