You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Hrishikesh Agashe <hr...@persistent.co.in> on 2009/11/14 17:25:06 UTC

DFS block size

Hi,

Default DFS block size is 64 MB. Does this mean that if I put file less than 64 MB on HDFS, it will not be divided any further?
I have lots and lots if XMLs and I would like to process them directly. Currently I am converting them to Sequence files (10 XMLs per sequence file) and the putting them on HDFS. However creating sequence files is very time consuming process. So if I just ensure that all XMLs are less than 64 MB (or value of dfs.block.size), they will not be split and I can safely process them in map / reduce using SAX parser?

If this is not possible, is there a way to speed up sequence file creation process?



DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.

Re: DFS block size

Posted by Jeff Hammerbacher <ha...@cloudera.com>.

>
> Cloudera has a pretty detailed blog on this.
>

Indeed. See http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/.
The post is getting a bit long in the tooth but should contain some useful
information for you.

Regards,
Jeff

Re: DFS block size

Posted by Amogh Vasekar <am...@yahoo-inc.com>.

Replies inline.

On 11/14/09 9:55 PM, "Hrishikesh Agashe" <hr...@persistent.co.in> wrote:

Hi,

Default DFS block size is 64 MB. Does this mean that if I put file less than 64 MB on HDFS, it will not be divided any further?

--Yes, file will be stored in single block per replica.

I have lots and lots if XMLs and I would like to process them directly. Currently I am converting them to Sequence files (10 XMLs per sequence file) and the putting them on HDFS. However creating sequence files is very time consuming process. So if I just ensure that all XMLs are less than 64 MB (or value of dfs.block.size), they will not be split and I can safely process them in map / reduce using SAX parser?

--True, but too many small files is generally not recommended, since they eat up into NN resources and add overhead to mapred jobs, along with other issues discussed previously in this forum. Cloudera has a pretty detailed blog on this. Alternatively, you can also define the split size to be used in your map-red code using configuration parameter mapred.min.split.size ( doesn't work with all formats :| ) . For XML, there is a streamxml or something similar named format you may want to consider.

Thanks,
Amogh

If this is not possible, is there a way to speed up sequence file creation process?

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.