You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Erik Forsberg <fo...@opera.com> on 2009/06/26 10:24:49 UTC

Performance hit by not splitting .bz2?

Hi!

I have a case where we need to analyse logfiles. They are currently
compressed using bzip2, and an example logfile is roughly 105Mb
compressed, 720Mb uncompressed.

I'm considering using a Hadoop version with .bz2 support - probably
Cloudera's 18.3 dist, but if I understand correctly, .bz2 files are
not split. 

I expect that for most jobs, the number of log files will exceed the
number of cores in my hadoop cluster.

Is it possible to estimate if I'll get a performance hit
because of the lack of splitting under these circumstances?

Thanks,
\EF
-- 
Erik Forsberg <fo...@opera.com>
Developer, Opera Mini - http://www.opera.com/mini/

Re: Performance hit by not splitting .bz2?

Posted by Zhong Wang <wa...@gmail.com>.

Hi Erik,

On Fri, Jun 26, 2009 at 4:24 PM, Erik Forsberg<fo...@opera.com> wrote:

> I'm considering using a Hadoop version with .bz2 support - probably
> Cloudera's 18.3 dist, but if I understand correctly, .bz2 files are
> not split.

Yes. The bzip2 compressed files are not splittable in current
versions, maybe it will be introduced in next version. You may be
interested in this patch
https://issues.apache.org/jira/browse/HADOOP-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel.

> I expect that for most jobs, the number of log files will exceed the
> number of cores in my hadoop cluster.
>
> Is it possible to estimate if I'll get a performance hit
> because of the lack of splitting under these circumstances?

The bzip2 files are not split that means your block size of HDFS is
720M. Even though the number of your log files may exceed the number
of cores in your cluster, large blocks will decrease load balancing.

-- 
Zhong Wang