You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@hadoop.apache.org by Jonathan Holloway <jo...@gmail.com> on 2009/09/14 19:33:36 UTC

Hadoop and Small Files

Hi all,

I'm new to Hadoop and currently looking at it for a project where there is
around a few TB of data that needs to be stored
in a format suitable for MapReduce functions.  The problem is that I'm
dealing with small text files (including metadata)
of 10Kb in size (and upwards to a few MB) that need to be stored in some
format.  The files need to be accessed randomly
with very low latency.  I've been through the docs and previous posts on the
mailing list, and looked at the following options:

* HDFS - not suitable "as is" because of the 64MB block size
* HAR (Hadoop Archives) - not sure about random access to files within
format
* Sequence Files - slow to convert into this format, can't randomly access
the files
* CombineFileInputFormat - assuming you still can't access the files
randomly https://issues.apache.org/jira/browse/HADOOP-4565
* MapFile - looks good... but not sure about latency
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html
* HBase (or similar distributed key-value store) - not sure about latency,
has this improved with the 0.20 release?

Please correct if I'm wrong re: the assumptions above.  Which is the most
appropriate option here?

Many thanks...
Jon.

Re: Hadoop and Small Files

Posted by Jonathan Holloway <jo...@gmail.com>.

Hi all...

Many thanks for your help and the responses, currently investigating HBase
0.20 as a potential option...

2009/9/14 Amr Awadallah <aa...@cloudera.com>

> > The files need to be accessed randomly with very low latency
>
> Then use:
>
> * HBase (or similar distributed key-value store) - not sure about latency,
> has this improved with the 0.20 release?
>
> Yes, latency is significantly better with 0.20, see preso from hadoop
> summit on results:
>
>
> http://devblog.streamy.com/2009/07/24/streamy-hadoop-summit-hbase-goes-realtime/
>
> -- amr
>
>
> Jonathan Holloway wrote:
>
>> Hi all,
>>
>> I'm new to Hadoop and currently looking at it for a project where there is
>> around a few TB of data that needs to be stored
>> in a format suitable for MapReduce functions.  The problem is that I'm
>> dealing with small text files (including metadata)
>> of 10Kb in size (and upwards to a few MB) that need to be stored in some
>> format.  The files need to be accessed randomly
>> with very low latency.  I've been through the docs and previous posts on
>> the
>> mailing list, and looked at the following options:
>>
>> * HDFS - not suitable "as is" because of the 64MB block size
>> * HAR (Hadoop Archives) - not sure about random access to files within
>> format
>> * Sequence Files - slow to convert into this format, can't randomly access
>> the files
>> * CombineFileInputFormat - assuming you still can't access the files
>> randomly https://issues.apache.org/jira/browse/HADOOP-4565
>> * MapFile - looks good... but not sure about latency
>>
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html
>> * HBase (or similar distributed key-value store) - not sure about latency,
>> has this improved with the 0.20 release?
>>
>> Please correct if I'm wrong re: the assumptions above.  Which is the most
>> appropriate option here?
>>
>> Many thanks...
>> Jon.
>>
>>
>>
>


-- 
Design and Tech-noogly
Web: http://www.oogly.co.uk
Mail:  jonathan.holloway@oogly.co.uk
IM:jonathan_philip_holloway@hotmail.com<IM...@hotmail.com>

Re: Hadoop and Small Files

Posted by Amr Awadallah <aa...@cloudera.com>.

 > The files need to be accessed randomly with very low latency

Then use:

* HBase (or similar distributed key-value store) - not sure about latency,
has this improved with the 0.20 release?

Yes, latency is significantly better with 0.20, see preso from hadoop 
summit on results:

http://devblog.streamy.com/2009/07/24/streamy-hadoop-summit-hbase-goes-realtime/

-- amr

Jonathan Holloway wrote:
> Hi all,
>
> I'm new to Hadoop and currently looking at it for a project where there is
> around a few TB of data that needs to be stored
> in a format suitable for MapReduce functions.  The problem is that I'm
> dealing with small text files (including metadata)
> of 10Kb in size (and upwards to a few MB) that need to be stored in some
> format.  The files need to be accessed randomly
> with very low latency.  I've been through the docs and previous posts on the
> mailing list, and looked at the following options:
>
> * HDFS - not suitable "as is" because of the 64MB block size
> * HAR (Hadoop Archives) - not sure about random access to files within
> format
> * Sequence Files - slow to convert into this format, can't randomly access
> the files
> * CombineFileInputFormat - assuming you still can't access the files
> randomly https://issues.apache.org/jira/browse/HADOOP-4565
> * MapFile - looks good... but not sure about latency
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html
> * HBase (or similar distributed key-value store) - not sure about latency,
> has this improved with the 0.20 release?
>
> Please correct if I'm wrong re: the assumptions above.  Which is the most
> appropriate option here?
>
> Many thanks...
> Jon.
>
>

Re: Hadoop and Small Files

Posted by Sam Baskinger <sa...@networkedinsights.com>.

Hey Jon,

I don't know how many seconds would constitute low latency for your application, but I would guess that Hadoop simply will not cut it. I would recommend something closer to Grid-SQL.

If you absolutely must process all the files using MapReduce, perhaps you can split them all up into 1GB files and process them as a fraction of the larger problem? Just a thought.

Sam

On 9/14/09 12:33 PM, "Jonathan Holloway" <jo...@gmail.com> wrote:

Hi all,

I'm new to Hadoop and currently looking at it for a project where there is
around a few TB of data that needs to be stored
in a format suitable for MapReduce functions. The problem is that I'm
dealing with small text files (including metadata)
of 10Kb in size (and upwards to a few MB) that need to be stored in some
format. The files need to be accessed randomly
with very low latency. I've been through the docs and previous posts on the
mailing list, and looked at the following options:

* HDFS - not suitable "as is" because of the 64MB block size
* HAR (Hadoop Archives) - not sure about random access to files within
format
* Sequence Files - slow to convert into this format, can't randomly access
the files
* CombineFileInputFormat - assuming you still can't access the files
randomly https://issues.apache.org/jira/browse/HADOOP-4565
* MapFile - looks good... but not sure about latency
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html
* HBase (or similar distributed key-value store) - not sure about latency,
has this improved with the 0.20 release?

Please correct if I'm wrong re: the assumptions above. Which is the most
appropriate option here?

Many thanks...
Jon.

Sam Baskinger
Software Engineer
Networked Insights, Inc. <http://www.networkedinsights.com/>

This e-mail message and any attachments are for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure, duplication or distribution is prohibited. If you received this message in error, please notify me by phone or return email, do not forward to any other person and permanently delete the foregoing message, all attachments and all copies immediately.