You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by Daren Hasenkamp <dh...@berkeley.edu> on 2010/06/01 22:56:44 UTC

record-aware file splitting

Hi,

I am interested in implementing record-aware file splitting for hadoop. I
am looking for someone who knows the hadoop internals well and is willing
to discuss some details of how to accomplish this.

By "record-aware file splitting", I mean that I want to be able to put
files into hadoop with a custom InputFormat implementation, and hadoop
will split the files into blocks such that no record is split between
blocks.

I believe that record-aware file splitting could offer considerable
speedup when dealing with large records--say, 10s or 100s of megabytes per
record--since it eliminates the need to stream part of a record from one
datanode to another when said record is split between block boundaries.

(The motivation here is that large records occur commonly when dealing
with scientific datasets. Imagine, for example, a set of climate
simulation data, where each "record" consists of climate data over the
entire globe at a given time step. This is a huge amount of data per
record. Essentially, I want to modify Hadoop to work faster with large
scientific datasets.)

If you are interested in discussing this with me, I would love to talk
more with you.

Thanks!
Daren Hasenkamp
Computer Science/Applied Mathematics, UC Berkeley
Student Assistant, Lawrence Berkeley National Lab

Re: record-aware file splitting

Posted by Jeff Hammerbacher <ha...@cloudera.com>.

Hey Daren,

Your idea has some pedigree in the Hadoop universe: it was proposed in early
2006 at https://issues.apache.org/jira/browse/HADOOP-106 and closed as
"won't fix". The suggestion there is to pad out the rest of the block for
very large records, as the complexity added to the file system for splitting
blocks on record boundaries is high.

That said, if you feel strongly about your direction, feel free to open a
new JIRA issue, link it to the old one, and have at your argument. You may
also be interested in authoring a HEP (see
http://www.cloudera.com/blog/2010/06/the-second-apache-hadoop-hdfs-and-mapreduce-contributors-meeting)
describing your intent.

One last recommendation: I note that you have not made large changes to the
Hadoop code base yet. You may want to start with a slightly smaller project
to get your feet wet. Lots of folks would be happy to guide you to an
appropriate project.

Regards,
Jeff

On Tue, Jun 1, 2010 at 1:56 PM, Daren Hasenkamp <dh...@berkeley.edu>wrote:

> Hi,
>
> I am interested in implementing record-aware file splitting for hadoop. I
> am looking for someone who knows the hadoop internals well and is willing
> to discuss some details of how to accomplish this.
>
> By "record-aware file splitting", I mean that I want to be able to put
> files into hadoop with a custom InputFormat implementation, and hadoop
> will split the files into blocks such that no record is split between
> blocks.
>
> I believe that record-aware file splitting could offer considerable
> speedup when dealing with large records--say, 10s or 100s of megabytes per
> record--since it eliminates the need to stream part of a record from one
> datanode to another when said record is split between block boundaries.
>
> (The motivation here is that large records occur commonly when dealing
> with scientific datasets. Imagine, for example, a set of climate
> simulation data, where each "record" consists of climate data over the
> entire globe at a given time step. This is a huge amount of data per
> record. Essentially, I want to modify Hadoop to work faster with large
> scientific datasets.)
>
> If you are interested in discussing this with me, I would love to talk
> more with you.
>
> Thanks!
> Daren Hasenkamp
> Computer Science/Applied Mathematics, UC Berkeley
> Student Assistant, Lawrence Berkeley National Lab
>
>