You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Tom Melendez <to...@supertom.com> on 2011/07/27 08:28:36 UTC

questions regarding data storage and inputformat

Hi Folks,

I have a bunch of binary files which I've stored in a sequencefile.
The name of the file is the key, the data is the value and I've stored
them sorted by key.  (I'm not tied to using a sequencefile for this).
The current test data is only 50MB, but the real data will be 500MB -
1GB.

My M/R job requires that it's input be several of these records in the
sequence file, which is determined by the key.  The sorting mentioned
above keeps these all packed together.

1. Any reason not to use a sequence file for this?  Perhaps a mapfile?
 Since I've sorted it, I don't need "random" accesses, but I do need
to be aware of the keys, as I need to be sure that I get all of the
relevant keys sent to a given mapper

2. Looks like I want a custom inputformat for this, extending
SequenceFileInputFormat.  Do you agree?  I'll gladly take some
opinions on this, as I ultimately want to split the based on what's in
the file, which might be a little unorthodox.

3. Another idea might be create separate seq files for chunk of
records and make them non-splittable, ensuring that they go to a
single mapper.  Assuming I can get away with this, see any pros/cons
with that approach?

Thanks,

Tom

-- 
===================
Skybox is hiring.
http://www.skyboximaging.com/careers/jobs

Re: questions regarding data storage and inputformat

Posted by Joey Echeverria <jo...@cloudera.com>.
You could either use a custom RecordReader or you could override the
run() method on your Mapper class to do the merging before calling the
map() method.

-Joey

On Wed, Jul 27, 2011 at 11:09 AM, Tom Melendez <to...@supertom.com> wrote:
>>
>>> 3. Another idea might be create separate seq files for chunk of
>>> records and make them non-splittable, ensuring that they go to a
>>> single mapper.  Assuming I can get away with this, see any pros/cons
>>> with that approach?
>>
>> Separate sequence files would require the least amount of custom code.
>>
>
> Thanks for the response, Joey.
>
> So, if I were to do the above, I would still need a custom record
> reader to put all the keys and values together, right?
>
> Thanks,
>
> Tom
>
> --
> ===================
> Skybox is hiring.
> http://www.skyboximaging.com/careers/jobs
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Re: questions regarding data storage and inputformat

Posted by Tom Melendez <to...@supertom.com>.
>
>> 3. Another idea might be create separate seq files for chunk of
>> records and make them non-splittable, ensuring that they go to a
>> single mapper.  Assuming I can get away with this, see any pros/cons
>> with that approach?
>
> Separate sequence files would require the least amount of custom code.
>

Thanks for the response, Joey.

So, if I were to do the above, I would still need a custom record
reader to put all the keys and values together, right?

Thanks,

Tom

-- 
===================
Skybox is hiring.
http://www.skyboximaging.com/careers/jobs

Re: questions regarding data storage and inputformat

Posted by Joey Echeverria <jo...@cloudera.com>.
> 1. Any reason not to use a sequence file for this?  Perhaps a mapfile?
>  Since I've sorted it, I don't need "random" accesses, but I do need
> to be aware of the keys, as I need to be sure that I get all of the
> relevant keys sent to a given mapper

MapFile *may* be better here (see my answer for 2 below).

> 2. Looks like I want a custom inputformat for this, extending
> SequenceFileInputFormat.  Do you agree?  I'll gladly take some
> opinions on this, as I ultimately want to split the based on what's in
> the file, which might be a little unorthodox.

If you need to split based on where certain keys are in the file, then
a SequenceFile isn't a great solution. It would require that your
InputFormat scan through all of the data just to find split points.
Assuming you know what keys to split on ahead of time, you could use
MapFiles and find the exact split point more quickly.

> 3. Another idea might be create separate seq files for chunk of
> records and make them non-splittable, ensuring that they go to a
> single mapper.  Assuming I can get away with this, see any pros/cons
> with that approach?

Separate sequence files would require the least amount of custom code.

-Joey

-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434