You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Ali Safdar Kureishy <sa...@gmail.com> on 2012/04/23 12:53:21 UTC

Reading data output by MapFileOutputFormat

Hi,

If I use a *MapFileOutputFormat* to output some data, I see that each
reducer's output is a folder ("part-00000", for example), and inside that
folder are two files: "data" and "index".

However, there is no corresponding MapFileInputFormat, to read back this
folder ("part-00000"). Instead, *SequenceFileInputFormat* seems to read the
data. So, I have some questions:
- does SequenceFileInputFormat actually read *all* the data that was output
by MapFileOutputFormat? Or is some relationship data between the data and
index files lost in this process that would have been better handled by
another InputFormat class? In other words, is SequenceFileInputFormat the
right InputFormat to read data written by MapFileOutputFormat?
- how is it that SequenceFileInputFormat works to read outputs from
*both*MapFileOutputFormat and SequenceFileOutputFormat? That would
imply that
MapFileOutputFormat and SequenceFileOutputFormat output the same data, OR
that SequenceFileInputFormat internally handles both differently. What is
the reality?

Thanks,
Safdar

Re: Reading data output by MapFileOutputFormat

Posted by Ali Safdar Kureishy <sa...@gmail.com>.

Thanks Harsh! This is very helpful.

Regards,
Ali

On Mon, Apr 23, 2012 at 2:08 PM, Harsh J <ha...@cloudera.com> wrote:
> Ali,
>
> MapFiles are explained at
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html
> - Please give it a read and it should solve half your questions. In
> short, MapFile is two files - one raw SequenceFile and another an
> index file built on top of it.
>
> The reason MR does not provide a MapFileInputFormat is that you don't
> need to use the index file in MR jobs (no lookups for input-driven
> jobs). Hence the SequenceFileInputFormat suffices to read the data (it
> ignores the index file, and only reads the sequence ones that carries
> the data).
>
> If you wish to make use of MapFile's index abilities for lookups/etc.,
> use the MapFile.Reader class directly in your implementation.
>
> On Mon, Apr 23, 2012 at 4:23 PM, Ali Safdar Kureishy
> <sa...@gmail.com> wrote:
>> Hi,
>>
>> If I use a *MapFileOutputFormat* to output some data, I see that each
>> reducer's output is a folder ("part-00000", for example), and inside that
>> folder are two files: "data" and "index".
>>
>> However, there is no corresponding MapFileInputFormat, to read back this
>> folder ("part-00000"). Instead, *SequenceFileInputFormat* seems to read the
>> data. So, I have some questions:
>> - does SequenceFileInputFormat actually read *all* the data that was output
>> by MapFileOutputFormat? Or is some relationship data between the data and
>> index files lost in this process that would have been better handled by
>> another InputFormat class? In other words, is SequenceFileInputFormat the
>> right InputFormat to read data written by MapFileOutputFormat?
>> - how is it that SequenceFileInputFormat works to read outputs from
>> *both*MapFileOutputFormat and SequenceFileOutputFormat? That would
>> imply that
>> MapFileOutputFormat and SequenceFileOutputFormat output the same data, OR
>> that SequenceFileInputFormat internally handles both differently. What is
>> the reality?
>>
>> Thanks,
>> Safdar
>
>
>
> --
> Harsh J

Re: Reading data output by MapFileOutputFormat

Posted by Harsh J <ha...@cloudera.com>.

Ali,

MapFiles are explained at
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html
- Please give it a read and it should solve half your questions. In
short, MapFile is two files - one raw SequenceFile and another an
index file built on top of it.

The reason MR does not provide a MapFileInputFormat is that you don't
need to use the index file in MR jobs (no lookups for input-driven
jobs). Hence the SequenceFileInputFormat suffices to read the data (it
ignores the index file, and only reads the sequence ones that carries
the data).

If you wish to make use of MapFile's index abilities for lookups/etc.,
use the MapFile.Reader class directly in your implementation.

On Mon, Apr 23, 2012 at 4:23 PM, Ali Safdar Kureishy
<sa...@gmail.com> wrote:
> Hi,
>
> If I use a *MapFileOutputFormat* to output some data, I see that each
> reducer's output is a folder ("part-00000", for example), and inside that
> folder are two files: "data" and "index".
>
> However, there is no corresponding MapFileInputFormat, to read back this
> folder ("part-00000"). Instead, *SequenceFileInputFormat* seems to read the
> data. So, I have some questions:
> - does SequenceFileInputFormat actually read *all* the data that was output
> by MapFileOutputFormat? Or is some relationship data between the data and
> index files lost in this process that would have been better handled by
> another InputFormat class? In other words, is SequenceFileInputFormat the
> right InputFormat to read data written by MapFileOutputFormat?
> - how is it that SequenceFileInputFormat works to read outputs from
> *both*MapFileOutputFormat and SequenceFileOutputFormat? That would
> imply that
> MapFileOutputFormat and SequenceFileOutputFormat output the same data, OR
> that SequenceFileInputFormat internally handles both differently. What is
> the reality?
>
> Thanks,
> Safdar



-- 
Harsh J