You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Xiaobo Gu <gu...@gmail.com> on 2011/07/06 12:03:19 UTC

File format question when write map-reduce applications

Hi,
Does every block of files in HDFS have to be the same file format when
writing map-reduce applications, a more specific question is , when
dealing with CSV files, can we have a head in the file? I have seen
Mahout applications using the UCI repository file format which is
similar as CSV without header, does it because all map reduce task
must run semantically, having a header will cause one map task be
unique to others.

Regards,

Xiaobo Gu

Re: File format question when write map-reduce applications

Posted by Sean Owen <sr...@gmail.com>.

I think it's just CSV, but I don't know.

On Wed, Jul 6, 2011 at 11:32 AM, Xiaobo Gu <gu...@gmail.com> wrote:

> OK, that's why mahout need a file descriptor, and wha't the difference
> between CSV and UCI?
>
>

Re: File format question when write map-reduce applications

Posted by Sean Owen <sr...@gmail.com>.

I think it's just CSV, but I don't know.

On Wed, Jul 6, 2011 at 11:32 AM, Xiaobo Gu <gu...@gmail.com> wrote:

> OK, that's why mahout need a file descriptor, and wha't the difference
> between CSV and UCI?
>
>

Re: File format question when write map-reduce applications

Posted by Xiaobo Gu <gu...@gmail.com>.

OK, that's why mahout need a file descriptor, and wha't the difference
between CSV and UCI?

On Wed, Jul 6, 2011 at 6:28 PM, Sean Owen <sr...@gmail.com> wrote:
> Yes, but, my point is that it doesn't quite make sense to do such a thing in
> MapReduce. Only one mapper will see the header, but, presumably all mappers
> need that info. If it's a bit of metadata, pass it in the Configuration
> object as a String. If it's a lot, put it in the DistributedCache (or on
> HDFS and pass the location for mappers to read).
>
> On Wed, Jul 6, 2011 at 11:23 AM, Xiaobo Gu <gu...@gmail.com> wrote:
>
>> Hi Sean,
>>
>>     Thanks for your reply first, so we must wirte specific code to
>> handle the CSV header if we have it in the file, right?
>>
>> Xiaobu Gu
>>
>>
>

Re: File format question when write map-reduce applications

Posted by Xiaobo Gu <gu...@gmail.com>.

OK, that's why mahout need a file descriptor, and wha't the difference
between CSV and UCI?

On Wed, Jul 6, 2011 at 6:28 PM, Sean Owen <sr...@gmail.com> wrote:
> Yes, but, my point is that it doesn't quite make sense to do such a thing in
> MapReduce. Only one mapper will see the header, but, presumably all mappers
> need that info. If it's a bit of metadata, pass it in the Configuration
> object as a String. If it's a lot, put it in the DistributedCache (or on
> HDFS and pass the location for mappers to read).
>
> On Wed, Jul 6, 2011 at 11:23 AM, Xiaobo Gu <gu...@gmail.com> wrote:
>
>> Hi Sean,
>>
>>     Thanks for your reply first, so we must wirte specific code to
>> handle the CSV header if we have it in the file, right?
>>
>> Xiaobu Gu
>>
>>
>

Re: File format question when write map-reduce applications

Posted by Sean Owen <sr...@gmail.com>.

Yes, but, my point is that it doesn't quite make sense to do such a thing in
MapReduce. Only one mapper will see the header, but, presumably all mappers
need that info. If it's a bit of metadata, pass it in the Configuration
object as a String. If it's a lot, put it in the DistributedCache (or on
HDFS and pass the location for mappers to read).

On Wed, Jul 6, 2011 at 11:23 AM, Xiaobo Gu <gu...@gmail.com> wrote:

> Hi Sean,
>
>     Thanks for your reply first, so we must wirte specific code to
> handle the CSV header if we have it in the file, right?
>
> Xiaobu Gu
>
>

Re: File format question when write map-reduce applications

Posted by Sean Owen <sr...@gmail.com>.

Yes, but, my point is that it doesn't quite make sense to do such a thing in
MapReduce. Only one mapper will see the header, but, presumably all mappers
need that info. If it's a bit of metadata, pass it in the Configuration
object as a String. If it's a lot, put it in the DistributedCache (or on
HDFS and pass the location for mappers to read).

On Wed, Jul 6, 2011 at 11:23 AM, Xiaobo Gu <gu...@gmail.com> wrote:

> Hi Sean,
>
>     Thanks for your reply first, so we must wirte specific code to
> handle the CSV header if we have it in the file, right?
>
> Xiaobu Gu
>
>

Re: File format question when write map-reduce applications

Posted by Ted Dunning <te...@gmail.com>.

You need to handle it one way or another.

Note, however, that none of the UCI data sets is large enough to be split in
a map-reduce program.  If you are producing your own data, I would recommend
using something like Avro that is self-describing (like CSV), but which is
much more flexible.

On Wed, Jul 6, 2011 at 3:23 AM, Xiaobo Gu <gu...@gmail.com> wrote:

>     Thanks for your reply first, so we must wirte specific code to
> handle the CSV header if we have it in the file, right?
>

Re: File format question when write map-reduce applications

Posted by Ted Dunning <te...@gmail.com>.

You need to handle it one way or another.

Note, however, that none of the UCI data sets is large enough to be split in
a map-reduce program.  If you are producing your own data, I would recommend
using something like Avro that is self-describing (like CSV), but which is
much more flexible.

On Wed, Jul 6, 2011 at 3:23 AM, Xiaobo Gu <gu...@gmail.com> wrote:

>     Thanks for your reply first, so we must wirte specific code to
> handle the CSV header if we have it in the file, right?
>

Re: File format question when write map-reduce applications

Posted by Xiaobo Gu <gu...@gmail.com>.

Hi Sean,

     Thanks for your reply first, so we must wirte specific code to
handle the CSV header if we have it in the file, right?

Xiaobu Gu



On Wed, Jul 6, 2011 at 6:11 PM, Sean Owen <sr...@gmail.com> wrote:
> A block is a piece of a file. It does not (necessarily) have a meaning, or a
> "file format", by itself. You would not address HDFS blocks individually
> from this level. So I suppose the first answer is, no, they do not have
> different formats, though the question is not well-formed.
>
> You can have whatever you like in whatever HDFS file you want. Your
> application (be it Mahout, or any MapReduce application) just needs to be
> prepared to read it. If your input is a CSV file with a header line, one
> mapper will read that first chunk with the header line. You don't know which
> mapper that will be. Only one will read it, so no you would not construct a
> MapReduce app that depends on all mappers seeing some header line, because
> they don't.
>
> Yes, so, you would not observe any Mahout job doing this, because it doesn't
> work.
>
> On Wed, Jul 6, 2011 at 11:03 AM, Xiaobo Gu <gu...@gmail.com> wrote:
>
>> Hi,
>> Does every block of files in HDFS have to be the same file format when
>> writing map-reduce applications, a more specific question is , when
>> dealing with CSV files, can we have a head in the file? I have seen
>> Mahout applications using the UCI repository file format which is
>> similar as CSV without header, does it because all map reduce task
>> must run semantically, having a header will cause one map task be
>> unique to others.
>>
>> Regards,
>>
>> Xiaobo Gu
>>
>

Re: File format question when write map-reduce applications

Posted by Ted Dunning <te...@gmail.com>.

Of course, this is only true of the TextInputFormat.

You can write a CsvInputFormat in which every mapper reads the first line as
well as their assigned split.  This would cause some delay at the beginning
as all of the first round of mappers whacked against the beginning of the
file, but that delay should be very short and the convenience of being able
to read standard CSV input would be significant.

On Wed, Jul 6, 2011 at 3:11 AM, Sean Owen <sr...@gmail.com> wrote:

>  If your input is a CSV file with a header line, one
> mapper will read that first chunk with the header line. You don't know
> which
> mapper that will be. Only one will read it, so no you would not construct a
> MapReduce app that depends on all mappers seeing some header line, because
> they don't.
>

Re: File format question when write map-reduce applications

Posted by Ted Dunning <te...@gmail.com>.

Of course, this is only true of the TextInputFormat.

You can write a CsvInputFormat in which every mapper reads the first line as
well as their assigned split.  This would cause some delay at the beginning
as all of the first round of mappers whacked against the beginning of the
file, but that delay should be very short and the convenience of being able
to read standard CSV input would be significant.

On Wed, Jul 6, 2011 at 3:11 AM, Sean Owen <sr...@gmail.com> wrote:

>  If your input is a CSV file with a header line, one
> mapper will read that first chunk with the header line. You don't know
> which
> mapper that will be. Only one will read it, so no you would not construct a
> MapReduce app that depends on all mappers seeing some header line, because
> they don't.
>

Re: File format question when write map-reduce applications

Posted by Xiaobo Gu <gu...@gmail.com>.

Hi Sean,

     Thanks for your reply first, so we must wirte specific code to
handle the CSV header if we have it in the file, right?

Xiaobu Gu



On Wed, Jul 6, 2011 at 6:11 PM, Sean Owen <sr...@gmail.com> wrote:
> A block is a piece of a file. It does not (necessarily) have a meaning, or a
> "file format", by itself. You would not address HDFS blocks individually
> from this level. So I suppose the first answer is, no, they do not have
> different formats, though the question is not well-formed.
>
> You can have whatever you like in whatever HDFS file you want. Your
> application (be it Mahout, or any MapReduce application) just needs to be
> prepared to read it. If your input is a CSV file with a header line, one
> mapper will read that first chunk with the header line. You don't know which
> mapper that will be. Only one will read it, so no you would not construct a
> MapReduce app that depends on all mappers seeing some header line, because
> they don't.
>
> Yes, so, you would not observe any Mahout job doing this, because it doesn't
> work.
>
> On Wed, Jul 6, 2011 at 11:03 AM, Xiaobo Gu <gu...@gmail.com> wrote:
>
>> Hi,
>> Does every block of files in HDFS have to be the same file format when
>> writing map-reduce applications, a more specific question is , when
>> dealing with CSV files, can we have a head in the file? I have seen
>> Mahout applications using the UCI repository file format which is
>> similar as CSV without header, does it because all map reduce task
>> must run semantically, having a header will cause one map task be
>> unique to others.
>>
>> Regards,
>>
>> Xiaobo Gu
>>
>

Re: File format question when write map-reduce applications

Posted by Xiaobo Gu <gu...@gmail.com>.

Hi Sean,

     Thanks for your reply first, so we must wirte specific code to
handle the CSV header if we have it in the file, right?

Xiaobu Gu



On Wed, Jul 6, 2011 at 6:11 PM, Sean Owen <sr...@gmail.com> wrote:
> A block is a piece of a file. It does not (necessarily) have a meaning, or a
> "file format", by itself. You would not address HDFS blocks individually
> from this level. So I suppose the first answer is, no, they do not have
> different formats, though the question is not well-formed.
>
> You can have whatever you like in whatever HDFS file you want. Your
> application (be it Mahout, or any MapReduce application) just needs to be
> prepared to read it. If your input is a CSV file with a header line, one
> mapper will read that first chunk with the header line. You don't know which
> mapper that will be. Only one will read it, so no you would not construct a
> MapReduce app that depends on all mappers seeing some header line, because
> they don't.
>
> Yes, so, you would not observe any Mahout job doing this, because it doesn't
> work.
>
> On Wed, Jul 6, 2011 at 11:03 AM, Xiaobo Gu <gu...@gmail.com> wrote:
>
>> Hi,
>> Does every block of files in HDFS have to be the same file format when
>> writing map-reduce applications, a more specific question is , when
>> dealing with CSV files, can we have a head in the file? I have seen
>> Mahout applications using the UCI repository file format which is
>> similar as CSV without header, does it because all map reduce task
>> must run semantically, having a header will cause one map task be
>> unique to others.
>>
>> Regards,
>>
>> Xiaobo Gu
>>
>

Re: File format question when write map-reduce applications

Posted by Sean Owen <sr...@gmail.com>.

A block is a piece of a file. It does not (necessarily) have a meaning, or a
"file format", by itself. You would not address HDFS blocks individually
from this level. So I suppose the first answer is, no, they do not have
different formats, though the question is not well-formed.

You can have whatever you like in whatever HDFS file you want. Your
application (be it Mahout, or any MapReduce application) just needs to be
prepared to read it. If your input is a CSV file with a header line, one
mapper will read that first chunk with the header line. You don't know which
mapper that will be. Only one will read it, so no you would not construct a
MapReduce app that depends on all mappers seeing some header line, because
they don't.

Yes, so, you would not observe any Mahout job doing this, because it doesn't
work.

On Wed, Jul 6, 2011 at 11:03 AM, Xiaobo Gu <gu...@gmail.com> wrote:

> Hi,
> Does every block of files in HDFS have to be the same file format when
> writing map-reduce applications, a more specific question is , when
> dealing with CSV files, can we have a head in the file? I have seen
> Mahout applications using the UCI repository file format which is
> similar as CSV without header, does it because all map reduce task
> must run semantically, having a header will cause one map task be
> unique to others.
>
> Regards,
>
> Xiaobo Gu
>

Re: File format question when write map-reduce applications

Posted by Sean Owen <sr...@gmail.com>.

A block is a piece of a file. It does not (necessarily) have a meaning, or a
"file format", by itself. You would not address HDFS blocks individually
from this level. So I suppose the first answer is, no, they do not have
different formats, though the question is not well-formed.

You can have whatever you like in whatever HDFS file you want. Your
application (be it Mahout, or any MapReduce application) just needs to be
prepared to read it. If your input is a CSV file with a header line, one
mapper will read that first chunk with the header line. You don't know which
mapper that will be. Only one will read it, so no you would not construct a
MapReduce app that depends on all mappers seeing some header line, because
they don't.

Yes, so, you would not observe any Mahout job doing this, because it doesn't
work.

On Wed, Jul 6, 2011 at 11:03 AM, Xiaobo Gu <gu...@gmail.com> wrote:

> Hi,
> Does every block of files in HDFS have to be the same file format when
> writing map-reduce applications, a more specific question is , when
> dealing with CSV files, can we have a head in the file? I have seen
> Mahout applications using the UCI repository file format which is
> similar as CSV without header, does it because all map reduce task
> must run semantically, having a header will cause one map task be
> unique to others.
>
> Regards,
>
> Xiaobo Gu
>