You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Pierre ANCELOT <pi...@gmail.com> on 2010/05/10 14:21:45 UTC

Fully distribute TextInputFormat...

Hi folks :)
I have one big file... I read it with FileInputFormat, this generates only
one task and of course, this doesn't get distributed across the cluster
nodes.
Should I use an other Input class or do I have a bug in my implementation?

The desired behavior is one task per line.

Thanks.



-- 
http://www.neko-consulting.com
Ego sum quis ego servo
"Je suis ce que je protège"
"I am what I protect"

Re: Fully distribute TextInputFormat...

Posted by Edward Capriolo <ed...@gmail.com>.
If you curious, I found out this morning that NLineInputFormat is not ported
to the new mapreduce api current yet. (It might be in trunk). So using
NLineFormat forces you into the older mapred api.

Edward

On Mon, May 10, 2010 at 12:35 PM, Ted Yu <yu...@gmail.com> wrote:

> NLineInputFormat seems a fit for your need.
> On Mon, May 10, 2010 at 6:05 AM, Pierre ANCELOT <pi...@gmail.com>
> wrote:
>
> > Simple and pure raw ascii text. One line == one treatment to do.
> >
> >
> >
> > On Mon, May 10, 2010 at 2:52 PM, Jeff Zhang <zj...@gmail.com> wrote:
> >
> > > What's the format of this file ? gzip can been split.
> > >
> > >
> > >
> > > On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT <pi...@gmail.com>
> > > wrote:
> > > > Hi folks :)
> > > > I have one big file... I read it with FileInputFormat, this generates
> > > only
> > > > one task and of course, this doesn't get distributed across the
> cluster
> > > > nodes.
> > > > Should I use an other Input class or do I have a bug in my
> > > implementation?
> > > >
> > > > The desired behavior is one task per line.
> > > >
> > > > Thanks.
> > > >
> > > >
> > > >
> > > > --
> > > > http://www.neko-consulting.com
> > > > Ego sum quis ego servo
> > > > "Je suis ce que je protège"
> > > > "I am what I protect"
> > > >
> > >
> > >
> > >
> > > --
> > > Best Regards
> > >
> > > Jeff Zhang
> > >
> >
> >
> >
> > --
> > http://www.neko-consulting.com
> > Ego sum quis ego servo
> > "Je suis ce que je protège"
> > "I am what I protect"
> >
>

Re: Fully distribute TextInputFormat...

Posted by Ted Yu <yu...@gmail.com>.
NLineInputFormat seems a fit for your need.
On Mon, May 10, 2010 at 6:05 AM, Pierre ANCELOT <pi...@gmail.com> wrote:

> Simple and pure raw ascii text. One line == one treatment to do.
>
>
>
> On Mon, May 10, 2010 at 2:52 PM, Jeff Zhang <zj...@gmail.com> wrote:
>
> > What's the format of this file ? gzip can been split.
> >
> >
> >
> > On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT <pi...@gmail.com>
> > wrote:
> > > Hi folks :)
> > > I have one big file... I read it with FileInputFormat, this generates
> > only
> > > one task and of course, this doesn't get distributed across the cluster
> > > nodes.
> > > Should I use an other Input class or do I have a bug in my
> > implementation?
> > >
> > > The desired behavior is one task per line.
> > >
> > > Thanks.
> > >
> > >
> > >
> > > --
> > > http://www.neko-consulting.com
> > > Ego sum quis ego servo
> > > "Je suis ce que je protège"
> > > "I am what I protect"
> > >
> >
> >
> >
> > --
> > Best Regards
> >
> > Jeff Zhang
> >
>
>
>
> --
> http://www.neko-consulting.com
> Ego sum quis ego servo
> "Je suis ce que je protège"
> "I am what I protect"
>

Re: Fully distribute TextInputFormat...

Posted by Pierre ANCELOT <pi...@gmail.com>.
Idea is, I want to share the lines of the file equally between nodes...



On Mon, May 10, 2010 at 3:05 PM, Pierre ANCELOT <pi...@gmail.com> wrote:

> Simple and pure raw ascii text. One line == one treatment to do.
>
>
>
>
> On Mon, May 10, 2010 at 2:52 PM, Jeff Zhang <zj...@gmail.com> wrote:
>
>> What's the format of this file ? gzip can been split.
>>
>>
>>
>> On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT <pi...@gmail.com>
>> wrote:
>> > Hi folks :)
>> > I have one big file... I read it with FileInputFormat, this generates
>> only
>> > one task and of course, this doesn't get distributed across the cluster
>> > nodes.
>> > Should I use an other Input class or do I have a bug in my
>> implementation?
>> >
>> > The desired behavior is one task per line.
>> >
>> > Thanks.
>> >
>> >
>> >
>> > --
>> > http://www.neko-consulting.com
>> > Ego sum quis ego servo
>> > "Je suis ce que je protège"
>> > "I am what I protect"
>> >
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
> http://www.neko-consulting.com
> Ego sum quis ego servo
> "Je suis ce que je protège"
> "I am what I protect"
>
>


-- 
http://www.neko-consulting.com
Ego sum quis ego servo
"Je suis ce que je protège"
"I am what I protect"

Re: Fully distribute TextInputFormat...

Posted by Pierre ANCELOT <pi...@gmail.com>.
Simple and pure raw ascii text. One line == one treatment to do.



On Mon, May 10, 2010 at 2:52 PM, Jeff Zhang <zj...@gmail.com> wrote:

> What's the format of this file ? gzip can been split.
>
>
>
> On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT <pi...@gmail.com>
> wrote:
> > Hi folks :)
> > I have one big file... I read it with FileInputFormat, this generates
> only
> > one task and of course, this doesn't get distributed across the cluster
> > nodes.
> > Should I use an other Input class or do I have a bug in my
> implementation?
> >
> > The desired behavior is one task per line.
> >
> > Thanks.
> >
> >
> >
> > --
> > http://www.neko-consulting.com
> > Ego sum quis ego servo
> > "Je suis ce que je protège"
> > "I am what I protect"
> >
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 
http://www.neko-consulting.com
Ego sum quis ego servo
"Je suis ce que je protège"
"I am what I protect"

Re: Fully distribute TextInputFormat...

Posted by Alex Baranov <al...@gmail.com>.
I meant splitting of very huge file to distribute it over multiple Map jobs.

Alex.

http://sematext.com

On Tue, May 11, 2010 at 6:13 AM, himanshu chandola <
himanshu_coolguy@yahoo.com> wrote:

> Actually would you have a case when no splitting is needed. Just curious.
>
> It seems that you would use LZO or not use any compression at all.
>
> H
>
> ----- Original Message ----
> From: Alex Baranov <al...@gmail.com>
> To: common-user@hadoop.apache.org
> Sent: Mon, May 10, 2010 4:27:11 PM
> Subject: Re: Fully distribute TextInputFormat...
>
> If I'm not mistaken LZO compression better suits when splitting needed, not
> gzip.
>
> Alex Baranau
>
> http://sematext.com
>
> On Mon, May 10, 2010 at 3:52 PM, Jeff Zhang <zj...@gmail.com> wrote:
>
> > What's the format of this file ? gzip can been split.
> >
> >
> >
> > On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT <pi...@gmail.com>
> > wrote:
> > > Hi folks :)
> > > I have one big file... I read it with FileInputFormat, this generates
> > only
> > > one task and of course, this doesn't get distributed across the cluster
> > > nodes.
> > > Should I use an other Input class or do I have a bug in my
> > implementation?
> > >
> > > The desired behavior is one task per line.
> > >
> > > Thanks.
> > >
> > >
> > >
> > > --
> > > http://www.neko-consulting.com
> > > Ego sum quis ego servo
> > > "Je suis ce que je protège"
> > > "I am what I protect"
> > >
> >
> >
> >
> > --
> > Best Regards
> >
> > Jeff Zhang
> >
>
>
>
>
>

Re: Fully distribute TextInputFormat...

Posted by himanshu chandola <hi...@yahoo.com>.
Actually would you have a case when no splitting is needed. Just curious.

It seems that you would use LZO or not use any compression at all.

H

----- Original Message ----
From: Alex Baranov <al...@gmail.com>
To: common-user@hadoop.apache.org
Sent: Mon, May 10, 2010 4:27:11 PM
Subject: Re: Fully distribute TextInputFormat...

If I'm not mistaken LZO compression better suits when splitting needed, not
gzip.

Alex Baranau

http://sematext.com

On Mon, May 10, 2010 at 3:52 PM, Jeff Zhang <zj...@gmail.com> wrote:

> What's the format of this file ? gzip can been split.
>
>
>
> On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT <pi...@gmail.com>
> wrote:
> > Hi folks :)
> > I have one big file... I read it with FileInputFormat, this generates
> only
> > one task and of course, this doesn't get distributed across the cluster
> > nodes.
> > Should I use an other Input class or do I have a bug in my
> implementation?
> >
> > The desired behavior is one task per line.
> >
> > Thanks.
> >
> >
> >
> > --
> > http://www.neko-consulting.com
> > Ego sum quis ego servo
> > "Je suis ce que je protège"
> > "I am what I protect"
> >
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



      

Re: Fully distribute TextInputFormat...

Posted by Alex Baranov <al...@gmail.com>.
If I'm not mistaken LZO compression better suits when splitting needed, not
gzip.

Alex Baranau

http://sematext.com

On Mon, May 10, 2010 at 3:52 PM, Jeff Zhang <zj...@gmail.com> wrote:

> What's the format of this file ? gzip can been split.
>
>
>
> On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT <pi...@gmail.com>
> wrote:
> > Hi folks :)
> > I have one big file... I read it with FileInputFormat, this generates
> only
> > one task and of course, this doesn't get distributed across the cluster
> > nodes.
> > Should I use an other Input class or do I have a bug in my
> implementation?
> >
> > The desired behavior is one task per line.
> >
> > Thanks.
> >
> >
> >
> > --
> > http://www.neko-consulting.com
> > Ego sum quis ego servo
> > "Je suis ce que je protège"
> > "I am what I protect"
> >
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: Fully distribute TextInputFormat...

Posted by Jeff Zhang <zj...@gmail.com>.
What's the format of this file ? gzip can been split.



On Mon, May 10, 2010 at 5:21 AM, Pierre ANCELOT <pi...@gmail.com> wrote:
> Hi folks :)
> I have one big file... I read it with FileInputFormat, this generates only
> one task and of course, this doesn't get distributed across the cluster
> nodes.
> Should I use an other Input class or do I have a bug in my implementation?
>
> The desired behavior is one task per line.
>
> Thanks.
>
>
>
> --
> http://www.neko-consulting.com
> Ego sum quis ego servo
> "Je suis ce que je protège"
> "I am what I protect"
>



-- 
Best Regards

Jeff Zhang