You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by maha <ma...@umail.ucsb.edu> on 2011/02/08 00:38:12 UTC

Quick Question: LineSplit or BlockSplit

Hi,

  I would appreciate it if you could give me your thoughts if there is affect on efficiency if:

  1) Mappers were per line in a document
 
  or 

  2) Mappers were per block of lines in a document.


 I know the obvious difference I can see is that (1) has more mappers. Does that mean (1) will be slower because of scheduling time ?

Thank you,
Maha
 

Re: Quick Question: LineSplit or BlockSplit

Posted by maha <ma...@umail.ucsb.edu>.
Thanks Ted. Then I have to write my own InputFormat to read a block-of-lines per mapper.
 
 NLineInputFormat didn't work with me, any working example about it is appreciate it.

Thanks again,

Maha





On Feb 7, 2011, at 6:32 PM, Mark Kerzner wrote:

> Thanks!
> Mark
> 
> On Mon, Feb 7, 2011 at 8:28 PM, Ted Dunning <td...@maprtech.com> wrote:
> 
>> That is quite doable.  One way to do it is to make the max split size quite
>> small.
>> 
>> On Mon, Feb 7, 2011 at 6:14 PM, Mark Kerzner <ma...@gmail.com>
>> wrote:
>> 
>>> Ted,
>>> 
>>> I am also interested in this answer.
>>> 
>>> I put the name of a zip file on a line in an input file, and I want one
>>> mapper to read this line, and start working on it (since it now knows the
>>> path in HDFS). Are you saying it's not doable?
>>> 
>>> Thank you,
>>> Mark
>>> 
>>> On Mon, Feb 7, 2011 at 8:10 PM, Ted Dunning <td...@maprtech.com>
>> wrote:
>>> 
>>>> Option (1) isn't the way that things normally work.  Besides, mappers
>> are
>>>> called many times for each construction of a mapper.
>>>> 
>>>> On Mon, Feb 7, 2011 at 3:38 PM, maha <ma...@umail.ucsb.edu> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I would appreciate it if you could give me your thoughts if there is
>>>>> affect on efficiency if:
>>>>> 
>>>>> 1) Mappers were per line in a document
>>>>> 
>>>>> or
>>>>> 
>>>>> 2) Mappers were per block of lines in a document.
>>>>> 
>>>>> 
>>>>> I know the obvious difference I can see is that (1) has more
>> mappers.
>>>> Does
>>>>> that mean (1) will be slower because of scheduling time ?
>>>>> 
>>>>> Thank you,
>>>>> Maha
>>>>> 
>>>> 
>>> 
>> 


Re: Quick Question: LineSplit or BlockSplit

Posted by Mark Kerzner <ma...@gmail.com>.
Thanks!
Mark

On Mon, Feb 7, 2011 at 8:28 PM, Ted Dunning <td...@maprtech.com> wrote:

> That is quite doable.  One way to do it is to make the max split size quite
> small.
>
> On Mon, Feb 7, 2011 at 6:14 PM, Mark Kerzner <ma...@gmail.com>
> wrote:
>
> > Ted,
> >
> > I am also interested in this answer.
> >
> > I put the name of a zip file on a line in an input file, and I want one
> > mapper to read this line, and start working on it (since it now knows the
> > path in HDFS). Are you saying it's not doable?
> >
> > Thank you,
> > Mark
> >
> > On Mon, Feb 7, 2011 at 8:10 PM, Ted Dunning <td...@maprtech.com>
> wrote:
> >
> > > Option (1) isn't the way that things normally work.  Besides, mappers
> are
> > > called many times for each construction of a mapper.
> > >
> > > On Mon, Feb 7, 2011 at 3:38 PM, maha <ma...@umail.ucsb.edu> wrote:
> > >
> > > > Hi,
> > > >
> > > >  I would appreciate it if you could give me your thoughts if there is
> > > > affect on efficiency if:
> > > >
> > > >  1) Mappers were per line in a document
> > > >
> > > >  or
> > > >
> > > >  2) Mappers were per block of lines in a document.
> > > >
> > > >
> > > >  I know the obvious difference I can see is that (1) has more
> mappers.
> > > Does
> > > > that mean (1) will be slower because of scheduling time ?
> > > >
> > > > Thank you,
> > > > Maha
> > > >
> > >
> >
>

Re: Quick Question: LineSplit or BlockSplit

Posted by Ted Dunning <td...@maprtech.com>.
That is quite doable.  One way to do it is to make the max split size quite
small.

On Mon, Feb 7, 2011 at 6:14 PM, Mark Kerzner <ma...@gmail.com> wrote:

> Ted,
>
> I am also interested in this answer.
>
> I put the name of a zip file on a line in an input file, and I want one
> mapper to read this line, and start working on it (since it now knows the
> path in HDFS). Are you saying it's not doable?
>
> Thank you,
> Mark
>
> On Mon, Feb 7, 2011 at 8:10 PM, Ted Dunning <td...@maprtech.com> wrote:
>
> > Option (1) isn't the way that things normally work.  Besides, mappers are
> > called many times for each construction of a mapper.
> >
> > On Mon, Feb 7, 2011 at 3:38 PM, maha <ma...@umail.ucsb.edu> wrote:
> >
> > > Hi,
> > >
> > >  I would appreciate it if you could give me your thoughts if there is
> > > affect on efficiency if:
> > >
> > >  1) Mappers were per line in a document
> > >
> > >  or
> > >
> > >  2) Mappers were per block of lines in a document.
> > >
> > >
> > >  I know the obvious difference I can see is that (1) has more mappers.
> > Does
> > > that mean (1) will be slower because of scheduling time ?
> > >
> > > Thank you,
> > > Maha
> > >
> >
>

Re: Quick Question: LineSplit or BlockSplit

Posted by Mark Kerzner <ma...@gmail.com>.
Ted,

I am also interested in this answer.

I put the name of a zip file on a line in an input file, and I want one
mapper to read this line, and start working on it (since it now knows the
path in HDFS). Are you saying it's not doable?

Thank you,
Mark

On Mon, Feb 7, 2011 at 8:10 PM, Ted Dunning <td...@maprtech.com> wrote:

> Option (1) isn't the way that things normally work.  Besides, mappers are
> called many times for each construction of a mapper.
>
> On Mon, Feb 7, 2011 at 3:38 PM, maha <ma...@umail.ucsb.edu> wrote:
>
> > Hi,
> >
> >  I would appreciate it if you could give me your thoughts if there is
> > affect on efficiency if:
> >
> >  1) Mappers were per line in a document
> >
> >  or
> >
> >  2) Mappers were per block of lines in a document.
> >
> >
> >  I know the obvious difference I can see is that (1) has more mappers.
> Does
> > that mean (1) will be slower because of scheduling time ?
> >
> > Thank you,
> > Maha
> >
>

Re: Quick Question: LineSplit or BlockSplit

Posted by Ted Dunning <td...@maprtech.com>.
Option (1) isn't the way that things normally work.  Besides, mappers are
called many times for each construction of a mapper.

On Mon, Feb 7, 2011 at 3:38 PM, maha <ma...@umail.ucsb.edu> wrote:

> Hi,
>
>  I would appreciate it if you could give me your thoughts if there is
> affect on efficiency if:
>
>  1) Mappers were per line in a document
>
>  or
>
>  2) Mappers were per block of lines in a document.
>
>
>  I know the obvious difference I can see is that (1) has more mappers. Does
> that mean (1) will be slower because of scheduling time ?
>
> Thank you,
> Maha
>