You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Something Something <ma...@gmail.com> on 2011/11/17 05:23:13 UTC

Business logic in cleanup?

Is the idea of writing business logic in cleanup method of a Mapper good or
bad?  We think we can make our Mapper run faster if we keep accumulating
data in a HashMap in a Mapper, and later in the cleanup() method write it.

1)  Does Map/Reduce paradigm guarantee that cleanup will always be called
before the reducer starts?
2)  Is cleanup strictly for cleaning up unneeded resources?
3)  We understand that the HashMap can grow & that could cause memory
issues, but hypothetically let's say the memory requirements
were manageable.

Please let me know.  Thanks.

Re: Business logic in cleanup?

Posted by Harsh J <ha...@cloudera.com>.

Arun,

On 19-Nov-2011, at 12:16 AM, Arun C Murthy wrote:

> 
> On Nov 18, 2011, at 10:44 AM, Harsh J wrote:
>> 
>> If you could follow up on that patch, and see it through, its wish granted for a lot of us as well, as we move ahead with the newer APIs in the future Hadoop releases ;-)
>> 
> 
> The plan is to support both mapred and mapreduce MR apis for the forseeable future.

That is surely good news that we are (And I do know we are gonna). But there may be some clarify needed here.

I reckon this is to avoid breakage primarily among other reasons, but it helps new developers if they know which one to choose when they start out (say, forced via deprecation as we tried before, or a document note?).

I personally think its good enough if we recommend one, and simply support the other (via regular deprecation periods, documented notes, other ways you can think of).

One place high on confusion _today_ is when a user determines he's supposed to use stable MR APIs, and then he tries out HBase where they are supporting only the newer ones as they roll ahead. Some other downstream projects too can't afford to maintain both like we could.

> 
> Arun
> 
>> On 18-Nov-2011, at 10:32 PM, Something Something wrote:
>> 
>>> Thanks again.  Will look at Mapper.run to understand better.  Actually, just a few minutes ago I got the AVROMapper to work (which will read from AVRO files). This will hopefully improve performance even more.
>>> 
>>> Interesting, AVROMapper doesn't extend from Mapper, so it doesn't have the 'cleanup' method.  Instead it provides a 'close' method, which seems to behave the same way.  Honestly, I like the method name 'close' better than 'cleanup'.
>>> 
>>> Doug - Is there a reason you chose to not extend from org/apache/hadoop/mapreduce/Mapper?
>>> 
>>> Thank you all for your help.
>>> 
>>> 
>>> On Fri, Nov 18, 2011 at 7:44 AM, Harsh J <ha...@cloudera.com> wrote:
>>> Given that you are sure about it, and you also know why thats the
>>> case, I'd definitely write inside the cleanup(…) hook. No harm at all
>>> in doing that.
>>> 
>>> Take a look at mapreduce.Mapper#run(…) method in source and you'll
>>> understand what I mean by it not being a stage or even an event, but
>>> just a tail call after all map()s are called.
>>> 
>>> On Fri, Nov 18, 2011 at 8:58 PM, Something Something
>>> <ma...@gmail.com> wrote:
>>> > Thanks again for the clarification.  Not sure what you mean by it's not a
>>> > 'stage'!  Okay.. may be not a stage but I think of it as an 'Event', such as
>>> > 'Mouseover', 'Mouseout'.  The 'cleanup' is really a 'MapperCompleted' event,
>>> > right?
>>> >
>>> > Confusion comes with the name of this method.  The name 'cleanup' makes me
>>> > think it should not be really used as 'mapperCompleted', but it appears
>>> > there's no harm in using it that way.
>>> >
>>> > Here's our dilemma - when we use (local) caching in the Mapper & write in
>>> > the 'cleanup', our job completes in 18 minutes.  When we don't write in
>>> > 'cleanup' it takes 3 hours!!!  Knowing this if you were to decide, would you
>>> > use 'cleanup' for this purpose?
>>> >
>>> > Thanks once again for your advice.
>>> >
>>> >
>>> > On Thu, Nov 17, 2011 at 9:35 PM, Harsh J <ha...@cloudera.com> wrote:
>>> >>
>>> >> Hello,
>>> >>
>>> >> On Fri, Nov 18, 2011 at 10:44 AM, Something Something
>>> >> <ma...@gmail.com> wrote:
>>> >> > Thanks for the reply.  Here's another concern we have.  Let's say Mapper
>>> >> > has
>>> >> > finished processing 1000 lines from the input file & then the machine
>>> >> > goes
>>> >> > down.  I believe Hadoop is smart enough to re-distribute the input split
>>> >> > that was assigned to this Mapper, correct?  After re-assigning will it
>>> >> > reprocess the 1000 lines that were processed successfully before & start
>>> >> > from line 1001  OR  would it reprocess ALL lines?
>>> >>
>>> >> Attempts of any task start afresh. That's the default nature of Hadoop.
>>> >>
>>> >> So, it would begin from start again and hence reprocess ALL lines.
>>> >> Understand that cleanup is just a fancy API call here, thats called
>>> >> after the input reader completes - not a "stage".
>>> >>
>>> >> --
>>> >> Harsh J
>>> >
>>> >
>>> 
>>> 
>>> 
>>> --
>>> Harsh J
>>> 
>> 
>

Re: Business logic in cleanup?

Posted by Arun C Murthy <ac...@hortonworks.com>.

On Nov 18, 2011, at 10:44 AM, Harsh J wrote:
> 
> If you could follow up on that patch, and see it through, its wish granted for a lot of us as well, as we move ahead with the newer APIs in the future Hadoop releases ;-)
> 

The plan is to support both mapred and mapreduce MR apis for the forseeable future.

Arun

> On 18-Nov-2011, at 10:32 PM, Something Something wrote:
> 
>> Thanks again.  Will look at Mapper.run to understand better.  Actually, just a few minutes ago I got the AVROMapper to work (which will read from AVRO files). This will hopefully improve performance even more.
>> 
>> Interesting, AVROMapper doesn't extend from Mapper, so it doesn't have the 'cleanup' method.  Instead it provides a 'close' method, which seems to behave the same way.  Honestly, I like the method name 'close' better than 'cleanup'.
>> 
>> Doug - Is there a reason you chose to not extend from org/apache/hadoop/mapreduce/Mapper?
>> 
>> Thank you all for your help.
>> 
>> 
>> On Fri, Nov 18, 2011 at 7:44 AM, Harsh J <ha...@cloudera.com> wrote:
>> Given that you are sure about it, and you also know why thats the
>> case, I'd definitely write inside the cleanup(…) hook. No harm at all
>> in doing that.
>> 
>> Take a look at mapreduce.Mapper#run(…) method in source and you'll
>> understand what I mean by it not being a stage or even an event, but
>> just a tail call after all map()s are called.
>> 
>> On Fri, Nov 18, 2011 at 8:58 PM, Something Something
>> <ma...@gmail.com> wrote:
>> > Thanks again for the clarification.  Not sure what you mean by it's not a
>> > 'stage'!  Okay.. may be not a stage but I think of it as an 'Event', such as
>> > 'Mouseover', 'Mouseout'.  The 'cleanup' is really a 'MapperCompleted' event,
>> > right?
>> >
>> > Confusion comes with the name of this method.  The name 'cleanup' makes me
>> > think it should not be really used as 'mapperCompleted', but it appears
>> > there's no harm in using it that way.
>> >
>> > Here's our dilemma - when we use (local) caching in the Mapper & write in
>> > the 'cleanup', our job completes in 18 minutes.  When we don't write in
>> > 'cleanup' it takes 3 hours!!!  Knowing this if you were to decide, would you
>> > use 'cleanup' for this purpose?
>> >
>> > Thanks once again for your advice.
>> >
>> >
>> > On Thu, Nov 17, 2011 at 9:35 PM, Harsh J <ha...@cloudera.com> wrote:
>> >>
>> >> Hello,
>> >>
>> >> On Fri, Nov 18, 2011 at 10:44 AM, Something Something
>> >> <ma...@gmail.com> wrote:
>> >> > Thanks for the reply.  Here's another concern we have.  Let's say Mapper
>> >> > has
>> >> > finished processing 1000 lines from the input file & then the machine
>> >> > goes
>> >> > down.  I believe Hadoop is smart enough to re-distribute the input split
>> >> > that was assigned to this Mapper, correct?  After re-assigning will it
>> >> > reprocess the 1000 lines that were processed successfully before & start
>> >> > from line 1001  OR  would it reprocess ALL lines?
>> >>
>> >> Attempts of any task start afresh. That's the default nature of Hadoop.
>> >>
>> >> So, it would begin from start again and hence reprocess ALL lines.
>> >> Understand that cleanup is just a fancy API call here, thats called
>> >> after the input reader completes - not a "stage".
>> >>
>> >> --
>> >> Harsh J
>> >
>> >
>> 
>> 
>> 
>> --
>> Harsh J
>> 
>

Re: Business logic in cleanup?

Posted by Harsh J <ha...@cloudera.com>.

I believe its been discussed on Avro user lists before, but here's what you want: https://issues.apache.org/jira/browse/AVRO-593

If you could follow up on that patch, and see it through, its wish granted for a lot of us as well, as we move ahead with the newer APIs in the future Hadoop releases ;-)

On 18-Nov-2011, at 10:32 PM, Something Something wrote:

> Thanks again.  Will look at Mapper.run to understand better.  Actually, just a few minutes ago I got the AVROMapper to work (which will read from AVRO files). This will hopefully improve performance even more.
> 
> Interesting, AVROMapper doesn't extend from Mapper, so it doesn't have the 'cleanup' method.  Instead it provides a 'close' method, which seems to behave the same way.  Honestly, I like the method name 'close' better than 'cleanup'.
> 
> Doug - Is there a reason you chose to not extend from org/apache/hadoop/mapreduce/Mapper?
> 
> Thank you all for your help.
> 
> 
> On Fri, Nov 18, 2011 at 7:44 AM, Harsh J <ha...@cloudera.com> wrote:
> Given that you are sure about it, and you also know why thats the
> case, I'd definitely write inside the cleanup(…) hook. No harm at all
> in doing that.
> 
> Take a look at mapreduce.Mapper#run(…) method in source and you'll
> understand what I mean by it not being a stage or even an event, but
> just a tail call after all map()s are called.
> 
> On Fri, Nov 18, 2011 at 8:58 PM, Something Something
> <ma...@gmail.com> wrote:
> > Thanks again for the clarification.  Not sure what you mean by it's not a
> > 'stage'!  Okay.. may be not a stage but I think of it as an 'Event', such as
> > 'Mouseover', 'Mouseout'.  The 'cleanup' is really a 'MapperCompleted' event,
> > right?
> >
> > Confusion comes with the name of this method.  The name 'cleanup' makes me
> > think it should not be really used as 'mapperCompleted', but it appears
> > there's no harm in using it that way.
> >
> > Here's our dilemma - when we use (local) caching in the Mapper & write in
> > the 'cleanup', our job completes in 18 minutes.  When we don't write in
> > 'cleanup' it takes 3 hours!!!  Knowing this if you were to decide, would you
> > use 'cleanup' for this purpose?
> >
> > Thanks once again for your advice.
> >
> >
> > On Thu, Nov 17, 2011 at 9:35 PM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> Hello,
> >>
> >> On Fri, Nov 18, 2011 at 10:44 AM, Something Something
> >> <ma...@gmail.com> wrote:
> >> > Thanks for the reply.  Here's another concern we have.  Let's say Mapper
> >> > has
> >> > finished processing 1000 lines from the input file & then the machine
> >> > goes
> >> > down.  I believe Hadoop is smart enough to re-distribute the input split
> >> > that was assigned to this Mapper, correct?  After re-assigning will it
> >> > reprocess the 1000 lines that were processed successfully before & start
> >> > from line 1001  OR  would it reprocess ALL lines?
> >>
> >> Attempts of any task start afresh. That's the default nature of Hadoop.
> >>
> >> So, it would begin from start again and hence reprocess ALL lines.
> >> Understand that cleanup is just a fancy API call here, thats called
> >> after the input reader completes - not a "stage".
> >>
> >> --
> >> Harsh J
> >
> >
> 
> 
> 
> --
> Harsh J
>

Re: Business logic in cleanup?

Posted by Something Something <ma...@gmail.com>.

Thanks again.  Will look at Mapper.run to understand better.  Actually,
just a few minutes ago I got the AVROMapper to work (which will read from
AVRO files). This will hopefully improve performance even more.

Interesting, AVROMapper doesn't extend from Mapper, so it doesn't have the
'cleanup' method.  Instead it provides a 'close' method, which seems to
behave the same way.  Honestly, I like the method name 'close' better than
'cleanup'.

Doug - Is there a reason you chose to not extend
from org/apache/hadoop/mapreduce/Mapper?

Thank you all for your help.


On Fri, Nov 18, 2011 at 7:44 AM, Harsh J <ha...@cloudera.com> wrote:

> Given that you are sure about it, and you also know why thats the
> case, I'd definitely write inside the cleanup(…) hook. No harm at all
> in doing that.
>
> Take a look at mapreduce.Mapper#run(…) method in source and you'll
> understand what I mean by it not being a stage or even an event, but
> just a tail call after all map()s are called.
>
> On Fri, Nov 18, 2011 at 8:58 PM, Something Something
> <ma...@gmail.com> wrote:
> > Thanks again for the clarification.  Not sure what you mean by it's not a
> > 'stage'!  Okay.. may be not a stage but I think of it as an 'Event',
> such as
> > 'Mouseover', 'Mouseout'.  The 'cleanup' is really a 'MapperCompleted'
> event,
> > right?
> >
> > Confusion comes with the name of this method.  The name 'cleanup' makes
> me
> > think it should not be really used as 'mapperCompleted', but it appears
> > there's no harm in using it that way.
> >
> > Here's our dilemma - when we use (local) caching in the Mapper & write in
> > the 'cleanup', our job completes in 18 minutes.  When we don't write in
> > 'cleanup' it takes 3 hours!!!  Knowing this if you were to decide, would
> you
> > use 'cleanup' for this purpose?
> >
> > Thanks once again for your advice.
> >
> >
> > On Thu, Nov 17, 2011 at 9:35 PM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> Hello,
> >>
> >> On Fri, Nov 18, 2011 at 10:44 AM, Something Something
> >> <ma...@gmail.com> wrote:
> >> > Thanks for the reply.  Here's another concern we have.  Let's say
> Mapper
> >> > has
> >> > finished processing 1000 lines from the input file & then the machine
> >> > goes
> >> > down.  I believe Hadoop is smart enough to re-distribute the input
> split
> >> > that was assigned to this Mapper, correct?  After re-assigning will it
> >> > reprocess the 1000 lines that were processed successfully before &
> start
> >> > from line 1001  OR  would it reprocess ALL lines?
> >>
> >> Attempts of any task start afresh. That's the default nature of Hadoop.
> >>
> >> So, it would begin from start again and hence reprocess ALL lines.
> >> Understand that cleanup is just a fancy API call here, thats called
> >> after the input reader completes - not a "stage".
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: Business logic in cleanup?

Posted by Harsh J <ha...@cloudera.com>.

Given that you are sure about it, and you also know why thats the
case, I'd definitely write inside the cleanup(…) hook. No harm at all
in doing that.

Take a look at mapreduce.Mapper#run(…) method in source and you'll
understand what I mean by it not being a stage or even an event, but
just a tail call after all map()s are called.

On Fri, Nov 18, 2011 at 8:58 PM, Something Something
<ma...@gmail.com> wrote:
> Thanks again for the clarification.  Not sure what you mean by it's not a
> 'stage'!  Okay.. may be not a stage but I think of it as an 'Event', such as
> 'Mouseover', 'Mouseout'.  The 'cleanup' is really a 'MapperCompleted' event,
> right?
>
> Confusion comes with the name of this method.  The name 'cleanup' makes me
> think it should not be really used as 'mapperCompleted', but it appears
> there's no harm in using it that way.
>
> Here's our dilemma - when we use (local) caching in the Mapper & write in
> the 'cleanup', our job completes in 18 minutes.  When we don't write in
> 'cleanup' it takes 3 hours!!!  Knowing this if you were to decide, would you
> use 'cleanup' for this purpose?
>
> Thanks once again for your advice.
>
>
> On Thu, Nov 17, 2011 at 9:35 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hello,
>>
>> On Fri, Nov 18, 2011 at 10:44 AM, Something Something
>> <ma...@gmail.com> wrote:
>> > Thanks for the reply.  Here's another concern we have.  Let's say Mapper
>> > has
>> > finished processing 1000 lines from the input file & then the machine
>> > goes
>> > down.  I believe Hadoop is smart enough to re-distribute the input split
>> > that was assigned to this Mapper, correct?  After re-assigning will it
>> > reprocess the 1000 lines that were processed successfully before & start
>> > from line 1001  OR  would it reprocess ALL lines?
>>
>> Attempts of any task start afresh. That's the default nature of Hadoop.
>>
>> So, it would begin from start again and hence reprocess ALL lines.
>> Understand that cleanup is just a fancy API call here, thats called
>> after the input reader completes - not a "stage".
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: Business logic in cleanup?

Posted by Something Something <ma...@gmail.com>.

Thanks again for the clarification.  Not sure what you mean by it's not a
'stage'!  Okay.. may be not a stage but I think of it as an 'Event', such
as 'Mouseover', 'Mouseout'.  The 'cleanup' is really a 'MapperCompleted'
event, right?

Confusion comes with the name of this method.  The name 'cleanup' makes me
think it should not be really used as 'mapperCompleted', but it appears
there's no harm in using it that way.

Here's our dilemma - when we use (local) caching in the Mapper & write in
the 'cleanup', our job completes in 18 minutes.  When we don't write in
'cleanup' it takes 3 hours!!!  Knowing this if you were to decide, would
you use 'cleanup' for this purpose?

Thanks once again for your advice.

On Thu, Nov 17, 2011 at 9:35 PM, Harsh J <ha...@cloudera.com> wrote:

> Hello,
>
> On Fri, Nov 18, 2011 at 10:44 AM, Something Something
> <ma...@gmail.com> wrote:
> > Thanks for the reply.  Here's another concern we have.  Let's say Mapper
> has
> > finished processing 1000 lines from the input file & then the machine
> goes
> > down.  I believe Hadoop is smart enough to re-distribute the input split
> > that was assigned to this Mapper, correct?  After re-assigning will it
> > reprocess the 1000 lines that were processed successfully before & start
> > from line 1001  OR  would it reprocess ALL lines?
>
> Attempts of any task start afresh. That's the default nature of Hadoop.
>
> So, it would begin from start again and hence reprocess ALL lines.
> Understand that cleanup is just a fancy API call here, thats called
> after the input reader completes - not a "stage".
>
> --
> Harsh J
>

Re: Business logic in cleanup?

Posted by Harsh J <ha...@cloudera.com>.

Hello,

On Fri, Nov 18, 2011 at 10:44 AM, Something Something
<ma...@gmail.com> wrote:
> Thanks for the reply.  Here's another concern we have.  Let's say Mapper has
> finished processing 1000 lines from the input file & then the machine goes
> down.  I believe Hadoop is smart enough to re-distribute the input split
> that was assigned to this Mapper, correct?  After re-assigning will it
> reprocess the 1000 lines that were processed successfully before & start
> from line 1001  OR  would it reprocess ALL lines?

Attempts of any task start afresh. That's the default nature of Hadoop.

So, it would begin from start again and hence reprocess ALL lines.
Understand that cleanup is just a fancy API call here, thats called
after the input reader completes - not a "stage".

-- 
Harsh J

Re: Business logic in cleanup?

Posted by Something Something <ma...@gmail.com>.

Thanks for the reply.  Here's another concern we have.  Let's say Mapper
has finished processing 1000 lines from the input file & then the machine
goes down.  I believe Hadoop is smart enough to re-distribute the input
split that was assigned to this Mapper, correct?  After re-assigning will
it reprocess the 1000 lines that were processed successfully before & start
from line 1001  OR  would it reprocess ALL lines?



On Wed, Nov 16, 2011 at 9:42 PM, Harsh J <ha...@cloudera.com> wrote:

> I'm sure you understand all implications here so I'll just answer your
> questions, inline.
>
> On Thu, Nov 17, 2011 at 9:53 AM, Something Something
> <ma...@gmail.com> wrote:
> > Is the idea of writing business logic in cleanup method of a Mapper good
> or
> > bad?  We think we can make our Mapper run faster if we keep accumulating
> > data in a HashMap in a Mapper, and later in the cleanup() method write
> it.
>
> You can certainly write it during cleanup() call. Streams are only
> closed after thats done, so no issues framework-wise.
>
> > 1)  Does Map/Reduce paradigm guarantee that cleanup will always be called
> > before the reducer starts?
>
> Reducers start reducing only after all Map Tasks have completed
> (Tasks, on the whole level). So, yes. This is guaranteed.
>
> > 2)  Is cleanup strictly for cleaning up unneeded resources?
>
> Yes, it was provided for that purpose.
>
> > 3)  We understand that the HashMap can grow & that could cause memory
> > issues, but hypothetically let's say the memory requirements
> > were manageable.
>
> You are also pushing the whole write load to after the reads. It is
> almost 1:1 otherwise.
>
> P.s. Perhaps try overriding Mapper#run if you'd like complete control
> on how a Mapper executes in stages.
>
> --
> Harsh J
>

Re: Business logic in cleanup?

Posted by Harsh J <ha...@cloudera.com>.

I'm sure you understand all implications here so I'll just answer your
questions, inline.

On Thu, Nov 17, 2011 at 9:53 AM, Something Something
<ma...@gmail.com> wrote:
> Is the idea of writing business logic in cleanup method of a Mapper good or
> bad?  We think we can make our Mapper run faster if we keep accumulating
> data in a HashMap in a Mapper, and later in the cleanup() method write it.

You can certainly write it during cleanup() call. Streams are only
closed after thats done, so no issues framework-wise.

> 1)  Does Map/Reduce paradigm guarantee that cleanup will always be called
> before the reducer starts?

Reducers start reducing only after all Map Tasks have completed
(Tasks, on the whole level). So, yes. This is guaranteed.

> 2)  Is cleanup strictly for cleaning up unneeded resources?

Yes, it was provided for that purpose.

> 3)  We understand that the HashMap can grow & that could cause memory
> issues, but hypothetically let's say the memory requirements
> were manageable.

You are also pushing the whole write load to after the reads. It is
almost 1:1 otherwise.

P.s. Perhaps try overriding Mapper#run if you'd like complete control
on how a Mapper executes in stages.

-- 
Harsh J