You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@crunch.apache.org by David Ortiz <do...@videologygroup.com> on 2015/07/27 23:58:37 UTC

Force new Map phase

Hey,

     Are there any easy tricks to force a new map stage to kick off?  I know I can force a reduce with GBK operations, but I am running into an issue where one of our jobs is having issues with data skew, and from what I can tell, the issue is we are getting a couple hot keys that join properly, but then when trying to do the follow up processing that comes before the next join, the reducer hits the GC Overhead Limit.  Based on the dot file, it is trying to do all the preprocessing for the next join in the reducer from the first join, but it could easily do it in the map phase before the next join in the pipeline without any issues, and I think this would also get past the issue we're having with memory.  The only solution I could think of to try and do this at the moment, is to do everything up to the first join, call pipeline.done(), then add some more operations before another pipeline.done() operation.

Thanks,
    Dave
This email is intended only for the use of the individual(s) to whom it is addressed. If you have received this communication in error, please immediately notify the sender and delete the original email.

Re: Force new Map phase

Posted by David Ortiz <dp...@gmail.com>.

*in the morning

On Mon, Jul 27, 2015, 6:45 PM David Ortiz <dp...@gmail.com> wrote:

> I'll give that a try on your morning.  Thanks.
>
> On Mon, Jul 27, 2015, 6:02 PM Josh Wills <jw...@cloudera.com> wrote:
>
>> Hey David,
>>
>> The easiest way is to insert a PCollection.cache() call at the stage
>> between the two joins where you think the reduce phase should end and the
>> next map phase should begin. When the Crunch planner makes the decision of
>> where to split the work between a reducer/mapper, it tries to respect any
>> explicit cache() calls that it encounters.
>>
>> Josh
>>
>> On Mon, Jul 27, 2015 at 2:58 PM, David Ortiz <do...@videologygroup.com>
>> wrote:
>>
>>>  Hey,
>>>
>>>
>>>
>>>      Are there any easy tricks to force a new map stage to kick off?  I
>>> know I can force a reduce with GBK operations, but I am running into an
>>> issue where one of our jobs is having issues with data skew, and from what
>>> I can tell, the issue is we are getting a couple hot keys that join
>>> properly, but then when trying to do the follow up processing that comes
>>> before the next join, the reducer hits the GC Overhead Limit.  Based on the
>>> dot file, it is trying to do all the preprocessing for the next join in the
>>> reducer from the first join, but it could easily do it in the map phase
>>> before the next join in the pipeline without any issues, and I think this
>>> would also get past the issue we’re having with memory.  The only solution
>>> I could think of to try and do this at the moment, is to do everything up
>>> to the first join, call pipeline.done(), then add some more operations
>>> before another pipeline.done() operation.
>>>
>>>
>>>
>>> Thanks,
>>>
>>>     Dave
>>>  *This email is intended only for the use of the individual(s) to whom
>>> it is addressed. If you have received this communication in error, please
>>> immediately notify the sender and delete the original email.*
>>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>

Re: Force new Map phase

Posted by David Ortiz <dp...@gmail.com>.

I'll give that a try on your morning.  Thanks.

On Mon, Jul 27, 2015, 6:02 PM Josh Wills <jw...@cloudera.com> wrote:

> Hey David,
>
> The easiest way is to insert a PCollection.cache() call at the stage
> between the two joins where you think the reduce phase should end and the
> next map phase should begin. When the Crunch planner makes the decision of
> where to split the work between a reducer/mapper, it tries to respect any
> explicit cache() calls that it encounters.
>
> Josh
>
> On Mon, Jul 27, 2015 at 2:58 PM, David Ortiz <do...@videologygroup.com>
> wrote:
>
>>  Hey,
>>
>>
>>
>>      Are there any easy tricks to force a new map stage to kick off?  I
>> know I can force a reduce with GBK operations, but I am running into an
>> issue where one of our jobs is having issues with data skew, and from what
>> I can tell, the issue is we are getting a couple hot keys that join
>> properly, but then when trying to do the follow up processing that comes
>> before the next join, the reducer hits the GC Overhead Limit.  Based on the
>> dot file, it is trying to do all the preprocessing for the next join in the
>> reducer from the first join, but it could easily do it in the map phase
>> before the next join in the pipeline without any issues, and I think this
>> would also get past the issue we’re having with memory.  The only solution
>> I could think of to try and do this at the moment, is to do everything up
>> to the first join, call pipeline.done(), then add some more operations
>> before another pipeline.done() operation.
>>
>>
>>
>> Thanks,
>>
>>     Dave
>>  *This email is intended only for the use of the individual(s) to whom
>> it is addressed. If you have received this communication in error, please
>> immediately notify the sender and delete the original email.*
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Re: Force new Map phase

Posted by Josh Wills <jw...@cloudera.com>.

Hey David,

The easiest way is to insert a PCollection.cache() call at the stage
between the two joins where you think the reduce phase should end and the
next map phase should begin. When the Crunch planner makes the decision of
where to split the work between a reducer/mapper, it tries to respect any
explicit cache() calls that it encounters.

Josh

On Mon, Jul 27, 2015 at 2:58 PM, David Ortiz <do...@videologygroup.com>
wrote:

>  Hey,
>
>
>
>      Are there any easy tricks to force a new map stage to kick off?  I
> know I can force a reduce with GBK operations, but I am running into an
> issue where one of our jobs is having issues with data skew, and from what
> I can tell, the issue is we are getting a couple hot keys that join
> properly, but then when trying to do the follow up processing that comes
> before the next join, the reducer hits the GC Overhead Limit.  Based on the
> dot file, it is trying to do all the preprocessing for the next join in the
> reducer from the first join, but it could easily do it in the map phase
> before the next join in the pipeline without any issues, and I think this
> would also get past the issue we’re having with memory.  The only solution
> I could think of to try and do this at the moment, is to do everything up
> to the first join, call pipeline.done(), then add some more operations
> before another pipeline.done() operation.
>
>
>
> Thanks,
>
>     Dave
>  *This email is intended only for the use of the individual(s) to whom it
> is addressed. If you have received this communication in error, please
> immediately notify the sender and delete the original email.*
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>