You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@crunch.apache.org by Allan Shoup <al...@gmail.com> on 2014/09/26 03:10:48 UTC

Reliably Parallelizing CPU-Intensive DoFns

I have a very cpu-intensive DoFn which running over a relatively small
input. Running on a Hadoop cluster, the job that it is run in sometimes
executes the function in map tasks and sometimes in reduce tasks. What's
the best way to reliably increase parallelization?

One option may be to force a reduce step and control the number of
reducers. Are there any better options?

Re: Reliably Parallelizing CPU-Intensive DoFns

Posted by "Brush,Ryan" <RB...@CERNER.COM>.

There be dragons, but in years past I solved a similar problem with the MultiThreadedMapper [1], and it would be possible to do something similar in a DoFn implementation. Basically the you can read multiple inputs and farm them off to threads, then synchronize and flush after N items are processed and do a final flush to the emitter in the cleanup(…) method.

There are lots of pitfalls to managing your own threads, of course. You’d need to detach incoming values passed to the DoFn so they don’t get clobbered by other threads, it could fight against Hadoop’s resource management (since Hadoop wants to manage how many threads are running), and writing multi-threaded code is pretty terrible in general. But it’s an option at least.

[1]
https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/map/MultithreadedMapper.html

On Sep 25, 2014, at 11:03 PM, Allan Shoup <al...@gmail.com>> wrote:

I failed to mention that the I don't have an opportunity to read the source - my input is a PTable of Avro keys and values.

On Thu, Sep 25, 2014 at 8:48 PM, Josh Wills <jo...@gmail.com>> wrote:
NLineSource, to control how many shards the small input is split up into?

On Thu, Sep 25, 2014 at 6:10 PM, Allan Shoup <al...@gmail.com>> wrote:
I have a very cpu-intensive DoFn which running over a relatively small input. Running on a Hadoop cluster, the job that it is run in sometimes executes the function in map tasks and sometimes in reduce tasks. What's the best way to reliably increase parallelization?

One option may be to force a reduce step and control the number of reducers. Are there any better options?

CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.

Re: Reliably Parallelizing CPU-Intensive DoFns

Posted by Allan Shoup <al...@gmail.com>.

I failed to mention that the I don't have an opportunity to read the source
- my input is a PTable of Avro keys and values.

On Thu, Sep 25, 2014 at 8:48 PM, Josh Wills <jo...@gmail.com> wrote:

> NLineSource, to control how many shards the small input is split up into?
>
> J
>
> On Thu, Sep 25, 2014 at 6:10 PM, Allan Shoup <al...@gmail.com>
> wrote:
>
>> I have a very cpu-intensive DoFn which running over a relatively small
>> input. Running on a Hadoop cluster, the job that it is run in sometimes
>> executes the function in map tasks and sometimes in reduce tasks. What's
>> the best way to reliably increase parallelization?
>>
>> One option may be to force a reduce step and control the number of
>> reducers. Are there any better options?
>>
>
>

Re: Reliably Parallelizing CPU-Intensive DoFns

Posted by Josh Wills <jo...@gmail.com>.

NLineSource, to control how many shards the small input is split up into?

J

On Thu, Sep 25, 2014 at 6:10 PM, Allan Shoup <al...@gmail.com> wrote:

> I have a very cpu-intensive DoFn which running over a relatively small
> input. Running on a Hadoop cluster, the job that it is run in sometimes
> executes the function in map tasks and sometimes in reduce tasks. What's
> the best way to reliably increase parallelization?
>
> One option may be to force a reduce step and control the number of
> reducers. Are there any better options?
>