You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@crunch.apache.org by Everett Anderson <ev...@nuna.com> on 2015/07/22 19:42:59 UTC

Crunch Spark pipeline seems to stall?

Hi,

I have a fairly complex Crunch pipeline with many joins and multiple inputs
and outputs. I've been using the MRPipeline on AWS with EMR/Hadoop
successfully, but was curious to try out the SparkPipeline.

I'm using Crunch 0.12.0 and tried Spark 1.4.0 with 25 core instances.

Spark seemed to successfully run one small part of the pipeline, but then
stalled, showing that all submitted jobs had succeeded, but that only 16
jobs had been submitted. It never terminated, but all the workers seemed
idle.

Has anyone seen something like that before? Is there a configuration
parameter that controls how many jobs Crunch will submit to Spark?

Thanks!

- Everett

-- 
*DISCLAIMER:* The contents of this email, including any attachments, may 
contain information that is confidential, proprietary in nature, protected 
health information (PHI), or otherwise protected by law from disclosure, 
and is solely for the use of the intended recipient(s). If you are not the 
intended recipient, you are hereby notified that any use, disclosure or 
copying of this email, including any attachments, is unauthorized and 
strictly prohibited. If you have received this email in error, please 
notify the sender of this email. Please delete this and all copies of this 
email from your system. Any opinions either expressed or implied in this 
email and all attachments, are those of its author only, and do not 
necessarily reflect those of Nuna Health, Inc.

Re: Crunch Spark pipeline seems to stall?

Posted by Everett Anderson <ev...@nuna.com>.

Thanks, Josh! I liked the articles and will give the suggestions a shot!

On Wed, Jul 22, 2015 at 11:21 AM, Josh Wills <jw...@cloudera.com> wrote:

> Hey Everett,
>
> I would ideally swing by to see the pipeline in person, but I'm traveling
> all over the place the next couple of weeks. Most common cause of job
> stalls in Spark in my experience has been during very large shuffles where
> not enough tasks are allocated to do the work the job requires. This series
> of blog posts is worth a read:
>
>
> http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/
>
> http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
>
> To get more tasks going on a Crunch-on-Spark pipeline, I think your best
> options are to:
>
> 1) Disable combine file by setting crunch.disable.combine.file to true in
> the conf,
> 2) Increase the parallelism of your jobs, either by manually increasing
> the number of "reducers" in a GBK or lowering crunch.bytes.per.reduce.task
> so that the automatic partitioning will be higher.
>
> J
>
> On Wed, Jul 22, 2015 at 10:42 AM, Everett Anderson <ev...@nuna.com>
> wrote:
>
>> Hi,
>>
>> I have a fairly complex Crunch pipeline with many joins and multiple
>> inputs and outputs. I've been using the MRPipeline on AWS with EMR/Hadoop
>> successfully, but was curious to try out the SparkPipeline.
>>
>> I'm using Crunch 0.12.0 and tried Spark 1.4.0 with 25 core instances.
>>
>> Spark seemed to successfully run one small part of the pipeline, but then
>> stalled, showing that all submitted jobs had succeeded, but that only 16
>> jobs had been submitted. It never terminated, but all the workers seemed
>> idle.
>>
>> Has anyone seen something like that before? Is there a configuration
>> parameter that controls how many jobs Crunch will submit to Spark?
>>
>> Thanks!
>>
>> - Everett
>>
>>
>> *DISCLAIMER:* The contents of this email, including any attachments, may
>> contain information that is confidential, proprietary in nature, protected
>> health information (PHI), or otherwise protected by law from disclosure,
>> and is solely for the use of the intended recipient(s). If you are not the
>> intended recipient, you are hereby notified that any use, disclosure or
>> copying of this email, including any attachments, is unauthorized and
>> strictly prohibited. If you have received this email in error, please
>> notify the sender of this email. Please delete this and all copies of this
>> email from your system. Any opinions either expressed or implied in this
>> email and all attachments, are those of its author only, and do not
>> necessarily reflect those of Nuna Health, Inc.
>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

-- 
*DISCLAIMER:* The contents of this email, including any attachments, may 
contain information that is confidential, proprietary in nature, protected 
health information (PHI), or otherwise protected by law from disclosure, 
and is solely for the use of the intended recipient(s). If you are not the 
intended recipient, you are hereby notified that any use, disclosure or 
copying of this email, including any attachments, is unauthorized and 
strictly prohibited. If you have received this email in error, please 
notify the sender of this email. Please delete this and all copies of this 
email from your system. Any opinions either expressed or implied in this 
email and all attachments, are those of its author only, and do not 
necessarily reflect those of Nuna Health, Inc.

Re: Crunch Spark pipeline seems to stall?

Posted by Josh Wills <jw...@cloudera.com>.

Hey Everett,

I would ideally swing by to see the pipeline in person, but I'm traveling
all over the place the next couple of weeks. Most common cause of job
stalls in Spark in my experience has been during very large shuffles where
not enough tasks are allocated to do the work the job requires. This series
of blog posts is worth a read:

http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

To get more tasks going on a Crunch-on-Spark pipeline, I think your best
options are to:

1) Disable combine file by setting crunch.disable.combine.file to true in
the conf,
2) Increase the parallelism of your jobs, either by manually increasing the
number of "reducers" in a GBK or lowering crunch.bytes.per.reduce.task so
that the automatic partitioning will be higher.

J

On Wed, Jul 22, 2015 at 10:42 AM, Everett Anderson <ev...@nuna.com> wrote:

> Hi,
>
> I have a fairly complex Crunch pipeline with many joins and multiple
> inputs and outputs. I've been using the MRPipeline on AWS with EMR/Hadoop
> successfully, but was curious to try out the SparkPipeline.
>
> I'm using Crunch 0.12.0 and tried Spark 1.4.0 with 25 core instances.
>
> Spark seemed to successfully run one small part of the pipeline, but then
> stalled, showing that all submitted jobs had succeeded, but that only 16
> jobs had been submitted. It never terminated, but all the workers seemed
> idle.
>
> Has anyone seen something like that before? Is there a configuration
> parameter that controls how many jobs Crunch will submit to Spark?
>
> Thanks!
>
> - Everett
>
>
> *DISCLAIMER:* The contents of this email, including any attachments, may
> contain information that is confidential, proprietary in nature, protected
> health information (PHI), or otherwise protected by law from disclosure,
> and is solely for the use of the intended recipient(s). If you are not the
> intended recipient, you are hereby notified that any use, disclosure or
> copying of this email, including any attachments, is unauthorized and
> strictly prohibited. If you have received this email in error, please
> notify the sender of this email. Please delete this and all copies of this
> email from your system. Any opinions either expressed or implied in this
> email and all attachments, are those of its author only, and do not
> necessarily reflect those of Nuna Health, Inc.

-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>