You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@crunch.apache.org by Surbhi Mungre <mu...@gmail.com> on 2015/09/02 23:17:54 UTC

JoinStrategy with Spark pipeline

I was trying to determine effect of changing JoinStrategy on a Spark
pipeline. I noticed that my pipeline works fine with DefaultJoinStrategy,
however I could not get it to working with MapSideJoinStrategy and
BloomFilterJoinStrategy. For MapSideJoinStrategy I get an exceptions[1] on
driver itself and for BloomFilterJoinStrategy I get exceptions[2] in one of
the stages. I have not tried to do any configuration changes but I did run
tests with datasets of different sizes to ensure that my PCollection is
small enough to fit in memory. I am running spark in yarn-client mode with
Crunch 0.11.0-cdh5.4.2.

[1] https://gist.github.com/anonymous/15d6c691b743ad392d42
[2] https://gist.github.com/anonymous/b02a82401a30a69f1cff

Thanks,
Surbhi

Re: JoinStrategy with Spark pipeline

Posted by Surbhi Mungre <mu...@gmail.com>.
With a minor modification to the patch you posted on CRUNCH-557 the
FileNotFoundException[1] is fixed but I still get
CrunchRuntimeException[2].

[1] *https://gist.github.com/anonymous/15d6c691b743ad392d42
<https://gist.github.com/anonymous/15d6c691b743ad392d42>*
[2] https://gist.github.com/anonymous/b02a82401a30a69f1cff


On Wed, Sep 2, 2015 at 5:56 PM, Josh Wills <jo...@gmail.com> wrote:

> Posted a patch to fix this here:
> https://issues.apache.org/jira/browse/CRUNCH-557
>
> On Wed, Sep 2, 2015 at 2:17 PM, Surbhi Mungre <mu...@gmail.com>
> wrote:
>
>> I was trying to determine effect of changing JoinStrategy on a Spark
>> pipeline. I noticed that my pipeline works fine with DefaultJoinStrategy,
>> however I could not get it to working with MapSideJoinStrategy and
>> BloomFilterJoinStrategy. For MapSideJoinStrategy I get an exceptions[1] on
>> driver itself and for BloomFilterJoinStrategy I get exceptions[2] in one of
>> the stages. I have not tried to do any configuration changes but I did run
>> tests with datasets of different sizes to ensure that my PCollection is
>> small enough to fit in memory. I am running spark in yarn-client mode with
>> Crunch 0.11.0-cdh5.4.2.
>>
>> [1] https://gist.github.com/anonymous/15d6c691b743ad392d42
>> [2] https://gist.github.com/anonymous/b02a82401a30a69f1cff
>>
>> Thanks,
>> Surbhi
>>
>
>

Re: JoinStrategy with Spark pipeline

Posted by Josh Wills <jo...@gmail.com>.
Posted a patch to fix this here:
https://issues.apache.org/jira/browse/CRUNCH-557

On Wed, Sep 2, 2015 at 2:17 PM, Surbhi Mungre <mu...@gmail.com>
wrote:

> I was trying to determine effect of changing JoinStrategy on a Spark
> pipeline. I noticed that my pipeline works fine with DefaultJoinStrategy,
> however I could not get it to working with MapSideJoinStrategy and
> BloomFilterJoinStrategy. For MapSideJoinStrategy I get an exceptions[1] on
> driver itself and for BloomFilterJoinStrategy I get exceptions[2] in one of
> the stages. I have not tried to do any configuration changes but I did run
> tests with datasets of different sizes to ensure that my PCollection is
> small enough to fit in memory. I am running spark in yarn-client mode with
> Crunch 0.11.0-cdh5.4.2.
>
> [1] https://gist.github.com/anonymous/15d6c691b743ad392d42
> [2] https://gist.github.com/anonymous/b02a82401a30a69f1cff
>
> Thanks,
> Surbhi
>