You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@crunch.apache.org by Nithin Asokan <an...@gmail.com> on 2015/10/08 18:32:03 UTC

SparkPipeline possible avro reuse on cache()

First I would like to thank everyone on the quick response and fixes on
most issues. Great job everyone!

I noticed that using cache() on PTable built using SparkPipeline seems to
reuse object for downstream DoFn's. Here is an example that exhibits this
behavior

https://gist.github.com/nasokan/531b4ff9bf827d0835ab

I would expect the output of this program to create a pair with same key,
value. However, this produces Pair with different key value. I have tested
this with text file input source and it works as expected. Removing cache()
also produces expected result. So I'm suspecting this issue to be specific
to avro and cache().

Any thoughts on this behavior?

Thank you!
Nithin

Re: SparkPipeline possible avro reuse on cache()

Posted by Nithin Asokan <an...@gmail.com>.
Thanks Josh. I logged https://issues.apache.org/jira/browse/CRUNCH-569 and
will try submitting a patch for this.

On Thu, Oct 8, 2015 at 1:51 PM Josh Wills <jw...@cloudera.com> wrote:

> Yeah, I could see how that would happen. I think the move would be to
> inject a deep copy inside of the RDD that is underneath a cached
> PCollection. I can probably take a crack at a patch later this weekend, I
> have a busy couple of days w/the baby and new job and what not. :)
>
> J
>
> On Thu, Oct 8, 2015 at 9:32 AM, Nithin Asokan <an...@gmail.com> wrote:
>
>> First I would like to thank everyone on the quick response and fixes on
>> most issues. Great job everyone!
>>
>> I noticed that using cache() on PTable built using SparkPipeline seems to
>> reuse object for downstream DoFn's. Here is an example that exhibits this
>> behavior
>>
>> https://gist.github.com/nasokan/531b4ff9bf827d0835ab
>>
>> I would expect the output of this program to create a pair with same key,
>> value. However, this produces Pair with different key value. I have tested
>> this with text file input source and it works as expected. Removing cache()
>> also produces expected result. So I'm suspecting this issue to be specific
>> to avro and cache().
>>
>> Any thoughts on this behavior?
>>
>> Thank you!
>> Nithin
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Re: SparkPipeline possible avro reuse on cache()

Posted by Josh Wills <jw...@cloudera.com>.
Yeah, I could see how that would happen. I think the move would be to
inject a deep copy inside of the RDD that is underneath a cached
PCollection. I can probably take a crack at a patch later this weekend, I
have a busy couple of days w/the baby and new job and what not. :)

J

On Thu, Oct 8, 2015 at 9:32 AM, Nithin Asokan <an...@gmail.com> wrote:

> First I would like to thank everyone on the quick response and fixes on
> most issues. Great job everyone!
>
> I noticed that using cache() on PTable built using SparkPipeline seems to
> reuse object for downstream DoFn's. Here is an example that exhibits this
> behavior
>
> https://gist.github.com/nasokan/531b4ff9bf827d0835ab
>
> I would expect the output of this program to create a pair with same key,
> value. However, this produces Pair with different key value. I have tested
> this with text file input source and it works as expected. Removing cache()
> also produces expected result. So I'm suspecting this issue to be specific
> to avro and cache().
>
> Any thoughts on this behavior?
>
> Thank you!
> Nithin
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>