You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Jaonary Rabarisoa <ja...@gmail.com> on 2014/04/03 14:44:54 UTC

Re: Strange behavior of RDD.cartesian

You can find here a gist that illustrates this issue
https://gist.github.com/jrabary/9953562
I got this with spark from master branch.


On Sat, Mar 29, 2014 at 7:12 PM, Andrew Ash <an...@andrewash.com> wrote:

> Is this spark 0.9.0? Try setting spark.shuffle.spill=false There was a
> hash collision bug that's fixed in 0.9.1 that might cause you to have too
> few results in that join.
>
> Sent from my mobile phone
> On Mar 28, 2014 8:04 PM, "Matei Zaharia" <ma...@gmail.com> wrote:
>
>> Weird, how exactly are you pulling out the sample? Do you have a small
>> program that reproduces this?
>>
>> Matei
>>
>> On Mar 28, 2014, at 3:09 AM, Jaonary Rabarisoa <ja...@gmail.com> wrote:
>>
>> I forgot to mention that I don't really use all of my data. Instead I use
>> a sample extracted with randomSample.
>>
>>
>> On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa <ja...@gmail.com>wrote:
>>
>>> Hi all,
>>>
>>> I notice that RDD.cartesian has a strange behavior with cached and
>>> uncached data. More precisely, I have a set of data that I load with
>>> objectFile
>>>
>>> *val data: RDD[(Int,String,Array[Double])] = sc.objectFile("data")*
>>>
>>> Then I split it in two set depending on some criteria
>>>
>>>
>>> *val part1 = data.filter(_._2 matches "view1")*
>>> *val part2 = data.filter(_._2 matches "view2")*
>>>
>>>
>>> Finally, I compute the cartesian product of part1 and part2
>>>
>>> *val pair = part1.cartesian(part2)*
>>>
>>>
>>> If every thing goes well I should have
>>>
>>> *pair.count == part1.count * part2.count*
>>>
>>> But this is not the case if I don't cache part1 and part2.
>>>
>>> What I was missing ? Does caching data mandatory in Spark ?
>>>
>>> Cheers,
>>>
>>> Jaonary
>>>
>>>
>>>
>>>
>>
>>