You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Friso van Vollenhoven <f....@gmail.com> on 2014/03/25 13:53:17 UTC

tuple as keys in pyspark show up reversed

Hi,

I have an example where I use a tuple of (int,int) in Python as key for a
RDD. When I do a reduceByKey(...), sometimes the tuples turn up with the
two int's reversed in order (which is problematic, as the ordering is part
of the key).

Here is a ipython notebook that has some code and demonstrates this issue:
http://nbviewer.ipython.org/urls/dl.dropboxusercontent.com/u/5812021/test-on-cluster.ipynb?create=1

Here's the long story... I am doing collaborative filtering using cosine
similarity on Spark using Python. Not because I need it, but it seemed like
a appropriately simple but useful exercise to get started with Spark.

I am seeing a difference in outcomes between running locally and running on
a cluster.

My approach is this:

   - given a dataset containing tuples of (user, item, rating)
   - group by user
   - for each user flatMap into tuples of ((item, item), (rating, rating))
   for each combination of two items that that user has seen
   - then map into a structure containing ((item, item), (rating * rating,
   left rating ^ 2, right rating ^ 2, 1)
   - then reduceByKey, summing up the values in the tuples column-wise;
   this gives a tuple of (sum of rating product, sum of left rating squares,
   sum of right rating square, co-occurrence count) which can be used to
   calculate cosine similarity.
   - map into (item,item), (similarity, count)
   - this should result in a dataset that looks like: (item, item),
   (similarity, count)

(In the notebook I leave out the final step of converting the sums into a
cosine similarity.)

When I do this for a artificial dataset of 100000 users with each 70 items
rated that are exactly the same for each user (so 7M ratings in total where
the user,item matrix is dense), I would expect the resulting dataset to
have 2415 tuples (== 70*69 / 2, the number of co-occurrences that exist
amongst 70 items that each user has rated) and I would expect the
co-occurrence count for each item,item pair to be 100000, as there are 100K
users.

When I run my code locally, the above assumptions work out, but when I run
on a small cluster (4 workers, on AWS), the numbers are way off. This
happens because of the reversed tuples.

Where am I going wrong?


Thanks for any pointers, cheers,
Friso

Re: tuple as keys in pyspark show up reversed

Posted by Friso van Vollenhoven <f....@gmail.com>.
OK, forget about this question. It was a nasty, one character typo in my
own code (sorting by rating instead of item at one point).
Best,
Friso


On Tue, Mar 25, 2014 at 1:53 PM, Friso van Vollenhoven <
f.van.vollenhoven@gmail.com> wrote:

> Hi,
>
> I have an example where I use a tuple of (int,int) in Python as key for a
> RDD. When I do a reduceByKey(...), sometimes the tuples turn up with the
> two int's reversed in order (which is problematic, as the ordering is part
> of the key).
>
> Here is a ipython notebook that has some code and demonstrates this issue:
> http://nbviewer.ipython.org/urls/dl.dropboxusercontent.com/u/5812021/test-on-cluster.ipynb?create=1
>
> Here's the long story... I am doing collaborative filtering using cosine
> similarity on Spark using Python. Not because I need it, but it seemed like
> a appropriately simple but useful exercise to get started with Spark.
>
> I am seeing a difference in outcomes between running locally and running
> on a cluster.
>
> My approach is this:
>
>    - given a dataset containing tuples of (user, item, rating)
>    - group by user
>    - for each user flatMap into tuples of ((item, item), (rating,
>    rating)) for each combination of two items that that user has seen
>    - then map into a structure containing ((item, item), (rating *
>    rating, left rating ^ 2, right rating ^ 2, 1)
>    - then reduceByKey, summing up the values in the tuples column-wise;
>    this gives a tuple of (sum of rating product, sum of left rating squares,
>    sum of right rating square, co-occurrence count) which can be used to
>    calculate cosine similarity.
>    - map into (item,item), (similarity, count)
>    - this should result in a dataset that looks like: (item, item),
>    (similarity, count)
>
> (In the notebook I leave out the final step of converting the sums into a
> cosine similarity.)
>
> When I do this for a artificial dataset of 100000 users with each 70 items
> rated that are exactly the same for each user (so 7M ratings in total where
> the user,item matrix is dense), I would expect the resulting dataset to
> have 2415 tuples (== 70*69 / 2, the number of co-occurrences that exist
> amongst 70 items that each user has rated) and I would expect the
> co-occurrence count for each item,item pair to be 100000, as there are 100K
> users.
>
> When I run my code locally, the above assumptions work out, but when I run
> on a small cluster (4 workers, on AWS), the numbers are way off. This
> happens because of the reversed tuples.
>
> Where am I going wrong?
>
>
> Thanks for any pointers, cheers,
> Friso
>
>