You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by jtengyp <gi...@git.apache.org> on 2017/05/08 09:28:42 UTC

[GitHub] spark issue #17898: Optimize the CartesianRDD to reduce repeatedly data fetc...

Github user jtengyp commented on the issue:

    https://github.com/apache/spark/pull/17898
  
    Here is my test:
    Environment : 3 workers, each has 10 cores, 30G memory, 1 executor
    Test data : users : 480,189, each is a 10-dim vector, and items : 17770, each is a 10-dim vector.
    With default CartesianRDD, cartesian time is 2420.7s.
    With this proposal, cartesian time is 45.3s
    50x faster than the original method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org