You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by witgo <gi...@git.apache.org> on 2014/08/21 17:34:32 UTC

[GitHub] spark pull request: [WIP][SPARK-3098]In some cases, the result of ...

GitHub user witgo opened a pull request:

    https://github.com/apache/spark/pull/2083

    [WIP][SPARK-3098]In some cases, the result of RDD.distinct is inconsistent

    cc @srowen

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/witgo/spark distinct

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2083.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2083
    
----
commit 425c8236d1c4d3b1fb93d2fe4c25d9cba45620fd
Author: GuoQiang Li <wi...@qq.com>
Date:   2014-08-20T16:34:27Z

    The result of RDD.distinct is inconsistent

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3098]In some cases, the result of ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2083#issuecomment-53041899
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19083/consoleFull) for   PR 2083 at commit [`7ce4740`](https://github.com/apache/spark/commit/7ce4740b1119190e3fcf5e5f6796e5957d30bcf6).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3098]In some cases, the result of ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2083#issuecomment-52945236
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19058/consoleFull) for   PR 2083 at commit [`425c823`](https://github.com/apache/spark/commit/425c8236d1c4d3b1fb93d2fe4c25d9cba45620fd).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3098]In some cases, the result of ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2083#issuecomment-53145524
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19095/consoleFull) for   PR 2083 at commit [`60e8274`](https://github.com/apache/spark/commit/60e827480e31f7773278da2e83b81178edc8ebb7).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3098]In some cases, the result of ...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/2083#issuecomment-53367217
  
    Actually I commented on the JIRA -- I'm not sure this is a bug. distinct() makes no guarantees on the order of results obtained, and in general, our shuffle operations shouldn't make guarantees about a specific order (unless you are explicitly calling sortByKey). So unless values are missing from the resulting set somehow, I wouldn't consider this a bug. Your code can always call sortByKey or call mapPartitions to sort within a partition if you want a specific order.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3098]In some cases, the result of ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2083#issuecomment-53037112
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19083/consoleFull) for   PR 2083 at commit [`7ce4740`](https://github.com/apache/spark/commit/7ce4740b1119190e3fcf5e5f6796e5957d30bcf6).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3098]In some cases, the result of ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2083#issuecomment-54074910
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19565/consoleFull) for   PR 2083 at commit [`df59bea`](https://github.com/apache/spark/commit/df59bea54691ec68b4cde1603f6f1a0db15efb06).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3098]In some cases, the result of ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2083#issuecomment-53038130
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19082/consoleFull) for   PR 2083 at commit [`613c641`](https://github.com/apache/spark/commit/613c6412faded8eee41c6e63d61f6e0f73e0f1e2).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3098]In some cases, the result of ...

Posted by nchammas <gi...@git.apache.org>.

Github user nchammas commented on the pull request:

    https://github.com/apache/spark/pull/2083#issuecomment-54105091
  
    > our shuffle operations shouldn't make guarantees about a specific order (unless you are explicitly calling sortByKey)
    
    As an aside, this matches the semantics of SQL. The only thing that _guarantees_ a certain order in SQL is an explicit call to `ORDER BY`; tables are unordered sets of rows; etc. 
    
    I think it makes sense to have these RDD operations follow similar semantics.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3098]In some cases, the result of ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2083#issuecomment-52938313
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19058/consoleFull) for   PR 2083 at commit [`425c823`](https://github.com/apache/spark/commit/425c8236d1c4d3b1fb93d2fe4c25d9cba45620fd).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3098]In some cases, the result of ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2083#issuecomment-54068545
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19565/consoleFull) for   PR 2083 at commit [`df59bea`](https://github.com/apache/spark/commit/df59bea54691ec68b4cde1603f6f1a0db15efb06).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3098]In some cases, the result of ...

Posted by nchammas <gi...@git.apache.org>.

Github user nchammas commented on the pull request:

    https://github.com/apache/spark/pull/2083#issuecomment-54176053
  
    @srowen Thanks for clarifying that. I agree that it would be confusing for people to get different results depending on when they lookup an item in an RDD. Perhaps the appropriate solution for the time being is just clear documentation about this behavior--that is, if you want a consistent path to a piece of data, you need a persistent index on the containing data set.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3098]In some cases, the result of ...

Posted by witgo <gi...@git.apache.org>.

Github user witgo closed the pull request at:

    https://github.com/apache/spark/pull/2083


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3098]In some cases, the result of ...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2083#discussion_r16692571
  
    --- Diff: core/src/main/scala/org/apache/spark/shuffle/hash/BlockStoreShuffleFetcher.scala ---
    @@ -83,4 +84,21 @@ private[hash] object BlockStoreShuffleFetcher extends Logging {
     
         new InterruptibleIterator[T](context, completionIter)
       }
    +
    +  private def randomize[T](data: Seq[T], shuffleId: Int, reduceId: Int): Seq[T] = {
    --- End diff --
    
    Use our Utils.randomize instead. In general it's not a good idea to create your own hash function.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3098]In some cases, the result of ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2083#issuecomment-53033537
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19082/consoleFull) for   PR 2083 at commit [`613c641`](https://github.com/apache/spark/commit/613c6412faded8eee41c6e63d61f6e0f73e0f1e2).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3098]In some cases, the result of ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/2083#issuecomment-54124826
  
    @nchammas to be clear, the question isn't about ordering really. The issue is that result of the same RDD in this example changes when it is reevaluated. It's more like having a ResultSet change under you while iterating. @mateiz explained that this is working as intended. I think some straightforward uses of `zipWithIndex` may surprise people then. For example I add an index to each datum, do some computation, and later go back to look up a datum by index on the very same RDD. I am not likely to get back the same datum -- unless it has been persisted. Something to keep in mind to see if it actually bites people regularly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3098]In some cases, the result of ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2083#issuecomment-53144520
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19095/consoleFull) for   PR 2083 at commit [`60e8274`](https://github.com/apache/spark/commit/60e827480e31f7773278da2e83b81178edc8ebb7).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3098]In some cases, the result of ...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/2083#issuecomment-54107787
  
    Yup, that's the goal (added the same discussion on https://issues.apache.org/jira/browse/SPARK-3098)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org