You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by rxin <gi...@git.apache.org> on 2015/04/26 21:29:40 UTC

[GitHub] spark pull request: [SPARK-7135][SQL] DataFrame expression for mon...

GitHub user rxin opened a pull request:

    https://github.com/apache/spark/pull/5709

    [SPARK-7135][SQL] DataFrame expression for monotonically increasing IDs.

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rxin/spark inc-id

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5709.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5709
    
----
commit a7136cb8fb542cf32675117a4aab616e33eb5750
Author: Reynold Xin <rx...@databricks.com>
Date:   2015-04-26T19:29:07Z

    [SPARK-7135][SQL] DataFrame expression for monotonically increasing IDs.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7135][SQL] DataFrame expression for mon...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5709#issuecomment-96767518
  
      [Test build #30969 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30969/consoleFull) for   PR 5709 at commit [`a9fda0d`](https://github.com/apache/spark/commit/a9fda0d5c65b884a4e115bba2401bd89ce4436f6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7135][SQL] DataFrame expression for mon...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/5709#issuecomment-96502922
  
    (That's not always true -- somebody could've deleted an index and then the scan gets turned from index scan to sequential scan, and then record ordering changed)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7135][SQL] DataFrame expression for mon...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5709#issuecomment-96430170
  
      [Test build #30959 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30959/consoleFull) for   PR 5709 at commit [`a7136cb`](https://github.com/apache/spark/commit/a7136cb8fb542cf32675117a4aab616e33eb5750).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7135][SQL] DataFrame expression for mon...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/5709#issuecomment-96463353
  
    No, but the ordering of records in a partition can change, so you might have different identifiers for the same record across retries (unless this is only used for already sorted data... is it?).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7135][SQL] DataFrame expression for mon...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/5709


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7135][SQL] DataFrame expression for mon...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/5709#issuecomment-96436586
  
    Could it be confusing to users that the ID associated with each record might be different on stage or task retries? The fact that ordering within a partition is not deterministic has caused people some concern in the past, and I wonder if this could sort of lead to more confusion since you are giving some _sort of_ ordering semantics.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7135][SQL] DataFrame expression for mon...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/5709#issuecomment-96503022
  
    Oh I see - I guess it doesn't matter then.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7135][SQL] DataFrame expression for mon...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5709#discussion_r29116369
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -103,8 +103,28 @@ def countDistinct(col, *cols):
         return Column(jc)
     
     
    +def monotonicallyIncreasingId():
    +    """A column that generates monotonically increasing 64-bit integers.
    +
    +    The generated ID is guaranteed by be monotonically increasing and unique, but not consecutive.
    +    The current implementation puts the partition ID in the upper 31 bits, and the record number
    +    within each partition in the lower 33 bits. The assumption is that the data frame has
    +    less than 1 billion partitions, and each partition has less than 20 billion records.
    --- End diff --
    
    20 b -> 4b ...
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7135][SQL] DataFrame expression for mon...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/5709#issuecomment-96494526
  
    Those could change in shuffle I guess, but I don't think this is creating more confusion. What we care about here is not the record ordering, but the output of this expression is monotonic increasing. That will always be true.
    
    This is very similar to the row id idea a lot of databases have. SQL tables also don't have ordering, unless they are sorted.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7135][SQL] DataFrame expression for mon...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/5709#issuecomment-96451839
  
    partition id doesn't change between retries, does it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7135][SQL] DataFrame expression for mon...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/5709#issuecomment-96502816
  
    @rxin yeah I just mean if I'm in a database and I run the same query twice, I will get the same row ID for the same record. Because of non determinism in the shuffle, that's not true here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7135][SQL] DataFrame expression for mon...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5709#issuecomment-96958390
  
      [Test build #31116 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31116/consoleFull) for   PR 5709 at commit [`7853611`](https://github.com/apache/spark/commit/78536117934223e2d0f7554897f319ef4b680650).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7135][SQL] DataFrame expression for mon...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/5709#issuecomment-96906283
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7135][SQL] DataFrame expression for mon...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5709#issuecomment-96906610
  
      [Test build #31112 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31112/consoleFull) for   PR 5709 at commit [`a9fda0d`](https://github.com/apache/spark/commit/a9fda0d5c65b884a4e115bba2401bd89ce4436f6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7135][SQL] DataFrame expression for mon...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5709#issuecomment-96431732
  
      [Test build #30959 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30959/consoleFull) for   PR 5709 at commit [`a7136cb`](https://github.com/apache/spark/commit/a7136cb8fb542cf32675117a4aab616e33eb5750).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7135][SQL] DataFrame expression for mon...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5709#issuecomment-96907479
  
      [Test build #31112 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31112/consoleFull) for   PR 5709 at commit [`a9fda0d`](https://github.com/apache/spark/commit/a9fda0d5c65b884a4e115bba2401bd89ce4436f6).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class MonotonicallyIncreasingID() extends LeafExpression `
    
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7135][SQL] DataFrame expression for mon...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5709#issuecomment-96913243
  
      [Test build #31116 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31116/consoleFull) for   PR 5709 at commit [`7853611`](https://github.com/apache/spark/commit/78536117934223e2d0f7554897f319ef4b680650).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7135][SQL] DataFrame expression for mon...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/5709#issuecomment-96906797
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7135][SQL] DataFrame expression for mon...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5709#discussion_r29116363
  
    --- Diff: python/pyspark/sql/functions.py ---
    @@ -103,8 +103,28 @@ def countDistinct(col, *cols):
         return Column(jc)
     
     
    +def monotonicallyIncreasingId():
    +    """A column that generates monotonically increasing 64-bit integers.
    +
    +    The generated ID is guaranteed by be monotonically increasing and unique, but not consecutive.
    --- End diff --
    
    guaranteed TO be


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-7135][SQL] DataFrame expression for mon...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/5709#issuecomment-96497106
  
    @pwendell you raised a very good point about ordering of records within RDDs and DataFrames. I think we should document those more clearly in the javadoc for these.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org