You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by kanzhang <gi...@git.apache.org> on 2014/05/20 21:27:37 UTC

[GitHub] spark pull request: [SPARK-1822] SchemaRDD.count() should use opti...

GitHub user kanzhang opened a pull request:

    https://github.com/apache/spark/pull/841

    [SPARK-1822] SchemaRDD.count() should use optimizer

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kanzhang/spark SPARK-1822

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/841.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #841
    
----
commit e67c910a5777300f1dc6d9c4908c0794dcd12863
Author: Kan Zhang <kz...@apache.org>
Date:   2014-05-20T19:24:47Z

    [SPARK-1822] SchemaRDD.count() should use optimizer

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1822] SchemaRDD.count() should use quer...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/841#issuecomment-44118489
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15183/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1822] SchemaRDD.count() should use quer...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/841#issuecomment-44118488
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1822] SchemaRDD.count() should use opti...

Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on the pull request:

    https://github.com/apache/spark/pull/841#issuecomment-43786603
  
    @rxin thanks for the heads up. I appreciate help from anyone to help burn down my open PRs, the oldest being over a month old.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1822] SchemaRDD.count() should use quer...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/841#issuecomment-44113762
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1822] SchemaRDD.count() should use quer...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/841


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1822] SchemaRDD.count() should use opti...

Posted by ash211 <gi...@git.apache.org>.
Github user ash211 commented on the pull request:

    https://github.com/apache/spark/pull/841#issuecomment-43674365
  
    Thanks for the contribution! Could use it in my own workflows.
    
    Python ints are signed 32 bit numbers right? Should make that a long
    explicitly unless python does the right thing with promoting to a long
    rather than overflowing.
    On May 20, 2014 12:44 PM, "kanzhang" <no...@github.com> wrote:
    
    > @marmbrus <https://github.com/marmbrus> I tried to implement the formula
    > you gave on the mailing list. Not sure if I missed anything. Pls take a
    > look. Note I changed Count() to return Long to match RDD.count(). On the
    > python side, the original rdd.count() returns Int.
    >
    > —
    > Reply to this email directly or view it on GitHub<https://github.com/apache/spark/pull/841#issuecomment-43673656>
    > .
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1822] SchemaRDD.count() should use opti...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/841#discussion_r12916889
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
    @@ -274,6 +274,10 @@ class SchemaRDD(
           seed: Long) =
         new SchemaRDD(sqlContext, Sample(fraction, withReplacement, seed, logicalPlan))
     
    +  override def count(): Long = {
    --- End diff --
    
    Do you mind adding javadoc for this? Just explain different from RDD count's, SchemaRDD count actually invokes the optimizer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1822] SchemaRDD.count() should use quer...

Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on the pull request:

    https://github.com/apache/spark/pull/841#issuecomment-44148008
  
    @rxin thanks for the cleanup!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1822] SchemaRDD.count() should use opti...

Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on the pull request:

    https://github.com/apache/spark/pull/841#issuecomment-43675504
  
    @ash211 In Python 2.X, it does promote an Int to Long when overflowing (it still matters in doctests, where you have to be explicit about the result value is 3 or 3L).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1822] SchemaRDD.count() should use quer...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/841#issuecomment-44113750
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1822] SchemaRDD.count() should use opti...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/841#issuecomment-43726915
  
    He's on vacation this week so it might take a while for him to get back :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1822] SchemaRDD.count() should use opti...

Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on the pull request:

    https://github.com/apache/spark/pull/841#issuecomment-43673656
  
    @marmbrus I tried to implement the formula you gave on the mailing list. Not sure if I missed anything. Pls take a look. Note I changed Count() to return Long to match RDD.count(). On the python side, the original rdd.count() returns Int.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1822] SchemaRDD.count() should use opti...

Posted by kanzhang <gi...@git.apache.org>.
Github user kanzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/841#discussion_r12921105
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
    @@ -274,6 +274,10 @@ class SchemaRDD(
           seed: Long) =
         new SchemaRDD(sqlContext, Sample(fraction, withReplacement, seed, logicalPlan))
     
    +  override def count(): Long = {
    --- End diff --
    
    Sure, will do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1822] SchemaRDD.count() should use quer...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/841#issuecomment-44113416
  
    Jenkins, add to whitelist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1822] SchemaRDD.count() should use opti...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/841#issuecomment-43671863
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1822] SchemaRDD.count() should use quer...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/841#issuecomment-44119287
  
    Thanks. I've merged this into master & branch-1.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---