You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by saucam <gi...@git.apache.org> on 2015/02/25 13:25:00 UTC

[GitHub] spark pull request: SPARK-6006: Optimize count distinct for high c...

GitHub user saucam opened a pull request:

    https://github.com/apache/spark/pull/4764

    SPARK-6006: Optimize count distinct for high cardinality columns

    Currently the plan for count distinct looks like this : 
    
    Aggregate false, [snAppProtocol#448], [CombineAndCount(partialSets#513) AS _c0#437L]
       Exchange SinglePartition
        Aggregate true, [snAppProtocol#448], [snAppProtocol#448,AddToHashSet(snAppProtocol#448) AS partialSets#513]
         !OutputFaker [snAppProtocol#448]
          ParquetTableScan [snAppProtocol#587], (ParquetRelation hdfs://192.168.160.57:9000/data/collector/13/11/14, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml), org.apache.spark.sql.hive.HiveContext@6b1ed434, [ptime#443], ptime=2014-11-13 00%3A55%3A00), []
    
    
    This can be slow if there are too many distinct values in a column. This PR changes the above plan to : 
    
    
    Aggregate false, [], [SUM(_c0#437L) AS totalCount#514L]
     Exchange SinglePartition
      Aggregate false, [snAppProtocol#448], [CombineAndCount(partialSets#513) AS _c0#437L]
       Exchange (HashPartitioning [snAppProtocol#448], 200)
        Aggregate true, [snAppProtocol#448], [snAppProtocol#448,AddToHashSet(snAppProtocol#448) AS partialSets#513]
         !OutputFaker [snAppProtocol#448]
          ParquetTableScan [snAppProtocol#587], (ParquetRelation hdfs://192.168.160.57:9000/data/collector/13/11/14, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml), org.apache.spark.sql.hive.HiveContext@6b1ed434, [ptime#443], ptime=2014-11-13 00%3A55%3A00), []
    
    This way even if there are too many distinct values; we insert them into partial maps and computation remains distributed and thus faster.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/saucam/spark optcountdis

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4764.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4764
    
----
commit 3e6d227184451026dbfda9866ae1e114bde002b1
Author: Yash Datta <ya...@guavus.com>
Date:   2015-02-25T12:09:01Z

    SPARK-6006: Optimize count distinct for high cardinality columns

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/4764#issuecomment-92509606
  
    Here is the JIRA: SPARK-4366.  Unless you think you will have something in the next day or two, would you mind closing this JIRA.  I'd like to keep the PR queue to only active issues so that we don't missing things.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4764#issuecomment-76062868
  
      [Test build #27960 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27960/consoleFull) for   PR 4764 at commit [`4125e2e`](https://github.com/apache/spark/commit/4125e2e26f444d781bc55f5b226085a17d47f0fc).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

Posted by saucam <gi...@git.apache.org>.
Github user saucam commented on the pull request:

    https://github.com/apache/spark/pull/4764#issuecomment-77510275
  
    please restest



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

Posted by saucam <gi...@git.apache.org>.
Github user saucam commented on the pull request:

    https://github.com/apache/spark/pull/4764#issuecomment-89756295
  
    fixed test failures because of class cast exceptions. Please retest.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

Posted by saucam <gi...@git.apache.org>.
Github user saucam commented on the pull request:

    https://github.com/apache/spark/pull/4764#issuecomment-92803029
  
    thanks @marmbrus . Let me refactor this then and open another PR later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

Posted by saucam <gi...@git.apache.org>.
Github user saucam commented on the pull request:

    https://github.com/apache/spark/pull/4764#issuecomment-75952342
  
    @marmbrus can you please guide how to rewrite this in a better way ?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4764#issuecomment-89091680
  
      [Test build #29635 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29635/consoleFull) for   PR 4764 at commit [`6883b42`](https://github.com/apache/spark/commit/6883b4293c33f8ed6658410661ee30251bf3bcdc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/4764#issuecomment-89091365
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4764#issuecomment-89769240
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29724/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4764#issuecomment-76147388
  
      [Test build #27994 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27994/consoleFull) for   PR 4764 at commit [`edee0d2`](https://github.com/apache/spark/commit/edee0d2d79def1eb5e4576f9db0f826e3e905888).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

Posted by saucam <gi...@git.apache.org>.
Github user saucam commented on the pull request:

    https://github.com/apache/spark/pull/4764#issuecomment-89604768
  
    fixed the test case of zero count when there is no data. rebased with latest master. please retest


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/4764#issuecomment-76147005
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

Posted by saucam <gi...@git.apache.org>.
Github user saucam commented on the pull request:

    https://github.com/apache/spark/pull/4764#issuecomment-76135270
  
    can we test this again please ?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4764#issuecomment-76149044
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27994/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/4764#issuecomment-91089650
  
    Thanks for working ont his and sorry for the delay in reviewing it.  My high level feedback is that I think we should optimize handling of distinct aggregation, but there are already plans to do this more holistically instead of as a point solution.  If this is really important to you for some specific production workload, we could consider adding something simple now and removing it later, but otherwise I'd prefer to wait for the full solution.
    
    More specifically, I have some advice on how I would structure this if we were to move forward with this approach.
     - Code style: In general for the whole optimizer we try to avoid the use of `var`s and `while` loops, preferring functional constructs where possible.  `var`s and `while` loops are okay in performance critical code.
     - Placement: Instead of making changes to analysis (only resolution and type coercion should happen here) and planning, I think this should be a single rule inside of the Optimizer.  This is because it starts with a valid logical plan and ends with a valid logical plan, but is rewriting it to be more efficient.
     - SumZero: Where possible, prefer to compose existing constructs.  i.e., I think this could just be a `coalesce(sum(...), 0)` instead of duplicating a significant amount of code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

Posted by saucam <gi...@git.apache.org>.
Github user saucam commented on the pull request:

    https://github.com/apache/spark/pull/4764#issuecomment-76184234
  
    Fixed the null count test failure. Optimization works only in case of single count distinct in select clause


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

Posted by saucam <gi...@git.apache.org>.
Github user saucam commented on the pull request:

    https://github.com/apache/spark/pull/4764#issuecomment-91997683
  
    hi @marmbrus , can you share other plans of modifying aggregates that you mentioned earlier? Can I help with that ? Otherwise i'll modify this one for now as you have suggested.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/4764#issuecomment-91093315
  
    As a very rough sketch (this is totally untested and I'm probably missing cases), I'd hope the solution could look something like the following:
    
    ```scala
    object OptimizeSimpleDistincts extends Rule[LogicalPlan] {
      def apply(plan: LogicalPlan): LogicalPlan = plan transform {
        case Aggregate(Nil, Seq(agg), c) =>
          val rewritten = agg transform {
            case CountDistinct(Seq(c)) => Count(c)
            case SumDistinct(c) => Sum(c)
          }
    
          if (rewritten != agg) {
            Aggregate(Nil, rewritten.asInstanceOf[NamedExpression] :: Nil, Distinct(c))
          } else {
            plan
          }
      }
    }
    ```
    
    With tests of course :)  See `FilterPushdownSuite` for an example.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4764#issuecomment-89107886
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29635/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4764#issuecomment-89604947
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29709/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

Posted by saucam <gi...@git.apache.org>.
Github user saucam commented on the pull request:

    https://github.com/apache/spark/pull/4764#issuecomment-92803267
  
    thanks @marmbrus . Let me refactor this then and open another PR later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

Posted by saucam <gi...@git.apache.org>.
Github user saucam commented on the pull request:

    https://github.com/apache/spark/pull/4764#issuecomment-76347215
  
    please retest


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4764#issuecomment-76065343
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27960/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/4764#issuecomment-76062074
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4764#issuecomment-89107882
  
      [Test build #29635 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29635/consoleFull) for   PR 4764 at commit [`6883b42`](https://github.com/apache/spark/commit/6883b4293c33f8ed6658410661ee30251bf3bcdc).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4764#issuecomment-76149037
  
      [Test build #27994 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27994/consoleFull) for   PR 4764 at commit [`edee0d2`](https://github.com/apache/spark/commit/edee0d2d79def1eb5e4576f9db0f826e3e905888).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

Posted by saucam <gi...@git.apache.org>.
Github user saucam closed the pull request at:

    https://github.com/apache/spark/pull/4764


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: SPARK-6006: Optimize count distinct for high c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4764#issuecomment-75951661
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4764#issuecomment-76065331
  
      [Test build #27960 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27960/consoleFull) for   PR 4764 at commit [`4125e2e`](https://github.com/apache/spark/commit/4125e2e26f444d781bc55f5b226085a17d47f0fc).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org