You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by larvaboy <gi...@git.apache.org> on 2014/05/12 12:33:36 UTC

[GitHub] spark pull request: Implement ApproximateCountDistinct for SparkSq...

GitHub user larvaboy opened a pull request:

    https://github.com/apache/spark/pull/737

    Implement ApproximateCountDistinct for SparkSql

    Add the implementation for ApproximateCountDistinct to SparkSql. We use the HyperLogLog algorithm implemented in stream-lib, and do the count in two phases: 1) counting the number of distinct elements in each partitions, and 2) merge the HyperLogLog results from different partitions.
    
    A simple serializer and test cases are added as well.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/larvaboy/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/737.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #737
    
----
commit 871abec814fa15e9507a98ca1b4718429781efd7
Author: larvaboy <la...@gmail.com>
Date:   2014-05-10T23:20:10Z

    Fix a couple of minor typos.

commit f73651c8dc23fdd83b1bfb35bda135449f84c5c5
Author: larvaboy <la...@gmail.com>
Date:   2014-05-11T23:15:35Z

    Fix a minor typo in the toString method of the Count case class.

commit 25b46046c5e7a772dd25f2bd7ae711c9dabd3959
Author: larvaboy <la...@gmail.com>
Date:   2014-05-12T09:25:59Z

    Add SparkSql serializer for HyperLogLog.

commit 80f1da4a48d3929272a4436aee26531f03eab4aa
Author: larvaboy <la...@gmail.com>
Date:   2014-05-12T09:38:16Z

    Add ApproximateCountDistinct aggregates and functions.
    
    We use stream-lib's HyperLogLog to approximately count the number of
    distinct elements in each partition, and merge the HyperLogLogs to
    compute the final result.
    
    If the expressions can not be successfully broken apart, we fall back to
    the exact CountDistinct.

commit 234a270a5e6766ad41b4fb49a54d42ddb4643264
Author: larvaboy <la...@gmail.com>
Date:   2014-05-12T04:58:54Z

    Add the parser for the approximate count.

commit cf73b921cfa901ffb40c848ca1961378475fea1a
Author: larvaboy <la...@gmail.com>
Date:   2014-05-12T05:05:15Z

    Add a test case for count distinct and approximate count distinct.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Implement ApproximateCountDistinct for SparkSq...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/737#discussion_r12546014
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala ---
    @@ -269,6 +308,34 @@ case class CountFunction(expr: Expression, base: AggregateExpression) extends Ag
       override def eval(input: Row): Any = count
     }
     
    +case class ApproxCountDistinctPartitionFunction(expr: Expression, base: AggregateExpression)
    +    extends AggregateFunction {
    +  def this() = this(null, null) // Required for serialization.
    +
    +  private val hyperLogLog = new HyperLogLog(ApproxCountDistinct.RelativeSD)
    +
    +  override def update(input: Row): Unit = {
    +    val evaluatedExpr = expr.eval(input)
    +    Option(evaluatedExpr).foreach(hyperLogLog.offer(_))
    +  }
    +
    +  override def eval(input: Row): Any = hyperLogLog
    +}
    +
    +case class ApproxCountDistinctMergeFunction(expr: Expression, base: AggregateExpression)
    +    extends AggregateFunction {
    +  def this() = this(null, null) // Required for serialization.
    +
    +  private val hyperLogLog = new HyperLogLog(ApproxCountDistinct.RelativeSD)
    +
    +  override def update(input: Row): Unit = {
    +    val evaluatedExpr = expr.eval(input)
    +    Option(evaluatedExpr.asInstanceOf[HyperLogLog]).foreach(hyperLogLog.addAll(_))
    --- End diff --
    
    Will this ever be null?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Implement ApproximateCountDistinct for SparkSq...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/737#discussion_r12546829
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala ---
    @@ -166,10 +167,48 @@ case class CountDistinct(expressions: Seq[Expression]) extends AggregateExpressi
       override def references = expressions.flatMap(_.references).toSet
       override def nullable = false
       override def dataType = IntegerType
    -  override def toString = s"COUNT(DISTINCT ${expressions.mkString(",")}})"
    +  override def toString = s"COUNT(DISTINCT ${expressions.mkString(",")})"
       override def newInstance() = new CountDistinctFunction(expressions, this)
     }
     
    +case class ApproxCountDistinctPartition(child: Expression)
    +    extends AggregateExpression with trees.UnaryNode[Expression] {
    --- End diff --
    
    style feedback: 2 space indenting for "extends" (We only do 4 space indenting for arguments)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Implement ApproximateCountDistinct for SparkSq...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/737#issuecomment-42886754
  
    @marmbrus @rxin ah okay guys - sorry for my wrong comment :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Implement ApproximateCountDistinct for SparkSq...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/737#discussion_r12545901
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala ---
    @@ -269,6 +308,34 @@ case class CountFunction(expr: Expression, base: AggregateExpression) extends Ag
       override def eval(input: Row): Any = count
     }
     
    +case class ApproxCountDistinctPartitionFunction(expr: Expression, base: AggregateExpression)
    +    extends AggregateFunction {
    +  def this() = this(null, null) // Required for serialization.
    +
    +  private val hyperLogLog = new HyperLogLog(ApproxCountDistinct.RelativeSD)
    +
    +  override def update(input: Row): Unit = {
    +    val evaluatedExpr = expr.eval(input)
    +    Option(evaluatedExpr).foreach(hyperLogLog.offer(_))
    --- End diff --
    
    I'm normally all for the Option pattern, but in this case you are probably incurring more object allocations that we want to in the critical path of query execution.  I'd just use an `if` here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Implement ApproximateCountDistinct for SparkSq...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/737#discussion_r12546769
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala ---
    @@ -166,10 +167,48 @@ case class CountDistinct(expressions: Seq[Expression]) extends AggregateExpressi
       override def references = expressions.flatMap(_.references).toSet
       override def nullable = false
       override def dataType = IntegerType
    -  override def toString = s"COUNT(DISTINCT ${expressions.mkString(",")}})"
    +  override def toString = s"COUNT(DISTINCT ${expressions.mkString(",")})"
       override def newInstance() = new CountDistinctFunction(expressions, this)
     }
     
    +case class ApproxCountDistinctPartition(child: Expression)
    +    extends AggregateExpression with trees.UnaryNode[Expression] {
    +  override def references = child.references
    +  override def nullable = false
    +  override def dataType = child.dataType
    +  override def toString = s"APPROXIMATE COUNT(DISTINCT $child)"
    +  override def newInstance() = new ApproxCountDistinctPartitionFunction(child, this)
    +}
    +
    +case class ApproxCountDistinctMerge(child: Expression)
    +    extends AggregateExpression with trees.UnaryNode[Expression] {
    +  override def references = child.references
    +  override def nullable = false
    +  override def dataType = IntegerType
    +  override def toString = s"APPROXIMATE COUNT(DISTINCT $child)"
    +  override def newInstance() = new ApproxCountDistinctMergeFunction(child, this)
    +}
    +
    +object ApproxCountDistinct {
    +  val RelativeSD = 0.05
    --- End diff --
    
    Having a default here is reasonable, but we should probably expose this to the user as well.  Maybe two versions in the parser?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Implement ApproximateCountDistinct for SparkSq...

Posted by larvaboy <gi...@git.apache.org>.

Github user larvaboy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/737#discussion_r12574227
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala ---
    @@ -269,6 +308,34 @@ case class CountFunction(expr: Expression, base: AggregateExpression) extends Ag
       override def eval(input: Row): Any = count
     }
     
    +case class ApproxCountDistinctPartitionFunction(expr: Expression, base: AggregateExpression)
    +    extends AggregateFunction {
    +  def this() = this(null, null) // Required for serialization.
    +
    +  private val hyperLogLog = new HyperLogLog(ApproxCountDistinct.RelativeSD)
    +
    +  override def update(input: Row): Unit = {
    +    val evaluatedExpr = expr.eval(input)
    +    Option(evaluatedExpr).foreach(hyperLogLog.offer(_))
    --- End diff --
    
    This has been changed into a null check.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Implement ApproximateCountDistinct for SparkSq...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/737#issuecomment-42870675
  
    Bypassing SerializableHyperLogLog has a few benefits:
    1. Less memory usage because we don't need the wrapper.
    2. Works with Spark SQL's internal serializer.
    3. stream-lib will actually make HyperLogLog serializable next release - so SerializableHyperLogLog will be gone ....


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Implement ApproximateCountDistinct for SparkSq...

Posted by larvaboy <gi...@git.apache.org>.

Github user larvaboy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/737#discussion_r12574452
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala ---
    @@ -166,10 +167,48 @@ case class CountDistinct(expressions: Seq[Expression]) extends AggregateExpressi
       override def references = expressions.flatMap(_.references).toSet
       override def nullable = false
       override def dataType = IntegerType
    -  override def toString = s"COUNT(DISTINCT ${expressions.mkString(",")}})"
    +  override def toString = s"COUNT(DISTINCT ${expressions.mkString(",")})"
       override def newInstance() = new CountDistinctFunction(expressions, this)
     }
     
    +case class ApproxCountDistinctPartition(child: Expression)
    +    extends AggregateExpression with trees.UnaryNode[Expression] {
    +  override def references = child.references
    +  override def nullable = false
    +  override def dataType = child.dataType
    +  override def toString = s"APPROXIMATE COUNT(DISTINCT $child)"
    +  override def newInstance() = new ApproxCountDistinctPartitionFunction(child, this)
    +}
    +
    +case class ApproxCountDistinctMerge(child: Expression)
    +    extends AggregateExpression with trees.UnaryNode[Expression] {
    +  override def references = child.references
    +  override def nullable = false
    +  override def dataType = IntegerType
    +  override def toString = s"APPROXIMATE COUNT(DISTINCT $child)"
    +  override def newInstance() = new ApproxCountDistinctMergeFunction(child, this)
    +}
    +
    +object ApproxCountDistinct {
    +  val RelativeSD = 0.05
    +}
    +
    +case class ApproxCountDistinct(child: Expression)
    +    extends PartialAggregate with trees.UnaryNode[Expression] {
    +  override def references = child.references
    +  override def nullable = false
    +  override def dataType = IntegerType
    +  override def toString = s"APPROXIMATE COUNT(DISTINCT $child)"
    +
    +  override def asPartial: SplitEvaluation = {
    +    val partialCount = Alias(ApproxCountDistinctPartition(child),
    +                             "PartialApproxCountDistinct")()
    --- End diff --
    
    This has been fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Implement ApproximateCountDistinct for SparkSq...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/737#issuecomment-42817556
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Implement ApproximateCountDistinct for SparkSq...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/737#discussion_r12546858
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala ---
    @@ -166,10 +167,48 @@ case class CountDistinct(expressions: Seq[Expression]) extends AggregateExpressi
       override def references = expressions.flatMap(_.references).toSet
       override def nullable = false
       override def dataType = IntegerType
    -  override def toString = s"COUNT(DISTINCT ${expressions.mkString(",")}})"
    +  override def toString = s"COUNT(DISTINCT ${expressions.mkString(",")})"
       override def newInstance() = new CountDistinctFunction(expressions, this)
     }
     
    +case class ApproxCountDistinctPartition(child: Expression)
    +    extends AggregateExpression with trees.UnaryNode[Expression] {
    +  override def references = child.references
    +  override def nullable = false
    +  override def dataType = child.dataType
    +  override def toString = s"APPROXIMATE COUNT(DISTINCT $child)"
    +  override def newInstance() = new ApproxCountDistinctPartitionFunction(child, this)
    +}
    +
    +case class ApproxCountDistinctMerge(child: Expression)
    +    extends AggregateExpression with trees.UnaryNode[Expression] {
    +  override def references = child.references
    +  override def nullable = false
    +  override def dataType = IntegerType
    +  override def toString = s"APPROXIMATE COUNT(DISTINCT $child)"
    +  override def newInstance() = new ApproxCountDistinctMergeFunction(child, this)
    +}
    +
    +object ApproxCountDistinct {
    +  val RelativeSD = 0.05
    +}
    +
    +case class ApproxCountDistinct(child: Expression)
    +    extends PartialAggregate with trees.UnaryNode[Expression] {
    +  override def references = child.references
    +  override def nullable = false
    +  override def dataType = IntegerType
    +  override def toString = s"APPROXIMATE COUNT(DISTINCT $child)"
    +
    +  override def asPartial: SplitEvaluation = {
    +    val partialCount = Alias(ApproxCountDistinctPartition(child),
    +                             "PartialApproxCountDistinct")()
    --- End diff --
    
    style feedback: just indent this line using 2 spaces instead of aligning them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Implement ApproximateCountDistinct for SparkSq...

Posted by larvaboy <gi...@git.apache.org>.

Github user larvaboy commented on the pull request:

    https://github.com/apache/spark/pull/737#issuecomment-42934624
  
    All the review issues should have been fixed in the most recent version of the code. Please let me know if I missed anything.
    
    Thanks a lot for the quick feedback.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Implement ApproximateCountDistinct for SparkSq...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/737#issuecomment-42862270
  
    This patch duplicates some logic that already exists elsewhere in Spark - would you mind updating it to use this class?: 
    https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/SerializableHyperLogLog.scala


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Implement ApproximateCountDistinct for SparkSq...

Posted by larvaboy <gi...@git.apache.org>.

Github user larvaboy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/737#discussion_r12574415
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala ---
    @@ -166,10 +167,48 @@ case class CountDistinct(expressions: Seq[Expression]) extends AggregateExpressi
       override def references = expressions.flatMap(_.references).toSet
       override def nullable = false
       override def dataType = IntegerType
    -  override def toString = s"COUNT(DISTINCT ${expressions.mkString(",")}})"
    +  override def toString = s"COUNT(DISTINCT ${expressions.mkString(",")})"
       override def newInstance() = new CountDistinctFunction(expressions, this)
     }
     
    +case class ApproxCountDistinctPartition(child: Expression)
    +    extends AggregateExpression with trees.UnaryNode[Expression] {
    --- End diff --
    
    All the instances are fixed now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Implement ApproximateCountDistinct for SparkSq...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/737


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Implement ApproximateCountDistinct for SparkSq...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/737#issuecomment-43040941
  
    Thanks. I merged this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Implement ApproximateCountDistinct for SparkSq...

Posted by larvaboy <gi...@git.apache.org>.

Github user larvaboy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/737#discussion_r12574394
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala ---
    @@ -166,10 +167,48 @@ case class CountDistinct(expressions: Seq[Expression]) extends AggregateExpressi
       override def references = expressions.flatMap(_.references).toSet
       override def nullable = false
       override def dataType = IntegerType
    -  override def toString = s"COUNT(DISTINCT ${expressions.mkString(",")}})"
    +  override def toString = s"COUNT(DISTINCT ${expressions.mkString(",")})"
       override def newInstance() = new CountDistinctFunction(expressions, this)
     }
     
    +case class ApproxCountDistinctPartition(child: Expression)
    +    extends AggregateExpression with trees.UnaryNode[Expression] {
    +  override def references = child.references
    +  override def nullable = false
    +  override def dataType = child.dataType
    +  override def toString = s"APPROXIMATE COUNT(DISTINCT $child)"
    +  override def newInstance() = new ApproxCountDistinctPartitionFunction(child, this)
    +}
    +
    +case class ApproxCountDistinctMerge(child: Expression)
    +    extends AggregateExpression with trees.UnaryNode[Expression] {
    +  override def references = child.references
    +  override def nullable = false
    +  override def dataType = IntegerType
    +  override def toString = s"APPROXIMATE COUNT(DISTINCT $child)"
    +  override def newInstance() = new ApproxCountDistinctMergeFunction(child, this)
    +}
    +
    +object ApproxCountDistinct {
    +  val RelativeSD = 0.05
    --- End diff --
    
    Please refer to the most recent version where we have another parser allowing users to pass in the standard deviation.
    
    The first version has the benefit of hiding the implementation details from the user. The standard deviation is not an intuitive parameter for an end user, especially given its side effect to the memory usage.
    
    Please let me know your thoughts on the new version.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Implement ApproximateCountDistinct for SparkSq...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/737#issuecomment-42869356
  
    @pwendell, I don't think that will work as Spark SQL does its own serialization for shuffles sometimes using Kryo and I don't think that SerializableHyperLogLog works with Kryo.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Implement ApproximateCountDistinct for SparkSq...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/737#issuecomment-42993784
  
    LGTM.  Thanks for doing this!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Implement ApproximateCountDistinct for SparkSq...

Posted by larvaboy <gi...@git.apache.org>.

Github user larvaboy commented on the pull request:

    https://github.com/apache/spark/pull/737#issuecomment-42996190
  
    Thanks, Michael.
    
    I just re-arranged my change sets a bit to put them together. Let me know if there's anything else needed to merge this to the upstream.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---