You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/10/25 10:05:18 UTC

[GitHub] [spark] AngersZhuuuu opened a new pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

AngersZhuuuu opened a new pull request #34380:
URL: https://github.com/apache/spark/pull/34380


   ### What changes were proposed in this pull request?
   This PR implements aggregation function `histogram_numeric`. Function `histogram_numeric` returns an approximate histogram of a numerical column using a user-specified number of bins. For example, the histogram of column `col` when split to 3 bins.
   
   Syntax:
   #### an approximate histogram of a numerical column using a user-specified number of bins. 
   histogram_numebric(col, nBins)
   
   ###### Returns an approximate histogram of a column `col` into 3 bins.
   SELECT histogram_numebric(col, 3) FROM table
   
   ##### Returns an approximate histogram of a column `col` into 5 bins.
   SELECT histogram_numebric(col, 5) FROM table
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   No change from user side
   
   ### How was this patch tested?
   Added UT
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951869706


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49090/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34380:
URL: https://github.com/apache/spark/pull/34380#discussion_r736543090



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala
##########
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import java.nio.ByteBuffer
+
+import com.google.common.primitives.{Doubles, Ints}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, TypeCheckSuccess}
+import org.apache.spark.sql.catalyst.expressions.{Expression, ExpressionDescription, ImplicitCastInputTypes}
+import org.apache.spark.sql.catalyst.trees.BinaryLike
+import org.apache.spark.sql.catalyst.util.GenericArrayData
+import org.apache.spark.sql.errors.QueryExecutionErrors
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, DateType, DayTimeIntervalType, DoubleType, IntegerType, NumericType, StructField, StructType, TimestampNTZType, TimestampType, TypeCollection, YearMonthIntervalType}
+import org.apache.spark.sql.util.NumericHistogram
+
+/**
+ * Computes an approximate histogram of a numerical column using a user-specified number of bins.
+ *
+ * The output is an array of (x,y) pairs as struct objects that represents the histogram's
+ * bin centers and heights.
+ */
+@ExpressionDescription(
+  usage = """
+    _FUNC_(expr, nb) - Computes a histogram on numeric 'expr' using nb bins.
+      The return value is an array of (x,y) pairs representing the centers of the
+      histogram's bins. As the value of 'nb' is increased, the histogram approximation
+      gets finer-grained, but may yield artifacts around outliers. In practice, 20-40
+      histogram bins appear to work well, with more bins being required for skewed or
+      smaller datasets. Note that this function creates a histogram with non-uniform
+      bin widths. It offers no guarantees in terms of the mean-squared-error of the
+      histogram, but in practice is comparable to the histograms produced by the R/S-Plus
+      statistical computing packages.
+    """,
+  examples = """
+    Examples:
+      > SELECT _FUNC_(col, 5) FROM VALUES (0), (1), (2), (10) AS tab(col);
+       [{"x":0.0,"y":1.0},{"x":1.0,"y":1.0},{"x":2.0,"y":1.0},{"x":10.0,"y":1.0}]
+  """,
+  group = "agg_funcs",
+  since = "3.3.0")
+case class HistogramNumeric(
+    child: Expression,
+    nBins: Expression,
+    override val mutableAggBufferOffset: Int,
+    override val inputAggBufferOffset: Int)
+  extends TypedImperativeAggregate[NumericHistogram] with ImplicitCastInputTypes
+  with BinaryLike[Expression] {
+
+  def this(child: Expression, nBins: Expression) = {
+    this(child, nBins, 0, 0)
+  }
+
+  private lazy val nb = nBins.eval() match {
+    case null => null

Review comment:
       do we allow `nBins` to be null?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34380:
URL: https://github.com/apache/spark/pull/34380#discussion_r736548424



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala
##########
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import java.nio.ByteBuffer
+
+import com.google.common.primitives.{Doubles, Ints}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, TypeCheckSuccess}
+import org.apache.spark.sql.catalyst.expressions.{Expression, ExpressionDescription, ImplicitCastInputTypes}
+import org.apache.spark.sql.catalyst.trees.BinaryLike
+import org.apache.spark.sql.catalyst.util.GenericArrayData
+import org.apache.spark.sql.errors.QueryExecutionErrors
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, DateType, DayTimeIntervalType, DoubleType, IntegerType, NumericType, StructField, StructType, TimestampNTZType, TimestampType, TypeCollection, YearMonthIntervalType}
+import org.apache.spark.sql.util.NumericHistogram
+
+/**
+ * Computes an approximate histogram of a numerical column using a user-specified number of bins.
+ *
+ * The output is an array of (x,y) pairs as struct objects that represents the histogram's
+ * bin centers and heights.
+ */
+@ExpressionDescription(
+  usage = """
+    _FUNC_(expr, nb) - Computes a histogram on numeric 'expr' using nb bins.
+      The return value is an array of (x,y) pairs representing the centers of the
+      histogram's bins. As the value of 'nb' is increased, the histogram approximation
+      gets finer-grained, but may yield artifacts around outliers. In practice, 20-40
+      histogram bins appear to work well, with more bins being required for skewed or
+      smaller datasets. Note that this function creates a histogram with non-uniform
+      bin widths. It offers no guarantees in terms of the mean-squared-error of the
+      histogram, but in practice is comparable to the histograms produced by the R/S-Plus
+      statistical computing packages.
+    """,
+  examples = """
+    Examples:
+      > SELECT _FUNC_(col, 5) FROM VALUES (0), (1), (2), (10) AS tab(col);
+       [{"x":0.0,"y":1.0},{"x":1.0,"y":1.0},{"x":2.0,"y":1.0},{"x":10.0,"y":1.0}]
+  """,
+  group = "agg_funcs",
+  since = "3.3.0")
+case class HistogramNumeric(
+    child: Expression,
+    nBins: Expression,
+    override val mutableAggBufferOffset: Int,
+    override val inputAggBufferOffset: Int)
+  extends TypedImperativeAggregate[NumericHistogram] with ImplicitCastInputTypes
+  with BinaryLike[Expression] {
+
+  def this(child: Expression, nBins: Expression) = {
+    this(child, nBins, 0, 0)
+  }
+
+  private lazy val nb = nBins.eval() match {
+    case null => null
+    case n: Int => n
+  }
+
+  override def inputTypes: Seq[AbstractDataType] = {
+    // Support NumericType, DateType, TimestampType and TimestampNTZType, YearMonthIntervalType,
+    // DayTimeIntervalType since their internal types are all numeric,
+    // and can be easily cast to double for processing.
+    Seq(TypeCollection(NumericType, DateType, TimestampType, TimestampNTZType,
+      YearMonthIntervalType, DayTimeIntervalType), IntegerType)
+  }
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+    val defaultCheck = super.checkInputDataTypes()
+    if (defaultCheck.isFailure) {
+      defaultCheck
+    } else if (!nBins.foldable) {
+      TypeCheckFailure(s"${this.prettyName} needs the nBins provided must be a constant literal.")
+    } else if (nb == null) {
+      TypeCheckFailure(s"${this.prettyName} needs nBins value must not be null.")
+    } else if (nb.asInstanceOf[Int] < 2) {
+      TypeCheckFailure(s"${this.prettyName} needs nBins to be at least 2, but you supplied $nb.")
+    } else {
+      TypeCheckSuccess
+    }
+  }
+
+  override def createAggregationBuffer(): NumericHistogram = {
+    val buffer = new NumericHistogram()
+    buffer.allocate(nb.asInstanceOf[Int])
+    buffer
+  }
+
+  override def update(buffer: NumericHistogram, inputRow: InternalRow): NumericHistogram = {
+    val value = child.eval(inputRow)
+    // Ignore empty rows, for example: histogram_numeric(null)
+    if (value != null) {
+      // Convert the value to a double value
+      val doubleValue = child.dataType match {
+        case DateType => value.asInstanceOf[Int].toDouble
+        case TimestampType | TimestampNTZType | DayTimeIntervalType(_, _) =>
+          value.asInstanceOf[Long].toDouble
+        case YearMonthIntervalType(_, _) => value.asInstanceOf[Int].toDouble
+        case n: NumericType => n.numeric.toDouble(value.asInstanceOf[n.InternalType])
+        case other: DataType =>
+          throw QueryExecutionErrors.dataTypeUnexpectedError(other)
+      }
+      buffer.add(doubleValue)
+    }
+    buffer
+  }
+
+  override def merge(
+      buffer: NumericHistogram,
+      other: NumericHistogram): NumericHistogram = {
+    buffer.merge(other)
+    buffer
+  }
+
+  override def eval(buffer: NumericHistogram): Any = {
+    if (buffer.getUsedBins < 1) {
+      null
+    } else {
+      val result = (0 until buffer.getUsedBins).map { index =>
+        val coord = buffer.getBin(index)
+        InternalRow.apply(coord.x, coord.y)
+      }
+      new GenericArrayData(result)
+    }
+  }
+
+  override def serialize(obj: NumericHistogram): Array[Byte] = {
+    HistogramNumeric.serializer.serialize(obj)
+  }
+
+  override def deserialize(bytes: Array[Byte]): NumericHistogram = {
+    HistogramNumeric.serializer.deserialize(bytes)
+  }
+
+  override def left: Expression = child
+
+  override def right: Expression = nBins
+
+  override protected def withNewChildrenInternal(
+      newLeft: Expression,
+      newRight: Expression): HistogramNumeric = {
+    copy(child = newLeft, nBins = newRight)
+  }
+
+  override def withNewMutableAggBufferOffset(newOffset: Int): HistogramNumeric =
+    copy(mutableAggBufferOffset = newOffset)
+
+  override def withNewInputAggBufferOffset(newOffset: Int): HistogramNumeric =
+    copy(inputAggBufferOffset = newOffset)
+
+  override def nullable: Boolean = true
+
+  override def dataType: DataType =
+    ArrayType(new StructType(Array(StructField("x", DoubleType), StructField("y", DoubleType))))
+
+  override def prettyName: String = "histogram_numeric"
+}
+
+object HistogramNumeric {
+  class NumericHistogramSerializer {

Review comment:
       since it's stateless, can it be an `object` directly?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-950850504






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-950850504






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951781225


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49088/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951781225


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49088/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-950902361


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144585/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on a change in pull request #34380:
URL: https://github.com/apache/spark/pull/34380#discussion_r736570184



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala
##########
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import java.nio.ByteBuffer
+
+import com.google.common.primitives.{Doubles, Ints}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, TypeCheckSuccess}
+import org.apache.spark.sql.catalyst.expressions.{Expression, ExpressionDescription, ImplicitCastInputTypes}
+import org.apache.spark.sql.catalyst.trees.BinaryLike
+import org.apache.spark.sql.catalyst.util.GenericArrayData
+import org.apache.spark.sql.errors.QueryExecutionErrors
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, DateType, DayTimeIntervalType, DoubleType, IntegerType, NumericType, StructField, StructType, TimestampNTZType, TimestampType, TypeCollection, YearMonthIntervalType}
+import org.apache.spark.sql.util.NumericHistogram
+
+/**
+ * Computes an approximate histogram of a numerical column using a user-specified number of bins.
+ *
+ * The output is an array of (x,y) pairs as struct objects that represents the histogram's
+ * bin centers and heights.
+ */
+@ExpressionDescription(
+  usage = """
+    _FUNC_(expr, nb) - Computes a histogram on numeric 'expr' using nb bins.
+      The return value is an array of (x,y) pairs representing the centers of the
+      histogram's bins. As the value of 'nb' is increased, the histogram approximation
+      gets finer-grained, but may yield artifacts around outliers. In practice, 20-40
+      histogram bins appear to work well, with more bins being required for skewed or
+      smaller datasets. Note that this function creates a histogram with non-uniform
+      bin widths. It offers no guarantees in terms of the mean-squared-error of the
+      histogram, but in practice is comparable to the histograms produced by the R/S-Plus
+      statistical computing packages.
+    """,
+  examples = """
+    Examples:
+      > SELECT _FUNC_(col, 5) FROM VALUES (0), (1), (2), (10) AS tab(col);
+       [{"x":0.0,"y":1.0},{"x":1.0,"y":1.0},{"x":2.0,"y":1.0},{"x":10.0,"y":1.0}]
+  """,
+  group = "agg_funcs",
+  since = "3.3.0")
+case class HistogramNumeric(
+    child: Expression,
+    nBins: Expression,
+    override val mutableAggBufferOffset: Int,
+    override val inputAggBufferOffset: Int)
+  extends TypedImperativeAggregate[NumericHistogram] with ImplicitCastInputTypes
+  with BinaryLike[Expression] {
+
+  def this(child: Expression, nBins: Expression) = {
+    this(child, nBins, 0, 0)
+  }
+
+  private lazy val nb = nBins.eval() match {
+    case null => null
+    case n: Int => n
+  }
+
+  override def inputTypes: Seq[AbstractDataType] = {
+    // Support NumericType, DateType, TimestampType and TimestampNTZType, YearMonthIntervalType,
+    // DayTimeIntervalType since their internal types are all numeric,
+    // and can be easily cast to double for processing.
+    Seq(TypeCollection(NumericType, DateType, TimestampType, TimestampNTZType,
+      YearMonthIntervalType, DayTimeIntervalType), IntegerType)
+  }
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+    val defaultCheck = super.checkInputDataTypes()
+    if (defaultCheck.isFailure) {
+      defaultCheck
+    } else if (!nBins.foldable) {
+      TypeCheckFailure(s"${this.prettyName} needs the nBins provided must be a constant literal.")
+    } else if (nb == null) {
+      TypeCheckFailure(s"${this.prettyName} needs nBins value must not be null.")
+    } else if (nb.asInstanceOf[Int] < 2) {
+      TypeCheckFailure(s"${this.prettyName} needs nBins to be at least 2, but you supplied $nb.")
+    } else {
+      TypeCheckSuccess
+    }
+  }
+
+  override def createAggregationBuffer(): NumericHistogram = {
+    val buffer = new NumericHistogram()
+    buffer.allocate(nb.asInstanceOf[Int])
+    buffer
+  }
+
+  override def update(buffer: NumericHistogram, inputRow: InternalRow): NumericHistogram = {
+    val value = child.eval(inputRow)
+    // Ignore empty rows, for example: histogram_numeric(null)
+    if (value != null) {
+      // Convert the value to a double value
+      val doubleValue = child.dataType match {
+        case DateType => value.asInstanceOf[Int].toDouble
+        case TimestampType | TimestampNTZType | DayTimeIntervalType(_, _) =>
+          value.asInstanceOf[Long].toDouble
+        case YearMonthIntervalType(_, _) => value.asInstanceOf[Int].toDouble
+        case n: NumericType => n.numeric.toDouble(value.asInstanceOf[n.InternalType])
+        case other: DataType =>
+          throw QueryExecutionErrors.dataTypeUnexpectedError(other)
+      }
+      buffer.add(doubleValue)
+    }
+    buffer
+  }
+
+  override def merge(
+      buffer: NumericHistogram,
+      other: NumericHistogram): NumericHistogram = {
+    buffer.merge(other)
+    buffer
+  }
+
+  override def eval(buffer: NumericHistogram): Any = {
+    if (buffer.getUsedBins < 1) {
+      null
+    } else {
+      val result = (0 until buffer.getUsedBins).map { index =>
+        val coord = buffer.getBin(index)
+        InternalRow.apply(coord.x, coord.y)
+      }
+      new GenericArrayData(result)
+    }
+  }
+
+  override def serialize(obj: NumericHistogram): Array[Byte] = {
+    HistogramNumeric.serializer.serialize(obj)
+  }
+
+  override def deserialize(bytes: Array[Byte]): NumericHistogram = {
+    HistogramNumeric.serializer.deserialize(bytes)
+  }
+
+  override def left: Expression = child
+
+  override def right: Expression = nBins
+
+  override protected def withNewChildrenInternal(
+      newLeft: Expression,
+      newRight: Expression): HistogramNumeric = {
+    copy(child = newLeft, nBins = newRight)
+  }
+
+  override def withNewMutableAggBufferOffset(newOffset: Int): HistogramNumeric =
+    copy(mutableAggBufferOffset = newOffset)
+
+  override def withNewInputAggBufferOffset(newOffset: Int): HistogramNumeric =
+    copy(inputAggBufferOffset = newOffset)
+
+  override def nullable: Boolean = true
+
+  override def dataType: DataType =
+    ArrayType(new StructType(Array(StructField("x", DoubleType), StructField("y", DoubleType))))
+
+  override def prettyName: String = "histogram_numeric"
+}
+
+object HistogramNumeric {
+  class NumericHistogramSerializer {

Review comment:
       Done

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala
##########
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import java.nio.ByteBuffer
+
+import com.google.common.primitives.{Doubles, Ints}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, TypeCheckSuccess}
+import org.apache.spark.sql.catalyst.expressions.{Expression, ExpressionDescription, ImplicitCastInputTypes}
+import org.apache.spark.sql.catalyst.trees.BinaryLike
+import org.apache.spark.sql.catalyst.util.GenericArrayData
+import org.apache.spark.sql.errors.QueryExecutionErrors
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, DateType, DayTimeIntervalType, DoubleType, IntegerType, NumericType, StructField, StructType, TimestampNTZType, TimestampType, TypeCollection, YearMonthIntervalType}
+import org.apache.spark.sql.util.NumericHistogram
+
+/**
+ * Computes an approximate histogram of a numerical column using a user-specified number of bins.
+ *
+ * The output is an array of (x,y) pairs as struct objects that represents the histogram's
+ * bin centers and heights.
+ */
+@ExpressionDescription(
+  usage = """
+    _FUNC_(expr, nb) - Computes a histogram on numeric 'expr' using nb bins.
+      The return value is an array of (x,y) pairs representing the centers of the
+      histogram's bins. As the value of 'nb' is increased, the histogram approximation
+      gets finer-grained, but may yield artifacts around outliers. In practice, 20-40
+      histogram bins appear to work well, with more bins being required for skewed or
+      smaller datasets. Note that this function creates a histogram with non-uniform
+      bin widths. It offers no guarantees in terms of the mean-squared-error of the
+      histogram, but in practice is comparable to the histograms produced by the R/S-Plus
+      statistical computing packages.
+    """,
+  examples = """
+    Examples:
+      > SELECT _FUNC_(col, 5) FROM VALUES (0), (1), (2), (10) AS tab(col);
+       [{"x":0.0,"y":1.0},{"x":1.0,"y":1.0},{"x":2.0,"y":1.0},{"x":10.0,"y":1.0}]
+  """,
+  group = "agg_funcs",
+  since = "3.3.0")
+case class HistogramNumeric(
+    child: Expression,
+    nBins: Expression,
+    override val mutableAggBufferOffset: Int,
+    override val inputAggBufferOffset: Int)
+  extends TypedImperativeAggregate[NumericHistogram] with ImplicitCastInputTypes
+  with BinaryLike[Expression] {
+
+  def this(child: Expression, nBins: Expression) = {
+    this(child, nBins, 0, 0)
+  }
+
+  private lazy val nb = nBins.eval() match {
+    case null => null
+    case n: Int => n
+  }
+
+  override def inputTypes: Seq[AbstractDataType] = {
+    // Support NumericType, DateType, TimestampType and TimestampNTZType, YearMonthIntervalType,
+    // DayTimeIntervalType since their internal types are all numeric,
+    // and can be easily cast to double for processing.
+    Seq(TypeCollection(NumericType, DateType, TimestampType, TimestampNTZType,
+      YearMonthIntervalType, DayTimeIntervalType), IntegerType)
+  }
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+    val defaultCheck = super.checkInputDataTypes()
+    if (defaultCheck.isFailure) {
+      defaultCheck
+    } else if (!nBins.foldable) {
+      TypeCheckFailure(s"${this.prettyName} needs the nBins provided must be a constant literal.")
+    } else if (nb == null) {
+      TypeCheckFailure(s"${this.prettyName} needs nBins value must not be null.")
+    } else if (nb.asInstanceOf[Int] < 2) {
+      TypeCheckFailure(s"${this.prettyName} needs nBins to be at least 2, but you supplied $nb.")
+    } else {
+      TypeCheckSuccess
+    }
+  }
+
+  override def createAggregationBuffer(): NumericHistogram = {
+    val buffer = new NumericHistogram()
+    buffer.allocate(nb.asInstanceOf[Int])
+    buffer
+  }
+
+  override def update(buffer: NumericHistogram, inputRow: InternalRow): NumericHistogram = {
+    val value = child.eval(inputRow)
+    // Ignore empty rows, for example: histogram_numeric(null)
+    if (value != null) {
+      // Convert the value to a double value
+      val doubleValue = child.dataType match {

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951974145


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144618/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951491333


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on a change in pull request #34380:
URL: https://github.com/apache/spark/pull/34380#discussion_r737139587



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/util/NumericHistogram.java
##########
@@ -0,0 +1,288 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.util;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Random;
+
+
+/**
+ * A generic, re-usable histogram class that supports partial aggregations.
+ * The algorithm is a heuristic adapted from the following paper:
+ * Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm",
+ * J. Machine Learning Research 11 (2010), pp. 849--872. Although there are no approximation
+ * guarantees, it appears to work well with adequate data and a large (e.g., 20-80) number
+ * of histogram bins.
+ *
+ * Adapted from Hive's NumericHistogram. Can refer to
+ * https://github.com/apache/hive/blob/master/ql/src/
+ * java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java
+ *
+ * Differences:
+ *   1. Declaring [[Coord]] and it's variables as public types for
+ *      easy access in the HistogramNumeric class.
+ *   2. Add method [[getNBins()]] for serialize [[NumericHistogram]]
+ *      in [[NumericHistogramSerializer]].
+ *   3. Add method [[setBin()]] for deserialize [[NumericHistogram]]

Review comment:
       > instead of adding `setBins`, can we just take `nBins` parameter in the constructor?
   
   You mean add `nbins` `bins` `nusedbins` as parameter in constructor?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34380:
URL: https://github.com/apache/spark/pull/34380#discussion_r737142979



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/util/NumericHistogram.java
##########
@@ -0,0 +1,288 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.util;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Random;
+
+
+/**
+ * A generic, re-usable histogram class that supports partial aggregations.
+ * The algorithm is a heuristic adapted from the following paper:
+ * Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm",
+ * J. Machine Learning Research 11 (2010), pp. 849--872. Although there are no approximation
+ * guarantees, it appears to work well with adequate data and a large (e.g., 20-80) number
+ * of histogram bins.
+ *
+ * Adapted from Hive's NumericHistogram. Can refer to
+ * https://github.com/apache/hive/blob/master/ql/src/
+ * java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java
+ *
+ * Differences:
+ *   1. Declaring [[Coord]] and it's variables as public types for
+ *      easy access in the HistogramNumeric class.
+ *   2. Add method [[getNBins()]] for serialize [[NumericHistogram]]
+ *      in [[NumericHistogramSerializer]].
+ *   3. Add method [[setBin()]] for deserialize [[NumericHistogram]]

Review comment:
       nvm, I misread the code.
   
   I think the name is misleading. It should be `addBin`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951941306


   **[Test build #144618 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144618/testReport)** for PR 34380 at commit [`beab716`](https://github.com/apache/spark/commit/beab7169682231e9b2eaf7d5d625ae7f746abef8).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952702317


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49120/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952048290


   **[Test build #144620 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144620/testReport)** for PR 34380 at commit [`4761493`](https://github.com/apache/spark/commit/4761493af67ea49e27ba887af4246c202438bc57).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951782985


   **[Test build #144620 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144620/testReport)** for PR 34380 at commit [`4761493`](https://github.com/apache/spark/commit/4761493af67ea49e27ba887af4246c202438bc57).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952605355


   **[Test build #144650 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144650/testReport)** for PR 34380 at commit [`5ec9afb`](https://github.com/apache/spark/commit/5ec9afba21c2ba2af852330bcedac9f74005c6a5).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34380: [SPARK-16280][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952605355


   **[Test build #144650 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144650/testReport)** for PR 34380 at commit [`5ec9afb`](https://github.com/apache/spark/commit/5ec9afba21c2ba2af852330bcedac9f74005c6a5).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952084535


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49093/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952050308


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144620/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951765651


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49088/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34380:
URL: https://github.com/apache/spark/pull/34380#discussion_r736609256



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/util/NumericHistogram.java
##########
@@ -0,0 +1,288 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.util;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Random;
+
+
+/**
+ * A generic, re-usable histogram class that supports partial aggregations.
+ * The algorithm is a heuristic adapted from the following paper:
+ * Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm",
+ * J. Machine Learning Research 11 (2010), pp. 849--872. Although there are no approximation
+ * guarantees, it appears to work well with adequate data and a large (e.g., 20-80) number
+ * of histogram bins.
+ *
+ * Adapted from Hive's NumericHistogram. Can refer to
+ * https://github.com/apache/hive/blob/master/ql/src/
+ * java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java
+ *
+ * Differences:
+ *   1. Declaring [[Coord]] and it's variables as public types for
+ *      easy access in the HistogramNumeric class.
+ *   2. Add method [[getNBins()]] for serialize [[NumericHistogram]]
+ *      in [[NumericHistogramSerializer]].
+ *   3. Add method [[setBin()]] for deserialize [[NumericHistogram]]
+ *      in [[NumericHistogramSerializer]].
+ *   4. In Hive's code, the method [[merge()] pass a serialized histogram,
+ *      in Spark, this method pass a deserialized histogram.
+ *      Here we change the code about merge bins.
+ */
+public class NumericHistogram {
+    /**
+     * The Coord class defines a histogram bin, which is just an (x,y) pair.
+     */
+    public static class Coord implements Comparable {
+        public double x;
+        public double y;
+
+        public int compareTo(Object other) {
+            return Double.compare(x, ((Coord) other).x);
+        }
+    };
+
+    // Class variables
+    private int nbins;
+    private int nusedbins;
+    private ArrayList<Coord> bins;
+    private Random prng;
+
+    /**
+     * Creates a new histogram object. Note that the allocate() or merge()
+     * method must be called before the histogram can be used.
+     */
+    public NumericHistogram() {
+        nbins = 0;
+        nusedbins = 0;
+        bins = null;
+
+        // init the RNG for breaking ties in histogram merging. A fixed seed is specified here
+        // to aid testing, but can be eliminated to use a time-based seed (which would
+        // make the algorithm non-deterministic).
+        prng = new Random(31183);
+    }
+
+    /**
+     * Resets a histogram object to its initial state. allocate() or merge() must be
+     * called again before use.
+     */
+    public void reset() {
+        bins = null;
+        nbins = nusedbins = 0;
+    }
+
+    /**
+     * Returns the number of bins.
+     */
+    public int getNBins() {
+        return nbins;
+    }
+
+    /**
+     * Returns the number of bins currently being used by the histogram.
+     */
+    public int getUsedBins() {
+        return nusedbins;
+    }
+
+    /**
+     * Set the number of bins currently being used by the histogram.
+     */
+    public void setUsedBins(int nusedBins) {
+        this.nusedbins = nusedBins;
+    }
+
+    /**
+     * Returns true if this histogram object has been initialized by calling merge()
+     * or allocate().
+     */
+    public boolean isReady() {
+        return nbins != 0;
+    }
+
+    /**
+     * Returns a particular histogram bin.
+     */
+    public Coord getBin(int b) {
+        return bins.get(b);
+    }
+
+    /**
+     * Set a particular histogram bin with index.
+     */
+    public void setBin(double x, double y, int b) {
+        Coord coord = new Coord();
+        coord.x = x;
+        coord.y = y;
+        bins.add(b, coord);
+    }
+
+    /**
+     * Sets the number of histogram bins to use for approximating data.
+     *
+     * @param num_bins Number of non-uniform-width histogram bins to use
+     */
+    public void allocate(int num_bins) {
+        nbins = num_bins;
+        bins = new ArrayList<Coord>();
+        nusedbins = 0;
+    }
+
+    /**
+     * Takes a serialized histogram created by the serialize() method and merges

Review comment:
       We need to update the doc here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951992226


   **[Test build #144623 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144623/testReport)** for PR 34380 at commit [`d03a64c`](https://github.com/apache/spark/commit/d03a64cf0b4e5e303c35acabab84315b9247b09b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951981661


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144622/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on a change in pull request #34380:
URL: https://github.com/apache/spark/pull/34380#discussion_r737138980



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/util/NumericHistogram.java
##########
@@ -0,0 +1,288 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.util;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Random;
+
+
+/**
+ * A generic, re-usable histogram class that supports partial aggregations.
+ * The algorithm is a heuristic adapted from the following paper:
+ * Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm",
+ * J. Machine Learning Research 11 (2010), pp. 849--872. Although there are no approximation
+ * guarantees, it appears to work well with adequate data and a large (e.g., 20-80) number
+ * of histogram bins.
+ *
+ * Adapted from Hive's NumericHistogram. Can refer to
+ * https://github.com/apache/hive/blob/master/ql/src/
+ * java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java
+ *
+ * Differences:
+ *   1. Declaring [[Coord]] and it's variables as public types for
+ *      easy access in the HistogramNumeric class.
+ *   2. Add method [[getNBins()]] for serialize [[NumericHistogram]]
+ *      in [[NumericHistogramSerializer]].
+ *   3. Add method [[setBin()]] for deserialize [[NumericHistogram]]
+ *      in [[NumericHistogramSerializer]].
+ *   4. In Hive's code, the method [[merge()] pass a serialized histogram,
+ *      in Spark, this method pass a deserialized histogram.
+ *      Here we change the code about merge bins.
+ */
+public class NumericHistogram {
+    /**
+     * The Coord class defines a histogram bin, which is just an (x,y) pair.
+     */
+    public static class Coord implements Comparable {
+        public double x;
+        public double y;
+
+        public int compareTo(Object other) {
+            return Double.compare(x, ((Coord) other).x);
+        }
+    };
+
+    // Class variables
+    private int nbins;
+    private int nusedbins;
+    private ArrayList<Coord> bins;
+    private Random prng;
+
+    /**
+     * Creates a new histogram object. Note that the allocate() or merge()
+     * method must be called before the histogram can be used.
+     */
+    public NumericHistogram() {
+        nbins = 0;
+        nusedbins = 0;
+        bins = null;
+
+        // init the RNG for breaking ties in histogram merging. A fixed seed is specified here
+        // to aid testing, but can be eliminated to use a time-based seed (which would
+        // make the algorithm non-deterministic).
+        prng = new Random(31183);
+    }
+
+    /**
+     * Resets a histogram object to its initial state. allocate() or merge() must be
+     * called again before use.
+     */
+    public void reset() {
+        bins = null;
+        nbins = nusedbins = 0;
+    }
+
+    /**
+     * Returns the number of bins.
+     */
+    public int getNBins() {
+        return nbins;
+    }
+
+    /**
+     * Returns the number of bins currently being used by the histogram.
+     */
+    public int getUsedBins() {
+        return nusedbins;
+    }
+
+    /**
+     * Set the number of bins currently being used by the histogram.
+     */
+    public void setUsedBins(int nusedBins) {
+        this.nusedbins = nusedBins;
+    }
+
+    /**
+     * Returns true if this histogram object has been initialized by calling merge()
+     * or allocate().
+     */
+    public boolean isReady() {
+        return nbins != 0;
+    }
+
+    /**
+     * Returns a particular histogram bin.
+     */
+    public Coord getBin(int b) {
+        return bins.get(b);
+    }
+
+    /**
+     * Set a particular histogram bin with index.
+     */
+    public void setBin(double x, double y, int b) {
+        Coord coord = new Coord();
+        coord.x = x;
+        coord.y = y;
+        bins.add(b, coord);
+    }
+
+    /**
+     * Sets the number of histogram bins to use for approximating data.
+     *
+     * @param num_bins Number of non-uniform-width histogram bins to use
+     */
+    public void allocate(int num_bins) {
+        nbins = num_bins;
+        bins = new ArrayList<Coord>();
+        nusedbins = 0;
+    }
+
+    /**
+     * Takes a serialized histogram created by the serialize() method and merges

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on a change in pull request #34380:
URL: https://github.com/apache/spark/pull/34380#discussion_r737152698



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/util/NumericHistogram.java
##########
@@ -0,0 +1,288 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.util;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Random;
+
+
+/**
+ * A generic, re-usable histogram class that supports partial aggregations.
+ * The algorithm is a heuristic adapted from the following paper:
+ * Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm",
+ * J. Machine Learning Research 11 (2010), pp. 849--872. Although there are no approximation
+ * guarantees, it appears to work well with adequate data and a large (e.g., 20-80) number
+ * of histogram bins.
+ *
+ * Adapted from Hive's NumericHistogram. Can refer to
+ * https://github.com/apache/hive/blob/master/ql/src/
+ * java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java
+ *
+ * Differences:
+ *   1. Declaring [[Coord]] and it's variables as public types for
+ *      easy access in the HistogramNumeric class.
+ *   2. Add method [[getNBins()]] for serialize [[NumericHistogram]]
+ *      in [[NumericHistogramSerializer]].
+ *   3. Add method [[setBin()]] for deserialize [[NumericHistogram]]

Review comment:
       > nvm, I misread the code.
   > 
   > I think the name is misleading. It should be `addBin`?
   
   Yea, updated, also change `getNBins` to `getNumBins` to make it clear




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951869706


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49090/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951110599


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49063/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951660183


   **[Test build #144610 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144610/testReport)** for PR 34380 at commit [`ac8936e`](https://github.com/apache/spark/commit/ac8936e6b3a17f80a4737a838882e3a3d3ec53c2).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34380:
URL: https://github.com/apache/spark/pull/34380#discussion_r736260547



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala
##########
@@ -0,0 +1,167 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, TypeCheckSuccess}
+import org.apache.spark.sql.catalyst.expressions.{Expression, ExpressionDescription, ImplicitCastInputTypes}
+import org.apache.spark.sql.catalyst.trees.BinaryLike
+import org.apache.spark.sql.catalyst.util.{DistributedHistogramSerializer, DistributeHistogram, GenericArrayData}
+import org.apache.spark.sql.errors.QueryExecutionErrors
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, DateType, DoubleType, IntegerType, NumericType, StructField, StructType, TimestampNTZType, TimestampType, TypeCollection}
+
+/**
+ * Computes an approximate histogram of a numerical column using a user-specified number of bins.
+ *
+ * The output is an array of (x,y) pairs as struct objects that represents the histogram's
+ * bin centers and heights.
+ */
+@ExpressionDescription(
+  usage = """
+    _FUNC_(expr, nb) - Computes a histogram on numeric 'expr' using nb bins.
+      The return value is an array of (x,y) pairs representing the centers of the
+      histogram's bins. As the value of 'nb' is increased, the histogram approximation
+      gets finer-grained, but may yield artifacts around outliers. In practice, 20-40
+      histogram bins appear to work well, with more bins being required for skewed or
+      smaller datasets. Note that this function creates a histogram with non-uniform
+      bin widths. It offers no guarantees in terms of the mean-squared-error of the
+      histogram, but in practice is comparable to the histograms produced by the R/S-Plus
+      statistical computing packages.
+    """,
+  examples = """
+    Examples:
+      > SELECT _FUNC_(col, 5) FROM VALUES (0), (1), (2), (10) AS tab(col);
+       [{"x":0.0,"y":1.0},{"x":1.0,"y":1.0},{"x":2.0,"y":1.0},{"x":10.0,"y":1.0}]
+  """,
+  group = "agg_funcs",
+  since = "3.3.0")
+case class HistogramNumeric(
+    child: Expression,
+    nBins: Expression,
+    override val mutableAggBufferOffset: Int,
+    override val inputAggBufferOffset: Int)
+  extends TypedImperativeAggregate[DistributeHistogram] with ImplicitCastInputTypes
+  with BinaryLike[Expression] {
+
+  def this(child: Expression, nBins: Expression) = {
+    this(child, nBins, 0, 0)
+  }
+
+  private lazy val nb = nBins.eval() match {
+    case null => null
+    case n: Int => n
+  }
+
+  override def inputTypes: Seq[AbstractDataType] = {
+    // Support NumericType, DateType, TimestampType and TimestampNTZType since their internal types
+    // are all numeric, and can be easily cast to double for processing.

Review comment:
       how about the new ANSI interval types?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951722931


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49088/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952032798


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49093/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34380: [SPARK-16280][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952871411


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144650/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34380:
URL: https://github.com/apache/spark/pull/34380#discussion_r736549191



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala
##########
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import java.nio.ByteBuffer
+
+import com.google.common.primitives.{Doubles, Ints}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, TypeCheckSuccess}
+import org.apache.spark.sql.catalyst.expressions.{Expression, ExpressionDescription, ImplicitCastInputTypes}
+import org.apache.spark.sql.catalyst.trees.BinaryLike
+import org.apache.spark.sql.catalyst.util.GenericArrayData
+import org.apache.spark.sql.errors.QueryExecutionErrors
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, DateType, DayTimeIntervalType, DoubleType, IntegerType, NumericType, StructField, StructType, TimestampNTZType, TimestampType, TypeCollection, YearMonthIntervalType}
+import org.apache.spark.sql.util.NumericHistogram
+
+/**
+ * Computes an approximate histogram of a numerical column using a user-specified number of bins.
+ *
+ * The output is an array of (x,y) pairs as struct objects that represents the histogram's
+ * bin centers and heights.
+ */
+@ExpressionDescription(
+  usage = """
+    _FUNC_(expr, nb) - Computes a histogram on numeric 'expr' using nb bins.
+      The return value is an array of (x,y) pairs representing the centers of the
+      histogram's bins. As the value of 'nb' is increased, the histogram approximation
+      gets finer-grained, but may yield artifacts around outliers. In practice, 20-40
+      histogram bins appear to work well, with more bins being required for skewed or
+      smaller datasets. Note that this function creates a histogram with non-uniform
+      bin widths. It offers no guarantees in terms of the mean-squared-error of the
+      histogram, but in practice is comparable to the histograms produced by the R/S-Plus
+      statistical computing packages.
+    """,
+  examples = """
+    Examples:
+      > SELECT _FUNC_(col, 5) FROM VALUES (0), (1), (2), (10) AS tab(col);
+       [{"x":0.0,"y":1.0},{"x":1.0,"y":1.0},{"x":2.0,"y":1.0},{"x":10.0,"y":1.0}]
+  """,
+  group = "agg_funcs",
+  since = "3.3.0")
+case class HistogramNumeric(
+    child: Expression,
+    nBins: Expression,
+    override val mutableAggBufferOffset: Int,
+    override val inputAggBufferOffset: Int)
+  extends TypedImperativeAggregate[NumericHistogram] with ImplicitCastInputTypes
+  with BinaryLike[Expression] {
+
+  def this(child: Expression, nBins: Expression) = {
+    this(child, nBins, 0, 0)
+  }
+
+  private lazy val nb = nBins.eval() match {
+    case null => null
+    case n: Int => n
+  }
+
+  override def inputTypes: Seq[AbstractDataType] = {
+    // Support NumericType, DateType, TimestampType and TimestampNTZType, YearMonthIntervalType,
+    // DayTimeIntervalType since their internal types are all numeric,
+    // and can be easily cast to double for processing.
+    Seq(TypeCollection(NumericType, DateType, TimestampType, TimestampNTZType,
+      YearMonthIntervalType, DayTimeIntervalType), IntegerType)
+  }
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+    val defaultCheck = super.checkInputDataTypes()
+    if (defaultCheck.isFailure) {
+      defaultCheck
+    } else if (!nBins.foldable) {
+      TypeCheckFailure(s"${this.prettyName} needs the nBins provided must be a constant literal.")
+    } else if (nb == null) {
+      TypeCheckFailure(s"${this.prettyName} needs nBins value must not be null.")
+    } else if (nb.asInstanceOf[Int] < 2) {
+      TypeCheckFailure(s"${this.prettyName} needs nBins to be at least 2, but you supplied $nb.")
+    } else {
+      TypeCheckSuccess
+    }
+  }
+
+  override def createAggregationBuffer(): NumericHistogram = {
+    val buffer = new NumericHistogram()
+    buffer.allocate(nb.asInstanceOf[Int])
+    buffer
+  }
+
+  override def update(buffer: NumericHistogram, inputRow: InternalRow): NumericHistogram = {
+    val value = child.eval(inputRow)
+    // Ignore empty rows, for example: histogram_numeric(null)
+    if (value != null) {
+      // Convert the value to a double value
+      val doubleValue = child.dataType match {
+        case DateType => value.asInstanceOf[Int].toDouble
+        case TimestampType | TimestampNTZType | DayTimeIntervalType(_, _) =>
+          value.asInstanceOf[Long].toDouble
+        case YearMonthIntervalType(_, _) => value.asInstanceOf[Int].toDouble
+        case n: NumericType => n.numeric.toDouble(value.asInstanceOf[n.InternalType])
+        case other: DataType =>
+          throw QueryExecutionErrors.dataTypeUnexpectedError(other)
+      }
+      buffer.add(doubleValue)
+    }
+    buffer
+  }
+
+  override def merge(
+      buffer: NumericHistogram,
+      other: NumericHistogram): NumericHistogram = {
+    buffer.merge(other)
+    buffer
+  }
+
+  override def eval(buffer: NumericHistogram): Any = {
+    if (buffer.getUsedBins < 1) {
+      null
+    } else {
+      val result = (0 until buffer.getUsedBins).map { index =>
+        val coord = buffer.getBin(index)
+        InternalRow.apply(coord.x, coord.y)
+      }
+      new GenericArrayData(result)
+    }
+  }
+
+  override def serialize(obj: NumericHistogram): Array[Byte] = {
+    HistogramNumeric.serializer.serialize(obj)
+  }
+
+  override def deserialize(bytes: Array[Byte]): NumericHistogram = {
+    HistogramNumeric.serializer.deserialize(bytes)
+  }
+
+  override def left: Expression = child
+
+  override def right: Expression = nBins
+
+  override protected def withNewChildrenInternal(
+      newLeft: Expression,
+      newRight: Expression): HistogramNumeric = {
+    copy(child = newLeft, nBins = newRight)
+  }
+
+  override def withNewMutableAggBufferOffset(newOffset: Int): HistogramNumeric =
+    copy(mutableAggBufferOffset = newOffset)
+
+  override def withNewInputAggBufferOffset(newOffset: Int): HistogramNumeric =
+    copy(inputAggBufferOffset = newOffset)
+
+  override def nullable: Boolean = true
+
+  override def dataType: DataType =
+    ArrayType(new StructType(Array(StructField("x", DoubleType), StructField("y", DoubleType))))

Review comment:
       how about the nullability of array element and struct field?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34380:
URL: https://github.com/apache/spark/pull/34380#discussion_r736542548



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/util/NumericHistogram.java
##########
@@ -0,0 +1,281 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.util;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Random;
+
+
+/**
+ * A generic, re-usable histogram class that supports partial aggregations.
+ * The algorithm is a heuristic adapted from the following paper:
+ * Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm",
+ * J. Machine Learning Research 11 (2010), pp. 849--872. Although there are no approximation
+ * guarantees, it appears to work well with adequate data and a large (e.g., 20-80) number
+ * of histogram bins.
+ *
+ * Adapted from Hive's NumericHistogram. Can refer to

Review comment:
       Can you highlight the difference?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-950754653


   **[Test build #144585 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144585/testReport)** for PR 34380 at commit [`541d9cd`](https://github.com/apache/spark/commit/541d9cd64ecff490d474a5396957442d01729608).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-950801606


   **[Test build #144586 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144586/testReport)** for PR 34380 at commit [`d4c607e`](https://github.com/apache/spark/commit/d4c607ee3fa3845941897ae07cd39bca92fdaee8).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-950902361


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144585/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-950952287


   **[Test build #144586 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144586/testReport)** for PR 34380 at commit [`d4c607e`](https://github.com/apache/spark/commit/d4c607ee3fa3845941897ae07cd39bca92fdaee8).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-950904171


   **[Test build #144589 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144589/testReport)** for PR 34380 at commit [`33b5e98`](https://github.com/apache/spark/commit/33b5e98444e29da2f9e476ff4ed17e43cac21502).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-950958756


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144586/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951544130


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49076/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951004142


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49060/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on a change in pull request #34380:
URL: https://github.com/apache/spark/pull/34380#discussion_r736555885



##########
File path: sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala
##########
@@ -969,7 +969,6 @@ class HiveCompatibilitySuite extends HiveQueryFileTest with BeforeAndAfter {
     "type_cast_1",
     "type_widening",
     "udaf_collect_set",
-    "udaf_histogram_numeric",

Review comment:
       Should be a mistake, revert this change




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952032759


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144623/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951848954


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49090/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-950829406


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49056/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-950835542


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49057/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951492159


   **[Test build #144605 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144605/testReport)** for PR 34380 at commit [`20c05de`](https://github.com/apache/spark/commit/20c05de766e15f856217c676b9b367a9e68a137e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951981602


   **[Test build #144622 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144622/testReport)** for PR 34380 at commit [`aa32be2`](https://github.com/apache/spark/commit/aa32be2bf821bf8f7eab861683ad9f81f85580f5).
    * This patch **fails to build**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951975712


   **[Test build #144622 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144622/testReport)** for PR 34380 at commit [`aa32be2`](https://github.com/apache/spark/commit/aa32be2bf821bf8f7eab861683ad9f81f85580f5).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951140072


   **[Test build #144590 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144590/testReport)** for PR 34380 at commit [`e2ad833`](https://github.com/apache/spark/commit/e2ad83346164a4256d5b17c2311fa1a291e23c18).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951539908


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49080/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951663716


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144610/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on a change in pull request #34380:
URL: https://github.com/apache/spark/pull/34380#discussion_r737152698



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/util/NumericHistogram.java
##########
@@ -0,0 +1,288 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.util;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Random;
+
+
+/**
+ * A generic, re-usable histogram class that supports partial aggregations.
+ * The algorithm is a heuristic adapted from the following paper:
+ * Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm",
+ * J. Machine Learning Research 11 (2010), pp. 849--872. Although there are no approximation
+ * guarantees, it appears to work well with adequate data and a large (e.g., 20-80) number
+ * of histogram bins.
+ *
+ * Adapted from Hive's NumericHistogram. Can refer to
+ * https://github.com/apache/hive/blob/master/ql/src/
+ * java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java
+ *
+ * Differences:
+ *   1. Declaring [[Coord]] and it's variables as public types for
+ *      easy access in the HistogramNumeric class.
+ *   2. Add method [[getNBins()]] for serialize [[NumericHistogram]]
+ *      in [[NumericHistogramSerializer]].
+ *   3. Add method [[setBin()]] for deserialize [[NumericHistogram]]

Review comment:
       > nvm, I misread the code.
   > 
   > I think the name is misleading. It should be `addBin`?
   
   Yea, updated




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952633764


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49118/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-950904171


   **[Test build #144589 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144589/testReport)** for PR 34380 at commit [`33b5e98`](https://github.com/apache/spark/commit/33b5e98444e29da2f9e476ff4ed17e43cac21502).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-950886914


   **[Test build #144585 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144585/testReport)** for PR 34380 at commit [`541d9cd`](https://github.com/apache/spark/commit/541d9cd64ecff490d474a5396957442d01729608).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on a change in pull request #34380:
URL: https://github.com/apache/spark/pull/34380#discussion_r736553963



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/util/NumericHistogram.java
##########
@@ -0,0 +1,281 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.util;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Random;
+
+
+/**
+ * A generic, re-usable histogram class that supports partial aggregations.
+ * The algorithm is a heuristic adapted from the following paper:
+ * Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm",
+ * J. Machine Learning Research 11 (2010), pp. 849--872. Although there are no approximation
+ * guarantees, it appears to work well with adequate data and a large (e.g., 20-80) number
+ * of histogram bins.
+ *
+ * Adapted from Hive's NumericHistogram. Can refer to

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951981661


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144622/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952004942


   **[Test build #144623 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144623/testReport)** for PR 34380 at commit [`d03a64c`](https://github.com/apache/spark/commit/d03a64cf0b4e5e303c35acabab84315b9247b09b).
    * This patch **fails Java style tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951992226


   **[Test build #144623 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144623/testReport)** for PR 34380 at commit [`d03a64c`](https://github.com/apache/spark/commit/d03a64cf0b4e5e303c35acabab84315b9247b09b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-950908116


   **[Test build #144590 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144590/testReport)** for PR 34380 at commit [`e2ad833`](https://github.com/apache/spark/commit/e2ad83346164a4256d5b17c2311fa1a291e23c18).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951544115


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49076/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952820647


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144648/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952822597


   **[Test build #144647 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144647/testReport)** for PR 34380 at commit [`8d4504d`](https://github.com/apache/spark/commit/8d4504d33a50dc0e2f66321236371116dd6f22bc).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [SPARK-16280][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952869314


   **[Test build #144650 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144650/testReport)** for PR 34380 at commit [`5ec9afb`](https://github.com/apache/spark/commit/5ec9afba21c2ba2af852330bcedac9f74005c6a5).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-950958756


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144586/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-950946941


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49060/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951539868


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49080/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951539908


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49080/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951782985


   **[Test build #144620 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144620/testReport)** for PR 34380 at commit [`4761493`](https://github.com/apache/spark/commit/4761493af67ea49e27ba887af4246c202438bc57).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952087891


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49093/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952087891


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49093/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on a change in pull request #34380:
URL: https://github.com/apache/spark/pull/34380#discussion_r736340592



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala
##########
@@ -0,0 +1,167 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, TypeCheckSuccess}
+import org.apache.spark.sql.catalyst.expressions.{Expression, ExpressionDescription, ImplicitCastInputTypes}
+import org.apache.spark.sql.catalyst.trees.BinaryLike
+import org.apache.spark.sql.catalyst.util.{DistributedHistogramSerializer, DistributeHistogram, GenericArrayData}
+import org.apache.spark.sql.errors.QueryExecutionErrors
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, DateType, DoubleType, IntegerType, NumericType, StructField, StructType, TimestampNTZType, TimestampType, TypeCollection}
+
+/**
+ * Computes an approximate histogram of a numerical column using a user-specified number of bins.
+ *
+ * The output is an array of (x,y) pairs as struct objects that represents the histogram's
+ * bin centers and heights.
+ */
+@ExpressionDescription(
+  usage = """
+    _FUNC_(expr, nb) - Computes a histogram on numeric 'expr' using nb bins.
+      The return value is an array of (x,y) pairs representing the centers of the
+      histogram's bins. As the value of 'nb' is increased, the histogram approximation
+      gets finer-grained, but may yield artifacts around outliers. In practice, 20-40
+      histogram bins appear to work well, with more bins being required for skewed or
+      smaller datasets. Note that this function creates a histogram with non-uniform
+      bin widths. It offers no guarantees in terms of the mean-squared-error of the
+      histogram, but in practice is comparable to the histograms produced by the R/S-Plus
+      statistical computing packages.
+    """,
+  examples = """
+    Examples:
+      > SELECT _FUNC_(col, 5) FROM VALUES (0), (1), (2), (10) AS tab(col);
+       [{"x":0.0,"y":1.0},{"x":1.0,"y":1.0},{"x":2.0,"y":1.0},{"x":10.0,"y":1.0}]
+  """,
+  group = "agg_funcs",
+  since = "3.3.0")
+case class HistogramNumeric(
+    child: Expression,
+    nBins: Expression,
+    override val mutableAggBufferOffset: Int,
+    override val inputAggBufferOffset: Int)
+  extends TypedImperativeAggregate[DistributeHistogram] with ImplicitCastInputTypes
+  with BinaryLike[Expression] {
+
+  def this(child: Expression, nBins: Expression) = {
+    this(child, nBins, 0, 0)
+  }
+
+  private lazy val nb = nBins.eval() match {
+    case null => null
+    case n: Int => n
+  }
+
+  override def inputTypes: Seq[AbstractDataType] = {
+    // Support NumericType, DateType, TimestampType and TimestampNTZType since their internal types
+    // are all numeric, and can be easily cast to double for processing.

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952580017


   **[Test build #144648 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144648/testReport)** for PR 34380 at commit [`c7a8c4b`](https://github.com/apache/spark/commit/c7a8c4b4c21fb113079863e72da4c489ecc05700).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951544130


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49076/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952050308


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144620/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952032759


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144623/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34380:
URL: https://github.com/apache/spark/pull/34380#discussion_r736608535



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/util/NumericHistogram.java
##########
@@ -0,0 +1,288 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.util;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Random;
+
+
+/**
+ * A generic, re-usable histogram class that supports partial aggregations.
+ * The algorithm is a heuristic adapted from the following paper:
+ * Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm",
+ * J. Machine Learning Research 11 (2010), pp. 849--872. Although there are no approximation
+ * guarantees, it appears to work well with adequate data and a large (e.g., 20-80) number
+ * of histogram bins.
+ *
+ * Adapted from Hive's NumericHistogram. Can refer to
+ * https://github.com/apache/hive/blob/master/ql/src/
+ * java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java
+ *
+ * Differences:
+ *   1. Declaring [[Coord]] and it's variables as public types for
+ *      easy access in the HistogramNumeric class.
+ *   2. Add method [[getNBins()]] for serialize [[NumericHistogram]]
+ *      in [[NumericHistogramSerializer]].
+ *   3. Add method [[setBin()]] for deserialize [[NumericHistogram]]

Review comment:
       instead of adding `setBins`, can we just take `nBins` parameter in the constructor?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951984447


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49092/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952645961


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49118/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952578256


   **[Test build #144647 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144647/testReport)** for PR 34380 at commit [`8d4504d`](https://github.com/apache/spark/commit/8d4504d33a50dc0e2f66321236371116dd6f22bc).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951008642


   **[Test build #144589 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144589/testReport)** for PR 34380 at commit [`33b5e98`](https://github.com/apache/spark/commit/33b5e98444e29da2f9e476ff4ed17e43cac21502).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `class DistributedHistogramSerializer `


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951016811






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951975712


   **[Test build #144622 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144622/testReport)** for PR 34380 at commit [`aa32be2`](https://github.com/apache/spark/commit/aa32be2bf821bf8f7eab861683ad9f81f85580f5).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951984416


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49092/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951984447


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49092/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951667696


   **[Test build #144618 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144618/testReport)** for PR 34380 at commit [`beab716`](https://github.com/apache/spark/commit/beab7169682231e9b2eaf7d5d625ae7f746abef8).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34380:
URL: https://github.com/apache/spark/pull/34380#discussion_r736547157



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala
##########
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import java.nio.ByteBuffer
+
+import com.google.common.primitives.{Doubles, Ints}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, TypeCheckSuccess}
+import org.apache.spark.sql.catalyst.expressions.{Expression, ExpressionDescription, ImplicitCastInputTypes}
+import org.apache.spark.sql.catalyst.trees.BinaryLike
+import org.apache.spark.sql.catalyst.util.GenericArrayData
+import org.apache.spark.sql.errors.QueryExecutionErrors
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, DateType, DayTimeIntervalType, DoubleType, IntegerType, NumericType, StructField, StructType, TimestampNTZType, TimestampType, TypeCollection, YearMonthIntervalType}
+import org.apache.spark.sql.util.NumericHistogram
+
+/**
+ * Computes an approximate histogram of a numerical column using a user-specified number of bins.
+ *
+ * The output is an array of (x,y) pairs as struct objects that represents the histogram's
+ * bin centers and heights.
+ */
+@ExpressionDescription(
+  usage = """
+    _FUNC_(expr, nb) - Computes a histogram on numeric 'expr' using nb bins.
+      The return value is an array of (x,y) pairs representing the centers of the
+      histogram's bins. As the value of 'nb' is increased, the histogram approximation
+      gets finer-grained, but may yield artifacts around outliers. In practice, 20-40
+      histogram bins appear to work well, with more bins being required for skewed or
+      smaller datasets. Note that this function creates a histogram with non-uniform
+      bin widths. It offers no guarantees in terms of the mean-squared-error of the
+      histogram, but in practice is comparable to the histograms produced by the R/S-Plus
+      statistical computing packages.
+    """,
+  examples = """
+    Examples:
+      > SELECT _FUNC_(col, 5) FROM VALUES (0), (1), (2), (10) AS tab(col);
+       [{"x":0.0,"y":1.0},{"x":1.0,"y":1.0},{"x":2.0,"y":1.0},{"x":10.0,"y":1.0}]
+  """,
+  group = "agg_funcs",
+  since = "3.3.0")
+case class HistogramNumeric(
+    child: Expression,
+    nBins: Expression,
+    override val mutableAggBufferOffset: Int,
+    override val inputAggBufferOffset: Int)
+  extends TypedImperativeAggregate[NumericHistogram] with ImplicitCastInputTypes
+  with BinaryLike[Expression] {
+
+  def this(child: Expression, nBins: Expression) = {
+    this(child, nBins, 0, 0)
+  }
+
+  private lazy val nb = nBins.eval() match {
+    case null => null
+    case n: Int => n
+  }
+
+  override def inputTypes: Seq[AbstractDataType] = {
+    // Support NumericType, DateType, TimestampType and TimestampNTZType, YearMonthIntervalType,
+    // DayTimeIntervalType since their internal types are all numeric,
+    // and can be easily cast to double for processing.
+    Seq(TypeCollection(NumericType, DateType, TimestampType, TimestampNTZType,
+      YearMonthIntervalType, DayTimeIntervalType), IntegerType)
+  }
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+    val defaultCheck = super.checkInputDataTypes()
+    if (defaultCheck.isFailure) {
+      defaultCheck
+    } else if (!nBins.foldable) {
+      TypeCheckFailure(s"${this.prettyName} needs the nBins provided must be a constant literal.")
+    } else if (nb == null) {
+      TypeCheckFailure(s"${this.prettyName} needs nBins value must not be null.")
+    } else if (nb.asInstanceOf[Int] < 2) {
+      TypeCheckFailure(s"${this.prettyName} needs nBins to be at least 2, but you supplied $nb.")
+    } else {
+      TypeCheckSuccess
+    }
+  }
+
+  override def createAggregationBuffer(): NumericHistogram = {
+    val buffer = new NumericHistogram()
+    buffer.allocate(nb.asInstanceOf[Int])
+    buffer
+  }
+
+  override def update(buffer: NumericHistogram, inputRow: InternalRow): NumericHistogram = {
+    val value = child.eval(inputRow)
+    // Ignore empty rows, for example: histogram_numeric(null)
+    if (value != null) {
+      // Convert the value to a double value
+      val doubleValue = child.dataType match {

Review comment:
       How about `value.asInstanceOf[Number].doubleValue`? Then we don't need pattern match here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34380:
URL: https://github.com/apache/spark/pull/34380#discussion_r736555820



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/util/NumericHistogram.java
##########
@@ -0,0 +1,281 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.util;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Random;
+
+
+/**
+ * A generic, re-usable histogram class that supports partial aggregations.
+ * The algorithm is a heuristic adapted from the following paper:
+ * Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm",
+ * J. Machine Learning Research 11 (2010), pp. 849--872. Although there are no approximation
+ * guarantees, it appears to work well with adequate data and a large (e.g., 20-80) number
+ * of histogram bins.
+ *
+ * Adapted from Hive's NumericHistogram. Can refer to

Review comment:
       where is the highlight? I though you will add some github comments or update the classdoc.

##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/util/NumericHistogram.java
##########
@@ -0,0 +1,281 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.util;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Random;
+
+
+/**
+ * A generic, re-usable histogram class that supports partial aggregations.
+ * The algorithm is a heuristic adapted from the following paper:
+ * Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm",
+ * J. Machine Learning Research 11 (2010), pp. 849--872. Although there are no approximation
+ * guarantees, it appears to work well with adequate data and a large (e.g., 20-80) number
+ * of histogram bins.
+ *
+ * Adapted from Hive's NumericHistogram. Can refer to

Review comment:
       where is the highlight? I thought you will add some github comments or update the classdoc.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951667696


   **[Test build #144618 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144618/testReport)** for PR 34380 at commit [`beab716`](https://github.com/apache/spark/commit/beab7169682231e9b2eaf7d5d625ae7f746abef8).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on a change in pull request #34380:
URL: https://github.com/apache/spark/pull/34380#discussion_r736569177



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala
##########
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import java.nio.ByteBuffer
+
+import com.google.common.primitives.{Doubles, Ints}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, TypeCheckSuccess}
+import org.apache.spark.sql.catalyst.expressions.{Expression, ExpressionDescription, ImplicitCastInputTypes}
+import org.apache.spark.sql.catalyst.trees.BinaryLike
+import org.apache.spark.sql.catalyst.util.GenericArrayData
+import org.apache.spark.sql.errors.QueryExecutionErrors
+import org.apache.spark.sql.types.{AbstractDataType, ArrayType, DataType, DateType, DayTimeIntervalType, DoubleType, IntegerType, NumericType, StructField, StructType, TimestampNTZType, TimestampType, TypeCollection, YearMonthIntervalType}
+import org.apache.spark.sql.util.NumericHistogram
+
+/**
+ * Computes an approximate histogram of a numerical column using a user-specified number of bins.
+ *
+ * The output is an array of (x,y) pairs as struct objects that represents the histogram's
+ * bin centers and heights.
+ */
+@ExpressionDescription(
+  usage = """
+    _FUNC_(expr, nb) - Computes a histogram on numeric 'expr' using nb bins.
+      The return value is an array of (x,y) pairs representing the centers of the
+      histogram's bins. As the value of 'nb' is increased, the histogram approximation
+      gets finer-grained, but may yield artifacts around outliers. In practice, 20-40
+      histogram bins appear to work well, with more bins being required for skewed or
+      smaller datasets. Note that this function creates a histogram with non-uniform
+      bin widths. It offers no guarantees in terms of the mean-squared-error of the
+      histogram, but in practice is comparable to the histograms produced by the R/S-Plus
+      statistical computing packages.
+    """,
+  examples = """
+    Examples:
+      > SELECT _FUNC_(col, 5) FROM VALUES (0), (1), (2), (10) AS tab(col);
+       [{"x":0.0,"y":1.0},{"x":1.0,"y":1.0},{"x":2.0,"y":1.0},{"x":10.0,"y":1.0}]
+  """,
+  group = "agg_funcs",
+  since = "3.3.0")
+case class HistogramNumeric(
+    child: Expression,
+    nBins: Expression,
+    override val mutableAggBufferOffset: Int,
+    override val inputAggBufferOffset: Int)
+  extends TypedImperativeAggregate[NumericHistogram] with ImplicitCastInputTypes
+  with BinaryLike[Expression] {
+
+  def this(child: Expression, nBins: Expression) = {
+    this(child, nBins, 0, 0)
+  }
+
+  private lazy val nb = nBins.eval() match {
+    case null => null
+    case n: Int => n
+  }
+
+  override def inputTypes: Seq[AbstractDataType] = {
+    // Support NumericType, DateType, TimestampType and TimestampNTZType, YearMonthIntervalType,
+    // DayTimeIntervalType since their internal types are all numeric,
+    // and can be easily cast to double for processing.
+    Seq(TypeCollection(NumericType, DateType, TimestampType, TimestampNTZType,
+      YearMonthIntervalType, DayTimeIntervalType), IntegerType)
+  }
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+    val defaultCheck = super.checkInputDataTypes()
+    if (defaultCheck.isFailure) {
+      defaultCheck
+    } else if (!nBins.foldable) {
+      TypeCheckFailure(s"${this.prettyName} needs the nBins provided must be a constant literal.")
+    } else if (nb == null) {
+      TypeCheckFailure(s"${this.prettyName} needs nBins value must not be null.")
+    } else if (nb.asInstanceOf[Int] < 2) {
+      TypeCheckFailure(s"${this.prettyName} needs nBins to be at least 2, but you supplied $nb.")
+    } else {
+      TypeCheckSuccess
+    }
+  }
+
+  override def createAggregationBuffer(): NumericHistogram = {
+    val buffer = new NumericHistogram()
+    buffer.allocate(nb.asInstanceOf[Int])
+    buffer
+  }
+
+  override def update(buffer: NumericHistogram, inputRow: InternalRow): NumericHistogram = {
+    val value = child.eval(inputRow)
+    // Ignore empty rows, for example: histogram_numeric(null)
+    if (value != null) {
+      // Convert the value to a double value
+      val doubleValue = child.dataType match {
+        case DateType => value.asInstanceOf[Int].toDouble
+        case TimestampType | TimestampNTZType | DayTimeIntervalType(_, _) =>
+          value.asInstanceOf[Long].toDouble
+        case YearMonthIntervalType(_, _) => value.asInstanceOf[Int].toDouble
+        case n: NumericType => n.numeric.toDouble(value.asInstanceOf[n.InternalType])
+        case other: DataType =>
+          throw QueryExecutionErrors.dataTypeUnexpectedError(other)
+      }
+      buffer.add(doubleValue)
+    }
+    buffer
+  }
+
+  override def merge(
+      buffer: NumericHistogram,
+      other: NumericHistogram): NumericHistogram = {
+    buffer.merge(other)
+    buffer
+  }
+
+  override def eval(buffer: NumericHistogram): Any = {
+    if (buffer.getUsedBins < 1) {
+      null
+    } else {
+      val result = (0 until buffer.getUsedBins).map { index =>
+        val coord = buffer.getBin(index)
+        InternalRow.apply(coord.x, coord.y)
+      }
+      new GenericArrayData(result)
+    }
+  }
+
+  override def serialize(obj: NumericHistogram): Array[Byte] = {
+    HistogramNumeric.serializer.serialize(obj)
+  }
+
+  override def deserialize(bytes: Array[Byte]): NumericHistogram = {
+    HistogramNumeric.serializer.deserialize(bytes)
+  }
+
+  override def left: Expression = child
+
+  override def right: Expression = nBins
+
+  override protected def withNewChildrenInternal(
+      newLeft: Expression,
+      newRight: Expression): HistogramNumeric = {
+    copy(child = newLeft, nBins = newRight)
+  }
+
+  override def withNewMutableAggBufferOffset(newOffset: Int): HistogramNumeric =
+    copy(mutableAggBufferOffset = newOffset)
+
+  override def withNewInputAggBufferOffset(newOffset: Int): HistogramNumeric =
+    copy(inputAggBufferOffset = newOffset)
+
+  override def nullable: Boolean = true
+
+  override def dataType: DataType =
+    ArrayType(new StructType(Array(StructField("x", DoubleType), StructField("y", DoubleType))))

Review comment:
       > how about the nullability of array element and struct field?
   
   All nullability, so I use default value
   ```
   scala> spark.sql("SELECT histogram_numeric(col, 3) FROM VALUES (null), (null), (null), (null), (1) AS tab(col)").schema
   res0: org.apache.spark.sql.types.StructType = StructType(StructField(histogram_numeric( col, 3),ArrayType(StructType(StructField(x,DoubleType,true), StructField(y,DoubleType,true)),true),true))
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951522889


   **[Test build #144610 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144610/testReport)** for PR 34380 at commit [`ac8936e`](https://github.com/apache/spark/commit/ac8936e6b3a17f80a4737a838882e3a3d3ec53c2).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952678769


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49120/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #34380:
URL: https://github.com/apache/spark/pull/34380#discussion_r736550665



##########
File path: sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala
##########
@@ -969,7 +969,6 @@ class HiveCompatibilitySuite extends HiveQueryFileTest with BeforeAndAfter {
     "type_cast_1",
     "type_widening",
     "udaf_collect_set",
-    "udaf_histogram_numeric",

Review comment:
       can we comment it out with a reason?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952818869


   **[Test build #144648 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144648/testReport)** for PR 34380 at commit [`c7a8c4b`](https://github.com/apache/spark/commit/c7a8c4b4c21fb113079863e72da4c489ecc05700).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952820647


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144648/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34380: [SPARK-16280][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952871411


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144650/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951520552


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49076/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951974145


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144618/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951819234


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49090/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951600372


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144605/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952645961


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49118/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951522889


   **[Test build #144610 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144610/testReport)** for PR 34380 at commit [`ac8936e`](https://github.com/apache/spark/commit/ac8936e6b3a17f80a4737a838882e3a3d3ec53c2).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #34380: [SPARK-16280][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

cloud-fan closed pull request #34380:
URL: https://github.com/apache/spark/pull/34380


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-950789015


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49056/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-950801606


   **[Test build #144586 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144586/testReport)** for PR 34380 at commit [`d4c607e`](https://github.com/apache/spark/commit/d4c607ee3fa3845941897ae07cd39bca92fdaee8).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-950908116


   **[Test build #144590 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144590/testReport)** for PR 34380 at commit [`e2ad833`](https://github.com/apache/spark/commit/e2ad83346164a4256d5b17c2311fa1a291e23c18).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951016811






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952578256


   **[Test build #144647 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144647/testReport)** for PR 34380 at commit [`8d4504d`](https://github.com/apache/spark/commit/8d4504d33a50dc0e2f66321236371116dd6f22bc).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952607864


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49118/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952638858


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49120/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952702317


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49120/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952580017


   **[Test build #144648 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144648/testReport)** for PR 34380 at commit [`c7a8c4b`](https://github.com/apache/spark/commit/c7a8c4b4c21fb113079863e72da4c489ecc05700).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34380: [SPARK-16280][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952857376


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144647/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #34380: [SPARK-16280][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952840631


   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34380: [SPARK-16280][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-952857376


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144647/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951663716


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144610/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951640414


   ping @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #34380: [SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AngersZhuuuu commented on a change in pull request #34380:
URL: https://github.com/apache/spark/pull/34380#discussion_r736593865



##########
File path: sql/catalyst/src/main/java/org/apache/spark/sql/util/NumericHistogram.java
##########
@@ -0,0 +1,281 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.util;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Random;
+
+
+/**
+ * A generic, re-usable histogram class that supports partial aggregations.
+ * The algorithm is a heuristic adapted from the following paper:
+ * Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm",
+ * J. Machine Learning Research 11 (2010), pp. 849--872. Although there are no approximation
+ * guarantees, it appears to work well with adequate data and a large (e.g., 20-80) number
+ * of histogram bins.
+ *
+ * Adapted from Hive's NumericHistogram. Can refer to

Review comment:
       How about current?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951073017


   **[Test build #144593 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144593/testReport)** for PR 34380 at commit [`20c05de`](https://github.com/apache/spark/commit/20c05de766e15f856217c676b9b367a9e68a137e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951599673


   **[Test build #144605 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144605/testReport)** for PR 34380 at commit [`20c05de`](https://github.com/apache/spark/commit/20c05de766e15f856217c676b9b367a9e68a137e).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951492159


   **[Test build #144605 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144605/testReport)** for PR 34380 at commit [`20c05de`](https://github.com/apache/spark/commit/20c05de766e15f856217c676b9b367a9e68a137e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-951600372


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144605/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34380: [WIP][SPARK-37082][SQL] Implements histogram_numeric aggregation function which supports partial aggregation.

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34380:
URL: https://github.com/apache/spark/pull/34380#issuecomment-950754653


   **[Test build #144585 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144585/testReport)** for PR 34380 at commit [`541d9cd`](https://github.com/apache/spark/commit/541d9cd64ecff490d474a5396957442d01729608).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org