You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by wzhfy <gi...@git.apache.org> on 2017/10/12 08:16:23 UTC

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

GitHub user wzhfy opened a pull request:

    https://github.com/apache/spark/pull/19479

    [SPARK-17074] [SQL] Generate equi-height histogram in column statistics

    ## What changes were proposed in this pull request?
    
    Equi-height histogram is effective in cardinality estimation, and more accurate than basic column stats (min, max, ndv, etc) especially in skew distribution. So we need to support it.
    
    For equi-height histogram, the heights of all buckets (intervals) are the same.
    In this PR, we use a two-step method to generate an equi-height histogram:
    1. use `ApproximatePercentile` to get percentiles `p(1/n), p(2/n) ... p((n-1)/n)`;
    2. use min, max, and percentiles to construct range values of buckets, e.g. `[min, p(1/n)], [p(1/n), p(2/n)] ... [p((n-1)/n), max]`, and then use `ApproxCountDistinctForIntervals` to count ndv in each bucket. Each bucket is of the form: `(lowerBound, higherBound, ndv)`.
    
    ## How was this patch tested?
    
    Added new test cases and modified some existing test cases.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wzhfy/spark generate_histogram

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19479.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19479
    
----
commit 54c678cd4903e0a8036fca57ed31712402f6d71e
Author: Zhenhua Wang <wa...@huawei.com>
Date:   2017-10-11T06:28:00Z

    generate equi-height histogram

commit 31a852affc7f359dae01e6a893cffec4caf1235f
Author: Zhenhua Wang <wa...@huawei.com>
Date:   2017-10-12T06:35:16Z

    add/modify tests

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r149058921
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -275,6 +317,118 @@ object ColumnStat extends Logging {
           avgLen = row.getLong(4),
           maxLen = row.getLong(5)
         )
    +    if (row.isNullAt(6)) {
    +      cs
    +    } else {
    +      val ndvs = row.getArray(6).toLongArray()
    +      assert(percentiles.get.length == ndvs.length + 1)
    +      val endpoints = percentiles.get.map(_.toString.toDouble)
    +      // Construct equi-height histogram
    +      val buckets = ndvs.zipWithIndex.map { case (ndv, i) =>
    +        EquiHeightBucket(endpoints(i), endpoints(i + 1), ndv)
    +      }
    +      val nonNullRows = rowCount - cs.nullCount
    +      val ehHistogram = EquiHeightHistogram(nonNullRows.toDouble / ndvs.length, buckets)
    +      cs.copy(histogram = Some(ehHistogram))
    +    }
    +  }
    +
    +}
    +
    +/**
    + * There are a few types of histograms in state-of-the-art estimation methods. E.g. equi-width
    + * histogram, equi-height histogram, frequency histogram (value-frequency pairs) and hybrid
    + * histogram, etc.
    + * Currently in Spark, we support equi-height histogram since it is good at handling skew
    + * distribution, and also provides reasonable accuracy in other cases.
    + * We can add other histograms in the future, which will make estimation logic more complicated.
    + * This is because we will have to deal with computation between different types of histograms in
    + * some cases, e.g. for join columns.
    + */
    +trait Histogram
    +
    +/**
    + * Equi-height histogram represents column value distribution by a sequence of buckets. Each bucket
    + * has a value range and contains approximately the same number of rows.
    + * @param height number of rows in each bucket
    + * @param ehBuckets equi-height histogram buckets
    + */
    +case class EquiHeightHistogram(height: Double, ehBuckets: Array[EquiHeightBucket])
    +  extends Histogram {
    +
    +  // Only for histogram equality test.
    +  override def equals(other: Any): Boolean = other match {
    +    case otherEHH: EquiHeightHistogram =>
    +      height == otherEHH.height && ehBuckets.sameElements(otherEHH.ehBuckets)
    +    case _ => false
    +  }
    +
    +  override def hashCode(): Int = super.hashCode()
    +}
    +
    +/**
    + * A bucket in an equi-height histogram. We use double type for lower/higher bound for simplicity.
    + * @param lo lower bound of the value range in this bucket
    + * @param hi higher bound of the value range in this bucket
    + * @param ndv approximate number of distinct values in this bucket
    + */
    +case class EquiHeightBucket(lo: Double, hi: Double, ndv: Long)
    +
    +object HistogramSerializer {
    +  // A flag to indicate the type of histogram
    +  val EQUI_HEIGHT_HISTOGRAM_TYPE: Byte = 1
    +
    +  /**
    +   * Serializes a given histogram to a string. For advanced statistics like histograms, sketches,
    +   * etc, we don't provide readability for their serialized formats in metastore (as
    +   * string-to-string table properties). This is because it's hard or unnatural for these
    +   * statistics to be human readable. For example, histogram is probably split into multiple
    +   * key-value properties, instead of a single, self-described property. And for
    +   * count-min-sketch, it's essentially unnatural to make it a readable string.
    +   */
    +  final def serialize(histogram: Histogram): String = histogram match {
    +    case h: EquiHeightHistogram =>
    +      // type + numBuckets + height + numBuckets * (lo + hi + ndv)
    --- End diff --
    
    what's the common size of `numBuckets`? If it's large enough, we may need to consider compression.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r147668338
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -177,13 +180,12 @@ object ColumnStat extends Logging {
           Some(ColumnStat(
             distinctCount = BigInt(map(KEY_DISTINCT_COUNT).toLong),
             // Note that flatMap(Option.apply) turns Option(null) into None.
    -        min = map.get(KEY_MIN_VALUE)
    -          .map(fromExternalString(_, field.name, field.dataType)).flatMap(Option.apply),
    -        max = map.get(KEY_MAX_VALUE)
    -          .map(fromExternalString(_, field.name, field.dataType)).flatMap(Option.apply),
    +        min = map.get(KEY_MIN_VALUE).map(fromString(_, field.name, field.dataType)),
    +        max = map.get(KEY_MAX_VALUE).map(fromString(_, field.name, field.dataType)),
    --- End diff --
    
    why remove `flatMap(Option.apply)`? `fromString` may return null.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r149678221
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -275,6 +317,98 @@ object ColumnStat extends Logging {
           avgLen = row.getLong(4),
           maxLen = row.getLong(5)
         )
    +    if (row.isNullAt(6)) {
    +      cs
    +    } else {
    +      val ndvs = row.getArray(6).toLongArray()
    +      assert(percentiles.get.length == ndvs.length + 1)
    +      val endpoints = percentiles.get.map(_.toString.toDouble)
    +      // Construct equi-height histogram
    +      val buckets = ndvs.zipWithIndex.map { case (ndv, i) =>
    +        EquiHeightBucket(endpoints(i), endpoints(i + 1), ndv)
    +      }
    +      val nonNullRows = rowCount - cs.nullCount
    +      val ehHistogram = EquiHeightHistogram(nonNullRows.toDouble / ndvs.length, buckets)
    +      cs.copy(histogram = Some(ehHistogram))
    +    }
    +  }
    +
    +}
    +
    +/**
    + * Equi-height histogram represents the distribution of a column's values by a sequence of buckets.
    + * Each bucket has a value range and contains approximately the same number of rows.
    + * In the context of Spark SQL statistics, we may use "histogram" to denote "equi-height histogram"
    + * for simplicity.
    + * @param height number of rows in each bucket
    + * @param buckets equi-height histogram buckets
    + */
    +case class EquiHeightHistogram(height: Double, buckets: Array[EquiHeightBucket]) {
    +
    +  // Only for histogram equality test.
    +  override def equals(other: Any): Boolean = other match {
    +    case otherEHH: EquiHeightHistogram =>
    +      height == otherEHH.height && buckets.sameElements(otherEHH.buckets)
    +    case _ => false
       }
     
    +  override def hashCode(): Int = super.hashCode()
    +}
    +
    +/**
    + * A bucket in an equi-height histogram. We use double type for lower/higher bound for simplicity.
    + * @param lo lower bound of the value range in this bucket
    + * @param hi higher bound of the value range in this bucket
    + * @param ndv approximate number of distinct values in this bucket
    + */
    +case class EquiHeightBucket(lo: Double, hi: Double, ndv: Long)
    +
    +object HistogramSerializer {
    +  /**
    +   * Serializes a given histogram to a string. For advanced statistics like histograms, sketches,
    +   * etc, we don't provide readability for their serialized formats in metastore
    +   * (string-to-string table properties). This is because it's hard or unnatural for these
    +   * statistics to be human readable. For example, a histogram is probably split into multiple
    +   * key-value properties, instead of a single, self-described property. And for
    +   * count-min-sketch, it's essentially unnatural to make it a readable string.
    +   */
    +  final def serialize(histogram: EquiHeightHistogram): String = {
    +    val bos = new ByteArrayOutputStream()
    +    val out = new DataOutputStream(new LZ4BlockOutputStream(bos))
    +    out.writeDouble(histogram.height)
    +    out.writeInt(histogram.buckets.length)
    +    var i = 0
    +    while (i < histogram.buckets.length) {
    +      val bucket = histogram.buckets(i)
    +      out.writeDouble(bucket.lo)
    --- End diff --
    
    Yes, after some tests, in this way compression ratio increased from about 25% (different ndvs) to 50% (all buckets have same ndvs), the code is less elegant though...


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r147946445
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -1032,7 +1032,21 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
           schema.fields.map(f => (f.name, f.dataType)).toMap
         stats.colStats.foreach { case (colName, colStat) =>
           colStat.toMap(colName, colNameTypeMap(colName)).foreach { case (k, v) =>
    -        statsProperties += (columnStatKeyPropName(colName, k) -> v)
    +        if (k == ColumnStat.KEY_HISTOGRAM) {
    --- End diff --
    
    don't special-case histogram, we should do it for all stats.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #83477 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83477/testReport)** for PR 19479 at commit [`7f77516`](https://github.com/apache/spark/commit/7f775165e3ecbc1f8949b02d92d09afa70624d44).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class EquiHeightHistogram(height: Double, ehBuckets: Array[EquiHeightBucket])`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r147672869
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala ---
    @@ -89,19 +93,159 @@ case class AnalyzeColumnCommand(
         // The first element in the result will be the overall row count, the following elements
         // will be structs containing all column stats.
         // The layout of each struct follows the layout of the ColumnStats.
    -    val ndvMaxErr = sparkSession.sessionState.conf.ndvMaxError
         val expressions = Count(Literal(1)).toAggregateExpression() +:
    -      attributesToAnalyze.map(ColumnStat.statExprs(_, ndvMaxErr))
    +      attributesToAnalyze.map(statExprs(_, sparkSession.sessionState.conf))
     
         val namedExpressions = expressions.map(e => Alias(e, e.toString)())
         val statsRow = new QueryExecution(sparkSession, Aggregate(Nil, namedExpressions, relation))
           .executedPlan.executeTake(1).head
     
         val rowCount = statsRow.getLong(0)
    -    val columnStats = attributesToAnalyze.zipWithIndex.map { case (attr, i) =>
    -      // according to `ColumnStat.statExprs`, the stats struct always have 6 fields.
    -      (attr.name, ColumnStat.rowToColumnStat(statsRow.getStruct(i + 1, 6), attr))
    -    }.toMap
    -    (rowCount, columnStats)
    +    val colStats = rowToColumnStats(sparkSession, relation, attributesToAnalyze, statsRow, rowCount)
    +    (rowCount, colStats)
    +  }
    +
    +  /**
    +   * Constructs an expression to compute column statistics for a given column.
    +   *
    +   * The expression should create a single struct column with the following schema:
    +   * distinctCount: Long, min: T, max: T, nullCount: Long, avgLen: Long, maxLen: Long,
    +   * percentiles: Array[T]
    +   *
    +   * Together with [[rowToColumnStats]], this function is used to create [[ColumnStat]] and
    +   * as a result should stay in sync with it.
    +   */
    +  private def statExprs(col: Attribute, conf: SQLConf): CreateNamedStruct = {
    +    def struct(exprs: Expression*): CreateNamedStruct = CreateStruct(exprs.map { expr =>
    +      expr.transformUp { case af: AggregateFunction => af.toAggregateExpression() }
    +    })
    +    val one = Literal(1, LongType)
    +
    +    // the approximate ndv (num distinct value) should never be larger than the number of rows
    +    val numNonNulls = if (col.nullable) Count(col) else Count(one)
    +    val ndv = Least(Seq(HyperLogLogPlusPlus(col, conf.ndvMaxError), numNonNulls))
    +    val numNulls = Subtract(Count(one), numNonNulls)
    +    val defaultSize = Literal(col.dataType.defaultSize, LongType)
    +    val nullArray = Literal(null, ArrayType(DoubleType))
    +
    +    def fixedLenTypeExprs(castType: DataType) = {
    +      // For fixed width types, avg size should be the same as max size.
    +      Seq(ndv, Cast(Min(col), castType), Cast(Max(col), castType), numNulls, defaultSize,
    +        defaultSize)
    +    }
    +
    +    def fixedLenTypeStruct(castType: DataType, genHistogram: Boolean) = {
    +      val percentileExpr = if (genHistogram) {
    +        // To generate equi-height histogram, we need to:
    +        // 1. get percentiles p(1/n), p(2/n) ... p((n-1)/n),
    +        // 2. use min, max, and percentiles as range values of buckets, e.g. [min, p(1/n)],
    +        // [p(1/n), p(2/n)] ... [p((n-1)/n), max], and then count ndv in each bucket.
    +        // Step 2 will be performed in `rowToColumnStats`.
    --- End diff --
    
    This is hacky, can we explicitly calculate percentiles at the beginning?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r145906524
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -216,65 +218,61 @@ object ColumnStat extends Logging {
         }
       }
     
    -  /**
    -   * Constructs an expression to compute column statistics for a given column.
    -   *
    -   * The expression should create a single struct column with the following schema:
    -   * distinctCount: Long, min: T, max: T, nullCount: Long, avgLen: Long, maxLen: Long
    -   *
    -   * Together with [[rowToColumnStat]], this function is used to create [[ColumnStat]] and
    -   * as a result should stay in sync with it.
    -   */
    -  def statExprs(col: Attribute, relativeSD: Double): CreateNamedStruct = {
    -    def struct(exprs: Expression*): CreateNamedStruct = CreateStruct(exprs.map { expr =>
    -      expr.transformUp { case af: AggregateFunction => af.toAggregateExpression() }
    -    })
    -    val one = Literal(1, LongType)
    +  private def convertToHistogram(s: String): EquiHeightHistogram = {
    +    val idx = s.indexOf(",")
    +    if (idx <= 0) {
    +      throw new AnalysisException("Failed to parse histogram.")
    +    }
    +    val height = s.substring(0, idx).toDouble
    +    val pattern = "Bucket\\(([^,]+), ([^,]+), ([^\\)]+)\\)".r
    +    val buckets = pattern.findAllMatchIn(s).map { m =>
    +      EquiHeightBucket(m.group(1).toDouble, m.group(2).toDouble, m.group(3).toLong)
    +    }.toSeq
    +    EquiHeightHistogram(height, buckets)
    +  }
     
    -    // the approximate ndv (num distinct value) should never be larger than the number of rows
    -    val numNonNulls = if (col.nullable) Count(col) else Count(one)
    -    val ndv = Least(Seq(HyperLogLogPlusPlus(col, relativeSD), numNonNulls))
    -    val numNulls = Subtract(Count(one), numNonNulls)
    -    val defaultSize = Literal(col.dataType.defaultSize, LongType)
    +}
     
    -    def fixedLenTypeStruct(castType: DataType) = {
    -      // For fixed width types, avg size should be the same as max size.
    -      struct(ndv, Cast(Min(col), castType), Cast(Max(col), castType), numNulls, defaultSize,
    -        defaultSize)
    -    }
    +/**
    + * There are a few types of histograms in state-of-the-art estimation methods. E.g. equi-width
    + * histogram, equi-height histogram, frequency histogram (value-frequency pairs) and hybrid
    + * histogram, etc.
    + * Currently in Spark, we support equi-height histogram since it is good at handling skew
    + * distribution, and also provides reasonable accuracy in other cases.
    + * We can add other histograms in the future, which will make estimation logic more complicated.
    + * Because we will have to deal with computation between different types of histograms in some
    --- End diff --
    
    Thanks, I'll improve the doc


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r148460568
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala ---
    @@ -89,19 +93,158 @@ case class AnalyzeColumnCommand(
         // The first element in the result will be the overall row count, the following elements
         // will be structs containing all column stats.
         // The layout of each struct follows the layout of the ColumnStats.
    -    val ndvMaxErr = sparkSession.sessionState.conf.ndvMaxError
         val expressions = Count(Literal(1)).toAggregateExpression() +:
    -      attributesToAnalyze.map(ColumnStat.statExprs(_, ndvMaxErr))
    +      attributesToAnalyze.map(statExprs(_, sparkSession.sessionState.conf))
    --- End diff --
    
    yea that'll make the logic clearer, thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83599/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    LGTM overall. We should also consider how to show the histogram in ANALYZE COLUMN, for debug purpose.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by ron8hu <gi...@git.apache.org>.

Github user ron8hu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r148724807
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -275,6 +327,64 @@ object ColumnStat extends Logging {
           avgLen = row.getLong(4),
           maxLen = row.getLong(5)
         )
    +    if (row.isNullAt(6)) {
    +      cs
    +    } else {
    +      val ndvs = row.getArray(6).toLongArray()
    +      assert(percentiles.get.length == ndvs.length + 1)
    +      val endpoints = percentiles.get.map(_.toString.toDouble)
    +      // Construct equi-height histogram
    +      val buckets = ndvs.zipWithIndex.map { case (ndv, i) =>
    +        EquiHeightBucket(endpoints(i), endpoints(i + 1), ndv)
    +      }
    +      val nonNullRows = rowCount - cs.nullCount
    +      val ehHistogram = EquiHeightHistogram(nonNullRows.toDouble / ndvs.length, buckets)
    +      cs.copy(histogram = Some(ehHistogram))
    +    }
       }
     
     }
    +
    +/**
    + * There are a few types of histograms in state-of-the-art estimation methods. E.g. equi-width
    + * histogram, equi-height histogram, frequency histogram (value-frequency pairs) and hybrid
    + * histogram, etc.
    + * Currently in Spark, we support equi-height histogram since it is good at handling skew
    + * distribution, and also provides reasonable accuracy in other cases.
    + * We can add other histograms in the future, which will make estimation logic more complicated.
    + * This is because we will have to deal with computation between different types of histograms in
    + * some cases, e.g. for join columns.
    + */
    +trait Histogram
    +
    +/**
    + * Equi-height histogram represents column value distribution by a sequence of buckets. Each bucket
    + * has a value range and contains approximately the same number of rows.
    + * @param height number of rows in each bucket
    + * @param ehBuckets equi-height histogram buckets
    + */
    +case class EquiHeightHistogram(height: Double, ehBuckets: Seq[EquiHeightBucket]) extends Histogram {
    --- End diff --
    
    Please declare ehBuckets: Array[EquiHeightBucket]) instead of ehBuckets: Seq[EquiHeightBucket]).  This is because we need to access a bucket directly and randomly.  For random access, Scala Array can provide better performance as it has index to access an array element quickly.  


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #83623 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83623/testReport)** for PR 19479 at commit [`a96169e`](https://github.com/apache/spark/commit/a96169eac41db1ba2db9d9211d0c301012c4c409).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class Histogram(height: Double, bins: Array[HistogramBin]) `
      * `case class HistogramBin(lo: Double, hi: Double, ndv: Long)`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83614/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #82801 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82801/testReport)** for PR 19479 at commit [`83424e4`](https://github.com/apache/spark/commit/83424e4215bf1d15cb558463c5e90f92e2146138).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r148232731
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -216,65 +219,61 @@ object ColumnStat extends Logging {
         }
       }
     
    -  /**
    -   * Constructs an expression to compute column statistics for a given column.
    -   *
    -   * The expression should create a single struct column with the following schema:
    -   * distinctCount: Long, min: T, max: T, nullCount: Long, avgLen: Long, maxLen: Long
    -   *
    -   * Together with [[rowToColumnStat]], this function is used to create [[ColumnStat]] and
    -   * as a result should stay in sync with it.
    -   */
    -  def statExprs(col: Attribute, relativeSD: Double): CreateNamedStruct = {
    -    def struct(exprs: Expression*): CreateNamedStruct = CreateStruct(exprs.map { expr =>
    -      expr.transformUp { case af: AggregateFunction => af.toAggregateExpression() }
    -    })
    -    val one = Literal(1, LongType)
    +  private def convertToHistogram(s: String): EquiHeightHistogram = {
    +    val idx = s.indexOf(",")
    +    if (idx <= 0) {
    +      throw new AnalysisException("Failed to parse histogram.")
    +    }
    +    val height = s.substring(0, idx).toDouble
    +    val pattern = "Bucket\\(([^,]+), ([^,]+), ([^\\)]+)\\)".r
    +    val buckets = pattern.findAllMatchIn(s).map { m =>
    +      EquiHeightBucket(m.group(1).toDouble, m.group(2).toDouble, m.group(3).toLong)
    +    }.toSeq
    +    EquiHeightHistogram(height, buckets)
    +  }
     
    -    // the approximate ndv (num distinct value) should never be larger than the number of rows
    -    val numNonNulls = if (col.nullable) Count(col) else Count(one)
    -    val ndv = Least(Seq(HyperLogLogPlusPlus(col, relativeSD), numNonNulls))
    -    val numNulls = Subtract(Count(one), numNonNulls)
    -    val defaultSize = Literal(col.dataType.defaultSize, LongType)
    +}
     
    -    def fixedLenTypeStruct(castType: DataType) = {
    -      // For fixed width types, avg size should be the same as max size.
    -      struct(ndv, Cast(Min(col), castType), Cast(Max(col), castType), numNulls, defaultSize,
    -        defaultSize)
    -    }
    +/**
    + * There are a few types of histograms in state-of-the-art estimation methods. E.g. equi-width
    + * histogram, equi-height histogram, frequency histogram (value-frequency pairs) and hybrid
    + * histogram, etc.
    + * Currently in Spark, we support equi-height histogram since it is good at handling skew
    + * distribution, and also provides reasonable accuracy in other cases.
    + * We can add other histograms in the future, which will make estimation logic more complicated.
    + * This is because we will have to deal with computation between different types of histograms in
    + * some cases, e.g. for join columns.
    + */
    +trait Histogram
     
    -    col.dataType match {
    -      case dt: IntegralType => fixedLenTypeStruct(dt)
    -      case _: DecimalType => fixedLenTypeStruct(col.dataType)
    -      case dt @ (DoubleType | FloatType) => fixedLenTypeStruct(dt)
    -      case BooleanType => fixedLenTypeStruct(col.dataType)
    -      case DateType => fixedLenTypeStruct(col.dataType)
    -      case TimestampType => fixedLenTypeStruct(col.dataType)
    -      case BinaryType | StringType =>
    -        // For string and binary type, we don't store min/max.
    -        val nullLit = Literal(null, col.dataType)
    -        struct(
    -          ndv, nullLit, nullLit, numNulls,
    -          // Set avg/max size to default size if all the values are null or there is no value.
    -          Coalesce(Seq(Ceil(Average(Length(col))), defaultSize)),
    -          Coalesce(Seq(Cast(Max(Length(col)), LongType), defaultSize)))
    -      case _ =>
    -        throw new AnalysisException("Analyzing column statistics is not supported for column " +
    -            s"${col.name} of data type: ${col.dataType}.")
    -    }
    -  }
    +/**
    + * Equi-height histogram represents column value distribution by a sequence of buckets. Each bucket
    + * has a value range and contains approximately the same number of rows.
    + * @param height number of rows in each bucket
    + * @param ehBuckets equi-height histogram buckets
    + */
    +case class EquiHeightHistogram(height: Double, ehBuckets: Seq[EquiHeightBucket]) extends Histogram {
    --- End diff --
    
    can we convert histogram to multiple kv entries? e.g.
    ```
    histograme.height=1
    histogram.bucket1=(x,x,x)
    histogram.bucket2=(y,y,y)
    ...
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by ala <gi...@git.apache.org>.

Github user ala commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    cc @bogdanrdc 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #83284 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83284/testReport)** for PR 19479 at commit [`a0010b0`](https://github.com/apache/spark/commit/a0010b03cc792b03356f18499f02afd85acfd37a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r149851930
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -1024,21 +1024,36 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
           stats: CatalogStatistics,
           schema: StructType): Map[String, String] = {
     
    -    var statsProperties: Map[String, String] =
    -      Map(STATISTICS_TOTAL_SIZE -> stats.sizeInBytes.toString())
    +    val statsProperties = new mutable.HashMap[String, String]()
    +    statsProperties += STATISTICS_TOTAL_SIZE -> stats.sizeInBytes.toString()
         if (stats.rowCount.isDefined) {
           statsProperties += STATISTICS_NUM_ROWS -> stats.rowCount.get.toString()
         }
     
    +    // In Hive metastore, the length of value in table properties cannot be larger than 4000.
    +    // We need to split the key-value pair into multiple key-value properties if the length of
    +    // value exceeds this threshold.
    +    val threshold = conf.get(SCHEMA_STRING_LENGTH_THRESHOLD)
    --- End diff --
    
    do we still need this hack? I don't think histogram string can hit this limitation. Creating too many buckets is non-sense.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #83626 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83626/testReport)** for PR 19479 at commit [`72c46f8`](https://github.com/apache/spark/commit/72c46f844967039ec2009de6cd93b9733ab1e8b8).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r147886882
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -177,13 +180,12 @@ object ColumnStat extends Logging {
           Some(ColumnStat(
             distinctCount = BigInt(map(KEY_DISTINCT_COUNT).toLong),
             // Note that flatMap(Option.apply) turns Option(null) into None.
    -        min = map.get(KEY_MIN_VALUE)
    -          .map(fromExternalString(_, field.name, field.dataType)).flatMap(Option.apply),
    -        max = map.get(KEY_MAX_VALUE)
    -          .map(fromExternalString(_, field.name, field.dataType)).flatMap(Option.apply),
    +        min = map.get(KEY_MIN_VALUE).map(fromString(_, field.name, field.dataType)),
    +        max = map.get(KEY_MAX_VALUE).map(fromString(_, field.name, field.dataType)),
    --- End diff --
    
    Yea, but I tend to revert the change because keep `flatMap(Option.apply)` is more robust.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #83741 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83741/testReport)** for PR 19479 at commit [`7b6d211`](https://github.com/apache/spark/commit/7b6d211f4f082415cc182de9e34f7420d65e94eb).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #82931 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82931/testReport)** for PR 19479 at commit [`6fe9985`](https://github.com/apache/spark/commit/6fe9985872c93b5dfa9972300ba3f59e97834d4c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r149849510
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -275,6 +317,122 @@ object ColumnStat extends Logging {
           avgLen = row.getLong(4),
           maxLen = row.getLong(5)
         )
    +    if (row.isNullAt(6)) {
    +      cs
    +    } else {
    +      val ndvs = row.getArray(6).toLongArray()
    +      assert(percentiles.get.numElements() == ndvs.length + 1)
    +      val endpoints = percentiles.get.toArray[Any](attr.dataType).map(_.toString.toDouble)
    --- End diff --
    
    is it safe to cast decimal to double and use it as bucket boundary?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #83645 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83645/testReport)** for PR 19479 at commit [`8af3868`](https://github.com/apache/spark/commit/8af38687d638ae2d94d9f76955b182df02404cce).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r145906444
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -216,65 +218,61 @@ object ColumnStat extends Logging {
         }
       }
     
    -  /**
    -   * Constructs an expression to compute column statistics for a given column.
    -   *
    -   * The expression should create a single struct column with the following schema:
    -   * distinctCount: Long, min: T, max: T, nullCount: Long, avgLen: Long, maxLen: Long
    -   *
    -   * Together with [[rowToColumnStat]], this function is used to create [[ColumnStat]] and
    -   * as a result should stay in sync with it.
    -   */
    -  def statExprs(col: Attribute, relativeSD: Double): CreateNamedStruct = {
    -    def struct(exprs: Expression*): CreateNamedStruct = CreateStruct(exprs.map { expr =>
    -      expr.transformUp { case af: AggregateFunction => af.toAggregateExpression() }
    -    })
    -    val one = Literal(1, LongType)
    +  private def convertToHistogram(s: String): EquiHeightHistogram = {
    +    val idx = s.indexOf(",")
    +    if (idx <= 0) {
    +      throw new AnalysisException("Failed to parse histogram.")
    +    }
    +    val height = s.substring(0, idx).toDouble
    +    val pattern = "Bucket\\(([^,]+), ([^,]+), ([^\\)]+)\\)".r
    +    val buckets = pattern.findAllMatchIn(s).map { m =>
    +      EquiHeightBucket(m.group(1).toDouble, m.group(2).toDouble, m.group(3).toLong)
    +    }.toSeq
    +    EquiHeightHistogram(height, buckets)
    +  }
     
    -    // the approximate ndv (num distinct value) should never be larger than the number of rows
    -    val numNonNulls = if (col.nullable) Count(col) else Count(one)
    -    val ndv = Least(Seq(HyperLogLogPlusPlus(col, relativeSD), numNonNulls))
    -    val numNulls = Subtract(Count(one), numNonNulls)
    -    val defaultSize = Literal(col.dataType.defaultSize, LongType)
    +}
     
    -    def fixedLenTypeStruct(castType: DataType) = {
    -      // For fixed width types, avg size should be the same as max size.
    -      struct(ndv, Cast(Min(col), castType), Cast(Max(col), castType), numNulls, defaultSize,
    -        defaultSize)
    -    }
    +/**
    + * There are a few types of histograms in state-of-the-art estimation methods. E.g. equi-width
    + * histogram, equi-height histogram, frequency histogram (value-frequency pairs) and hybrid
    + * histogram, etc.
    + * Currently in Spark, we support equi-height histogram since it is good at handling skew
    + * distribution, and also provides reasonable accuracy in other cases.
    + * We can add other histograms in the future, which will make estimation logic more complicated.
    + * Because we will have to deal with computation between different types of histograms in some
    + * cases, e.g. for join columns.
    + */
    +trait Histogram
     
    -    col.dataType match {
    -      case dt: IntegralType => fixedLenTypeStruct(dt)
    -      case _: DecimalType => fixedLenTypeStruct(col.dataType)
    -      case dt @ (DoubleType | FloatType) => fixedLenTypeStruct(dt)
    -      case BooleanType => fixedLenTypeStruct(col.dataType)
    -      case DateType => fixedLenTypeStruct(col.dataType)
    -      case TimestampType => fixedLenTypeStruct(col.dataType)
    -      case BinaryType | StringType =>
    -        // For string and binary type, we don't store min/max.
    -        val nullLit = Literal(null, col.dataType)
    -        struct(
    -          ndv, nullLit, nullLit, numNulls,
    -          // Set avg/max size to default size if all the values are null or there is no value.
    -          Coalesce(Seq(Ceil(Average(Length(col))), defaultSize)),
    -          Coalesce(Seq(Cast(Max(Length(col)), LongType), defaultSize)))
    -      case _ =>
    -        throw new AnalysisException("Analyzing column statistics is not supported for column " +
    -            s"${col.name} of data type: ${col.dataType}.")
    -    }
    -  }
    +/**
    + * Equi-height histogram represents column value distribution by a sequence of buckets. Each bucket
    + * has a value range and contains approximately the same number of rows.
    + * @param height number of rows in each bucket
    + * @param ehBuckets equi-height histogram buckets
    + */
    +case class EquiHeightHistogram(height: Double, ehBuckets: Seq[EquiHeightBucket]) extends Histogram {
     
    -  /** Convert a struct for column stats (defined in statExprs) into [[ColumnStat]]. */
    -  def rowToColumnStat(row: InternalRow, attr: Attribute): ColumnStat = {
    -    ColumnStat(
    -      distinctCount = BigInt(row.getLong(0)),
    -      // for string/binary min/max, get should return null
    -      min = Option(row.get(1, attr.dataType)),
    -      max = Option(row.get(2, attr.dataType)),
    -      nullCount = BigInt(row.getLong(3)),
    -      avgLen = row.getLong(4),
    -      maxLen = row.getLong(5)
    -    )
    +  override def toString: String = {
    +    def bucketString(bucket: EquiHeightBucket): String = {
    +      val sb = new StringBuilder
    +      sb.append("Bucket(")
    +      sb.append(bucket.lo)
    +      sb.append(", ")
    +      sb.append(bucket.hi)
    +      sb.append(", ")
    +      sb.append(bucket.ndv)
    +      sb.append(")")
    +      sb.toString()
    +    }
    +    height + ", " + ehBuckets.map(bucketString).mkString(", ")
       }
    -
     }
    +
    +/**
    + * A bucket in an equi-height histogram. We use double type for lower/higher bound for simplicity.
    + * @param lo lower bound of the value range in this bucket
    + * @param hi higher bound of the value range in this bucket
    + * @param ndv approximate number of distinct values in this bucket
    + */
    +case class EquiHeightBucket(lo: Double, hi: Double, ndv: Long)
    --- End diff --
    
    Using Long is to simplify computation. I think Long is enough for ndv, maybe we can change the BigInt type in `ColumnStat` to Long.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r149504772
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -220,29 +239,46 @@ object ColumnStat extends Logging {
        * Constructs an expression to compute column statistics for a given column.
        *
        * The expression should create a single struct column with the following schema:
    -   * distinctCount: Long, min: T, max: T, nullCount: Long, avgLen: Long, maxLen: Long
    +   * distinctCount: Long, min: T, max: T, nullCount: Long, avgLen: Long, maxLen: Long,
    +   * distinctCountsForIntervals: Array[Long]
        *
        * Together with [[rowToColumnStat]], this function is used to create [[ColumnStat]] and
        * as a result should stay in sync with it.
        */
    -  def statExprs(col: Attribute, relativeSD: Double): CreateNamedStruct = {
    +  def statExprs(
    +      col: Attribute,
    +      conf: SQLConf,
    +      colPercentiles: AttributeMap[Array[Any]]): CreateNamedStruct = {
         def struct(exprs: Expression*): CreateNamedStruct = CreateStruct(exprs.map { expr =>
           expr.transformUp { case af: AggregateFunction => af.toAggregateExpression() }
         })
         val one = Literal(1, LongType)
     
         // the approximate ndv (num distinct value) should never be larger than the number of rows
         val numNonNulls = if (col.nullable) Count(col) else Count(one)
    -    val ndv = Least(Seq(HyperLogLogPlusPlus(col, relativeSD), numNonNulls))
    +    val ndv = Least(Seq(HyperLogLogPlusPlus(col, conf.ndvMaxError), numNonNulls))
         val numNulls = Subtract(Count(one), numNonNulls)
         val defaultSize = Literal(col.dataType.defaultSize, LongType)
    +    val nullArray = Literal(null, ArrayType(LongType))
     
    -    def fixedLenTypeStruct(castType: DataType) = {
    +    def fixedLenTypeExprs(castType: DataType) = {
           // For fixed width types, avg size should be the same as max size.
    -      struct(ndv, Cast(Min(col), castType), Cast(Max(col), castType), numNulls, defaultSize,
    +      Seq(ndv, Cast(Min(col), castType), Cast(Max(col), castType), numNulls, defaultSize,
             defaultSize)
         }
     
    +    def fixedLenTypeStruct(dataType: DataType) = {
    +      val genHistogram =
    +        ColumnStat.supportsHistogram(dataType) && colPercentiles.contains(col)
    +      val intervalNdvsExpr = if (genHistogram) {
    +        ApproxCountDistinctForIntervals(col,
    +          CreateArray(colPercentiles(col).map(Literal(_))), conf.ndvMaxError)
    --- End diff --
    
    since we need a catalyst array here, why not let the first job return `ArrayData` directly? i.e., do not call `toArray` in https://github.com/apache/spark/pull/19479/files#diff-027d6bd7c8cf4f64f99acc058389d859R145 , but just collect the `ArrayData`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r150134850
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -1034,11 +1034,18 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
           schema.fields.map(f => (f.name, f.dataType)).toMap
         stats.colStats.foreach { case (colName, colStat) =>
           colStat.toMap(colName, colNameTypeMap(colName)).foreach { case (k, v) =>
    -        statsProperties += (columnStatKeyPropName(colName, k) -> v)
    +        val statKey = columnStatKeyPropName(colName, k)
    +        val threshold = conf.get(SCHEMA_STRING_LENGTH_THRESHOLD)
    +        if (v.length > threshold) {
    +          throw new AnalysisException(s"Cannot persist '$statKey' into hive metastore as " +
    --- End diff --
    
    Hive's exception is not friendly to Spark users. Spark user may not know what's wrong in his operation:
    ```
    org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table. Put request failed : INSERT INTO TABLE_PARAMS (PARAM_VALUE,TBL_ID,PARAM_KEY) VALUES (?,?,?) 
    org.datanucleus.exceptions.NucleusDataStoreException: Put request failed : INSERT INTO TABLE_PARAMS (PARAM_VALUE,TBL_ID,PARAM_KEY) VALUES (?,?,?) 
    ...
    Caused by: java.sql.SQLDataException: A truncation error was encountered trying to shrink VARCHAR 'TFo0QmxvY2smeREAANBdAAALz3IBM0AUAAEAQgPoP/ALAAQUACNAJBAAEy4I&' to length 4000.
    ...
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #83477 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83477/testReport)** for PR 19479 at commit [`7f77516`](https://github.com/apache/spark/commit/7f775165e3ecbc1f8949b02d92d09afa70624d44).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by ron8hu <gi...@git.apache.org>.

Github user ron8hu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r145840719
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -216,65 +218,61 @@ object ColumnStat extends Logging {
         }
       }
     
    -  /**
    -   * Constructs an expression to compute column statistics for a given column.
    -   *
    -   * The expression should create a single struct column with the following schema:
    -   * distinctCount: Long, min: T, max: T, nullCount: Long, avgLen: Long, maxLen: Long
    -   *
    -   * Together with [[rowToColumnStat]], this function is used to create [[ColumnStat]] and
    -   * as a result should stay in sync with it.
    -   */
    -  def statExprs(col: Attribute, relativeSD: Double): CreateNamedStruct = {
    -    def struct(exprs: Expression*): CreateNamedStruct = CreateStruct(exprs.map { expr =>
    -      expr.transformUp { case af: AggregateFunction => af.toAggregateExpression() }
    -    })
    -    val one = Literal(1, LongType)
    +  private def convertToHistogram(s: String): EquiHeightHistogram = {
    +    val idx = s.indexOf(",")
    +    if (idx <= 0) {
    +      throw new AnalysisException("Failed to parse histogram.")
    +    }
    +    val height = s.substring(0, idx).toDouble
    +    val pattern = "Bucket\\(([^,]+), ([^,]+), ([^\\)]+)\\)".r
    +    val buckets = pattern.findAllMatchIn(s).map { m =>
    +      EquiHeightBucket(m.group(1).toDouble, m.group(2).toDouble, m.group(3).toLong)
    +    }.toSeq
    +    EquiHeightHistogram(height, buckets)
    +  }
     
    -    // the approximate ndv (num distinct value) should never be larger than the number of rows
    -    val numNonNulls = if (col.nullable) Count(col) else Count(one)
    -    val ndv = Least(Seq(HyperLogLogPlusPlus(col, relativeSD), numNonNulls))
    -    val numNulls = Subtract(Count(one), numNonNulls)
    -    val defaultSize = Literal(col.dataType.defaultSize, LongType)
    +}
     
    -    def fixedLenTypeStruct(castType: DataType) = {
    -      // For fixed width types, avg size should be the same as max size.
    -      struct(ndv, Cast(Min(col), castType), Cast(Max(col), castType), numNulls, defaultSize,
    -        defaultSize)
    -    }
    +/**
    + * There are a few types of histograms in state-of-the-art estimation methods. E.g. equi-width
    + * histogram, equi-height histogram, frequency histogram (value-frequency pairs) and hybrid
    + * histogram, etc.
    + * Currently in Spark, we support equi-height histogram since it is good at handling skew
    + * distribution, and also provides reasonable accuracy in other cases.
    + * We can add other histograms in the future, which will make estimation logic more complicated.
    + * Because we will have to deal with computation between different types of histograms in some
    + * cases, e.g. for join columns.
    + */
    +trait Histogram
     
    -    col.dataType match {
    -      case dt: IntegralType => fixedLenTypeStruct(dt)
    -      case _: DecimalType => fixedLenTypeStruct(col.dataType)
    -      case dt @ (DoubleType | FloatType) => fixedLenTypeStruct(dt)
    -      case BooleanType => fixedLenTypeStruct(col.dataType)
    -      case DateType => fixedLenTypeStruct(col.dataType)
    -      case TimestampType => fixedLenTypeStruct(col.dataType)
    -      case BinaryType | StringType =>
    -        // For string and binary type, we don't store min/max.
    -        val nullLit = Literal(null, col.dataType)
    -        struct(
    -          ndv, nullLit, nullLit, numNulls,
    -          // Set avg/max size to default size if all the values are null or there is no value.
    -          Coalesce(Seq(Ceil(Average(Length(col))), defaultSize)),
    -          Coalesce(Seq(Cast(Max(Length(col)), LongType), defaultSize)))
    -      case _ =>
    -        throw new AnalysisException("Analyzing column statistics is not supported for column " +
    -            s"${col.name} of data type: ${col.dataType}.")
    -    }
    -  }
    +/**
    + * Equi-height histogram represents column value distribution by a sequence of buckets. Each bucket
    + * has a value range and contains approximately the same number of rows.
    + * @param height number of rows in each bucket
    + * @param ehBuckets equi-height histogram buckets
    + */
    +case class EquiHeightHistogram(height: Double, ehBuckets: Seq[EquiHeightBucket]) extends Histogram {
     
    -  /** Convert a struct for column stats (defined in statExprs) into [[ColumnStat]]. */
    -  def rowToColumnStat(row: InternalRow, attr: Attribute): ColumnStat = {
    -    ColumnStat(
    -      distinctCount = BigInt(row.getLong(0)),
    -      // for string/binary min/max, get should return null
    -      min = Option(row.get(1, attr.dataType)),
    -      max = Option(row.get(2, attr.dataType)),
    -      nullCount = BigInt(row.getLong(3)),
    -      avgLen = row.getLong(4),
    -      maxLen = row.getLong(5)
    -    )
    +  override def toString: String = {
    +    def bucketString(bucket: EquiHeightBucket): String = {
    +      val sb = new StringBuilder
    +      sb.append("Bucket(")
    +      sb.append(bucket.lo)
    +      sb.append(", ")
    +      sb.append(bucket.hi)
    +      sb.append(", ")
    +      sb.append(bucket.ndv)
    +      sb.append(")")
    +      sb.toString()
    +    }
    +    height + ", " + ehBuckets.map(bucketString).mkString(", ")
       }
    -
     }
    +
    +/**
    + * A bucket in an equi-height histogram. We use double type for lower/higher bound for simplicity.
    + * @param lo lower bound of the value range in this bucket
    + * @param hi higher bound of the value range in this bucket
    + * @param ndv approximate number of distinct values in this bucket
    + */
    +case class EquiHeightBucket(lo: Double, hi: Double, ndv: Long)
    --- End diff --
    
    The field distinctCount in ColumnStat class represents the number of distinct values in a table.  It has a type BigInt.  Here ndv represents the number of distinct values in a histogram bucket.  It has a type Long.  Should we make them have same type?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r149860437
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -275,6 +317,122 @@ object ColumnStat extends Logging {
           avgLen = row.getLong(4),
           maxLen = row.getLong(5)
         )
    +    if (row.isNullAt(6)) {
    +      cs
    +    } else {
    +      val ndvs = row.getArray(6).toLongArray()
    +      assert(percentiles.get.numElements() == ndvs.length + 1)
    +      val endpoints = percentiles.get.toArray[Any](attr.dataType).map(_.toString.toDouble)
    --- End diff --
    
    It's for estimation, so I think accuracy loss is acceptable. Double type makes code a lot simpler in estimation logic.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    thanks, merging to master!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83477/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r147886758
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -155,6 +156,8 @@ object ColumnStat extends Logging {
       private val KEY_NULL_COUNT = "nullCount"
       private val KEY_AVG_LEN = "avgLen"
       private val KEY_MAX_LEN = "maxLen"
    +  val KEY_HISTOGRAM = "histogram"
    +  val KEY_HISTOGRAM_SEPARATOR = "-"
    --- End diff --
    
    they are used in `HiveExternalCatalog` for stats/properties conversion.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #83547 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83547/testReport)** for PR 19479 at commit [`fa338dd`](https://github.com/apache/spark/commit/fa338ddcb655f6e421b1be35fdd8dcd5cd866df0).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82669/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r149864452
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -275,6 +317,122 @@ object ColumnStat extends Logging {
           avgLen = row.getLong(4),
           maxLen = row.getLong(5)
         )
    +    if (row.isNullAt(6)) {
    +      cs
    +    } else {
    +      val ndvs = row.getArray(6).toLongArray()
    +      assert(percentiles.get.numElements() == ndvs.length + 1)
    +      val endpoints = percentiles.get.toArray[Any](attr.dataType).map(_.toString.toDouble)
    +      // Construct equi-height histogram
    +      val buckets = ndvs.zipWithIndex.map { case (ndv, i) =>
    +        HistogramBucket(endpoints(i), endpoints(i + 1), ndv)
    +      }
    +      val nonNullRows = rowCount - cs.nullCount
    +      val histogram = Histogram(nonNullRows.toDouble / ndvs.length, buckets)
    +      cs.copy(histogram = Some(histogram))
    +    }
    +  }
    +
    +}
    +
    +/**
    + * This class is an implementation of equi-height histogram.
    + * Equi-height histogram represents the distribution of a column's values by a sequence of buckets.
    + * Each bucket has a value range and contains approximately the same number of rows.
    + * @param height number of rows in each bucket
    + * @param buckets equi-height histogram buckets
    + */
    +case class Histogram(height: Double, buckets: Array[HistogramBucket]) {
    +
    +  // Only for histogram equality test.
    +  override def equals(other: Any): Boolean = other match {
    +    case otherHgm: Histogram =>
    +      height == otherHgm.height && buckets.sameElements(otherHgm.buckets)
    +    case _ => false
       }
     
    +  override def hashCode(): Int = super.hashCode()
    +}
    +
    +/**
    + * A bucket in an equi-height histogram. We use double type for lower/higher bound for simplicity.
    + * @param lo lower bound of the value range in this bucket
    + * @param hi higher bound of the value range in this bucket
    + * @param ndv approximate number of distinct values in this bucket
    + */
    +case class HistogramBucket(lo: Double, hi: Double, ndv: Long)
    +
    +object HistogramSerializer {
    +  /**
    +   * Serializes a given histogram to a string. For advanced statistics like histograms, sketches,
    +   * etc, we don't provide readability for their serialized formats in metastore
    +   * (string-to-string table properties). This is because it's hard or unnatural for these
    +   * statistics to be human readable. For example, a histogram is probably split into multiple
    +   * key-value properties, instead of a single, self-described property. And for
    +   * count-min-sketch, it's essentially unnatural to make it a readable string.
    +   */
    +  final def serialize(histogram: Histogram): String = {
    +    val bos = new ByteArrayOutputStream()
    +    val out = new DataOutputStream(new LZ4BlockOutputStream(bos))
    +    out.writeDouble(histogram.height)
    +    out.writeInt(histogram.buckets.length)
    +    // Write data with same type together for compression.
    +    var i = 0
    +    while (i < histogram.buckets.length) {
    +      out.writeDouble(histogram.buckets(i).lo)
    +      i += 1
    +    }
    +    i = 0
    +    while (i < histogram.buckets.length) {
    +      out.writeDouble(histogram.buckets(i).hi)
    +      i += 1
    +    }
    +    i = 0
    +    while (i < histogram.buckets.length) {
    +      out.writeLong(histogram.buckets(i).ndv)
    +      i += 1
    +    }
    +    out.writeInt(-1)
    --- End diff --
    
    To denote the end of the stream. `SparkPlan.getByteArrayRdd`, `Percentile.serialize` etc also write -1 at the end.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by ron8hu <gi...@git.apache.org>.

Github user ron8hu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r145828713
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -216,65 +218,61 @@ object ColumnStat extends Logging {
         }
       }
     
    -  /**
    -   * Constructs an expression to compute column statistics for a given column.
    -   *
    -   * The expression should create a single struct column with the following schema:
    -   * distinctCount: Long, min: T, max: T, nullCount: Long, avgLen: Long, maxLen: Long
    -   *
    -   * Together with [[rowToColumnStat]], this function is used to create [[ColumnStat]] and
    -   * as a result should stay in sync with it.
    -   */
    -  def statExprs(col: Attribute, relativeSD: Double): CreateNamedStruct = {
    -    def struct(exprs: Expression*): CreateNamedStruct = CreateStruct(exprs.map { expr =>
    -      expr.transformUp { case af: AggregateFunction => af.toAggregateExpression() }
    -    })
    -    val one = Literal(1, LongType)
    +  private def convertToHistogram(s: String): EquiHeightHistogram = {
    +    val idx = s.indexOf(",")
    +    if (idx <= 0) {
    +      throw new AnalysisException("Failed to parse histogram.")
    +    }
    +    val height = s.substring(0, idx).toDouble
    +    val pattern = "Bucket\\(([^,]+), ([^,]+), ([^\\)]+)\\)".r
    +    val buckets = pattern.findAllMatchIn(s).map { m =>
    +      EquiHeightBucket(m.group(1).toDouble, m.group(2).toDouble, m.group(3).toLong)
    +    }.toSeq
    +    EquiHeightHistogram(height, buckets)
    +  }
     
    -    // the approximate ndv (num distinct value) should never be larger than the number of rows
    -    val numNonNulls = if (col.nullable) Count(col) else Count(one)
    -    val ndv = Least(Seq(HyperLogLogPlusPlus(col, relativeSD), numNonNulls))
    -    val numNulls = Subtract(Count(one), numNonNulls)
    -    val defaultSize = Literal(col.dataType.defaultSize, LongType)
    +}
     
    -    def fixedLenTypeStruct(castType: DataType) = {
    -      // For fixed width types, avg size should be the same as max size.
    -      struct(ndv, Cast(Min(col), castType), Cast(Max(col), castType), numNulls, defaultSize,
    -        defaultSize)
    -    }
    +/**
    + * There are a few types of histograms in state-of-the-art estimation methods. E.g. equi-width
    + * histogram, equi-height histogram, frequency histogram (value-frequency pairs) and hybrid
    + * histogram, etc.
    + * Currently in Spark, we support equi-height histogram since it is good at handling skew
    + * distribution, and also provides reasonable accuracy in other cases.
    + * We can add other histograms in the future, which will make estimation logic more complicated.
    + * Because we will have to deal with computation between different types of histograms in some
    --- End diff --
    
    This is because we will have to ........


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    ah sorry for the typo, yea DESC COLUMN


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #82669 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82669/testReport)** for PR 19479 at commit [`31a852a`](https://github.com/apache/spark/commit/31a852affc7f359dae01e6a893cffec4caf1235f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r150391234
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -275,6 +313,127 @@ object ColumnStat extends Logging {
           avgLen = row.getLong(4),
           maxLen = row.getLong(5)
         )
    +    if (row.isNullAt(6)) {
    +      cs
    +    } else {
    +      val ndvs = row.getArray(6).toLongArray()
    +      assert(percentiles.get.numElements() == ndvs.length + 1)
    +      val endpoints = percentiles.get.toArray[Any](attr.dataType).map(_.toString.toDouble)
    +      // Construct equi-height histogram
    +      val bins = ndvs.zipWithIndex.map { case (ndv, i) =>
    +        HistogramBin(endpoints(i), endpoints(i + 1), ndv)
    +      }
    +      val nonNullRows = rowCount - cs.nullCount
    +      val histogram = Histogram(nonNullRows.toDouble / ndvs.length, bins)
    +      cs.copy(histogram = Some(histogram))
    +    }
    +  }
    +
    +}
    +
    +/**
    + * This class is an implementation of equi-height histogram.
    + * Equi-height histogram represents the distribution of a column's values by a sequence of bins.
    + * Each bin has a value range and contains approximately the same number of rows.
    + * @param height number of rows in each bin
    --- End diff --
    
    nit: one blank line before this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r149344559
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -275,6 +317,118 @@ object ColumnStat extends Logging {
           avgLen = row.getLong(4),
           maxLen = row.getLong(5)
         )
    +    if (row.isNullAt(6)) {
    +      cs
    +    } else {
    +      val ndvs = row.getArray(6).toLongArray()
    +      assert(percentiles.get.length == ndvs.length + 1)
    +      val endpoints = percentiles.get.map(_.toString.toDouble)
    +      // Construct equi-height histogram
    +      val buckets = ndvs.zipWithIndex.map { case (ndv, i) =>
    +        EquiHeightBucket(endpoints(i), endpoints(i + 1), ndv)
    +      }
    +      val nonNullRows = rowCount - cs.nullCount
    +      val ehHistogram = EquiHeightHistogram(nonNullRows.toDouble / ndvs.length, buckets)
    +      cs.copy(histogram = Some(ehHistogram))
    +    }
    +  }
    +
    +}
    +
    +/**
    + * There are a few types of histograms in state-of-the-art estimation methods. E.g. equi-width
    + * histogram, equi-height histogram, frequency histogram (value-frequency pairs) and hybrid
    + * histogram, etc.
    + * Currently in Spark, we support equi-height histogram since it is good at handling skew
    + * distribution, and also provides reasonable accuracy in other cases.
    + * We can add other histograms in the future, which will make estimation logic more complicated.
    + * This is because we will have to deal with computation between different types of histograms in
    + * some cases, e.g. for join columns.
    + */
    +trait Histogram
    +
    +/**
    + * Equi-height histogram represents column value distribution by a sequence of buckets. Each bucket
    + * has a value range and contains approximately the same number of rows.
    + * @param height number of rows in each bucket
    + * @param ehBuckets equi-height histogram buckets
    + */
    +case class EquiHeightHistogram(height: Double, ehBuckets: Array[EquiHeightBucket])
    +  extends Histogram {
    +
    +  // Only for histogram equality test.
    +  override def equals(other: Any): Boolean = other match {
    +    case otherEHH: EquiHeightHistogram =>
    +      height == otherEHH.height && ehBuckets.sameElements(otherEHH.ehBuckets)
    +    case _ => false
    +  }
    +
    +  override def hashCode(): Int = super.hashCode()
    +}
    +
    +/**
    + * A bucket in an equi-height histogram. We use double type for lower/higher bound for simplicity.
    + * @param lo lower bound of the value range in this bucket
    + * @param hi higher bound of the value range in this bucket
    + * @param ndv approximate number of distinct values in this bucket
    + */
    +case class EquiHeightBucket(lo: Double, hi: Double, ndv: Long)
    +
    +object HistogramSerializer {
    +  // A flag to indicate the type of histogram
    +  val EQUI_HEIGHT_HISTOGRAM_TYPE: Byte = 1
    +
    +  /**
    +   * Serializes a given histogram to a string. For advanced statistics like histograms, sketches,
    +   * etc, we don't provide readability for their serialized formats in metastore (as
    +   * string-to-string table properties). This is because it's hard or unnatural for these
    +   * statistics to be human readable. For example, histogram is probably split into multiple
    +   * key-value properties, instead of a single, self-described property. And for
    +   * count-min-sketch, it's essentially unnatural to make it a readable string.
    +   */
    +  final def serialize(histogram: Histogram): String = histogram match {
    +    case h: EquiHeightHistogram =>
    +      // type + numBuckets + height + numBuckets * (lo + hi + ndv)
    --- End diff --
    
    Thanks for the advice. The default number of buckets is 254. Tests showed that after compression, the serialized length is reduced by more than 50%.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83626/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #83547 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83547/testReport)** for PR 19479 at commit [`fa338dd`](https://github.com/apache/spark/commit/fa338ddcb655f6e421b1be35fdd8dcd5cd866df0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r148234308
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala ---
    @@ -89,19 +93,158 @@ case class AnalyzeColumnCommand(
         // The first element in the result will be the overall row count, the following elements
         // will be structs containing all column stats.
         // The layout of each struct follows the layout of the ColumnStats.
    -    val ndvMaxErr = sparkSession.sessionState.conf.ndvMaxError
         val expressions = Count(Literal(1)).toAggregateExpression() +:
    -      attributesToAnalyze.map(ColumnStat.statExprs(_, ndvMaxErr))
    +      attributesToAnalyze.map(statExprs(_, sparkSession.sessionState.conf))
    --- End diff --
    
    My feeling is that, we should run a job before calling `statExprs`, because `statExprs` need some extra information about buckets. I think it's better than hiding this job deep in `rowToColumnStats`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #83623 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83623/testReport)** for PR 19479 at commit [`a96169e`](https://github.com/apache/spark/commit/a96169eac41db1ba2db9d9211d0c301012c4c409).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #83726 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83726/testReport)** for PR 19479 at commit [`976acbe`](https://github.com/apache/spark/commit/976acbe501227735ad54725eaa8b9dad77b6b618).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r150192871
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -1034,11 +1034,18 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
           schema.fields.map(f => (f.name, f.dataType)).toMap
         stats.colStats.foreach { case (colName, colStat) =>
           colStat.toMap(colName, colNameTypeMap(colName)).foreach { case (k, v) =>
    -        statsProperties += (columnStatKeyPropName(colName, k) -> v)
    +        val statKey = columnStatKeyPropName(colName, k)
    +        val threshold = conf.get(SCHEMA_STRING_LENGTH_THRESHOLD)
    +        if (v.length > threshold) {
    +          throw new AnalysisException(s"Cannot persist '$statKey' into hive metastore as " +
    --- End diff --
    
    can we try catch it in `HiveExternalCatalog.withClient`? This can also help if users set a very long table property.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #82669 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82669/testReport)** for PR 19479 at commit [`31a852a`](https://github.com/apache/spark/commit/31a852affc7f359dae01e6a893cffec4caf1235f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class EquiHeightHistogram(height: Double, ehBuckets: Seq[EquiHeightBucket]) extends Histogram `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83623/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83284/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r149850233
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -275,6 +317,122 @@ object ColumnStat extends Logging {
           avgLen = row.getLong(4),
           maxLen = row.getLong(5)
         )
    +    if (row.isNullAt(6)) {
    +      cs
    +    } else {
    +      val ndvs = row.getArray(6).toLongArray()
    +      assert(percentiles.get.numElements() == ndvs.length + 1)
    +      val endpoints = percentiles.get.toArray[Any](attr.dataType).map(_.toString.toDouble)
    +      // Construct equi-height histogram
    +      val buckets = ndvs.zipWithIndex.map { case (ndv, i) =>
    +        HistogramBucket(endpoints(i), endpoints(i + 1), ndv)
    +      }
    +      val nonNullRows = rowCount - cs.nullCount
    +      val histogram = Histogram(nonNullRows.toDouble / ndvs.length, buckets)
    +      cs.copy(histogram = Some(histogram))
    +    }
    +  }
    +
    +}
    +
    +/**
    + * This class is an implementation of equi-height histogram.
    + * Equi-height histogram represents the distribution of a column's values by a sequence of buckets.
    + * Each bucket has a value range and contains approximately the same number of rows.
    + * @param height number of rows in each bucket
    + * @param buckets equi-height histogram buckets
    + */
    +case class Histogram(height: Double, buckets: Array[HistogramBucket]) {
    +
    +  // Only for histogram equality test.
    +  override def equals(other: Any): Boolean = other match {
    +    case otherHgm: Histogram =>
    +      height == otherHgm.height && buckets.sameElements(otherHgm.buckets)
    +    case _ => false
       }
     
    +  override def hashCode(): Int = super.hashCode()
    +}
    +
    +/**
    + * A bucket in an equi-height histogram. We use double type for lower/higher bound for simplicity.
    + * @param lo lower bound of the value range in this bucket
    + * @param hi higher bound of the value range in this bucket
    + * @param ndv approximate number of distinct values in this bucket
    + */
    +case class HistogramBucket(lo: Double, hi: Double, ndv: Long)
    +
    +object HistogramSerializer {
    +  /**
    +   * Serializes a given histogram to a string. For advanced statistics like histograms, sketches,
    +   * etc, we don't provide readability for their serialized formats in metastore
    +   * (string-to-string table properties). This is because it's hard or unnatural for these
    +   * statistics to be human readable. For example, a histogram is probably split into multiple
    +   * key-value properties, instead of a single, self-described property. And for
    +   * count-min-sketch, it's essentially unnatural to make it a readable string.
    +   */
    +  final def serialize(histogram: Histogram): String = {
    +    val bos = new ByteArrayOutputStream()
    +    val out = new DataOutputStream(new LZ4BlockOutputStream(bos))
    +    out.writeDouble(histogram.height)
    +    out.writeInt(histogram.buckets.length)
    +    // Write data with same type together for compression.
    +    var i = 0
    +    while (i < histogram.buckets.length) {
    +      out.writeDouble(histogram.buckets(i).lo)
    +      i += 1
    +    }
    +    i = 0
    +    while (i < histogram.buckets.length) {
    +      out.writeDouble(histogram.buckets(i).hi)
    +      i += 1
    +    }
    +    i = 0
    +    while (i < histogram.buckets.length) {
    +      out.writeLong(histogram.buckets(i).ndv)
    +      i += 1
    +    }
    +    out.writeInt(-1)
    --- End diff --
    
    why a `-1` at the end?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #83545 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83545/testReport)** for PR 19479 at commit [`804b375`](https://github.com/apache/spark/commit/804b37565b2f5d61edd492d415475a59afec41f5).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83741/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r149848909
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -220,29 +239,46 @@ object ColumnStat extends Logging {
        * Constructs an expression to compute column statistics for a given column.
        *
        * The expression should create a single struct column with the following schema:
    -   * distinctCount: Long, min: T, max: T, nullCount: Long, avgLen: Long, maxLen: Long
    +   * distinctCount: Long, min: T, max: T, nullCount: Long, avgLen: Long, maxLen: Long,
    +   * distinctCountsForIntervals: Array[Long]
        *
        * Together with [[rowToColumnStat]], this function is used to create [[ColumnStat]] and
        * as a result should stay in sync with it.
        */
    -  def statExprs(col: Attribute, relativeSD: Double): CreateNamedStruct = {
    +  def statExprs(
    +      col: Attribute,
    +      conf: SQLConf,
    +      colPercentiles: AttributeMap[ArrayData]): CreateNamedStruct = {
         def struct(exprs: Expression*): CreateNamedStruct = CreateStruct(exprs.map { expr =>
           expr.transformUp { case af: AggregateFunction => af.toAggregateExpression() }
         })
         val one = Literal(1, LongType)
     
         // the approximate ndv (num distinct value) should never be larger than the number of rows
         val numNonNulls = if (col.nullable) Count(col) else Count(one)
    -    val ndv = Least(Seq(HyperLogLogPlusPlus(col, relativeSD), numNonNulls))
    +    val ndv = Least(Seq(HyperLogLogPlusPlus(col, conf.ndvMaxError), numNonNulls))
         val numNulls = Subtract(Count(one), numNonNulls)
         val defaultSize = Literal(col.dataType.defaultSize, LongType)
    +    val nullArray = Literal(null, ArrayType(LongType))
     
    -    def fixedLenTypeStruct(castType: DataType) = {
    +    def fixedLenTypeExprs(castType: DataType) = {
    --- End diff --
    
    looks like we can inline this method


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r150391475
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -275,6 +313,127 @@ object ColumnStat extends Logging {
           avgLen = row.getLong(4),
           maxLen = row.getLong(5)
         )
    +    if (row.isNullAt(6)) {
    +      cs
    +    } else {
    +      val ndvs = row.getArray(6).toLongArray()
    +      assert(percentiles.get.numElements() == ndvs.length + 1)
    +      val endpoints = percentiles.get.toArray[Any](attr.dataType).map(_.toString.toDouble)
    +      // Construct equi-height histogram
    +      val bins = ndvs.zipWithIndex.map { case (ndv, i) =>
    +        HistogramBin(endpoints(i), endpoints(i + 1), ndv)
    +      }
    +      val nonNullRows = rowCount - cs.nullCount
    +      val histogram = Histogram(nonNullRows.toDouble / ndvs.length, bins)
    +      cs.copy(histogram = Some(histogram))
    +    }
    +  }
    +
    +}
    +
    +/**
    + * This class is an implementation of equi-height histogram.
    + * Equi-height histogram represents the distribution of a column's values by a sequence of bins.
    + * Each bin has a value range and contains approximately the same number of rows.
    + * @param height number of rows in each bin
    + * @param bins equi-height histogram bins
    + */
    +case class Histogram(height: Double, bins: Array[HistogramBin]) {
    +
    +  // Only for histogram equality test.
    +  override def equals(other: Any): Boolean = other match {
    +    case otherHgm: Histogram =>
    +      height == otherHgm.height && bins.sameElements(otherHgm.bins)
    +    case _ => false
    +  }
    +
    +  override def hashCode(): Int = {
    +    val temp = java.lang.Double.doubleToLongBits(height)
    +    var result = (temp ^ (temp >>> 32)).toInt
    +    result = 31 * result + java.util.Arrays.hashCode(bins.asInstanceOf[Array[AnyRef]])
    +    result
    +  }
    +}
    +
    +/**
    + * A bin in an equi-height histogram. We use double type for lower/higher bound for simplicity.
    + * @param lo lower bound of the value range in this bin
    + * @param hi higher bound of the value range in this bin
    + * @param ndv approximate number of distinct values in this bin
    + */
    +case class HistogramBin(lo: Double, hi: Double, ndv: Long)
    +
    +object HistogramSerializer {
    +  /**
    +   * Serializes a given histogram to a string. For advanced statistics like histograms, sketches,
    +   * etc, we don't provide readability for their serialized formats in metastore
    +   * (string-to-string table properties). This is because it's hard or unnatural for these
    +   * statistics to be human readable. For example, a histogram usually cannot fit in a single,
    +   * self-described property. And for count-min-sketch, it's essentially unnatural to make it
    +   * a readable string.
    +   */
    +  final def serialize(histogram: Histogram): String = {
    +    val bos = new ByteArrayOutputStream()
    +    val out = new DataOutputStream(new LZ4BlockOutputStream(bos))
    +    out.writeDouble(histogram.height)
    +    out.writeInt(histogram.bins.length)
    +    // Write data with same type together for compression.
    +    var i = 0
    +    while (i < histogram.bins.length) {
    +      out.writeDouble(histogram.bins(i).lo)
    +      i += 1
    +    }
    +    i = 0
    +    while (i < histogram.bins.length) {
    +      out.writeDouble(histogram.bins(i).hi)
    +      i += 1
    +    }
    +    i = 0
    +    while (i < histogram.bins.length) {
    +      out.writeLong(histogram.bins(i).ndv)
    +      i += 1
    +    }
    +    out.writeInt(-1)
    +    out.flush()
    +    out.close()
    +
    +    org.apache.commons.codec.binary.Base64.encodeBase64String(bos.toByteArray)
    --- End diff --
    
    cc @rxin , the histogram is not very human-readable anyway, and it wastes a lot of spaces if we use plain string to represent it, which may easily hit hive metastore limitation(4k value length of table property). Here we pick a stable encoding(bin1_low, bin2_low, ... bin1_high, bin2_high, ... bin1_ndv, bin2_ndv, ...) and turn the binary to base64string, to save space.
    
    As long as we don't change the histogram implementation, this approach won't have backward compatibility issuse. If we do wanna change the implementation, we can treat it as a new statistics.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r149503350
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -275,6 +317,98 @@ object ColumnStat extends Logging {
           avgLen = row.getLong(4),
           maxLen = row.getLong(5)
         )
    +    if (row.isNullAt(6)) {
    +      cs
    +    } else {
    +      val ndvs = row.getArray(6).toLongArray()
    +      assert(percentiles.get.length == ndvs.length + 1)
    +      val endpoints = percentiles.get.map(_.toString.toDouble)
    +      // Construct equi-height histogram
    +      val buckets = ndvs.zipWithIndex.map { case (ndv, i) =>
    +        EquiHeightBucket(endpoints(i), endpoints(i + 1), ndv)
    +      }
    +      val nonNullRows = rowCount - cs.nullCount
    +      val ehHistogram = EquiHeightHistogram(nonNullRows.toDouble / ndvs.length, buckets)
    +      cs.copy(histogram = Some(ehHistogram))
    +    }
    +  }
    +
    +}
    +
    +/**
    + * Equi-height histogram represents the distribution of a column's values by a sequence of buckets.
    + * Each bucket has a value range and contains approximately the same number of rows.
    + * In the context of Spark SQL statistics, we may use "histogram" to denote "equi-height histogram"
    + * for simplicity.
    + * @param height number of rows in each bucket
    + * @param buckets equi-height histogram buckets
    + */
    +case class EquiHeightHistogram(height: Double, buckets: Array[EquiHeightBucket]) {
    +
    +  // Only for histogram equality test.
    +  override def equals(other: Any): Boolean = other match {
    +    case otherEHH: EquiHeightHistogram =>
    +      height == otherEHH.height && buckets.sameElements(otherEHH.buckets)
    +    case _ => false
       }
     
    +  override def hashCode(): Int = super.hashCode()
    +}
    +
    +/**
    + * A bucket in an equi-height histogram. We use double type for lower/higher bound for simplicity.
    + * @param lo lower bound of the value range in this bucket
    + * @param hi higher bound of the value range in this bucket
    + * @param ndv approximate number of distinct values in this bucket
    + */
    +case class EquiHeightBucket(lo: Double, hi: Double, ndv: Long)
    +
    +object HistogramSerializer {
    +  /**
    +   * Serializes a given histogram to a string. For advanced statistics like histograms, sketches,
    +   * etc, we don't provide readability for their serialized formats in metastore
    +   * (string-to-string table properties). This is because it's hard or unnatural for these
    +   * statistics to be human readable. For example, a histogram is probably split into multiple
    +   * key-value properties, instead of a single, self-described property. And for
    +   * count-min-sketch, it's essentially unnatural to make it a readable string.
    +   */
    +  final def serialize(histogram: EquiHeightHistogram): String = {
    +    val bos = new ByteArrayOutputStream()
    +    val out = new DataOutputStream(new LZ4BlockOutputStream(bos))
    +    out.writeDouble(histogram.height)
    +    out.writeInt(histogram.buckets.length)
    +    var i = 0
    +    while (i < histogram.buckets.length) {
    +      val bucket = histogram.buckets(i)
    +      out.writeDouble(bucket.lo)
    --- End diff --
    
    nit: shall we write all `bucket.lo` first, then `bucket.hi`, and finally `bucket.ndv`? Putting values of the same type together might be better for compression, but I might be wrong. Can you give it a try? thanks


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83545/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #83735 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83735/testReport)** for PR 19479 at commit [`7b6d211`](https://github.com/apache/spark/commit/7b6d211f4f082415cc182de9e34f7420d65e94eb).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #83545 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83545/testReport)** for PR 19479 at commit [`804b375`](https://github.com/apache/spark/commit/804b37565b2f5d61edd492d415475a59afec41f5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class EquiHeightHistogram(height: Double, buckets: Array[EquiHeightBucket]) `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19479


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r149341693
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -216,65 +218,61 @@ object ColumnStat extends Logging {
         }
       }
     
    -  /**
    -   * Constructs an expression to compute column statistics for a given column.
    -   *
    -   * The expression should create a single struct column with the following schema:
    -   * distinctCount: Long, min: T, max: T, nullCount: Long, avgLen: Long, maxLen: Long
    -   *
    -   * Together with [[rowToColumnStat]], this function is used to create [[ColumnStat]] and
    -   * as a result should stay in sync with it.
    -   */
    -  def statExprs(col: Attribute, relativeSD: Double): CreateNamedStruct = {
    -    def struct(exprs: Expression*): CreateNamedStruct = CreateStruct(exprs.map { expr =>
    -      expr.transformUp { case af: AggregateFunction => af.toAggregateExpression() }
    -    })
    -    val one = Literal(1, LongType)
    +  private def convertToHistogram(s: String): EquiHeightHistogram = {
    +    val idx = s.indexOf(",")
    +    if (idx <= 0) {
    +      throw new AnalysisException("Failed to parse histogram.")
    +    }
    +    val height = s.substring(0, idx).toDouble
    +    val pattern = "Bucket\\(([^,]+), ([^,]+), ([^\\)]+)\\)".r
    +    val buckets = pattern.findAllMatchIn(s).map { m =>
    +      EquiHeightBucket(m.group(1).toDouble, m.group(2).toDouble, m.group(3).toLong)
    +    }.toSeq
    +    EquiHeightHistogram(height, buckets)
    +  }
     
    -    // the approximate ndv (num distinct value) should never be larger than the number of rows
    -    val numNonNulls = if (col.nullable) Count(col) else Count(one)
    -    val ndv = Least(Seq(HyperLogLogPlusPlus(col, relativeSD), numNonNulls))
    -    val numNulls = Subtract(Count(one), numNonNulls)
    -    val defaultSize = Literal(col.dataType.defaultSize, LongType)
    +}
     
    -    def fixedLenTypeStruct(castType: DataType) = {
    -      // For fixed width types, avg size should be the same as max size.
    -      struct(ndv, Cast(Min(col), castType), Cast(Max(col), castType), numNulls, defaultSize,
    -        defaultSize)
    -    }
    +/**
    + * There are a few types of histograms in state-of-the-art estimation methods. E.g. equi-width
    + * histogram, equi-height histogram, frequency histogram (value-frequency pairs) and hybrid
    + * histogram, etc.
    + * Currently in Spark, we support equi-height histogram since it is good at handling skew
    + * distribution, and also provides reasonable accuracy in other cases.
    + * We can add other histograms in the future, which will make estimation logic more complicated.
    --- End diff --
    
    OK. Removed the trait `Histogram`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82931/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #83284 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83284/testReport)** for PR 19479 at commit [`a0010b0`](https://github.com/apache/spark/commit/a0010b03cc792b03356f18499f02afd85acfd37a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r149803550
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -275,6 +317,123 @@ object ColumnStat extends Logging {
           avgLen = row.getLong(4),
           maxLen = row.getLong(5)
         )
    +    if (row.isNullAt(6)) {
    +      cs
    +    } else {
    +      val ndvs = row.getArray(6).toLongArray()
    +      assert(percentiles.get.numElements() == ndvs.length + 1)
    +      val endpoints = percentiles.get.toArray[Any](attr.dataType).map(_.toString.toDouble)
    +      // Construct equi-height histogram
    +      val buckets = ndvs.zipWithIndex.map { case (ndv, i) =>
    +        EquiHeightBucket(endpoints(i), endpoints(i + 1), ndv)
    +      }
    +      val nonNullRows = rowCount - cs.nullCount
    +      val ehHistogram = EquiHeightHistogram(nonNullRows.toDouble / ndvs.length, buckets)
    +      cs.copy(histogram = Some(ehHistogram))
    +    }
    +  }
    +
    +}
    +
    +/**
    + * Equi-height histogram represents the distribution of a column's values by a sequence of buckets.
    + * Each bucket has a value range and contains approximately the same number of rows.
    + * In the context of Spark SQL statistics, we may use "histogram" to denote "equi-height histogram"
    + * for simplicity.
    + * @param height number of rows in each bucket
    + * @param buckets equi-height histogram buckets
    + */
    +case class EquiHeightHistogram(height: Double, buckets: Array[EquiHeightBucket]) {
    --- End diff --
    
    How about just call it `Histogram` and in document say it's equi-height?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83547/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #83614 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83614/testReport)** for PR 19479 at commit [`72840a9`](https://github.com/apache/spark/commit/72840a93e0890919827d33415d1ab4e0e0ccc1f4).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r147669325
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -177,13 +180,12 @@ object ColumnStat extends Logging {
           Some(ColumnStat(
             distinctCount = BigInt(map(KEY_DISTINCT_COUNT).toLong),
             // Note that flatMap(Option.apply) turns Option(null) into None.
    -        min = map.get(KEY_MIN_VALUE)
    -          .map(fromExternalString(_, field.name, field.dataType)).flatMap(Option.apply),
    -        max = map.get(KEY_MAX_VALUE)
    -          .map(fromExternalString(_, field.name, field.dataType)).flatMap(Option.apply),
    +        min = map.get(KEY_MIN_VALUE).map(fromString(_, field.name, field.dataType)),
    +        max = map.get(KEY_MAX_VALUE).map(fromString(_, field.name, field.dataType)),
    --- End diff --
    
    oh then the `KEY_MAX_VALUE` will not exist in the map..


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #83614 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83614/testReport)** for PR 19479 at commit [`72840a9`](https://github.com/apache/spark/commit/72840a93e0890919827d33415d1ab4e0e0ccc1f4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class Histogram(height: Double, buckets: Array[HistogramBucket]) `
      * `case class HistogramBucket(lo: Double, hi: Double, ndv: Long)`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #83741 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83741/testReport)** for PR 19479 at commit [`7b6d211`](https://github.com/apache/spark/commit/7b6d211f4f082415cc182de9e34f7420d65e94eb).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    > We should also consider how to show the histogram in ANALYZE COLUMN, for debug purpose.
    
    Do you mean show the histogram in DESC COLUMN command?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #82801 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82801/testReport)** for PR 19479 at commit [`83424e4`](https://github.com/apache/spark/commit/83424e4215bf1d15cb558463c5e90f92e2146138).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r147946241
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -1032,7 +1032,21 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
           schema.fields.map(f => (f.name, f.dataType)).toMap
         stats.colStats.foreach { case (colName, colStat) =>
           colStat.toMap(colName, colNameTypeMap(colName)).foreach { case (k, v) =>
    -        statsProperties += (columnStatKeyPropName(colName, k) -> v)
    +        if (k == ColumnStat.KEY_HISTOGRAM) {
    +          // In Hive metastore, the length of value in table properties cannot be larger than 4000,
    +          // so we need to split histogram into multiple key-value properties if it's too long.
    +          val maxValueLen = 4000
    --- End diff --
    
    use `SCHEMA_STRING_LENGTH_THRESHOLD` instead of hardcode, please follow `tableMetaToTableProps`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #83599 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83599/testReport)** for PR 19479 at commit [`d9dee61`](https://github.com/apache/spark/commit/d9dee6113a4720b0e9eb713baf9c7394bcdd3eb6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r147667888
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -155,6 +156,8 @@ object ColumnStat extends Logging {
       private val KEY_NULL_COUNT = "nullCount"
       private val KEY_AVG_LEN = "avgLen"
       private val KEY_MAX_LEN = "maxLen"
    +  val KEY_HISTOGRAM = "histogram"
    +  val KEY_HISTOGRAM_SEPARATOR = "-"
    --- End diff --
    
    private val?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #82931 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82931/testReport)** for PR 19479 at commit [`6fe9985`](https://github.com/apache/spark/commit/6fe9985872c93b5dfa9972300ba3f59e97834d4c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83285/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83369/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83726/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r148451915
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -216,65 +219,61 @@ object ColumnStat extends Logging {
         }
       }
     
    -  /**
    -   * Constructs an expression to compute column statistics for a given column.
    -   *
    -   * The expression should create a single struct column with the following schema:
    -   * distinctCount: Long, min: T, max: T, nullCount: Long, avgLen: Long, maxLen: Long
    -   *
    -   * Together with [[rowToColumnStat]], this function is used to create [[ColumnStat]] and
    -   * as a result should stay in sync with it.
    -   */
    -  def statExprs(col: Attribute, relativeSD: Double): CreateNamedStruct = {
    -    def struct(exprs: Expression*): CreateNamedStruct = CreateStruct(exprs.map { expr =>
    -      expr.transformUp { case af: AggregateFunction => af.toAggregateExpression() }
    -    })
    -    val one = Literal(1, LongType)
    +  private def convertToHistogram(s: String): EquiHeightHistogram = {
    +    val idx = s.indexOf(",")
    +    if (idx <= 0) {
    +      throw new AnalysisException("Failed to parse histogram.")
    +    }
    +    val height = s.substring(0, idx).toDouble
    +    val pattern = "Bucket\\(([^,]+), ([^,]+), ([^\\)]+)\\)".r
    +    val buckets = pattern.findAllMatchIn(s).map { m =>
    +      EquiHeightBucket(m.group(1).toDouble, m.group(2).toDouble, m.group(3).toLong)
    +    }.toSeq
    +    EquiHeightHistogram(height, buckets)
    +  }
     
    -    // the approximate ndv (num distinct value) should never be larger than the number of rows
    -    val numNonNulls = if (col.nullable) Count(col) else Count(one)
    -    val ndv = Least(Seq(HyperLogLogPlusPlus(col, relativeSD), numNonNulls))
    -    val numNulls = Subtract(Count(one), numNonNulls)
    -    val defaultSize = Literal(col.dataType.defaultSize, LongType)
    +}
     
    -    def fixedLenTypeStruct(castType: DataType) = {
    -      // For fixed width types, avg size should be the same as max size.
    -      struct(ndv, Cast(Min(col), castType), Cast(Max(col), castType), numNulls, defaultSize,
    -        defaultSize)
    -    }
    +/**
    + * There are a few types of histograms in state-of-the-art estimation methods. E.g. equi-width
    + * histogram, equi-height histogram, frequency histogram (value-frequency pairs) and hybrid
    + * histogram, etc.
    + * Currently in Spark, we support equi-height histogram since it is good at handling skew
    + * distribution, and also provides reasonable accuracy in other cases.
    + * We can add other histograms in the future, which will make estimation logic more complicated.
    + * This is because we will have to deal with computation between different types of histograms in
    + * some cases, e.g. for join columns.
    + */
    +trait Histogram
     
    -    col.dataType match {
    -      case dt: IntegralType => fixedLenTypeStruct(dt)
    -      case _: DecimalType => fixedLenTypeStruct(col.dataType)
    -      case dt @ (DoubleType | FloatType) => fixedLenTypeStruct(dt)
    -      case BooleanType => fixedLenTypeStruct(col.dataType)
    -      case DateType => fixedLenTypeStruct(col.dataType)
    -      case TimestampType => fixedLenTypeStruct(col.dataType)
    -      case BinaryType | StringType =>
    -        // For string and binary type, we don't store min/max.
    -        val nullLit = Literal(null, col.dataType)
    -        struct(
    -          ndv, nullLit, nullLit, numNulls,
    -          // Set avg/max size to default size if all the values are null or there is no value.
    -          Coalesce(Seq(Ceil(Average(Length(col))), defaultSize)),
    -          Coalesce(Seq(Cast(Max(Length(col)), LongType), defaultSize)))
    -      case _ =>
    -        throw new AnalysisException("Analyzing column statistics is not supported for column " +
    -            s"${col.name} of data type: ${col.dataType}.")
    -    }
    -  }
    +/**
    + * Equi-height histogram represents column value distribution by a sequence of buckets. Each bucket
    + * has a value range and contains approximately the same number of rows.
    + * @param height number of rows in each bucket
    + * @param ehBuckets equi-height histogram buckets
    + */
    +case class EquiHeightHistogram(height: Double, ehBuckets: Seq[EquiHeightBucket]) extends Histogram {
    --- End diff --
    
    Histogram is generated per column. If we compute histograms for tens of columns, the number of parameters will become thousands or even tens of thousands. This can influence metastore access performance.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r149852228
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
    @@ -963,98 +964,178 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto
         assert(stats.size == data.head.productArity - 1)
         val df = data.toDF(stats.keys.toSeq :+ "carray" : _*)
     
    +    val expectedSerializedColStats = Map(
    +      "spark.sql.statistics.colStats.cbinary.avgLen" -> "3",
    +      "spark.sql.statistics.colStats.cbinary.distinctCount" -> "2",
    +      "spark.sql.statistics.colStats.cbinary.maxLen" -> "3",
    +      "spark.sql.statistics.colStats.cbinary.nullCount" -> "1",
    +      "spark.sql.statistics.colStats.cbinary.version" -> "1",
    +      "spark.sql.statistics.colStats.cbool.avgLen" -> "1",
    +      "spark.sql.statistics.colStats.cbool.distinctCount" -> "2",
    +      "spark.sql.statistics.colStats.cbool.max" -> "true",
    +      "spark.sql.statistics.colStats.cbool.maxLen" -> "1",
    +      "spark.sql.statistics.colStats.cbool.min" -> "false",
    +      "spark.sql.statistics.colStats.cbool.nullCount" -> "1",
    +      "spark.sql.statistics.colStats.cbool.version" -> "1",
    +      "spark.sql.statistics.colStats.cbyte.avgLen" -> "1",
    +      "spark.sql.statistics.colStats.cbyte.distinctCount" -> "2",
    +      "spark.sql.statistics.colStats.cbyte.max" -> "2",
    +      "spark.sql.statistics.colStats.cbyte.maxLen" -> "1",
    +      "spark.sql.statistics.colStats.cbyte.min" -> "1",
    +      "spark.sql.statistics.colStats.cbyte.nullCount" -> "1",
    +      "spark.sql.statistics.colStats.cbyte.version" -> "1",
    +      "spark.sql.statistics.colStats.cdate.avgLen" -> "4",
    +      "spark.sql.statistics.colStats.cdate.distinctCount" -> "2",
    +      "spark.sql.statistics.colStats.cdate.max" -> "2016-05-09",
    +      "spark.sql.statistics.colStats.cdate.maxLen" -> "4",
    +      "spark.sql.statistics.colStats.cdate.min" -> "2016-05-08",
    +      "spark.sql.statistics.colStats.cdate.nullCount" -> "1",
    +      "spark.sql.statistics.colStats.cdate.version" -> "1",
    +      "spark.sql.statistics.colStats.cdecimal.avgLen" -> "16",
    +      "spark.sql.statistics.colStats.cdecimal.distinctCount" -> "2",
    +      "spark.sql.statistics.colStats.cdecimal.max" -> "8.000000000000000000",
    +      "spark.sql.statistics.colStats.cdecimal.maxLen" -> "16",
    +      "spark.sql.statistics.colStats.cdecimal.min" -> "1.000000000000000000",
    +      "spark.sql.statistics.colStats.cdecimal.nullCount" -> "1",
    +      "spark.sql.statistics.colStats.cdecimal.version" -> "1",
    +      "spark.sql.statistics.colStats.cdouble.avgLen" -> "8",
    +      "spark.sql.statistics.colStats.cdouble.distinctCount" -> "2",
    +      "spark.sql.statistics.colStats.cdouble.max" -> "6.0",
    +      "spark.sql.statistics.colStats.cdouble.maxLen" -> "8",
    +      "spark.sql.statistics.colStats.cdouble.min" -> "1.0",
    +      "spark.sql.statistics.colStats.cdouble.nullCount" -> "1",
    +      "spark.sql.statistics.colStats.cdouble.version" -> "1",
    +      "spark.sql.statistics.colStats.cfloat.avgLen" -> "4",
    +      "spark.sql.statistics.colStats.cfloat.distinctCount" -> "2",
    +      "spark.sql.statistics.colStats.cfloat.max" -> "7.0",
    +      "spark.sql.statistics.colStats.cfloat.maxLen" -> "4",
    +      "spark.sql.statistics.colStats.cfloat.min" -> "1.0",
    +      "spark.sql.statistics.colStats.cfloat.nullCount" -> "1",
    +      "spark.sql.statistics.colStats.cfloat.version" -> "1",
    +      "spark.sql.statistics.colStats.cint.avgLen" -> "4",
    +      "spark.sql.statistics.colStats.cint.distinctCount" -> "2",
    +      "spark.sql.statistics.colStats.cint.max" -> "4",
    +      "spark.sql.statistics.colStats.cint.maxLen" -> "4",
    +      "spark.sql.statistics.colStats.cint.min" -> "1",
    +      "spark.sql.statistics.colStats.cint.nullCount" -> "1",
    +      "spark.sql.statistics.colStats.cint.version" -> "1",
    +      "spark.sql.statistics.colStats.clong.avgLen" -> "8",
    +      "spark.sql.statistics.colStats.clong.distinctCount" -> "2",
    +      "spark.sql.statistics.colStats.clong.max" -> "5",
    +      "spark.sql.statistics.colStats.clong.maxLen" -> "8",
    +      "spark.sql.statistics.colStats.clong.min" -> "1",
    +      "spark.sql.statistics.colStats.clong.nullCount" -> "1",
    +      "spark.sql.statistics.colStats.clong.version" -> "1",
    +      "spark.sql.statistics.colStats.cshort.avgLen" -> "2",
    +      "spark.sql.statistics.colStats.cshort.distinctCount" -> "2",
    +      "spark.sql.statistics.colStats.cshort.max" -> "3",
    +      "spark.sql.statistics.colStats.cshort.maxLen" -> "2",
    +      "spark.sql.statistics.colStats.cshort.min" -> "1",
    +      "spark.sql.statistics.colStats.cshort.nullCount" -> "1",
    +      "spark.sql.statistics.colStats.cshort.version" -> "1",
    +      "spark.sql.statistics.colStats.cstring.avgLen" -> "3",
    +      "spark.sql.statistics.colStats.cstring.distinctCount" -> "2",
    +      "spark.sql.statistics.colStats.cstring.maxLen" -> "3",
    +      "spark.sql.statistics.colStats.cstring.nullCount" -> "1",
    +      "spark.sql.statistics.colStats.cstring.version" -> "1",
    +      "spark.sql.statistics.colStats.ctimestamp.avgLen" -> "8",
    +      "spark.sql.statistics.colStats.ctimestamp.distinctCount" -> "2",
    +      "spark.sql.statistics.colStats.ctimestamp.max" -> "2016-05-09 00:00:02.0",
    +      "spark.sql.statistics.colStats.ctimestamp.maxLen" -> "8",
    +      "spark.sql.statistics.colStats.ctimestamp.min" -> "2016-05-08 00:00:01.0",
    +      "spark.sql.statistics.colStats.ctimestamp.nullCount" -> "1",
    +      "spark.sql.statistics.colStats.ctimestamp.version" -> "1"
    +    )
    +
    +    val expectedSerializedHistograms = Map(
    +      "spark.sql.statistics.colStats.cbyte.histogram" ->
    +        HistogramSerializer.serialize(statsWithHgms("cbyte").histogram.get),
    +      "spark.sql.statistics.colStats.cshort.histogram" ->
    +        HistogramSerializer.serialize(statsWithHgms("cshort").histogram.get),
    +      "spark.sql.statistics.colStats.cint.histogram" ->
    +        HistogramSerializer.serialize(statsWithHgms("cint").histogram.get),
    +      "spark.sql.statistics.colStats.clong.histogram" ->
    +        HistogramSerializer.serialize(statsWithHgms("clong").histogram.get),
    +      "spark.sql.statistics.colStats.cdouble.histogram" ->
    +        HistogramSerializer.serialize(statsWithHgms("cdouble").histogram.get),
    +      "spark.sql.statistics.colStats.cfloat.histogram" ->
    +        HistogramSerializer.serialize(statsWithHgms("cfloat").histogram.get),
    +      "spark.sql.statistics.colStats.cdecimal.histogram" ->
    +        HistogramSerializer.serialize(statsWithHgms("cdecimal").histogram.get),
    +      "spark.sql.statistics.colStats.cdate.histogram" ->
    +        HistogramSerializer.serialize(statsWithHgms("cdate").histogram.get),
    +      "spark.sql.statistics.colStats.ctimestamp.histogram" ->
    +        HistogramSerializer.serialize(statsWithHgms("ctimestamp").histogram.get)
    +    )
    +
    +    def checkColStatsProps(expected: Map[String, String]): Unit = {
    +      sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS FOR COLUMNS " + stats.keys.mkString(", "))
    +      val table = hiveClient.getTable("default", tableName)
    +      val props = table.properties.filterKeys(_.startsWith("spark.sql.statistics.colStats"))
    +      assert(props == expected)
    +    }
    +
         withTable(tableName) {
           df.write.saveAsTable(tableName)
     
    -      // Collect statistics
    -      sql(s"analyze table $tableName compute STATISTICS FOR COLUMNS " + stats.keys.mkString(", "))
    +      // Collect and validate statistics
    +      checkColStatsProps(expectedSerializedColStats)
     
    -      // Validate statistics
    -      val table = hiveClient.getTable("default", tableName)
    +      withSQLConf(
    +        SQLConf.HISTOGRAM_ENABLED.key -> "true", SQLConf.HISTOGRAM_BUCKETS_NUM.key -> "2") {
     
    -      val props = table.properties.filterKeys(_.startsWith("spark.sql.statistics.colStats"))
    -      assert(props == Map(
    -        "spark.sql.statistics.colStats.cbinary.avgLen" -> "3",
    -        "spark.sql.statistics.colStats.cbinary.distinctCount" -> "2",
    -        "spark.sql.statistics.colStats.cbinary.maxLen" -> "3",
    -        "spark.sql.statistics.colStats.cbinary.nullCount" -> "1",
    -        "spark.sql.statistics.colStats.cbinary.version" -> "1",
    -        "spark.sql.statistics.colStats.cbool.avgLen" -> "1",
    -        "spark.sql.statistics.colStats.cbool.distinctCount" -> "2",
    -        "spark.sql.statistics.colStats.cbool.max" -> "true",
    -        "spark.sql.statistics.colStats.cbool.maxLen" -> "1",
    -        "spark.sql.statistics.colStats.cbool.min" -> "false",
    -        "spark.sql.statistics.colStats.cbool.nullCount" -> "1",
    -        "spark.sql.statistics.colStats.cbool.version" -> "1",
    -        "spark.sql.statistics.colStats.cbyte.avgLen" -> "1",
    -        "spark.sql.statistics.colStats.cbyte.distinctCount" -> "2",
    -        "spark.sql.statistics.colStats.cbyte.max" -> "2",
    -        "spark.sql.statistics.colStats.cbyte.maxLen" -> "1",
    -        "spark.sql.statistics.colStats.cbyte.min" -> "1",
    -        "spark.sql.statistics.colStats.cbyte.nullCount" -> "1",
    -        "spark.sql.statistics.colStats.cbyte.version" -> "1",
    -        "spark.sql.statistics.colStats.cdate.avgLen" -> "4",
    -        "spark.sql.statistics.colStats.cdate.distinctCount" -> "2",
    -        "spark.sql.statistics.colStats.cdate.max" -> "2016-05-09",
    -        "spark.sql.statistics.colStats.cdate.maxLen" -> "4",
    -        "spark.sql.statistics.colStats.cdate.min" -> "2016-05-08",
    -        "spark.sql.statistics.colStats.cdate.nullCount" -> "1",
    -        "spark.sql.statistics.colStats.cdate.version" -> "1",
    -        "spark.sql.statistics.colStats.cdecimal.avgLen" -> "16",
    -        "spark.sql.statistics.colStats.cdecimal.distinctCount" -> "2",
    -        "spark.sql.statistics.colStats.cdecimal.max" -> "8.000000000000000000",
    -        "spark.sql.statistics.colStats.cdecimal.maxLen" -> "16",
    -        "spark.sql.statistics.colStats.cdecimal.min" -> "1.000000000000000000",
    -        "spark.sql.statistics.colStats.cdecimal.nullCount" -> "1",
    -        "spark.sql.statistics.colStats.cdecimal.version" -> "1",
    -        "spark.sql.statistics.colStats.cdouble.avgLen" -> "8",
    -        "spark.sql.statistics.colStats.cdouble.distinctCount" -> "2",
    -        "spark.sql.statistics.colStats.cdouble.max" -> "6.0",
    -        "spark.sql.statistics.colStats.cdouble.maxLen" -> "8",
    -        "spark.sql.statistics.colStats.cdouble.min" -> "1.0",
    -        "spark.sql.statistics.colStats.cdouble.nullCount" -> "1",
    -        "spark.sql.statistics.colStats.cdouble.version" -> "1",
    -        "spark.sql.statistics.colStats.cfloat.avgLen" -> "4",
    -        "spark.sql.statistics.colStats.cfloat.distinctCount" -> "2",
    -        "spark.sql.statistics.colStats.cfloat.max" -> "7.0",
    -        "spark.sql.statistics.colStats.cfloat.maxLen" -> "4",
    -        "spark.sql.statistics.colStats.cfloat.min" -> "1.0",
    -        "spark.sql.statistics.colStats.cfloat.nullCount" -> "1",
    -        "spark.sql.statistics.colStats.cfloat.version" -> "1",
    -        "spark.sql.statistics.colStats.cint.avgLen" -> "4",
    -        "spark.sql.statistics.colStats.cint.distinctCount" -> "2",
    -        "spark.sql.statistics.colStats.cint.max" -> "4",
    -        "spark.sql.statistics.colStats.cint.maxLen" -> "4",
    -        "spark.sql.statistics.colStats.cint.min" -> "1",
    -        "spark.sql.statistics.colStats.cint.nullCount" -> "1",
    -        "spark.sql.statistics.colStats.cint.version" -> "1",
    -        "spark.sql.statistics.colStats.clong.avgLen" -> "8",
    -        "spark.sql.statistics.colStats.clong.distinctCount" -> "2",
    -        "spark.sql.statistics.colStats.clong.max" -> "5",
    -        "spark.sql.statistics.colStats.clong.maxLen" -> "8",
    -        "spark.sql.statistics.colStats.clong.min" -> "1",
    -        "spark.sql.statistics.colStats.clong.nullCount" -> "1",
    -        "spark.sql.statistics.colStats.clong.version" -> "1",
    -        "spark.sql.statistics.colStats.cshort.avgLen" -> "2",
    -        "spark.sql.statistics.colStats.cshort.distinctCount" -> "2",
    -        "spark.sql.statistics.colStats.cshort.max" -> "3",
    -        "spark.sql.statistics.colStats.cshort.maxLen" -> "2",
    -        "spark.sql.statistics.colStats.cshort.min" -> "1",
    -        "spark.sql.statistics.colStats.cshort.nullCount" -> "1",
    -        "spark.sql.statistics.colStats.cshort.version" -> "1",
    -        "spark.sql.statistics.colStats.cstring.avgLen" -> "3",
    -        "spark.sql.statistics.colStats.cstring.distinctCount" -> "2",
    -        "spark.sql.statistics.colStats.cstring.maxLen" -> "3",
    -        "spark.sql.statistics.colStats.cstring.nullCount" -> "1",
    -        "spark.sql.statistics.colStats.cstring.version" -> "1",
    -        "spark.sql.statistics.colStats.ctimestamp.avgLen" -> "8",
    -        "spark.sql.statistics.colStats.ctimestamp.distinctCount" -> "2",
    -        "spark.sql.statistics.colStats.ctimestamp.max" -> "2016-05-09 00:00:02.0",
    -        "spark.sql.statistics.colStats.ctimestamp.maxLen" -> "8",
    -        "spark.sql.statistics.colStats.ctimestamp.min" -> "2016-05-08 00:00:01.0",
    -        "spark.sql.statistics.colStats.ctimestamp.nullCount" -> "1",
    -        "spark.sql.statistics.colStats.ctimestamp.version" -> "1"
    -      ))
    +        checkColStatsProps(expectedSerializedColStats ++ expectedSerializedHistograms)
    +      }
    +    }
    +  }
    +
    +  test("serde/deser of histograms exceeding 4000 length") {
    +    import testImplicits._
    +
    +    def checkBucketsOrder(buckets: Array[HistogramBucket]): Unit = {
    +      for (i <- buckets.indices) {
    +        val b = buckets(i)
    +        assert(b.lo <= b.hi)
    +        if (i > 0) {
    +          val pre = buckets(i - 1)
    +          assert(pre.hi <= b.lo)
    +        }
    +      }
    +    }
    +
    +    val startTimestamp = DateTimeUtils.fromJavaTimestamp(Timestamp.valueOf("2016-05-08 00:00:01"))
    +    val df = (1 to 5000)
    +      .map(i => (i, DateTimeUtils.toJavaTimestamp(startTimestamp + i)))
    +      .toDF("cint", "ctimestamp")
    +    val tableName = "long_histogram_test"
    +
    +    withTable(tableName) {
    +      df.write.saveAsTable(tableName)
    +
    +      withSQLConf(
    +        SQLConf.HISTOGRAM_ENABLED.key -> "true", SQLConf.HISTOGRAM_BUCKETS_NUM.key -> "1000") {
    --- End diff --
    
    If users set 1000 buckets, I think it's more reasonable to just fail... My point is, it doesn't worth this complexity to this corner case.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r147887335
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -216,65 +218,61 @@ object ColumnStat extends Logging {
         }
       }
     
    -  /**
    -   * Constructs an expression to compute column statistics for a given column.
    -   *
    -   * The expression should create a single struct column with the following schema:
    -   * distinctCount: Long, min: T, max: T, nullCount: Long, avgLen: Long, maxLen: Long
    -   *
    -   * Together with [[rowToColumnStat]], this function is used to create [[ColumnStat]] and
    -   * as a result should stay in sync with it.
    -   */
    -  def statExprs(col: Attribute, relativeSD: Double): CreateNamedStruct = {
    -    def struct(exprs: Expression*): CreateNamedStruct = CreateStruct(exprs.map { expr =>
    -      expr.transformUp { case af: AggregateFunction => af.toAggregateExpression() }
    -    })
    -    val one = Literal(1, LongType)
    +  private def convertToHistogram(s: String): EquiHeightHistogram = {
    +    val idx = s.indexOf(",")
    +    if (idx <= 0) {
    +      throw new AnalysisException("Failed to parse histogram.")
    +    }
    +    val height = s.substring(0, idx).toDouble
    +    val pattern = "Bucket\\(([^,]+), ([^,]+), ([^\\)]+)\\)".r
    +    val buckets = pattern.findAllMatchIn(s).map { m =>
    +      EquiHeightBucket(m.group(1).toDouble, m.group(2).toDouble, m.group(3).toLong)
    +    }.toSeq
    +    EquiHeightHistogram(height, buckets)
    +  }
     
    -    // the approximate ndv (num distinct value) should never be larger than the number of rows
    -    val numNonNulls = if (col.nullable) Count(col) else Count(one)
    -    val ndv = Least(Seq(HyperLogLogPlusPlus(col, relativeSD), numNonNulls))
    -    val numNulls = Subtract(Count(one), numNonNulls)
    -    val defaultSize = Literal(col.dataType.defaultSize, LongType)
    +}
     
    -    def fixedLenTypeStruct(castType: DataType) = {
    -      // For fixed width types, avg size should be the same as max size.
    -      struct(ndv, Cast(Min(col), castType), Cast(Max(col), castType), numNulls, defaultSize,
    -        defaultSize)
    -    }
    +/**
    + * There are a few types of histograms in state-of-the-art estimation methods. E.g. equi-width
    + * histogram, equi-height histogram, frequency histogram (value-frequency pairs) and hybrid
    + * histogram, etc.
    + * Currently in Spark, we support equi-height histogram since it is good at handling skew
    + * distribution, and also provides reasonable accuracy in other cases.
    + * We can add other histograms in the future, which will make estimation logic more complicated.
    --- End diff --
    
    It's not in high priority, here I just want to say it's doable, but will complicate the estimation logic.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r149058649
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -216,65 +218,61 @@ object ColumnStat extends Logging {
         }
       }
     
    -  /**
    -   * Constructs an expression to compute column statistics for a given column.
    -   *
    -   * The expression should create a single struct column with the following schema:
    -   * distinctCount: Long, min: T, max: T, nullCount: Long, avgLen: Long, maxLen: Long
    -   *
    -   * Together with [[rowToColumnStat]], this function is used to create [[ColumnStat]] and
    -   * as a result should stay in sync with it.
    -   */
    -  def statExprs(col: Attribute, relativeSD: Double): CreateNamedStruct = {
    -    def struct(exprs: Expression*): CreateNamedStruct = CreateStruct(exprs.map { expr =>
    -      expr.transformUp { case af: AggregateFunction => af.toAggregateExpression() }
    -    })
    -    val one = Literal(1, LongType)
    +  private def convertToHistogram(s: String): EquiHeightHistogram = {
    +    val idx = s.indexOf(",")
    +    if (idx <= 0) {
    +      throw new AnalysisException("Failed to parse histogram.")
    +    }
    +    val height = s.substring(0, idx).toDouble
    +    val pattern = "Bucket\\(([^,]+), ([^,]+), ([^\\)]+)\\)".r
    +    val buckets = pattern.findAllMatchIn(s).map { m =>
    +      EquiHeightBucket(m.group(1).toDouble, m.group(2).toDouble, m.group(3).toLong)
    +    }.toSeq
    +    EquiHeightHistogram(height, buckets)
    +  }
     
    -    // the approximate ndv (num distinct value) should never be larger than the number of rows
    -    val numNonNulls = if (col.nullable) Count(col) else Count(one)
    -    val ndv = Least(Seq(HyperLogLogPlusPlus(col, relativeSD), numNonNulls))
    -    val numNulls = Subtract(Count(one), numNonNulls)
    -    val defaultSize = Literal(col.dataType.defaultSize, LongType)
    +}
     
    -    def fixedLenTypeStruct(castType: DataType) = {
    -      // For fixed width types, avg size should be the same as max size.
    -      struct(ndv, Cast(Min(col), castType), Cast(Max(col), castType), numNulls, defaultSize,
    -        defaultSize)
    -    }
    +/**
    + * There are a few types of histograms in state-of-the-art estimation methods. E.g. equi-width
    + * histogram, equi-height histogram, frequency histogram (value-frequency pairs) and hybrid
    + * histogram, etc.
    + * Currently in Spark, we support equi-height histogram since it is good at handling skew
    + * distribution, and also provides reasonable accuracy in other cases.
    + * We can add other histograms in the future, which will make estimation logic more complicated.
    --- End diff --
    
    I think we are not going to do. Let's remove the abstraction for it and make the code simpler. cc @rxin @juliuszsompolski 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    cc @sameeragarwal @cloud-fan @gatorsmile @ron8hu 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r147947097
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -1032,7 +1032,21 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
           schema.fields.map(f => (f.name, f.dataType)).toMap
         stats.colStats.foreach { case (colName, colStat) =>
           colStat.toMap(colName, colNameTypeMap(colName)).foreach { case (k, v) =>
    -        statsProperties += (columnStatKeyPropName(colName, k) -> v)
    +        if (k == ColumnStat.KEY_HISTOGRAM) {
    --- End diff --
    
    just make sure if the limitation is not hit, we still generate a simple key-value.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Also cc @juliuszsompolski @ala @hvanhovell 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #83726 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83726/testReport)** for PR 19479 at commit [`976acbe`](https://github.com/apache/spark/commit/976acbe501227735ad54725eaa8b9dad77b6b618).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r150391245
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -275,6 +313,127 @@ object ColumnStat extends Logging {
           avgLen = row.getLong(4),
           maxLen = row.getLong(5)
         )
    +    if (row.isNullAt(6)) {
    +      cs
    +    } else {
    +      val ndvs = row.getArray(6).toLongArray()
    +      assert(percentiles.get.numElements() == ndvs.length + 1)
    +      val endpoints = percentiles.get.toArray[Any](attr.dataType).map(_.toString.toDouble)
    +      // Construct equi-height histogram
    +      val bins = ndvs.zipWithIndex.map { case (ndv, i) =>
    +        HistogramBin(endpoints(i), endpoints(i + 1), ndv)
    +      }
    +      val nonNullRows = rowCount - cs.nullCount
    +      val histogram = Histogram(nonNullRows.toDouble / ndvs.length, bins)
    +      cs.copy(histogram = Some(histogram))
    +    }
    +  }
    +
    +}
    +
    +/**
    + * This class is an implementation of equi-height histogram.
    + * Equi-height histogram represents the distribution of a column's values by a sequence of bins.
    + * Each bin has a value range and contains approximately the same number of rows.
    + * @param height number of rows in each bin
    + * @param bins equi-height histogram bins
    + */
    +case class Histogram(height: Double, bins: Array[HistogramBin]) {
    +
    +  // Only for histogram equality test.
    +  override def equals(other: Any): Boolean = other match {
    +    case otherHgm: Histogram =>
    +      height == otherHgm.height && bins.sameElements(otherHgm.bins)
    +    case _ => false
    +  }
    +
    +  override def hashCode(): Int = {
    +    val temp = java.lang.Double.doubleToLongBits(height)
    +    var result = (temp ^ (temp >>> 32)).toInt
    +    result = 31 * result + java.util.Arrays.hashCode(bins.asInstanceOf[Array[AnyRef]])
    +    result
    +  }
    +}
    +
    +/**
    + * A bin in an equi-height histogram. We use double type for lower/higher bound for simplicity.
    + * @param lo lower bound of the value range in this bin
    --- End diff --
    
    ditto


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r147669959
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -216,65 +218,61 @@ object ColumnStat extends Logging {
         }
       }
     
    -  /**
    -   * Constructs an expression to compute column statistics for a given column.
    -   *
    -   * The expression should create a single struct column with the following schema:
    -   * distinctCount: Long, min: T, max: T, nullCount: Long, avgLen: Long, maxLen: Long
    -   *
    -   * Together with [[rowToColumnStat]], this function is used to create [[ColumnStat]] and
    -   * as a result should stay in sync with it.
    -   */
    -  def statExprs(col: Attribute, relativeSD: Double): CreateNamedStruct = {
    -    def struct(exprs: Expression*): CreateNamedStruct = CreateStruct(exprs.map { expr =>
    -      expr.transformUp { case af: AggregateFunction => af.toAggregateExpression() }
    -    })
    -    val one = Literal(1, LongType)
    +  private def convertToHistogram(s: String): EquiHeightHistogram = {
    +    val idx = s.indexOf(",")
    +    if (idx <= 0) {
    +      throw new AnalysisException("Failed to parse histogram.")
    +    }
    +    val height = s.substring(0, idx).toDouble
    +    val pattern = "Bucket\\(([^,]+), ([^,]+), ([^\\)]+)\\)".r
    +    val buckets = pattern.findAllMatchIn(s).map { m =>
    +      EquiHeightBucket(m.group(1).toDouble, m.group(2).toDouble, m.group(3).toLong)
    +    }.toSeq
    +    EquiHeightHistogram(height, buckets)
    +  }
     
    -    // the approximate ndv (num distinct value) should never be larger than the number of rows
    -    val numNonNulls = if (col.nullable) Count(col) else Count(one)
    -    val ndv = Least(Seq(HyperLogLogPlusPlus(col, relativeSD), numNonNulls))
    -    val numNulls = Subtract(Count(one), numNonNulls)
    -    val defaultSize = Literal(col.dataType.defaultSize, LongType)
    +}
     
    -    def fixedLenTypeStruct(castType: DataType) = {
    -      // For fixed width types, avg size should be the same as max size.
    -      struct(ndv, Cast(Min(col), castType), Cast(Max(col), castType), numNulls, defaultSize,
    -        defaultSize)
    -    }
    +/**
    + * There are a few types of histograms in state-of-the-art estimation methods. E.g. equi-width
    + * histogram, equi-height histogram, frequency histogram (value-frequency pairs) and hybrid
    + * histogram, etc.
    + * Currently in Spark, we support equi-height histogram since it is good at handling skew
    + * distribution, and also provides reasonable accuracy in other cases.
    + * We can add other histograms in the future, which will make estimation logic more complicated.
    --- End diff --
    
    are we really gonna do that ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r149850111
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala ---
    @@ -275,6 +317,122 @@ object ColumnStat extends Logging {
           avgLen = row.getLong(4),
           maxLen = row.getLong(5)
         )
    +    if (row.isNullAt(6)) {
    +      cs
    +    } else {
    +      val ndvs = row.getArray(6).toLongArray()
    +      assert(percentiles.get.numElements() == ndvs.length + 1)
    +      val endpoints = percentiles.get.toArray[Any](attr.dataType).map(_.toString.toDouble)
    +      // Construct equi-height histogram
    +      val buckets = ndvs.zipWithIndex.map { case (ndv, i) =>
    +        HistogramBucket(endpoints(i), endpoints(i + 1), ndv)
    +      }
    +      val nonNullRows = rowCount - cs.nullCount
    +      val histogram = Histogram(nonNullRows.toDouble / ndvs.length, buckets)
    +      cs.copy(histogram = Some(histogram))
    +    }
    +  }
    +
    +}
    +
    +/**
    + * This class is an implementation of equi-height histogram.
    + * Equi-height histogram represents the distribution of a column's values by a sequence of buckets.
    + * Each bucket has a value range and contains approximately the same number of rows.
    + * @param height number of rows in each bucket
    + * @param buckets equi-height histogram buckets
    + */
    +case class Histogram(height: Double, buckets: Array[HistogramBucket]) {
    +
    +  // Only for histogram equality test.
    +  override def equals(other: Any): Boolean = other match {
    +    case otherHgm: Histogram =>
    +      height == otherHgm.height && buckets.sameElements(otherHgm.buckets)
    +    case _ => false
       }
     
    +  override def hashCode(): Int = super.hashCode()
    --- End diff --
    
    please implement it correctly.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #83645 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83645/testReport)** for PR 19479 at commit [`8af3868`](https://github.com/apache/spark/commit/8af38687d638ae2d94d9f76955b182df02404cce).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83735/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82801/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #83626 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83626/testReport)** for PR 19479 at commit [`72c46f8`](https://github.com/apache/spark/commit/72c46f844967039ec2009de6cd93b9733ab1e8b8).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    @cloud-fan Yes, showing histogram in DESC COLUMN command is in the plan. I just created a JIRA for it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83645/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by wzhfy <gi...@git.apache.org>.

Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r147887853
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala ---
    @@ -89,19 +93,159 @@ case class AnalyzeColumnCommand(
         // The first element in the result will be the overall row count, the following elements
         // will be structs containing all column stats.
         // The layout of each struct follows the layout of the ColumnStats.
    -    val ndvMaxErr = sparkSession.sessionState.conf.ndvMaxError
         val expressions = Count(Literal(1)).toAggregateExpression() +:
    -      attributesToAnalyze.map(ColumnStat.statExprs(_, ndvMaxErr))
    +      attributesToAnalyze.map(statExprs(_, sparkSession.sessionState.conf))
     
         val namedExpressions = expressions.map(e => Alias(e, e.toString)())
         val statsRow = new QueryExecution(sparkSession, Aggregate(Nil, namedExpressions, relation))
           .executedPlan.executeTake(1).head
     
         val rowCount = statsRow.getLong(0)
    -    val columnStats = attributesToAnalyze.zipWithIndex.map { case (attr, i) =>
    -      // according to `ColumnStat.statExprs`, the stats struct always have 6 fields.
    -      (attr.name, ColumnStat.rowToColumnStat(statsRow.getStruct(i + 1, 6), attr))
    -    }.toMap
    -    (rowCount, columnStats)
    +    val colStats = rowToColumnStats(sparkSession, relation, attributesToAnalyze, statsRow, rowCount)
    +    (rowCount, colStats)
    +  }
    +
    +  /**
    +   * Constructs an expression to compute column statistics for a given column.
    +   *
    +   * The expression should create a single struct column with the following schema:
    +   * distinctCount: Long, min: T, max: T, nullCount: Long, avgLen: Long, maxLen: Long,
    +   * percentiles: Array[T]
    +   *
    +   * Together with [[rowToColumnStats]], this function is used to create [[ColumnStat]] and
    +   * as a result should stay in sync with it.
    +   */
    +  private def statExprs(col: Attribute, conf: SQLConf): CreateNamedStruct = {
    +    def struct(exprs: Expression*): CreateNamedStruct = CreateStruct(exprs.map { expr =>
    +      expr.transformUp { case af: AggregateFunction => af.toAggregateExpression() }
    +    })
    +    val one = Literal(1, LongType)
    +
    +    // the approximate ndv (num distinct value) should never be larger than the number of rows
    +    val numNonNulls = if (col.nullable) Count(col) else Count(one)
    +    val ndv = Least(Seq(HyperLogLogPlusPlus(col, conf.ndvMaxError), numNonNulls))
    +    val numNulls = Subtract(Count(one), numNonNulls)
    +    val defaultSize = Literal(col.dataType.defaultSize, LongType)
    +    val nullArray = Literal(null, ArrayType(DoubleType))
    +
    +    def fixedLenTypeExprs(castType: DataType) = {
    +      // For fixed width types, avg size should be the same as max size.
    +      Seq(ndv, Cast(Min(col), castType), Cast(Max(col), castType), numNulls, defaultSize,
    +        defaultSize)
    +    }
    +
    +    def fixedLenTypeStruct(castType: DataType, genHistogram: Boolean) = {
    +      val percentileExpr = if (genHistogram) {
    +        // To generate equi-height histogram, we need to:
    +        // 1. get percentiles p(1/n), p(2/n) ... p((n-1)/n),
    +        // 2. use min, max, and percentiles as range values of buckets, e.g. [min, p(1/n)],
    +        // [p(1/n), p(2/n)] ... [p((n-1)/n), max], and then count ndv in each bucket.
    +        // Step 2 will be performed in `rowToColumnStats`.
    --- End diff --
    
    Do you mean calculate percentiles for min/max at the step 1? Currently other percentiles are already calculated at step 1.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #83735 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83735/testReport)** for PR 19479 at commit [`7b6d211`](https://github.com/apache/spark/commit/7b6d211f4f082415cc182de9e34f7420d65e94eb).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83343/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19479: [SPARK-17074] [SQL] Generate equi-height histogra...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19479#discussion_r150011624
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -1034,11 +1034,18 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat
           schema.fields.map(f => (f.name, f.dataType)).toMap
         stats.colStats.foreach { case (colName, colStat) =>
           colStat.toMap(colName, colNameTypeMap(colName)).foreach { case (k, v) =>
    -        statsProperties += (columnStatKeyPropName(colName, k) -> v)
    +        val statKey = columnStatKeyPropName(colName, k)
    +        val threshold = conf.get(SCHEMA_STRING_LENGTH_THRESHOLD)
    +        if (v.length > threshold) {
    +          throw new AnalysisException(s"Cannot persist '$statKey' into hive metastore as " +
    --- End diff --
    
    what if we don't do it? will hive give us an exception?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    **[Test build #83599 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83599/testReport)** for PR 19479 at commit [`d9dee61`](https://github.com/apache/spark/commit/d9dee6113a4720b0e9eb713baf9c7394bcdd3eb6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19479: [SPARK-17074] [SQL] Generate equi-height histogram in co...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19479
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org