You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Andy Grove (Jira)" <ji...@apache.org> on 2022/08/05 17:47:00 UTC

[jira] [Created] (ARROW-17325) AQE should use available column statistics from completed query stages

Andy Grove created ARROW-17325:
----------------------------------

             Summary: AQE should use available column statistics from completed query stages
                 Key: ARROW-17325
                 URL: https://issues.apache.org/jira/browse/ARROW-17325
             Project: Apache Arrow
          Issue Type: Improvement
          Components: SQL
            Reporter: Andy Grove


In QueryStageExec.computeStats we copy partial statistics from materlized query stages by calling QueryStageExec#getRuntimeStatistics, which in turn calls ShuffleExchangeLike#runtimeStatistics or BroadcastExchangeLike#runtimeStatistics.

 

Only dataSize and numOutputRows are copied into the new Statistics object:

 {code:scala}
  def computeStats(): Option[Statistics] = if (isMaterialized) {
    val runtimeStats = getRuntimeStatistics
    val dataSize = runtimeStats.sizeInBytes.max(0)
    val numOutputRows = runtimeStats.rowCount.map(_.max(0))
    Some(Statistics(dataSize, numOutputRows, isRuntime = true))
  } else {
    None
  }
{code}

I would like to also copy over the column statistics stored in Statistics.attributeMap so that they can be fed back into the logical plan optimization phase.

The Spark implementations of ShuffleExchangeLike and BroadcastExchangeLike do not currently provide such column statistics but other custom implementations can.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)