You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Reynold Xin (JIRA)" <ji...@apache.org> on 2016/12/14 00:27:58 UTC
[jira] [Created] (SPARK-18853) Project is way too aggressive in
estimating statistics
Reynold Xin created SPARK-18853:
-----------------------------------
Summary: Project is way too aggressive in estimating statistics
Key: SPARK-18853
URL: https://issues.apache.org/jira/browse/SPARK-18853
Project: Spark
Issue Type: Bug
Components: SQL
Reporter: Reynold Xin
We currently define statistics in UnaryNode:
{code}
override def statistics: Statistics = {
// There should be some overhead in Row object, the size should not be zero when there is
// no columns, this help to prevent divide-by-zero error.
val childRowSize = child.output.map(_.dataType.defaultSize).sum + 8
val outputRowSize = output.map(_.dataType.defaultSize).sum + 8
// Assume there will be the same number of rows as child has.
var sizeInBytes = (child.statistics.sizeInBytes * outputRowSize) / childRowSize
if (sizeInBytes == 0) {
// sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be zero
// (product of children).
sizeInBytes = 1
}
child.statistics.copy(sizeInBytes = sizeInBytes)
}
{code}
This has a few issues:
1. This can aggressively underestimate the size for Project. We assume each array/map has 100 elements, which is an overestimate. If the user projects a single field out of a deeply nested field, this would lead to huge underestimation. A safer sane default is probably 2.
2. It is not a property of UnaryNode to propagate statistics this way. It should be a property of Project.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org