You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Andrew Ray (JIRA)" <ji...@apache.org> on 2017/01/13 15:41:26 UTC
[jira] [Commented] (SPARK-19116) LogicalPlan.statistics.sizeInBytes wrong for trivial parquet file

    [ https://issues.apache.org/jira/browse/SPARK-19116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821914#comment-15821914 ] 

Andrew Ray commented on SPARK-19116:
------------------------------------

The 2318 number is the size of the parquet files written to disk

> LogicalPlan.statistics.sizeInBytes wrong for trivial parquet file
> -----------------------------------------------------------------
>
>                 Key: SPARK-19116
>                 URL: https://issues.apache.org/jira/browse/SPARK-19116
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 2.0.1, 2.0.2
>         Environment: Python 3.5.x
> Windows 10
>            Reporter: Shea Parkes
>
> We're having some modestly severe issues with broadcast join inference, and I've been chasing them through the join heuristics in the catalyst engine.  I've made it as far as I can, and I've hit upon something that does not make any sense to me.
> I thought that loading from parquet would be a RelationPlan, which would just use the sum of default sizeInBytes for each column times the number of rows.  But this trivial example shows that I am not correct:
> {code}
> import pyspark.sql.functions as F
> df_range = session.range(100).select(F.col('id').cast('integer'))
> df_range.write.parquet('c:/scratch/hundred_integers.parquet')
> df_parquet = session.read.parquet('c:/scratch/hundred_integers.parquet')
> df_parquet.explain(True)
> # Expected sizeInBytes
> integer_default_sizeinbytes = 4
> print(df_parquet.count() * integer_default_sizeinbytes)  # = 400
> # Inferred sizeInBytes
> print(df_parquet._jdf.logicalPlan().statistics().sizeInBytes())  # = 2318
> # For posterity (Didn't really expect this to match anything above)
> print(df_range._jdf.logicalPlan().statistics().sizeInBytes())  # = 600
> {code}
> And here's the results of explain(True) on df_parquet:
> {code}
> In [456]: == Parsed Logical Plan ==
> Relation[id#794] parquet
> == Analyzed Logical Plan ==
> id: int
> Relation[id#794] parquet
> == Optimized Logical Plan ==
> Relation[id#794] parquet
> == Physical Plan ==
> *BatchedScan parquet [id#794] Format: ParquetFormat, InputPaths: file:/c:/scratch/hundred_integers.parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>
> {code}
> So basically, I'm not understanding well how the size of the parquet file is being estimated.  I don't expect it to be extremely accurate, but empirically it's so inaccurate that we're having to mess with autoBroadcastJoinThreshold way too much.  (It's not always too high like the example above, it's often way too low.)
> Without deeper understanding, I'm considering a result of 2318 instead of 400 to be a bug.  My apologies if I'm missing something obvious.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org