You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Babulal (JIRA)" <ji...@apache.org> on 2019/07/16 16:05:00 UTC
[jira] [Updated] (SPARK-28413) sizeInByte is Not updated for
parquet datasource on Next Insert.
[ https://issues.apache.org/jira/browse/SPARK-28413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Babulal updated SPARK-28413:
----------------------------
Description:
In SPARK-21237 (link SPARK-21237) it is fix when Appending data using write.mode("append") . But when create same type of parquet table using SQL and Insert data ,stats shows in-correct (not updated).
*+Correct Stats Example (SPARK-21237)+*
scala> spark.range(100).write.saveAsTable("tab1")
scala> spark.sql("explain cost select * from tab1").show(false)
+------------------------------------------------------------------------
|plan
+------------------------------------------------------------------------|
|== Optimized Logical Plan ==
Relation[id#10L|#10L] parquet, Statistics(*sizeInBytes=784.0 B*, hints=none)|
== Physical Plan ==
FileScan parquet default.tab1[id#10L|#10L] Batched: false, Format: Parquet,
scala> spark.range(100).write.mode("append").saveAsTable("tab1")
scala> spark.sql("explain cost select * from tab1").show(false)
+----------------------------------------------------------------------
|plan
+----------------------------------------------------------------------|
|== Optimized Logical Plan ==
Relation[id#23L|#23L] parquet, Statistics(*sizeInBytes=1568.0 B*, hints=none)|
== Physical Plan ==
FileScan parquet default.tab1[id#23L|#23L] Batched: false, Format: Parquet,
+*Incorrect Stats Example*+
scala> spark.sql("create table tab2(id bigint) using parquet")
res6: org.apache.spark.sql.DataFrame = []
scala> spark.sql("explain cost select * from tab2").show(false)
+----------------------------------------------------------------------
|plan
+----------------------------------------------------------------------|
|== Optimized Logical Plan ==
Relation[id#30L|#30L] parquet, Statistics(*sizeInBytes=374.0 B,* hints=none)|
== Physical Plan ==
FileScan parquet default.tab2[id#30L|#30L] Batched: false, Format: Parquet,
scala> spark.sql("insert into tab2 select 1")
res9: org.apache.spark.sql.DataFrame = []
scala> spark.sql("explain cost select * from tab2").show(false)
+----------------------------------------------------------------------
|plan
+----------------------------------------------------------------------|
|== Optimized Logical Plan ==
Relation[id#30L|#30L] parquet, Statistics(*sizeInBytes={color:#ff0000}374.0 B{color}*, hints=none)|
== Physical Plan ==
FileScan parquet default.tab2[id#30L|#30L] Batched: false, Format: Parquet,
Both table are same type of table
scala> spark.sql("desc formatted tab1").show(2000,false)
+-----------------------------+-------------------------------------------------------------+
|col_name|data_type|
+-----------------------------+-------------------------------------------------------------+
|id|bigint|
| | |
| # Detailed Table Information| |
|Database|default|
|Table|tab1|
|Owner|Administrator|
|Created Time|Tue Jul 16 21:08:35 IST 2019|
|Last Access|Thu Jan 01 05:30:00 IST 1970|
|Created By|Spark 2.3.2|
|Type|MANAGED|
|Provider|parquet|
|Table Properties|[transient_lastDdlTime=1563291579]|
|Statistics|1568 bytes|
|Location|file:/x/2|
|Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe|
|InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat|
|OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat|
scala> spark.sql("desc formatted tab2").show(2000,false)
+-----------------------------+-------------------------------------------------------------
|col_name|data_type
+-----------------------------+-------------------------------------------------------------|
|id|bigint|
| |
| # Detailed Table Information|
|Database|default|
|Table|tab2|
|Owner|Administrator|
|Created Time|Tue Jul 16 21:10:24 IST 2019|
|Last Access|Thu Jan 01 05:30:00 IST 1970|
|Created By|Spark 2.3.2|
|Type|MANAGED|
|Provider|parquet|
|Table Properties|[transient_lastDdlTime=1563291624]|
|Location|file:/x/1|
|Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe|
|InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat|
|OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat|
was:
In SPARK-21237 ([link SPARK-21237|https://issues.apache.org/jira/browse/SPARK-21237] it is fix when Appending data using write.mode("append") . But when create same type of parquet table using SQL and Insert data ,stats shows in-correct (not updated).
*+Correct Stats Example (SPARK-21237)+*
scala> spark.range(100).write.saveAsTable("tab1")
scala> spark.sql("explain cost select * from tab1").show(false)
+------------------------------------------------------------------------
|plan
+------------------------------------------------------------------------
|== Optimized Logical Plan ==
Relation[id#10L] parquet, Statistics(*sizeInBytes=784.0 B*, hints=none)
== Physical Plan ==
FileScan parquet default.tab1[id#10L] Batched: false, Format: Parquet,
scala> spark.range(100).write.mode("append").saveAsTable("tab1")
scala> spark.sql("explain cost select * from tab1").show(false)
+----------------------------------------------------------------------
|plan
+----------------------------------------------------------------------
|== Optimized Logical Plan ==
Relation[id#23L] parquet, Statistics(*sizeInBytes=1568.0 B*, hints=none)
== Physical Plan ==
FileScan parquet default.tab1[id#23L] Batched: false, Format: Parquet,
+*Incorrect Stats Example*+
scala> spark.sql("create table tab2(id bigint) using parquet")
res6: org.apache.spark.sql.DataFrame = []
scala> spark.sql("explain cost select * from tab2").show(false)
+----------------------------------------------------------------------
|plan
+----------------------------------------------------------------------
|== Optimized Logical Plan ==
Relation[id#30L] parquet, Statistics(*sizeInBytes=374.0 B,* hints=none)
== Physical Plan ==
FileScan parquet default.tab2[id#30L] Batched: false, Format: Parquet,
scala> spark.sql("insert into tab2 select 1")
res9: org.apache.spark.sql.DataFrame = []
scala> spark.sql("explain cost select * from tab2").show(false)
+----------------------------------------------------------------------
|plan
+----------------------------------------------------------------------
|== Optimized Logical Plan ==
Relation[id#30L] parquet, Statistics(*sizeInBytes={color:#FF0000}374.0 B{color}*, hints=none)
== Physical Plan ==
FileScan parquet default.tab2[id#30L] Batched: false, Format: Parquet,
Both table are same type of table
scala> spark.sql("desc formatted tab1").show(2000,false)
+----------------------------+--------------------------------------------------------------+
|col_name |data_type |
+----------------------------+--------------------------------------------------------------+
|id |bigint |
| | |
|# Detailed Table Information| |
|Database |default |
|Table |tab1 |
|Owner |Administrator |
|Created Time |Tue Jul 16 21:08:35 IST 2019 |
|Last Access |Thu Jan 01 05:30:00 IST 1970 |
|Created By |Spark 2.3.2 |
|Type |MANAGED |
|Provider |parquet |
|Table Properties |[transient_lastDdlTime=1563291579] |
|Statistics |1568 bytes |
|Location |file:/x/2 |
|Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe |
|InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat |
|OutputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat|
scala> spark.sql("desc formatted tab2").show(2000,false)
+----------------------------+--------------------------------------------------------------
|col_name |data_type
+----------------------------+--------------------------------------------------------------
|id |bigint
| |
|# Detailed Table Information|
|Database |default
|Table |tab2
|Owner |Administrator
|Created Time |Tue Jul 16 21:10:24 IST 2019
|Last Access |Thu Jan 01 05:30:00 IST 1970
|Created By |Spark 2.3.2
|Type |MANAGED
|Provider |parquet
|Table Properties |[transient_lastDdlTime=1563291624]
|Location |file:/x/1
|Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
|InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
|OutputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
> sizeInByte is Not updated for parquet datasource on Next Insert.
> ----------------------------------------------------------------
>
> Key: SPARK-28413
> URL: https://issues.apache.org/jira/browse/SPARK-28413
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.3.2, 2.4.1
> Reporter: Babulal
> Priority: Minor
>
> In SPARK-21237 (link SPARK-21237) it is fix when Appending data using write.mode("append") . But when create same type of parquet table using SQL and Insert data ,stats shows in-correct (not updated).
> *+Correct Stats Example (SPARK-21237)+*
> scala> spark.range(100).write.saveAsTable("tab1")
> scala> spark.sql("explain cost select * from tab1").show(false)
> +------------------------------------------------------------------------
> |plan
> +------------------------------------------------------------------------|
> |== Optimized Logical Plan ==
> Relation[id#10L|#10L] parquet, Statistics(*sizeInBytes=784.0 B*, hints=none)|
> == Physical Plan ==
> FileScan parquet default.tab1[id#10L|#10L] Batched: false, Format: Parquet,
> scala> spark.range(100).write.mode("append").saveAsTable("tab1")
> scala> spark.sql("explain cost select * from tab1").show(false)
> +----------------------------------------------------------------------
> |plan
> +----------------------------------------------------------------------|
> |== Optimized Logical Plan ==
> Relation[id#23L|#23L] parquet, Statistics(*sizeInBytes=1568.0 B*, hints=none)|
> == Physical Plan ==
> FileScan parquet default.tab1[id#23L|#23L] Batched: false, Format: Parquet,
>
>
> +*Incorrect Stats Example*+
> scala> spark.sql("create table tab2(id bigint) using parquet")
> res6: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("explain cost select * from tab2").show(false)
> +----------------------------------------------------------------------
> |plan
> +----------------------------------------------------------------------|
> |== Optimized Logical Plan ==
> Relation[id#30L|#30L] parquet, Statistics(*sizeInBytes=374.0 B,* hints=none)|
> == Physical Plan ==
> FileScan parquet default.tab2[id#30L|#30L] Batched: false, Format: Parquet,
>
> scala> spark.sql("insert into tab2 select 1")
> res9: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("explain cost select * from tab2").show(false)
> +----------------------------------------------------------------------
> |plan
> +----------------------------------------------------------------------|
> |== Optimized Logical Plan ==
> Relation[id#30L|#30L] parquet, Statistics(*sizeInBytes={color:#ff0000}374.0 B{color}*, hints=none)|
> == Physical Plan ==
> FileScan parquet default.tab2[id#30L|#30L] Batched: false, Format: Parquet,
>
>
> Both table are same type of table
> scala> spark.sql("desc formatted tab1").show(2000,false)
> +-----------------------------+-------------------------------------------------------------+
> |col_name|data_type|
> +-----------------------------+-------------------------------------------------------------+
> |id|bigint|
> | | |
> | # Detailed Table Information| |
> |Database|default|
> |Table|tab1|
> |Owner|Administrator|
> |Created Time|Tue Jul 16 21:08:35 IST 2019|
> |Last Access|Thu Jan 01 05:30:00 IST 1970|
> |Created By|Spark 2.3.2|
> |Type|MANAGED|
> |Provider|parquet|
> |Table Properties|[transient_lastDdlTime=1563291579]|
> |Statistics|1568 bytes|
> |Location|file:/x/2|
> |Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe|
> |InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat|
> |OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat|
>
> scala> spark.sql("desc formatted tab2").show(2000,false)
> +-----------------------------+-------------------------------------------------------------
> |col_name|data_type
> +-----------------------------+-------------------------------------------------------------|
> |id|bigint|
> | |
> | # Detailed Table Information|
> |Database|default|
> |Table|tab2|
> |Owner|Administrator|
> |Created Time|Tue Jul 16 21:10:24 IST 2019|
> |Last Access|Thu Jan 01 05:30:00 IST 1970|
> |Created By|Spark 2.3.2|
> |Type|MANAGED|
> |Provider|parquet|
> |Table Properties|[transient_lastDdlTime=1563291624]|
> |Location|file:/x/1|
> |Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe|
> |InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat|
> |OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat|
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org