You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/04 04:52:57 UTC

[GitHub] [spark] wangyum opened a new pull request #24523: [SPARK-27631][SQL] Avoid repeating calculate table statistics when AUTO_SIZE_UPDATE_ENABLED is enabled

wangyum opened a new pull request #24523: [SPARK-27631][SQL] Avoid repeating calculate table statistics when AUTO_SIZE_UPDATE_ENABLED is enabled
URL: https://github.com/apache/spark/pull/24523
 
 
   ## What changes were proposed in this pull request?
   
   How to reproduce this issue:
   ```shell
   build/sbt clean package -Phive -Phadoop-3.2
   export SPARK_PREPEND_CLASSES=true
   bin/spark-shell --conf spark.sql.statistics.size.autoUpdate.enabled=true --conf spark.hadoop.hive.metastore.schema.verification=false --conf spark.hadoop.datanucleus.schema.autoCreateAll=true 
   ```
   ```scala
   sc.setLogLevel("INFO")
   spark.sql("create table t1(id int) using hive")
   spark.sql("insert into t1 values(1)")
   
   ...
   19/05/03 21:38:53 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 468 ms on localhost (executor driver) (1/1)
   ...
   19/05/03 21:38:53 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=alter_table: db=default tbl=t1 newtbl=t1
   19/05/03 21:38:53 INFO log: Updating table stats fast for t1
   19/05/03 21:38:53 INFO log: Updated size of table t1 to 2
   ...
   19/05/03 21:38:53 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=alter_table: db=default tbl=t1 newtbl=t1
   19/05/03 21:38:53 INFO log: Updating table stats fast for t1
   19/05/03 21:38:53 INFO log: Updated size of table t1 to 2
   ```
   It shows that it has executed `Updated size of table t1 to 2` twice.
   
   This pr update the `hasFollowingStatsTask` based on `AUTO_SIZE_UPDATE_ENABLED` to avoid repeating calculate table statistics.
   
   ## How was this patch tested?
   
   manual tests:
   ```shell
   build/sbt clean package -Phive -Phadoop-3.2
   export SPARK_PREPEND_CLASSES=true
   bin/spark-shell --conf spark.sql.statistics.size.autoUpdate.enabled=true --conf spark.hadoop.hive.metastore.schema.verification=false --conf spark.hadoop.datanucleus.schema.autoCreateAll=true 
   ```
   ```scala
   sc.setLogLevel("INFO")
   spark.sql("create table t1(id int) using hive")
   spark.sql("insert into t1 values(1)")
   ```
   It shows that it has executed `Updated size of table t1 to 2` only once.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org