You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/03/07 09:52:10 UTC

[GitHub] [spark] wangyum opened a new pull request #24003: [SPARK-19678][FOLLOW-UP][SQL] Add behavior change test when table statistics are incorrect

wangyum opened a new pull request #24003: [SPARK-19678][FOLLOW-UP][SQL] Add behavior change test when table statistics are incorrect
URL: https://github.com/apache/spark/pull/24003
 
 
   ## What changes were proposed in this pull request?
   
   Since Spark 2.2.0 ([SPARK-19678](https://issues.apache.org/jira/browse/SPARK-19678)), the below SQL changed from `broadcast join` to `sort merge join`:
   ```sql
   -- small external table with incorrect statistics
   CREATE EXTERNAL TABLE t1(c1 int)
   ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
   WITH SERDEPROPERTIES (
     'serialization.format' = '1'
   )
   STORED AS
     INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
     OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
   LOCATION 'file:///tmp/t1'
   TBLPROPERTIES (
   'rawDataSize'='-1', 'numFiles'='0', 'totalSize'='0', 'COLUMN_STATS_ACCURATE'='false', 'numRows'='-1'
   );
   
   -- big table
   CREATE TABLE t2 (c1 int)
   LOCATION 'file:///tmp/t2'
   TBLPROPERTIES (
   'rawDataSize'='23437737', 'numFiles'='12222', 'totalSize'='333442230', 'COLUMN_STATS_ACCURATE'='false', 'numRows'='443442223'
   );
   
   explain SELECT t1.c1 FROM t1 INNER JOIN t2 ON t1.c1 = t2.c1;
   ```
   This pr add a test case for this behavior change.
   
   ## How was this patch tested?
   
   unit tests
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org