You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by wangyum <gi...@git.apache.org> on 2018/10/16 07:35:39 UTC
[GitHub] spark pull request #22743: [WIP][SPARK-25740][SQL] Set some configuration ne...
GitHub user wangyum opened a pull request:
https://github.com/apache/spark/pull/22743
[WIP][SPARK-25740][SQL] Set some configuration need invalidateStatsCache
## What changes were proposed in this pull request?
How to reproduce:
```sql
# spark-sql
create table t1 (a int) stored as parquet;
create table t2 (a int) stored as parquet;
insert into table t1 values (1);
insert into table t2 values (1);
explain select * from t1, t2 where t1.a = t2.a;
exit;
```
```sql
# spark-sql
set spark.sql.statistics.fallBackToHdfs=true;
explain select * from t1, t2 where t1.a = t2.a;
-- It is BroadcastHashJoin
```
```sql
# spark-sql
explain select * from t1, t2 where t1.a = t2.a;
-- SortMergeJoin
set spark.sql.statistics.fallBackToHdfs=true;
explain select * from t1, t2 where t1.a = t2.a;
-- SortMergeJoin, it should be BroadcastHashJoin
```
We need `LogicalPlanStats.invalidateStatsCache` to clean cached stats when execute set `spark.sql.statistics.fallBackToHdfs` Command, but seems only we can do is `invalidateAllCachedTables`.
## How was this patch tested?
manual tests
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/wangyum/spark SPARK-25740
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22743.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22743
----
commit cf43e225c9da4f1274c7c82b568a89b3369e3515
Author: Yuming Wang <yu...@...>
Date: 2018-10-16T07:27:03Z
Set some configuration need invalidateStatsCache
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22743
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:
https://github.com/apache/spark/pull/22743
Yes. you are right, if datasource table stats is empty, `DetermineTableStats` doesn't set stats for it, so it's only a problem for hive tables.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22743
**[Test build #97517 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97517/testReport)** for PR 22743 at commit [`c32a2a9`](https://github.com/apache/spark/commit/c32a2a976718fcd1d7c92bb2310e463b7edff478).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [WIP][SPARK-25740][SQL] Set some configuration need inva...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22743
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [WIP][SPARK-25740][SQL] Set some configuration need inva...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22743
**[Test build #97515 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97515/testReport)** for PR 22743 at commit [`c32a2a9`](https://github.com/apache/spark/commit/c32a2a976718fcd1d7c92bb2310e463b7edff478).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22743
**[Test build #97515 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97515/testReport)** for PR 22743 at commit [`c32a2a9`](https://github.com/apache/spark/commit/c32a2a976718fcd1d7c92bb2310e463b7edff478).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22743
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97515/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22743
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97522/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:
https://github.com/apache/spark/pull/22743
Datasource table will not cache in [tableRelationCache](https://github.com/apache/spark/blob/01c3dfab158d40653f8ce5d96f57220297545d5b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala#L134).
Hive table only occured when Hive table stats is empty and enable `spark.sql.hive.convertMetastoreParquet` (default value). then when we read this table, Spark will [convertToLogicalRelation](https://github.com/apache/spark/blob/a2f502cf53b6b00af7cb80b6f38e64cf46367595/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L116) and [cache it](https://github.com/apache/spark/blob/a2f502cf53b6b00af7cb80b6f38e64cf46367595/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L207).
Empty stats occured at least in 2 situations:
1. Create as Hive table and enable `spark.sql.hive.convertMetastoreParquet` (default value) and disable `spark.sql.statistics.size.autoUpdate.enabled` (default value) then do inserting.
2. Table managed by Hive and didn't gather stats.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22743
why it's only a problem for hive tables?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22743
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4079/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22743
> Datasource table will not cache in tableRelationCache.
I don't think so. Spark caches data source table in `FindDataSourceTable`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [WIP][SPARK-25740][SQL] Set some configuration need inva...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22743
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4026/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [WIP][SPARK-25740][SQL] Set some configuration need inva...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22743
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:
https://github.com/apache/spark/pull/22743
This happens when a table `LogicalRelation` has been cached, then we change `spark.sql.statistics.fallBackToHdfs` or `spark.sql.defaultSizeInBytes` will not have any effect to stats, it always uses the stats already cached in `LogicalRelation`. This is an example:
```scala
import org.apache.spark.sql.catalyst.QualifiedTableName
import org.apache.spark.sql.catalyst.catalog.SessionCatalog
import org.apache.spark.sql.execution.datasources.LogicalRelation
spark.sql("CREATE TABLE t1 (c1 bigint) STORED AS PARQUET")
spark.sql("INSERT INTO TABLE t1 VALUES (1)")
spark.sql("REFRESH TABLE t1")
val catalog = spark.sessionState.catalog
val qualifiedTableName = QualifiedTableName(catalog.getCurrentDatabase, "t1")
spark.sql("SELECT * from t1").collect()
val cachedRelation = catalog.getCachedTable(qualifiedTableName)
cachedRelation.asInstanceOf[LogicalRelation].catalogTable.get.stats.get.sizeInBytes
// res4: BigInt = 9223372036854775807
spark.sql("set spark.sql.statistics.fallBackToHdfs=true")
spark.sql("SELECT * from t1").collect()
val cachedRelation = catalog.getCachedTable(qualifiedTableName)
cachedRelation.asInstanceOf[LogicalRelation].catalogTable.get.stats.get.sizeInBytes
// res7: BigInt = 9223372036854775807
// It should compute from file system, but still 9223372036854775807
spark.sql("REFRESH TABLE t1")
spark.sql("SELECT * from t1").collect()
val cachedRelation = catalog.getCachedTable(qualifiedTableName)
cachedRelation.asInstanceOf[LogicalRelation].catalogTable.get.stats.get.sizeInBytes
// res10: BigInt = 708
// If we refresh this table, it correct.
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22743
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4075/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [WIP][SPARK-25740][SQL] Set some configuration need inva...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22743
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97442/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22743
**[Test build #97522 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97522/testReport)** for PR 22743 at commit [`206743c`](https://github.com/apache/spark/commit/206743cef96e536783a315785739af16f845f5c1).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [WIP][SPARK-25740][SQL] Set some configuration need inva...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22743
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [WIP][SPARK-25740][SQL] Set some configuration need inva...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22743
**[Test build #97442 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97442/testReport)** for PR 22743 at commit [`cf43e22`](https://github.com/apache/spark/commit/cf43e225c9da4f1274c7c82b568a89b3369e3515).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:
https://github.com/apache/spark/pull/22743
retest this please
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:
https://github.com/apache/spark/pull/22743
cc @cloud-fan
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/22743
can you explain more about how this happens?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22743
**[Test build #97517 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97517/testReport)** for PR 22743 at commit [`c32a2a9`](https://github.com/apache/spark/commit/c32a2a976718fcd1d7c92bb2310e463b7edff478).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22743
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [WIP][SPARK-25740][SQL] Set some configuration need inva...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22743
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4074/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [WIP][SPARK-25740][SQL] Set some configuration need inva...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22743
**[Test build #97442 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97442/testReport)** for PR 22743 at commit [`cf43e22`](https://github.com/apache/spark/commit/cf43e225c9da4f1274c7c82b568a89b3369e3515).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22743
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97517/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22743
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22743
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22743
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/22743
**[Test build #97522 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97522/testReport)** for PR 22743 at commit [`206743c`](https://github.com/apache/spark/commit/206743cef96e536783a315785739af16f845f5c1).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org