You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by wangyum <gi...@git.apache.org> on 2018/10/16 07:35:39 UTC

[GitHub] spark pull request #22743: [WIP][SPARK-25740][SQL] Set some configuration ne...

GitHub user wangyum opened a pull request:

    https://github.com/apache/spark/pull/22743

    [WIP][SPARK-25740][SQL] Set some configuration need invalidateStatsCache

    ## What changes were proposed in this pull request?
    How to reproduce:
    ```sql
    # spark-sql
    create table t1 (a int) stored as parquet;
    create table t2 (a int) stored as parquet;
    insert into table t1 values (1);
    insert into table t2 values (1);
    explain select * from t1, t2 where t1.a = t2.a;
    exit;
    ```
    ```sql
    # spark-sql
    set spark.sql.statistics.fallBackToHdfs=true;
    explain select * from t1, t2 where t1.a = t2.a;
    -- It is BroadcastHashJoin
    ```
    ```sql
    # spark-sql
    explain select * from t1, t2 where t1.a = t2.a;
    -- SortMergeJoin
    set spark.sql.statistics.fallBackToHdfs=true;
    explain select * from t1, t2 where t1.a = t2.a;
    -- SortMergeJoin, it should be BroadcastHashJoin
    ```
    We need `LogicalPlanStats.invalidateStatsCache` to clean cached stats when execute set `spark.sql.statistics.fallBackToHdfs` Command, but seems only we can do is `invalidateAllCachedTables`.
    ## How was this patch tested?
    
    manual tests


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wangyum/spark SPARK-25740

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22743.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22743
    
----
commit cf43e225c9da4f1274c7c82b568a89b3369e3515
Author: Yuming Wang <yu...@...>
Date:   2018-10-16T07:27:03Z

    Set some configuration need invalidateStatsCache

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    Yes. you are right, if datasource table stats is empty, `DetermineTableStats` doesn't set stats for it, so it's only a problem for hive tables.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    **[Test build #97517 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97517/testReport)** for PR 22743 at commit [`c32a2a9`](https://github.com/apache/spark/commit/c32a2a976718fcd1d7c92bb2310e463b7edff478).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [WIP][SPARK-25740][SQL] Set some configuration need inva...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [WIP][SPARK-25740][SQL] Set some configuration need inva...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    **[Test build #97515 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97515/testReport)** for PR 22743 at commit [`c32a2a9`](https://github.com/apache/spark/commit/c32a2a976718fcd1d7c92bb2310e463b7edff478).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    **[Test build #97515 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97515/testReport)** for PR 22743 at commit [`c32a2a9`](https://github.com/apache/spark/commit/c32a2a976718fcd1d7c92bb2310e463b7edff478).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97515/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97522/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    Datasource table will not cache in [tableRelationCache](https://github.com/apache/spark/blob/01c3dfab158d40653f8ce5d96f57220297545d5b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala#L134).
    Hive table only occured when Hive table stats is empty and enable `spark.sql.hive.convertMetastoreParquet` (default value). then when we read this table,  Spark will [convertToLogicalRelation](https://github.com/apache/spark/blob/a2f502cf53b6b00af7cb80b6f38e64cf46367595/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L116) and [cache it](https://github.com/apache/spark/blob/a2f502cf53b6b00af7cb80b6f38e64cf46367595/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L207).
    
    Empty stats occured at least in 2 situations:
    1. Create as Hive table and enable `spark.sql.hive.convertMetastoreParquet` (default value) and disable `spark.sql.statistics.size.autoUpdate.enabled` (default value) then do inserting.
    2. Table managed by Hive and didn't gather stats.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    why it's only a problem for hive tables?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4079/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    > Datasource table will not cache in tableRelationCache.
    
    I don't think so. Spark caches data source table in `FindDataSourceTable`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [WIP][SPARK-25740][SQL] Set some configuration need inva...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4026/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [WIP][SPARK-25740][SQL] Set some configuration need inva...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    This happens when a table `LogicalRelation` has been cached, then we change `spark.sql.statistics.fallBackToHdfs` or `spark.sql.defaultSizeInBytes` will not have any effect to stats, it always uses the stats already cached in `LogicalRelation`. This is an example:
    
    ```scala
    import org.apache.spark.sql.catalyst.QualifiedTableName
    import org.apache.spark.sql.catalyst.catalog.SessionCatalog
    import org.apache.spark.sql.execution.datasources.LogicalRelation
    
    spark.sql("CREATE TABLE t1 (c1 bigint) STORED AS PARQUET")
    spark.sql("INSERT INTO TABLE t1 VALUES (1)")
    spark.sql("REFRESH TABLE t1")
    
    val catalog = spark.sessionState.catalog
    val qualifiedTableName = QualifiedTableName(catalog.getCurrentDatabase, "t1")
    
    spark.sql("SELECT * from t1").collect()
    val cachedRelation = catalog.getCachedTable(qualifiedTableName)
    cachedRelation.asInstanceOf[LogicalRelation].catalogTable.get.stats.get.sizeInBytes
    // res4: BigInt = 9223372036854775807
    
    spark.sql("set spark.sql.statistics.fallBackToHdfs=true")
    spark.sql("SELECT * from t1").collect()
    val cachedRelation = catalog.getCachedTable(qualifiedTableName)
    cachedRelation.asInstanceOf[LogicalRelation].catalogTable.get.stats.get.sizeInBytes
    // res7: BigInt = 9223372036854775807
    // It should compute from file system, but still 9223372036854775807
    
    spark.sql("REFRESH TABLE t1")
    spark.sql("SELECT * from t1").collect()
    val cachedRelation = catalog.getCachedTable(qualifiedTableName)
    cachedRelation.asInstanceOf[LogicalRelation].catalogTable.get.stats.get.sizeInBytes
    // res10: BigInt = 708
    // If we refresh this table, it correct.
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4075/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [WIP][SPARK-25740][SQL] Set some configuration need inva...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97442/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    **[Test build #97522 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97522/testReport)** for PR 22743 at commit [`206743c`](https://github.com/apache/spark/commit/206743cef96e536783a315785739af16f845f5c1).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [WIP][SPARK-25740][SQL] Set some configuration need inva...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [WIP][SPARK-25740][SQL] Set some configuration need inva...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    **[Test build #97442 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97442/testReport)** for PR 22743 at commit [`cf43e22`](https://github.com/apache/spark/commit/cf43e225c9da4f1274c7c82b568a89b3369e3515).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    cc @cloud-fan 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    can you explain more about how this happens?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    **[Test build #97517 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97517/testReport)** for PR 22743 at commit [`c32a2a9`](https://github.com/apache/spark/commit/c32a2a976718fcd1d7c92bb2310e463b7edff478).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [WIP][SPARK-25740][SQL] Set some configuration need inva...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4074/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [WIP][SPARK-25740][SQL] Set some configuration need inva...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    **[Test build #97442 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97442/testReport)** for PR 22743 at commit [`cf43e22`](https://github.com/apache/spark/commit/cf43e225c9da4f1274c7c82b568a89b3369e3515).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97517/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22743: [SPARK-25740][SQL] Refactor DetermineTableStats to inval...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22743
  
    **[Test build #97522 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97522/testReport)** for PR 22743 at commit [`206743c`](https://github.com/apache/spark/commit/206743cef96e536783a315785739af16f845f5c1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org