You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by jinxing64 <gi...@git.apache.org> on 2017/10/23 13:05:06 UTC

[GitHub] spark pull request #19560: [SPARK-22334][SQL] Check table size from HDFS in ...

GitHub user jinxing64 opened a pull request:

    https://github.com/apache/spark/pull/19560

    [SPARK-22334][SQL] Check table size from HDFS in case the size in metastore is wrong.

    ## What changes were proposed in this pull request?
    
    Currently we use table properties('totalSize') to judge if to use broadcast join. Table properties are from metastore. However they can be wrong. Hive sometimes fails to update table properties after producing data successfully(e,g, NameNode timeout from https://github.com/apache/hive/blob/branch-1.2/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java#L180). If 'totalSize' in table properties is much smaller than its real size on HDFS, Spark can launch broadcast join by mistake and suffer OOM.
    Could we add a defense config and check the size from HDFS when 'totalSize' is below `spark.sql.autoBroadcastJoinThreshold`
    
    ## How was this patch tested?
    
    Tests added

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jinxing64/spark SPARK-22334

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19560.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19560
    
----
commit 98975adf129c2359156e97e338f0e3e4f623372b
Author: jinxing <ji...@126.com>
Date:   2017-10-23T13:02:42Z

    Check table size from HDFS in case the size in metastore is wrong.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    **[Test build #88775 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88775/testReport)** for PR 19560 at commit [`78b34bd`](https://github.com/apache/spark/commit/78b34bd7b79550b23730e1c9cdf06620e52b66f2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    > Users always do not know there's error in stats.
    
    Isn't there any exceptions or error messages when updating table/stats fails? I suppose the system is able to know it through logging or protect it by redo mechanism?
    
    > Help to avoid OOM caused by broadcast join. It's for stability.
    
    This config can't avoid OOM anyway becuase file size is different from memory usage when broadcast join.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    @wzhfy
    Thanks for comment;
    I know your point.
    In my cluster, namenode is under heavy pressure. Errors in stats happen with big chance. Users always do not know there's error in stats. That's why I propose this config. Users can chose to turn on this config by default. Yes, it will hit performance. But it's only when Join and the `totalSize` from metastore is below `spark.sql.autoBroadcastJoinThreshold`, which I think is acceptable.
    Like I mentioned in description, this is for defense. Help to avoid OOM caused by broadcast join. It's for stability.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19560: [SPARK-22334][SQL] Check table size from HDFS in ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19560#discussion_r146319070
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -187,6 +187,15 @@ object SQLConf {
         .booleanConf
         .createWithDefault(false)
     
    +  val VERIFY_STATS_FROM_HDFS_WHEN_BROADCASTJOIN =
    +    buildConf("spark.sql.statistics.verifyStatsFromHdfsWhenBroadcastJoin")
    +    .doc("If table size in metastore is below spark.sql.autoBroadcastJoinThreshold, check the " +
    +      "size on Hdfs and set table size to be the bigger one. This is for defense and help avoid" +
    +      " OOM caused by broadcast join. It's useful when metastore failed to update the stats of" +
    --- End diff --
    
    Hi, @jinxing64 .
    Is this helpful for highly compressed Parquet tables, too?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88775/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    I wonder when this config should be used. If user knows there's some error in stats, why not just analyze the table (specify "noscan" if only size is needed)? This can fix the problem instead of verifying the stats every time analyze a query. If user doesn't know, then he is also not sure when to open it, because stats verification can hit performance.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19560: [SPARK-22334][SQL] Check table size from HDFS in ...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19560#discussion_r146351183
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -187,6 +187,15 @@ object SQLConf {
         .booleanConf
         .createWithDefault(false)
     
    +  val VERIFY_STATS_FROM_HDFS_WHEN_BROADCASTJOIN =
    +    buildConf("spark.sql.statistics.verifyStatsFromHdfsWhenBroadcastJoin")
    +    .doc("If table size in metastore is below spark.sql.autoBroadcastJoinThreshold, check the " +
    +      "size on Hdfs and set table size to be the bigger one. This is for defense and help avoid" +
    --- End diff --
    
    It might not be from HDFS.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    **[Test build #83008 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83008/testReport)** for PR 19560 at commit [`78b34bd`](https://github.com/apache/spark/commit/78b34bd7b79550b23730e1c9cdf06620e52b66f2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    @wangyum 
    Make sense.
    You can also try approach in this pr. 
    If there are many(tens of thousands of) ETLs in the warehouse, we cannot afford to give that many hints or fix all the inaccurate table properties in metastore.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    My main concern is, we'd better not to put burden on Spark to deal with metastore failures, because Spark doesn't have control on metastores. The system using Spark and metastore should be responsible for consistency.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from HDFS in case th...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    **[Test build #83008 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83008/testReport)** for PR 19560 at commit [`78b34bd`](https://github.com/apache/spark/commit/78b34bd7b79550b23730e1c9cdf06620e52b66f2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91092/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from HDFS in case th...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    **[Test build #82986 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82986/testReport)** for PR 19560 at commit [`98975ad`](https://github.com/apache/spark/commit/98975adf129c2359156e97e338f0e3e4f623372b).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    **[Test build #83002 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83002/testReport)** for PR 19560 at commit [`bf59c27`](https://github.com/apache/spark/commit/bf59c27d0a8a01dc0572cf148f40b6337799f241).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    **[Test build #91092 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91092/testReport)** for PR 19560 at commit [`78b34bd`](https://github.com/apache/spark/commit/78b34bd7b79550b23730e1c9cdf06620e52b66f2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    I can see the value and also the potential extra overhead (more expensive for object stores), although this does not resolve the root cause. 
    
    Before we providing adaptive runtime optimization and incremental online stats collection, this might be a workaround solution for avoiding some OOM cases. Let us leave this PR open and see the feedbacks from the others


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    **[Test build #91092 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91092/testReport)** for PR 19560 at commit [`78b34bd`](https://github.com/apache/spark/commit/78b34bd7b79550b23730e1c9cdf06620e52b66f2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83008/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19560: [SPARK-22334][SQL] Check table size from filesyst...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19560#discussion_r146448519
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -187,6 +187,15 @@ object SQLConf {
         .booleanConf
         .createWithDefault(false)
     
    +  val VERIFY_STATS_FROM_FILESYSTEM_WHEN_BROADCASTJOIN =
    --- End diff --
    
    This config name implies it only does verification when broadcast join. However, seems that it verifies the statistics no matter doing broadcast join or not. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    >My main concern is, we'd better not to put burden on Spark to deal with metastore failures
    
    I think this make sense. I was also thinking about this when proposing this pr. I do agree with you on some level. But in the product env, reasons of failing to update the stats seems various and we find it hard to build a strong redo mechanism. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83002/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    @gatorsmile @dongjoon-hyun 
    
    Thanks a lot for looking into this.
    This pr aims to avoid OOM if metastore fails to update table properties after the data is already produced. With the config in this pr enabled, we check the size on filesystem only when `totalSize` is below `spark.sql.autoBroadcastJoinThreshold`, so I think the cost can be acceptable.
    
    Yes, the storage can be other filesystems. I refined the name. Please take a look again when you have time.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from HDFS in case th...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    **[Test build #82986 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82986/testReport)** for PR 19560 at commit [`98975ad`](https://github.com/apache/spark/commit/98975adf129c2359156e97e338f0e3e4f623372b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from HDFS in case th...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82986/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from HDFS in case th...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    **[Test build #83002 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83002/testReport)** for PR 19560 at commit [`bf59c27`](https://github.com/apache/spark/commit/bf59c27d0a8a01dc0572cf148f40b6337799f241).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    I also hint this issues:
    ```sql
    select * from A join B on a.key = b.key
    ```
    table A is small but table B is big and table B's stats are incorrect. so It will Broadcast table B.
    
    I try to use Broadcast hint to solve this issues:
    ```sql
    select /*+ MAPJOIN(A) */ * from A join B on a.key = b.key
    ```
    But it doesn't work. I create a pr to fix it: https://github.com/apache/spark/pull/19714


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    @viirya
    Thanks a lot for comments.
    1. In current change, I verify the stats from file system only when the relation is under join.
    2. I added a warning when the size from file system is smaller than from metastore.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19560: [SPARK-22334][SQL] Check table size from filesyst...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19560#discussion_r146448976
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
    @@ -120,22 +120,41 @@ class DetermineTableStats(session: SparkSession) extends Rule[LogicalPlan] {
             if DDLUtils.isHiveTable(relation.tableMeta) && relation.tableMeta.stats.isEmpty =>
           val table = relation.tableMeta
           val sizeInBytes = if (session.sessionState.conf.fallBackToHdfsForStatsEnabled) {
    -        try {
    -          val hadoopConf = session.sessionState.newHadoopConf()
    -          val tablePath = new Path(table.location)
    -          val fs: FileSystem = tablePath.getFileSystem(hadoopConf)
    -          fs.getContentSummary(tablePath).getLength
    -        } catch {
    -          case e: IOException =>
    -            logWarning("Failed to get table size from hdfs.", e)
    -            session.sessionState.conf.defaultSizeInBytes
    -        }
    +        getSizeFromHdfs(table.location)
           } else {
             session.sessionState.conf.defaultSizeInBytes
           }
     
           val withStats = table.copy(stats = Some(CatalogStatistics(sizeInBytes = BigInt(sizeInBytes))))
           relation.copy(tableMeta = withStats)
    +
    +    case relation: HiveTableRelation
    +        if DDLUtils.isHiveTable(relation.tableMeta) && relation.tableMeta.stats.nonEmpty &&
    +          session.sessionState.conf.verifyStatsFromFileSystemWhenBroadcastJoin &&
    +          relation.tableMeta.stats.get.sizeInBytes <
    +            session.sessionState.conf.autoBroadcastJoinThreshold =>
    +      val table = relation.tableMeta
    +      val sizeInBytes = getSizeFromHdfs(table.location)
    --- End diff --
    
    If the metadata statistics are wrong, getting the size from files every time seems a burden. Can we show some message to users and suggest them to update table statistics?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19560
  
    **[Test build #88775 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88775/testReport)** for PR 19560 at commit [`78b34bd`](https://github.com/apache/spark/commit/78b34bd7b79550b23730e1c9cdf06620e52b66f2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19560: [SPARK-22334][SQL] Check table size from filesyst...

Posted by jinxing64 <gi...@git.apache.org>.
Github user jinxing64 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19560#discussion_r146449741
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
    @@ -120,22 +120,41 @@ class DetermineTableStats(session: SparkSession) extends Rule[LogicalPlan] {
             if DDLUtils.isHiveTable(relation.tableMeta) && relation.tableMeta.stats.isEmpty =>
           val table = relation.tableMeta
           val sizeInBytes = if (session.sessionState.conf.fallBackToHdfsForStatsEnabled) {
    -        try {
    -          val hadoopConf = session.sessionState.newHadoopConf()
    -          val tablePath = new Path(table.location)
    -          val fs: FileSystem = tablePath.getFileSystem(hadoopConf)
    -          fs.getContentSummary(tablePath).getLength
    -        } catch {
    -          case e: IOException =>
    -            logWarning("Failed to get table size from hdfs.", e)
    -            session.sessionState.conf.defaultSizeInBytes
    -        }
    +        getSizeFromHdfs(table.location)
           } else {
             session.sessionState.conf.defaultSizeInBytes
           }
     
           val withStats = table.copy(stats = Some(CatalogStatistics(sizeInBytes = BigInt(sizeInBytes))))
           relation.copy(tableMeta = withStats)
    +
    +    case relation: HiveTableRelation
    +        if DDLUtils.isHiveTable(relation.tableMeta) && relation.tableMeta.stats.nonEmpty &&
    +          session.sessionState.conf.verifyStatsFromFileSystemWhenBroadcastJoin &&
    +          relation.tableMeta.stats.get.sizeInBytes <
    +            session.sessionState.conf.autoBroadcastJoinThreshold =>
    +      val table = relation.tableMeta
    +      val sizeInBytes = getSizeFromHdfs(table.location)
    --- End diff --
    
    Yes, I think it's good idea.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19560: [SPARK-22334][SQL] Check table size from HDFS in ...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19560#discussion_r146351097
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -187,6 +187,15 @@ object SQLConf {
         .booleanConf
         .createWithDefault(false)
     
    +  val VERIFY_STATS_FROM_HDFS_WHEN_BROADCASTJOIN =
    +    buildConf("spark.sql.statistics.verifyStatsFromHdfsWhenBroadcastJoin")
    +    .doc("If table size in metastore is below spark.sql.autoBroadcastJoinThreshold, check the " +
    +      "size on Hdfs and set table size to be the bigger one. This is for defense and help avoid" +
    +      " OOM caused by broadcast join. It's useful when metastore failed to update the stats of" +
    --- End diff --
    
    @dongjoon-hyun It does not solve that issue. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org