You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by Parth-Brahmbhatt <gi...@git.apache.org> on 2016/05/17 17:31:47 UTC

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

GitHub user Parth-Brahmbhatt opened a pull request:

    https://github.com/apache/spark/pull/13150

    [SPARK-15365] [SQL]: When table size statistics are not available from metastore, we should fallback to HDFS

    ## What changes were proposed in this pull request?
    Currently if a table is used in join operation we rely on Metastore returned size to calculate if we can convert the operation to Broadcast join. This optimization only kicks in for table's that have the statistics available in metastore. Hive generally rolls over to HDFS if the statistics are not available directly from metastore and this seems like a reasonable choice to adopt given the optimization benefit of using broadcast joins.
    
    
    ## How was this patch tested?
    I have executed queries locally to test.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/Parth-Brahmbhatt/spark SPARK-15365

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13150.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13150
    
----
commit 702781d188846faceca9cd529984b21fa5b0d724
Author: Parth Brahmbhatt <pb...@netflix.com>
Date:   2016-05-17T17:29:10Z

    [SPARK-15365] [SQL]: When table size statistics are not available from metastore, fall back to HDFS.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13150#issuecomment-221390733
  
    **[Test build #59215 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59215/consoleFull)** for PR 13150 at commit [`f5d5dde`](https://github.com/apache/spark/commit/f5d5ddec9386489400ca947f2c095086e757e3c4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by Parth-Brahmbhatt <gi...@git.apache.org>.

Github user Parth-Brahmbhatt commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13150#discussion_r63968104
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala ---
    @@ -114,17 +115,27 @@ private[hive] case class MetastoreRelation(
           val rawDataSize = hiveQlTable.getParameters.get(StatsSetupConst.RAW_DATA_SIZE)
           // TODO: check if this estimate is valid for tables after partition pruning.
           // NOTE: getting `totalSize` directly from params is kind of hacky, but this should be
    -      // relatively cheap if parameters for the table are populated into the metastore.  An
    -      // alternative would be going through Hadoop's FileSystem API, which can be expensive if a lot
    -      // of RPCs are involved.  Besides `totalSize`, there are also `numFiles`, `numRows`,
    -      // `rawDataSize` keys (see StatsSetupConst in Hive) that we can look at in the future.
    +      // relatively cheap if parameters for the table are populated into the metastore.
    +      // Besides `totalSize`, there are also `numFiles`, `numRows`, `rawDataSize` keys
    +      // (see StatsSetupConst in Hive) that we can look at in the future.
           BigInt(
             // When table is external,`totalSize` is always zero, which will influence join strategy
             // so when `totalSize` is zero, use `rawDataSize` instead
    -        // if the size is still less than zero, we use default size
    +        // if the size is still less than zero, we try to get the file size from HDFS.
    +        // given this is only needed for optimization, if the HDFS call fails we return the default.
             Option(totalSize).map(_.toLong).filter(_ > 0)
               .getOrElse(Option(rawDataSize).map(_.toLong).filter(_ > 0)
    -            .getOrElse(sparkSession.sessionState.conf.defaultSizeInBytes)))
    +              .getOrElse({
    +                val hadoopConf = sparkSession.sessionState.newHadoopConf()
    +                val fs: FileSystem = hiveQlTable.getPath.getFileSystem(hadoopConf)
    +                try {
    +                  fs.getContentSummary(hiveQlTable.getPath).getLength
    +                } catch {
    +                  case e: Exception =>
    +                    log.warn("Failed to get table size from hdfs.", e)
    --- End diff --
    
    updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13150#issuecomment-221367988
  
    **[Test build #59215 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59215/consoleFull)** for PR 13150 at commit [`f5d5dde`](https://github.com/apache/spark/commit/f5d5ddec9386489400ca947f2c095086e757e3c4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by Parth-Brahmbhatt <gi...@git.apache.org>.

Github user Parth-Brahmbhatt commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13150#discussion_r64503327
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala ---
    @@ -114,17 +117,31 @@ private[hive] case class MetastoreRelation(
           val rawDataSize = hiveQlTable.getParameters.get(StatsSetupConst.RAW_DATA_SIZE)
           // TODO: check if this estimate is valid for tables after partition pruning.
           // NOTE: getting `totalSize` directly from params is kind of hacky, but this should be
    -      // relatively cheap if parameters for the table are populated into the metastore.  An
    -      // alternative would be going through Hadoop's FileSystem API, which can be expensive if a lot
    -      // of RPCs are involved.  Besides `totalSize`, there are also `numFiles`, `numRows`,
    -      // `rawDataSize` keys (see StatsSetupConst in Hive) that we can look at in the future.
    +      // relatively cheap if parameters for the table are populated into the metastore.
    +      // Besides `totalSize`, there are also `numFiles`, `numRows`, `rawDataSize` keys
    +      // (see StatsSetupConst in Hive) that we can look at in the future.
           BigInt(
             // When table is external,`totalSize` is always zero, which will influence join strategy
             // so when `totalSize` is zero, use `rawDataSize` instead
    -        // if the size is still less than zero, we use default size
    -        Option(totalSize).map(_.toLong).filter(_ > 0)
    -          .getOrElse(Option(rawDataSize).map(_.toLong).filter(_ > 0)
    -            .getOrElse(sparkSession.sessionState.conf.defaultSizeInBytes)))
    +        // if the size is still less than zero, we try to get the file size from HDFS.
    +        // given this is only needed for optimization, if the HDFS call fails we return the default.
    +        if (Option(totalSize).map(_.toLong).getOrElse(0L) > 0) {
    --- End diff --
    
    Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on the pull request:

    https://github.com/apache/spark/pull/13150#issuecomment-221356370
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13150#discussion_r63628099
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala ---
    @@ -114,17 +115,27 @@ private[hive] case class MetastoreRelation(
           val rawDataSize = hiveQlTable.getParameters.get(StatsSetupConst.RAW_DATA_SIZE)
           // TODO: check if this estimate is valid for tables after partition pruning.
           // NOTE: getting `totalSize` directly from params is kind of hacky, but this should be
    -      // relatively cheap if parameters for the table are populated into the metastore.  An
    -      // alternative would be going through Hadoop's FileSystem API, which can be expensive if a lot
    -      // of RPCs are involved.  Besides `totalSize`, there are also `numFiles`, `numRows`,
    -      // `rawDataSize` keys (see StatsSetupConst in Hive) that we can look at in the future.
    +      // relatively cheap if parameters for the table are populated into the metastore.
    +      // Besides `totalSize`, there are also `numFiles`, `numRows`, `rawDataSize` keys
    +      // (see StatsSetupConst in Hive) that we can look at in the future.
           BigInt(
             // When table is external,`totalSize` is always zero, which will influence join strategy
             // so when `totalSize` is zero, use `rawDataSize` instead
    -        // if the size is still less than zero, we use default size
    +        // if the size is still less than zero, we try to get the file size from HDFS.
    +        // given this is only needed for optimization, if the HDFS call fails we return the default.
             Option(totalSize).map(_.toLong).filter(_ > 0)
               .getOrElse(Option(rawDataSize).map(_.toLong).filter(_ > 0)
    -            .getOrElse(sparkSession.sessionState.conf.defaultSizeInBytes)))
    +              .getOrElse({
    --- End diff --
    
    I hope this link is helpful.. https://github.com/apache/spark/pull/12256


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13150#issuecomment-219793393
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13150#discussion_r64504540
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
    @@ -68,6 +71,51 @@ class StatisticsSuite extends QueryTest with TestHiveSingleton {
           classOf[AnalyzeTableCommand])
       }
     
    +  test("MetastoreRelations fallback to HDFS for size estimation") {
    +    val enableFallBackToHdfsForStats = hiveContext.conf.fallBackToHdfsForStatsEnabled
    +    try {
    +      withTempDir { tempDir =>
    +
    +        // EXTERNAL OpenCSVSerde table pointing to LOCATION
    +
    +        val file1 = new File(tempDir + "/data1")
    +        val writer1 = new PrintWriter(file1)
    +        writer1.write("1,2")
    +        writer1.close()
    +
    +        val file2 = new File(tempDir + "/data2")
    +        val writer2 = new PrintWriter(file2)
    +        writer2.write("1,2")
    +        writer2.close()
    +
    +        sql(
    +          s"""CREATE EXTERNAL TABLE csv_table(page_id INT, impressions INT)
    +        ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    --- End diff --
    
    need to indent this properly. I can fix this when I merge if Jenkins passes.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13150#issuecomment-221452802
  
    **[Test build #3017 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3017/consoleFull)** for PR 13150 at commit [`ff69f91`](https://github.com/apache/spark/commit/ff69f91e273580547b6d86a6dfc87f2a94066507).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by Parth-Brahmbhatt <gi...@git.apache.org>.

Github user Parth-Brahmbhatt commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13150#discussion_r63968095
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala ---
    @@ -114,17 +115,27 @@ private[hive] case class MetastoreRelation(
           val rawDataSize = hiveQlTable.getParameters.get(StatsSetupConst.RAW_DATA_SIZE)
           // TODO: check if this estimate is valid for tables after partition pruning.
           // NOTE: getting `totalSize` directly from params is kind of hacky, but this should be
    -      // relatively cheap if parameters for the table are populated into the metastore.  An
    -      // alternative would be going through Hadoop's FileSystem API, which can be expensive if a lot
    -      // of RPCs are involved.  Besides `totalSize`, there are also `numFiles`, `numRows`,
    -      // `rawDataSize` keys (see StatsSetupConst in Hive) that we can look at in the future.
    +      // relatively cheap if parameters for the table are populated into the metastore.
    +      // Besides `totalSize`, there are also `numFiles`, `numRows`, `rawDataSize` keys
    +      // (see StatsSetupConst in Hive) that we can look at in the future.
           BigInt(
             // When table is external,`totalSize` is always zero, which will influence join strategy
             // so when `totalSize` is zero, use `rawDataSize` instead
    -        // if the size is still less than zero, we use default size
    +        // if the size is still less than zero, we try to get the file size from HDFS.
    +        // given this is only needed for optimization, if the HDFS call fails we return the default.
             Option(totalSize).map(_.toLong).filter(_ > 0)
               .getOrElse(Option(rawDataSize).map(_.toLong).filter(_ > 0)
    -            .getOrElse(sparkSession.sessionState.conf.defaultSizeInBytes)))
    +              .getOrElse({
    +                val hadoopConf = sparkSession.sessionState.newHadoopConf()
    +                val fs: FileSystem = hiveQlTable.getPath.getFileSystem(hadoopConf)
    +                try {
    +                  fs.getContentSummary(hiveQlTable.getPath).getLength
    +                } catch {
    +                  case e: Exception =>
    --- End diff --
    
    Resolved.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by Parth-Brahmbhatt <gi...@git.apache.org>.

Github user Parth-Brahmbhatt commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13150#discussion_r63627480
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala ---
    @@ -114,17 +115,27 @@ private[hive] case class MetastoreRelation(
           val rawDataSize = hiveQlTable.getParameters.get(StatsSetupConst.RAW_DATA_SIZE)
           // TODO: check if this estimate is valid for tables after partition pruning.
           // NOTE: getting `totalSize` directly from params is kind of hacky, but this should be
    -      // relatively cheap if parameters for the table are populated into the metastore.  An
    -      // alternative would be going through Hadoop's FileSystem API, which can be expensive if a lot
    -      // of RPCs are involved.  Besides `totalSize`, there are also `numFiles`, `numRows`,
    -      // `rawDataSize` keys (see StatsSetupConst in Hive) that we can look at in the future.
    +      // relatively cheap if parameters for the table are populated into the metastore.
    +      // Besides `totalSize`, there are also `numFiles`, `numRows`, `rawDataSize` keys
    +      // (see StatsSetupConst in Hive) that we can look at in the future.
           BigInt(
             // When table is external,`totalSize` is always zero, which will influence join strategy
             // so when `totalSize` is zero, use `rawDataSize` instead
    -        // if the size is still less than zero, we use default size
    +        // if the size is still less than zero, we try to get the file size from HDFS.
    +        // given this is only needed for optimization, if the HDFS call fails we return the default.
             Option(totalSize).map(_.toLong).filter(_ > 0)
               .getOrElse(Option(rawDataSize).map(_.toLong).filter(_ > 0)
    -            .getOrElse(sparkSession.sessionState.conf.defaultSizeInBytes)))
    +              .getOrElse({
    --- End diff --
    
    What alternative was chosen as more readable/less confusing?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/13150


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13150#discussion_r63633820
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala ---
    @@ -114,17 +115,27 @@ private[hive] case class MetastoreRelation(
           val rawDataSize = hiveQlTable.getParameters.get(StatsSetupConst.RAW_DATA_SIZE)
           // TODO: check if this estimate is valid for tables after partition pruning.
           // NOTE: getting `totalSize` directly from params is kind of hacky, but this should be
    -      // relatively cheap if parameters for the table are populated into the metastore.  An
    -      // alternative would be going through Hadoop's FileSystem API, which can be expensive if a lot
    -      // of RPCs are involved.  Besides `totalSize`, there are also `numFiles`, `numRows`,
    -      // `rawDataSize` keys (see StatsSetupConst in Hive) that we can look at in the future.
    +      // relatively cheap if parameters for the table are populated into the metastore.
    +      // Besides `totalSize`, there are also `numFiles`, `numRows`, `rawDataSize` keys
    +      // (see StatsSetupConst in Hive) that we can look at in the future.
           BigInt(
             // When table is external,`totalSize` is always zero, which will influence join strategy
             // so when `totalSize` is zero, use `rawDataSize` instead
    -        // if the size is still less than zero, we use default size
    +        // if the size is still less than zero, we try to get the file size from HDFS.
    +        // given this is only needed for optimization, if the HDFS call fails we return the default.
             Option(totalSize).map(_.toLong).filter(_ > 0)
               .getOrElse(Option(rawDataSize).map(_.toLong).filter(_ > 0)
    -            .getOrElse(sparkSession.sessionState.conf.defaultSizeInBytes)))
    +              .getOrElse({
    --- End diff --
    
    yea there is nothing wrong with getOrElse, but chaining a lot of them together (also with option filter) can get really confusing. While you are at this, it'd be better to fix this.
    
    Expanding it to be longer (and maybe use some imperative style code) could make it less confusing.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/13150#issuecomment-221452537
  
    Great - thanks.
    
    Jenkins, test this please.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by Parth-Brahmbhatt <gi...@git.apache.org>.

Github user Parth-Brahmbhatt commented on the pull request:

    https://github.com/apache/spark/pull/13150#issuecomment-221026594
  
    @sameeragarwal @rxin  FYI, Github currently has some latency issues so you probably can't see the updates.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/13150#issuecomment-221467315
  
    @Parth-Brahmbhatt you should add the email address you used in your commit to your github profile, so the commit is associated with your account. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13150#discussion_r64484127
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala ---
    @@ -114,17 +117,31 @@ private[hive] case class MetastoreRelation(
           val rawDataSize = hiveQlTable.getParameters.get(StatsSetupConst.RAW_DATA_SIZE)
           // TODO: check if this estimate is valid for tables after partition pruning.
           // NOTE: getting `totalSize` directly from params is kind of hacky, but this should be
    -      // relatively cheap if parameters for the table are populated into the metastore.  An
    -      // alternative would be going through Hadoop's FileSystem API, which can be expensive if a lot
    -      // of RPCs are involved.  Besides `totalSize`, there are also `numFiles`, `numRows`,
    -      // `rawDataSize` keys (see StatsSetupConst in Hive) that we can look at in the future.
    +      // relatively cheap if parameters for the table are populated into the metastore.
    +      // Besides `totalSize`, there are also `numFiles`, `numRows`, `rawDataSize` keys
    +      // (see StatsSetupConst in Hive) that we can look at in the future.
           BigInt(
             // When table is external,`totalSize` is always zero, which will influence join strategy
             // so when `totalSize` is zero, use `rawDataSize` instead
    -        // if the size is still less than zero, we use default size
    -        Option(totalSize).map(_.toLong).filter(_ > 0)
    -          .getOrElse(Option(rawDataSize).map(_.toLong).filter(_ > 0)
    -            .getOrElse(sparkSession.sessionState.conf.defaultSizeInBytes)))
    +        // if the size is still less than zero, we try to get the file size from HDFS.
    +        // given this is only needed for optimization, if the HDFS call fails we return the default.
    +        if (Option(totalSize).map(_.toLong).getOrElse(0L) > 0) {
    --- End diff --
    
    can we write something like this to make it easier to read?
    
    ```
    if (totalSize != null && totalSize.toLong > 0L) {
      totalSize.toLong
    } else if (rawDataSize != null && rawDataSize.toLong > 0) {
      rawDataSize.toLong
    } else if (sparkSession.sessionState.conf.fallBackToHdfsForStatsEnabled) {
      ...
    } else {
      sparkSession.sessionState.conf.defaultSizeInBytes
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by Parth-Brahmbhatt <gi...@git.apache.org>.

Github user Parth-Brahmbhatt commented on the pull request:

    https://github.com/apache/spark/pull/13150#issuecomment-221942182
  
    @rxin Thanks for taking the time to review and merging the patch. I have added the Email to my profile.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13150#issuecomment-221390997
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13150#issuecomment-221390999
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59215/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by Parth-Brahmbhatt <gi...@git.apache.org>.

Github user Parth-Brahmbhatt commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13150#discussion_r63968077
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala ---
    @@ -114,17 +115,27 @@ private[hive] case class MetastoreRelation(
           val rawDataSize = hiveQlTable.getParameters.get(StatsSetupConst.RAW_DATA_SIZE)
           // TODO: check if this estimate is valid for tables after partition pruning.
           // NOTE: getting `totalSize` directly from params is kind of hacky, but this should be
    -      // relatively cheap if parameters for the table are populated into the metastore.  An
    -      // alternative would be going through Hadoop's FileSystem API, which can be expensive if a lot
    -      // of RPCs are involved.  Besides `totalSize`, there are also `numFiles`, `numRows`,
    -      // `rawDataSize` keys (see StatsSetupConst in Hive) that we can look at in the future.
    +      // relatively cheap if parameters for the table are populated into the metastore.
    +      // Besides `totalSize`, there are also `numFiles`, `numRows`, `rawDataSize` keys
    +      // (see StatsSetupConst in Hive) that we can look at in the future.
           BigInt(
             // When table is external,`totalSize` is always zero, which will influence join strategy
             // so when `totalSize` is zero, use `rawDataSize` instead
    -        // if the size is still less than zero, we use default size
    +        // if the size is still less than zero, we try to get the file size from HDFS.
    +        // given this is only needed for optimization, if the HDFS call fails we return the default.
             Option(totalSize).map(_.toLong).filter(_ > 0)
               .getOrElse(Option(rawDataSize).map(_.toLong).filter(_ > 0)
    -            .getOrElse(sparkSession.sessionState.conf.defaultSizeInBytes)))
    +              .getOrElse({
    --- End diff --
    
    @rxin I have replaced if with conventional if/elseif. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13150#discussion_r63633968
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala ---
    @@ -114,17 +115,27 @@ private[hive] case class MetastoreRelation(
           val rawDataSize = hiveQlTable.getParameters.get(StatsSetupConst.RAW_DATA_SIZE)
           // TODO: check if this estimate is valid for tables after partition pruning.
           // NOTE: getting `totalSize` directly from params is kind of hacky, but this should be
    -      // relatively cheap if parameters for the table are populated into the metastore.  An
    -      // alternative would be going through Hadoop's FileSystem API, which can be expensive if a lot
    -      // of RPCs are involved.  Besides `totalSize`, there are also `numFiles`, `numRows`,
    -      // `rawDataSize` keys (see StatsSetupConst in Hive) that we can look at in the future.
    +      // relatively cheap if parameters for the table are populated into the metastore.
    +      // Besides `totalSize`, there are also `numFiles`, `numRows`, `rawDataSize` keys
    +      // (see StatsSetupConst in Hive) that we can look at in the future.
           BigInt(
             // When table is external,`totalSize` is always zero, which will influence join strategy
             // so when `totalSize` is zero, use `rawDataSize` instead
    -        // if the size is still less than zero, we use default size
    +        // if the size is still less than zero, we try to get the file size from HDFS.
    +        // given this is only needed for optimization, if the HDFS call fails we return the default.
             Option(totalSize).map(_.toLong).filter(_ > 0)
               .getOrElse(Option(rawDataSize).map(_.toLong).filter(_ > 0)
    -            .getOrElse(sparkSession.sessionState.conf.defaultSizeInBytes)))
    +              .getOrElse({
    +                val hadoopConf = sparkSession.sessionState.newHadoopConf()
    +                val fs: FileSystem = hiveQlTable.getPath.getFileSystem(hadoopConf)
    +                try {
    +                  fs.getContentSummary(hiveQlTable.getPath).getLength
    +                } catch {
    +                  case e: Exception =>
    --- End diff --
    
    yea defintely better to just catch IOException


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13150#discussion_r64445157
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -113,6 +113,13 @@ object SQLConf {
         .longConf
         .createWithDefault(10L * 1024 * 1024)
     
    +  val ENABLE_FALL_BACK_TO_HDFS_FOR_STATS =
    +    SQLConfigBuilder("spark.sql.enableFallBackToHdfsForStats")
    +    .doc("If the table statistics are not available from table metadata enable fall back to hdfs" +
    --- End diff --
    
    nit: missing period after hdfs


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on the pull request:

    https://github.com/apache/spark/pull/13150#issuecomment-221360098
  
    jenkins test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by Parth-Brahmbhatt <gi...@git.apache.org>.

Github user Parth-Brahmbhatt commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13150#discussion_r63627565
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala ---
    @@ -114,17 +115,27 @@ private[hive] case class MetastoreRelation(
           val rawDataSize = hiveQlTable.getParameters.get(StatsSetupConst.RAW_DATA_SIZE)
           // TODO: check if this estimate is valid for tables after partition pruning.
           // NOTE: getting `totalSize` directly from params is kind of hacky, but this should be
    -      // relatively cheap if parameters for the table are populated into the metastore.  An
    -      // alternative would be going through Hadoop's FileSystem API, which can be expensive if a lot
    -      // of RPCs are involved.  Besides `totalSize`, there are also `numFiles`, `numRows`,
    -      // `rawDataSize` keys (see StatsSetupConst in Hive) that we can look at in the future.
    +      // relatively cheap if parameters for the table are populated into the metastore.
    +      // Besides `totalSize`, there are also `numFiles`, `numRows`, `rawDataSize` keys
    +      // (see StatsSetupConst in Hive) that we can look at in the future.
           BigInt(
             // When table is external,`totalSize` is always zero, which will influence join strategy
             // so when `totalSize` is zero, use `rawDataSize` instead
    -        // if the size is still less than zero, we use default size
    +        // if the size is still less than zero, we try to get the file size from HDFS.
    +        // given this is only needed for optimization, if the HDFS call fails we return the default.
             Option(totalSize).map(_.toLong).filter(_ > 0)
               .getOrElse(Option(rawDataSize).map(_.toLong).filter(_ > 0)
    -            .getOrElse(sparkSession.sessionState.conf.defaultSizeInBytes)))
    +              .getOrElse({
    +                val hadoopConf = sparkSession.sessionState.newHadoopConf()
    +                val fs: FileSystem = hiveQlTable.getPath.getFileSystem(hadoopConf)
    +                try {
    +                  fs.getContentSummary(hiveQlTable.getPath).getLength
    +                } catch {
    +                  case e: Exception =>
    --- End diff --
    
    As this call is necessary only to ensure query optimization, i think it makes sense to ignore failures. If you think we should just catch IOException, I am fine with that too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13150#discussion_r63633986
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala ---
    @@ -114,17 +115,27 @@ private[hive] case class MetastoreRelation(
           val rawDataSize = hiveQlTable.getParameters.get(StatsSetupConst.RAW_DATA_SIZE)
           // TODO: check if this estimate is valid for tables after partition pruning.
           // NOTE: getting `totalSize` directly from params is kind of hacky, but this should be
    -      // relatively cheap if parameters for the table are populated into the metastore.  An
    -      // alternative would be going through Hadoop's FileSystem API, which can be expensive if a lot
    -      // of RPCs are involved.  Besides `totalSize`, there are also `numFiles`, `numRows`,
    -      // `rawDataSize` keys (see StatsSetupConst in Hive) that we can look at in the future.
    +      // relatively cheap if parameters for the table are populated into the metastore.
    +      // Besides `totalSize`, there are also `numFiles`, `numRows`, `rawDataSize` keys
    +      // (see StatsSetupConst in Hive) that we can look at in the future.
           BigInt(
             // When table is external,`totalSize` is always zero, which will influence join strategy
             // so when `totalSize` is zero, use `rawDataSize` instead
    -        // if the size is still less than zero, we use default size
    +        // if the size is still less than zero, we try to get the file size from HDFS.
    +        // given this is only needed for optimization, if the HDFS call fails we return the default.
             Option(totalSize).map(_.toLong).filter(_ > 0)
               .getOrElse(Option(rawDataSize).map(_.toLong).filter(_ > 0)
    -            .getOrElse(sparkSession.sessionState.conf.defaultSizeInBytes)))
    +              .getOrElse({
    +                val hadoopConf = sparkSession.sessionState.newHadoopConf()
    +                val fs: FileSystem = hiveQlTable.getPath.getFileSystem(hadoopConf)
    +                try {
    +                  fs.getContentSummary(hiveQlTable.getPath).getLength
    +                } catch {
    +                  case e: Exception =>
    +                    log.warn("Failed to get table size from hdfs.", e)
    --- End diff --
    
    logWarning


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/13150#issuecomment-221417642
  
    This looks good. Just two minor nits. If you can fix those that would be great. 
    
    Also - would it be possible to add a test case?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by Parth-Brahmbhatt <gi...@git.apache.org>.

Github user Parth-Brahmbhatt commented on the pull request:

    https://github.com/apache/spark/pull/13150#issuecomment-221450739
  
    @rxin added a test case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13150#discussion_r63626710
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala ---
    @@ -114,17 +115,27 @@ private[hive] case class MetastoreRelation(
           val rawDataSize = hiveQlTable.getParameters.get(StatsSetupConst.RAW_DATA_SIZE)
           // TODO: check if this estimate is valid for tables after partition pruning.
           // NOTE: getting `totalSize` directly from params is kind of hacky, but this should be
    -      // relatively cheap if parameters for the table are populated into the metastore.  An
    -      // alternative would be going through Hadoop's FileSystem API, which can be expensive if a lot
    -      // of RPCs are involved.  Besides `totalSize`, there are also `numFiles`, `numRows`,
    -      // `rawDataSize` keys (see StatsSetupConst in Hive) that we can look at in the future.
    +      // relatively cheap if parameters for the table are populated into the metastore.
    +      // Besides `totalSize`, there are also `numFiles`, `numRows`, `rawDataSize` keys
    +      // (see StatsSetupConst in Hive) that we can look at in the future.
           BigInt(
             // When table is external,`totalSize` is always zero, which will influence join strategy
             // so when `totalSize` is zero, use `rawDataSize` instead
    -        // if the size is still less than zero, we use default size
    +        // if the size is still less than zero, we try to get the file size from HDFS.
    +        // given this is only needed for optimization, if the HDFS call fails we return the default.
             Option(totalSize).map(_.toLong).filter(_ > 0)
               .getOrElse(Option(rawDataSize).map(_.toLong).filter(_ > 0)
    -            .getOrElse(sparkSession.sessionState.conf.defaultSizeInBytes)))
    +              .getOrElse({
    +                val hadoopConf = sparkSession.sessionState.newHadoopConf()
    +                val fs: FileSystem = hiveQlTable.getPath.getFileSystem(hadoopConf)
    +                try {
    +                  fs.getContentSummary(hiveQlTable.getPath).getLength
    +                } catch {
    +                  case e: Exception =>
    --- End diff --
    
    I wonder if it is appropriate to catch every exceptions here..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13150#discussion_r63626678
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala ---
    @@ -114,17 +115,27 @@ private[hive] case class MetastoreRelation(
           val rawDataSize = hiveQlTable.getParameters.get(StatsSetupConst.RAW_DATA_SIZE)
           // TODO: check if this estimate is valid for tables after partition pruning.
           // NOTE: getting `totalSize` directly from params is kind of hacky, but this should be
    -      // relatively cheap if parameters for the table are populated into the metastore.  An
    -      // alternative would be going through Hadoop's FileSystem API, which can be expensive if a lot
    -      // of RPCs are involved.  Besides `totalSize`, there are also `numFiles`, `numRows`,
    -      // `rawDataSize` keys (see StatsSetupConst in Hive) that we can look at in the future.
    +      // relatively cheap if parameters for the table are populated into the metastore.
    +      // Besides `totalSize`, there are also `numFiles`, `numRows`, `rawDataSize` keys
    +      // (see StatsSetupConst in Hive) that we can look at in the future.
           BigInt(
             // When table is external,`totalSize` is always zero, which will influence join strategy
             // so when `totalSize` is zero, use `rawDataSize` instead
    -        // if the size is still less than zero, we use default size
    +        // if the size is still less than zero, we try to get the file size from HDFS.
    +        // given this is only needed for optimization, if the HDFS call fails we return the default.
             Option(totalSize).map(_.toLong).filter(_ > 0)
               .getOrElse(Option(rawDataSize).map(_.toLong).filter(_ > 0)
    -            .getOrElse(sparkSession.sessionState.conf.defaultSizeInBytes)))
    +              .getOrElse({
    --- End diff --
    
    (I remember I was told that chaining `getOrElse` is not preferred because it is confusing)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/13150#issuecomment-221467220
  
    Merging in master/2.0.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13150#issuecomment-221463522
  
    **[Test build #3017 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3017/consoleFull)** for PR 13150 at commit [`ff69f91`](https://github.com/apache/spark/commit/ff69f91e273580547b6d86a6dfc87f2a94066507).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by Parth-Brahmbhatt <gi...@git.apache.org>.

Github user Parth-Brahmbhatt commented on the pull request:

    https://github.com/apache/spark/pull/13150#issuecomment-221017398
  
    @sameeragarwal  Added config option. @rxin can you take a look one more time?
     


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15365] [SQL]: When table size statistic...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on the pull request:

    https://github.com/apache/spark/pull/13150#issuecomment-220734823
  
    @Parth-Brahmbhatt this looks pretty good. However, given that hitting the underlying filesystem directly can incur a lot of latency (especially in case of S3), can you please conf protect this change (with a comment about the potential performance issues)? Additionally, perhaps it might be nice to set the conf to false by default to prevent silent regressions for existing queries (especially if we're targeting this for 2.0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org