You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by yhuai <gi...@git.apache.org> on 2014/08/02 23:52:02 UTC

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

GitHub user yhuai opened a pull request:

    https://github.com/apache/spark/pull/1741

    [SPARK-2783][SQL] Basic support for analyze in HiveContext

    JIRA: https://issues.apache.org/jira/browse/SPARK-2783

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yhuai/spark analyzeTable

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1741.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1741
    
----
commit 23df227062e6d3b3f5a9e64da9930285c56fc360
Author: Yin Huai <hu...@cse.ohio-state.edu>
Date:   2014-08-02T21:50:23Z

    Add a simple analyze method to get the size of a table and update the "totalSize" property of this table in the Hive metastore.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1741#issuecomment-50975715
  
    QA tests have started for PR 1741. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17782/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1741#issuecomment-50975587
  
    QA tests have started for PR 1741. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17780/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1741#discussion_r15736318
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala ---
    @@ -93,6 +97,83 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) {
         catalog.createTable("default", tableName, ScalaReflection.attributesFor[A], allowExisting)
       }
     
    +  /**
    +   * Analyzes the given table in the current database to generate statistics, which will be
    +   * used in query optimizations.
    +   *
    +   * Right now, it only supports Hive tables and it only updates the size of a Hive table
    +   * in the Hive metastore.
    +   */
    +  def analyze(tableName: String) {
    +    val relation = catalog.lookupRelation(None, tableName) match {
    +      case LowerCaseSchema(r) => r
    +      case o => o
    +    }
    +
    +    relation match {
    +      case relation: MetastoreRelation => {
    +        // This method is borrowed from
    +        // org.apache.hadoop.hive.ql.stats.StatsUtils.getFileSizeForTable(HiveConf, Table)
    +        // in Hive 0.13.
    +        // TODO: Generalize statistics collection.
    +        // TODO: Can we use fs.getContentSummary?
    +        // Seems fs.getContentSummary returns wrong table size on Jenkins. So we use
    +        // countFileSize to count the table size.
    +        def countFileSize(fs: FileSystem, path: Path): Long = {
    --- End diff --
    
    It should be `calculateTableSize`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1741


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/1741#issuecomment-51000495
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1741#issuecomment-51003029
  
    QA tests have started for PR 1741. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17829/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1741#issuecomment-50975840
  
    QA results for PR 1741:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17782/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1741#issuecomment-50982297
  
    QA results for PR 1741:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17799/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1741#issuecomment-51000563
  
    QA tests have started for PR 1741. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17824/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by concretevitamin <gi...@git.apache.org>.

Github user concretevitamin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1741#discussion_r15733265
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -280,7 +281,7 @@ private[hive] case class MetastoreRelation
           // of RPCs are involved.  Besides `totalSize`, there are also `numFiles`, `numRows`,
    --- End diff --
    
    Perhaps update the comments here to say other fields in `StatsSetupConst` might be useful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/1741#issuecomment-51001531
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1741#discussion_r15739401
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala ---
    @@ -92,6 +95,64 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) {
         catalog.createTable("default", tableName, ScalaReflection.attributesFor[A], allowExisting)
       }
     
    +  /**
    +   * Analyzes the given table in the current database to generate statistics, which will be
    +   * used in query optimizations.
    +   *
    +   * Right now, it only supports Hive tables and it only updates the size of a Hive table
    +   * in the Hive metastore.
    +   */
    +  def analyze(tableName: String) {
    +    val relation = catalog.lookupRelation(None, tableName) match {
    +      case LowerCaseSchema(r) => r
    +      case o => o
    +    }
    +
    +    relation match {
    +      case relation: MetastoreRelation => {
    +        // This method is borrowed from
    +        // org.apache.hadoop.hive.ql.stats.StatsUtils.getFileSizeForTable(HiveConf, Table)
    +        // in Hive 0.13.
    +        // TODO: Generalize statistics collection.
    +        def getFileSizeForTable(conf: HiveConf, table: Table): Long = {
    +          val path = table.getPath()
    +          var size: Long = 0L
    +          try {
    +            val fs = path.getFileSystem(conf)
    +            size = fs.getContentSummary(path).getLength()
    +          } catch {
    +            case e: Exception =>
    +              logWarning(
    +                s"Failed to get the size of table ${table.getTableName} in the " +
    +                s"database ${table.getDbName} because of ${e.toString}", e)
    +              size = 0L
    +          }
    +
    +          size
    +        }
    +
    +        val tableParameters = relation.hiveQlTable.getParameters
    +        val oldTotalSize =
    +          Option(tableParameters.get(StatsSetupConst.TOTAL_SIZE)).map(_.toLong).getOrElse(0L)
    +        val newTotalSize = getFileSizeForTable(hiveconf, relation.hiveQlTable)
    +        // Update the Hive metastore if the total size of the table is different than the size
    +        // recorded in the Hive metastore.
    +        // This logic is based on org.apache.hadoop.hive.ql.exec.StatsTask.aggregateStats().
    +        if (newTotalSize > 0 && newTotalSize != oldTotalSize) {
    +          tableParameters.put(StatsSetupConst.TOTAL_SIZE, newTotalSize.toString)
    --- End diff --
    
    Sorry, didn't see this yesterday. Confirmed that the metastore handles concurrency correctly with a transaction of the underlying RDBMS.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1741#issuecomment-50981507
  
    QA tests have started for PR 1741. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17799/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1741#issuecomment-50976401
  
    QA tests have started for PR 1741. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17786/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1741#issuecomment-50998272
  
    QA tests have started for PR 1741. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17819/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by concretevitamin <gi...@git.apache.org>.

Github user concretevitamin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1741#discussion_r15733255
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala ---
    @@ -21,12 +21,15 @@ import java.io.{BufferedReader, File, InputStreamReader, PrintStream}
     import java.sql.Timestamp
     import java.util.{ArrayList => JArrayList}
     
    +import org.apache.hadoop.hive.ql.stats.StatsSetupConst
    --- End diff --
    
    Alphabetize imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1741#issuecomment-51004552
  
    This only failed streaming tests.  I'm going to merge into master and 1.1.  Thanks @yhuai!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1741#issuecomment-50975750
  
    QA results for PR 1741:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17780/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/1741#issuecomment-51000231
  
    ```
    [info] StatisticsSuite:
    path: file:/tmp/sparkHiveWarehouse7050426032683338824/analyzetable, size: 4096
    path: file:/tmp/sparkHiveWarehouse7050426032683338824/analyzetable/part-00001_copy_1, size: 2896
    path: file:/tmp/sparkHiveWarehouse7050426032683338824/analyzetable/part-00001, size: 2896
    path: file:/tmp/sparkHiveWarehouse7050426032683338824/analyzetable/part-00000_copy_1, size: 2916
    path: file:/tmp/sparkHiveWarehouse7050426032683338824/analyzetable/part-00000, size: 2916
    path: file:/tmp/sparkHiveWarehouse7050426032683338824/analyzetable/_SUCCESS, size: 0
    path: file:/tmp/sparkHiveWarehouse7050426032683338824/analyzetable/_SUCCESS_copy_1, size: 0
    [info] - analyze MetastoreRelations *** FAILED ***
    [info]   11768 did not equal 11624 (StatisticsSuite.scala:42)
    ```
    Looks like there is something wrong with `getContentSummary`. I will not use it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1741#issuecomment-50977857
  
    QA results for PR 1741:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17786/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1741#discussion_r15735734
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala ---
    @@ -92,6 +95,64 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) {
         catalog.createTable("default", tableName, ScalaReflection.attributesFor[A], allowExisting)
       }
     
    +  /**
    +   * Analyzes the given table in the current database to generate statistics, which will be
    +   * used in query optimizations.
    +   *
    +   * Right now, it only supports Hive tables and it only updates the size of a Hive table
    +   * in the Hive metastore.
    +   */
    +  def analyze(tableName: String) {
    +    val relation = catalog.lookupRelation(None, tableName) match {
    +      case LowerCaseSchema(r) => r
    +      case o => o
    +    }
    +
    +    relation match {
    +      case relation: MetastoreRelation => {
    +        // This method is borrowed from
    +        // org.apache.hadoop.hive.ql.stats.StatsUtils.getFileSizeForTable(HiveConf, Table)
    +        // in Hive 0.13.
    +        // TODO: Generalize statistics collection.
    +        def getFileSizeForTable(conf: HiveConf, table: Table): Long = {
    +          val path = table.getPath()
    +          var size: Long = 0L
    +          try {
    +            val fs = path.getFileSystem(conf)
    +            size = fs.getContentSummary(path).getLength()
    +          } catch {
    +            case e: Exception =>
    +              logWarning(
    +                s"Failed to get the size of table ${table.getTableName} in the " +
    +                s"database ${table.getDbName} because of ${e.toString}", e)
    +              size = 0L
    +          }
    +
    +          size
    +        }
    +
    +        val tableParameters = relation.hiveQlTable.getParameters
    +        val oldTotalSize =
    +          Option(tableParameters.get(StatsSetupConst.TOTAL_SIZE)).map(_.toLong).getOrElse(0L)
    +        val newTotalSize = getFileSizeForTable(hiveconf, relation.hiveQlTable)
    +        // Update the Hive metastore if the total size of the table is different than the size
    +        // recorded in the Hive metastore.
    +        // This logic is based on org.apache.hadoop.hive.ql.exec.StatsTask.aggregateStats().
    +        if (newTotalSize > 0 && newTotalSize != oldTotalSize) {
    +          tableParameters.put(StatsSetupConst.TOTAL_SIZE, newTotalSize.toString)
    --- End diff --
    
    Seems every time we do lookupRelation, we get a new instance of hiveQlTable (with a new instance of underlying Hive TTable). So, I think `tableParameters` will not be shared. The place we do metastore update is `catalog.client.alterTable(tableFullName, new Table(hiveTTable))`. I guess the underlying metastore can take care the concurrent update.
    
    @liancheng can you also take a look at this analyze method?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1741#discussion_r15732118
  
    --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
    @@ -26,6 +26,53 @@ import org.apache.spark.sql.hive.test.TestHive._
     
     class StatisticsSuite extends QueryTest {
     
    +  test("analyze MetastoreRelations") {
    +    def queryTotalSize(tableName: String): BigInt =
    +      catalog.lookupRelation(None, tableName).statistics.sizeInBytes
    +
    +    // Non-partitioned table
    +    hql("CREATE TABLE srcToBeAnalyzed (key STRING, value STRING)").collect()
    +    hql("INSERT INTO TABLE srcToBeAnalyzed SELECT * FROM src").collect()
    +    hql("INSERT INTO TABLE srcToBeAnalyzed SELECT * FROM src").collect()
    +
    +    assert(queryTotalSize("srcToBeAnalyzed") === defaultSizeInBytes)
    +
    +    analyze("srcTobeAnalyzed")
    +
    +    assert(queryTotalSize("srcToBeAnalyzed") === BigInt(11624))
    +
    +    hql("DROP TABLE srcToBeAnalyzed").collect()
    --- End diff --
    
    We'll need to coordinate this with #1746.  The only problem will be deprecation warning though, so I'm okay fixing it up in a follow up PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1741#discussion_r15731595
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala ---
    @@ -92,6 +95,62 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) {
         catalog.createTable("default", tableName, ScalaReflection.attributesFor[A], allowExisting)
       }
     
    +  /**
    +   * Analyzes the given table in the current database to generate statistics, which will be
    +   * used in query optimizations.
    +   * Right now, it only supports retrieving the size of a Hive table.
    +   */
    +  def analyze(tableName: String) {
    +    val relation = catalog.lookupRelation(None, tableName) match {
    +      case LowerCaseSchema(r) => r
    +      case o => o
    +    }
    +
    +    relation match {
    +      case relation: MetastoreRelation => {
    +        // This method is borrowed from
    +        // org.apache.hadoop.hive.ql.stats.StatsUtils.getFileSizeForTable(HiveConf, Table)
    +        // in Hive 0.13.
    +        // TODO: Generalize statistics collection.
    +        def getFileSizeForTable(conf: HiveConf, table: Table): Long = {
    +          val path = table.getPath()
    +          var size: Long = 0L
    +          try {
    +            val fs = path.getFileSystem(conf)
    +            size = fs.getContentSummary(path).getLength()
    +          } catch {
    +            case e: Exception =>
    +              logger.warn(
    --- End diff --
    
    There have been some changes in logging use `logWarn(...)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1741#discussion_r15736150
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala ---
    @@ -92,6 +95,64 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) {
         catalog.createTable("default", tableName, ScalaReflection.attributesFor[A], allowExisting)
       }
     
    +  /**
    +   * Analyzes the given table in the current database to generate statistics, which will be
    +   * used in query optimizations.
    +   *
    +   * Right now, it only supports Hive tables and it only updates the size of a Hive table
    +   * in the Hive metastore.
    +   */
    +  def analyze(tableName: String) {
    +    val relation = catalog.lookupRelation(None, tableName) match {
    +      case LowerCaseSchema(r) => r
    +      case o => o
    +    }
    +
    +    relation match {
    +      case relation: MetastoreRelation => {
    +        // This method is borrowed from
    +        // org.apache.hadoop.hive.ql.stats.StatsUtils.getFileSizeForTable(HiveConf, Table)
    +        // in Hive 0.13.
    +        // TODO: Generalize statistics collection.
    +        def getFileSizeForTable(conf: HiveConf, table: Table): Long = {
    +          val path = table.getPath()
    +          var size: Long = 0L
    +          try {
    +            val fs = path.getFileSystem(conf)
    +            size = fs.getContentSummary(path).getLength()
    +          } catch {
    +            case e: Exception =>
    +              logWarning(
    +                s"Failed to get the size of table ${table.getTableName} in the " +
    +                s"database ${table.getDbName} because of ${e.toString}", e)
    +              size = 0L
    +          }
    +
    +          size
    +        }
    +
    +        val tableParameters = relation.hiveQlTable.getParameters
    +        val oldTotalSize =
    +          Option(tableParameters.get(StatsSetupConst.TOTAL_SIZE)).map(_.toLong).getOrElse(0L)
    +        val newTotalSize = getFileSizeForTable(hiveconf, relation.hiveQlTable)
    +        // Update the Hive metastore if the total size of the table is different than the size
    +        // recorded in the Hive metastore.
    +        // This logic is based on org.apache.hadoop.hive.ql.exec.StatsTask.aggregateStats().
    +        if (newTotalSize > 0 && newTotalSize != oldTotalSize) {
    +          tableParameters.put(StatsSetupConst.TOTAL_SIZE, newTotalSize.toString)
    +          val hiveTTable = relation.hiveQlTable.getTTable
    +          hiveTTable.setParameters(tableParameters)
    +          val tableFullName =
    +            relation.hiveQlTable.getDbName() + "." + relation.hiveQlTable.getTableName()
    +
    +          catalog.client.alterTable(tableFullName, new Table(hiveTTable))
    +        }
    +      }
    +      case otherRelation =>
    +        throw new NotImplementedError(s"Analyzing a ${otherRelation} has not been implemented")
    --- End diff --
    
    I think I would actually disagree here.  When users get this error it may be helpful to know what type of relation they are trying to analyze so you can explain why thats not yet possible.  Avoiding toString is probably reasonable since for treeNodes it can be quite verbose.
    
    Perhaps: `s"Analyze has only been implemented for Hive MetaStore relations, but $tableName is a ${otherRelation.nodeName}."`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/1741#issuecomment-50998203
  
    I logged into Jenkins and tried my test. It passed... I just added a few logging entries and hopefully we can know what's going from the console output.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1741#issuecomment-51001520
  
    QA results for PR 1741:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17822/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by concretevitamin <gi...@git.apache.org>.

Github user concretevitamin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1741#discussion_r15733277
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala ---
    @@ -92,6 +95,64 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) {
         catalog.createTable("default", tableName, ScalaReflection.attributesFor[A], allowExisting)
       }
     
    +  /**
    +   * Analyzes the given table in the current database to generate statistics, which will be
    +   * used in query optimizations.
    +   *
    +   * Right now, it only supports Hive tables and it only updates the size of a Hive table
    +   * in the Hive metastore.
    +   */
    +  def analyze(tableName: String) {
    +    val relation = catalog.lookupRelation(None, tableName) match {
    +      case LowerCaseSchema(r) => r
    +      case o => o
    +    }
    +
    +    relation match {
    +      case relation: MetastoreRelation => {
    +        // This method is borrowed from
    +        // org.apache.hadoop.hive.ql.stats.StatsUtils.getFileSizeForTable(HiveConf, Table)
    +        // in Hive 0.13.
    +        // TODO: Generalize statistics collection.
    +        def getFileSizeForTable(conf: HiveConf, table: Table): Long = {
    +          val path = table.getPath()
    +          var size: Long = 0L
    +          try {
    +            val fs = path.getFileSystem(conf)
    +            size = fs.getContentSummary(path).getLength()
    +          } catch {
    +            case e: Exception =>
    +              logWarning(
    +                s"Failed to get the size of table ${table.getTableName} in the " +
    +                s"database ${table.getDbName} because of ${e.toString}", e)
    +              size = 0L
    +          }
    +
    +          size
    +        }
    +
    +        val tableParameters = relation.hiveQlTable.getParameters
    +        val oldTotalSize =
    +          Option(tableParameters.get(StatsSetupConst.TOTAL_SIZE)).map(_.toLong).getOrElse(0L)
    +        val newTotalSize = getFileSizeForTable(hiveconf, relation.hiveQlTable)
    +        // Update the Hive metastore if the total size of the table is different than the size
    +        // recorded in the Hive metastore.
    +        // This logic is based on org.apache.hadoop.hive.ql.exec.StatsTask.aggregateStats().
    +        if (newTotalSize > 0 && newTotalSize != oldTotalSize) {
    +          tableParameters.put(StatsSetupConst.TOTAL_SIZE, newTotalSize.toString)
    +          val hiveTTable = relation.hiveQlTable.getTTable
    +          hiveTTable.setParameters(tableParameters)
    +          val tableFullName =
    +            relation.hiveQlTable.getDbName() + "." + relation.hiveQlTable.getTableName()
    +
    +          catalog.client.alterTable(tableFullName, new Table(hiveTTable))
    +        }
    +      }
    +      case otherRelation =>
    +        throw new NotImplementedError(s"Analyzing a ${otherRelation} has not been implemented")
    --- End diff --
    
    We probably don't want the result of a general `.toString`. Perhaps just say "Analyzing relations other than MetastoreRelation's has not been implemented" instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/1741#issuecomment-50981485
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1741#issuecomment-51002608
  
    QA results for PR 1741:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17824/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by concretevitamin <gi...@git.apache.org>.

Github user concretevitamin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1741#discussion_r15733260
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -280,7 +281,7 @@ private[hive] case class MetastoreRelation
           // of RPCs are involved.  Besides `totalSize`, there are also `numFiles`, `numRows`,
           // `rawDataSize` keys that we can look at in the future.
           BigInt(
    -        Option(hiveQlTable.getParameters.get("totalSize"))
    +        Option(hiveQlTable.getParameters.get(StatsSetupConst.TOTAL_SIZE))
    --- End diff --
    
    Oh wow, this is a hard-to-find class!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by marmbrus <gi...@git.apache.org>.

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1741#issuecomment-50977990
  
    Hmmm, linux vs mac file size problems?
    
    ```
    [info] StatisticsSuite:
    [info] - analyze MetastoreRelations *** FAILED ***
    [info]   11768 did not equal 11624 (StatisticsSuite.scala:42)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by yhuai <gi...@git.apache.org>.

Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/1741#issuecomment-51000971
  
    ```
    [info] StatisticsSuite:
    path: file:/tmp/sparkHiveWarehouse5177651279772692023/analyzetable, size: 4096
    path: file:/tmp/sparkHiveWarehouse5177651279772692023/analyzetable/part-00001_copy_1, size: 2896
    path: file:/tmp/sparkHiveWarehouse5177651279772692023/analyzetable/part-00001, size: 2896
    path: file:/tmp/sparkHiveWarehouse5177651279772692023/analyzetable/part-00000_copy_1, size: 2916
    path: file:/tmp/sparkHiveWarehouse5177651279772692023/analyzetable/part-00000, size: 2916
    path: file:/tmp/sparkHiveWarehouse5177651279772692023/analyzetable/_SUCCESS, size: 0
    path: file:/tmp/sparkHiveWarehouse5177651279772692023/analyzetable/_SUCCESS_copy_1, size: 0
    Size of table returned from fs.getContentSummary(path).getLength(): 11768
    [info] - analyze MetastoreRelations *** FAILED ***
    [info]   11768 did not equal 11624 (StatisticsSuite.scala:42)
    ```
    Seems `getContentSummary` caused the problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1741#issuecomment-51000547
  
    QA results for PR 1741:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17819/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1741#issuecomment-51003691
  
    QA results for PR 1741:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17825/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by concretevitamin <gi...@git.apache.org>.

Github user concretevitamin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1741#discussion_r15733274
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala ---
    @@ -92,6 +95,64 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) {
         catalog.createTable("default", tableName, ScalaReflection.attributesFor[A], allowExisting)
       }
     
    +  /**
    +   * Analyzes the given table in the current database to generate statistics, which will be
    +   * used in query optimizations.
    +   *
    +   * Right now, it only supports Hive tables and it only updates the size of a Hive table
    +   * in the Hive metastore.
    +   */
    +  def analyze(tableName: String) {
    +    val relation = catalog.lookupRelation(None, tableName) match {
    +      case LowerCaseSchema(r) => r
    +      case o => o
    +    }
    +
    +    relation match {
    +      case relation: MetastoreRelation => {
    +        // This method is borrowed from
    +        // org.apache.hadoop.hive.ql.stats.StatsUtils.getFileSizeForTable(HiveConf, Table)
    +        // in Hive 0.13.
    +        // TODO: Generalize statistics collection.
    +        def getFileSizeForTable(conf: HiveConf, table: Table): Long = {
    +          val path = table.getPath()
    +          var size: Long = 0L
    +          try {
    +            val fs = path.getFileSystem(conf)
    +            size = fs.getContentSummary(path).getLength()
    +          } catch {
    +            case e: Exception =>
    +              logWarning(
    +                s"Failed to get the size of table ${table.getTableName} in the " +
    +                s"database ${table.getDbName} because of ${e.toString}", e)
    +              size = 0L
    +          }
    +
    +          size
    +        }
    +
    +        val tableParameters = relation.hiveQlTable.getParameters
    +        val oldTotalSize =
    +          Option(tableParameters.get(StatsSetupConst.TOTAL_SIZE)).map(_.toLong).getOrElse(0L)
    +        val newTotalSize = getFileSizeForTable(hiveconf, relation.hiveQlTable)
    +        // Update the Hive metastore if the total size of the table is different than the size
    +        // recorded in the Hive metastore.
    +        // This logic is based on org.apache.hadoop.hive.ql.exec.StatsTask.aggregateStats().
    +        if (newTotalSize > 0 && newTotalSize != oldTotalSize) {
    +          tableParameters.put(StatsSetupConst.TOTAL_SIZE, newTotalSize.toString)
    --- End diff --
    
    Do we need to be concerned about concurrent accesses to `tableParameters`? More generally do we need to somehow synchronize on `MetastoreRelation#hiveQlTable` in various places?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1741#issuecomment-50999395
  
    QA tests have started for PR 1741. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17822/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1741#issuecomment-51001596
  
    QA tests have started for PR 1741. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17825/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org