You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by petermaxlee <gi...@git.apache.org> on 2016/06/30 04:54:34 UTC

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

GitHub user petermaxlee opened a pull request:

    https://github.com/apache/spark/pull/13989

    [SPARK-16311][SQL] Improve metadata refresh

    ## What changes were proposed in this pull request?
    This patch implements the 3 things specified in SPARK-16311:
    
    (1) Append a message to the FileNotFoundException and say that a workaround is to do explicitly metadata refresh.
    (2) Make metadata refresh work on temporary tables/views.
    (3) Make metadata refresh work on Datasets/DataFrames, by introducing a Dataset.refresh() method.
    
    And one additional small change:
    (4) Merge invalidateTable and refreshTable.
    
    ## How was this patch tested?
    Created a new test suite that creates a temporary directory and then deletes a file from it to verify Spark can read the directory once refresh is called.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/petermaxlee/spark SPARK-16311

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13989.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13989
    
----
commit cbfbbc7d27ae086805625fa41dbcbad50783fee8
Author: petermaxlee <pe...@gmail.com>
Date:   2016-06-30T04:50:37Z

    [SPARK-16311][SQL] Improve metadata refresh

commit f7150345245accd0e71a351e9da9ebac9b80a520
Author: petermaxlee <pe...@gmail.com>
Date:   2016-06-30T04:53:58Z

    Add test suite

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69078499
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala ---
    @@ -85,5 +85,10 @@ case class LogicalRelation(
           expectedOutputAttributes,
           metastoreTableIdentifier).asInstanceOf[this.type]
     
    +  override def refresh(): Unit = relation match {
    +    case fs: HadoopFsRelation => fs.refresh()
    --- End diff --
    
    Sorry, yeah, it is unable to be refreshed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69090124
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2307,6 +2307,19 @@ class Dataset[T] private[sql](
       def distinct(): Dataset[T] = dropDuplicates()
     
       /**
    +   * Refreshes the metadata and data cached in Spark for data associated with this Dataset.
    +   * An example use case is to invalidate the file system metadata cached by Spark, when the
    +   * underlying files have been updated by an external process.
    +   *
    +   * @group action
    +   * @since 2.0.0
    +   */
    +  def refresh(): Unit = {
    +    unpersist(false)
    --- End diff --
    
    We can unpersist, but should persist it again immediately.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69073383
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala ---
    @@ -265,6 +265,11 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging {
               s"Reference '$name' is ambiguous, could be: $referenceNames.")
         }
       }
    +
    +  /**
    +   * Invalidates any metadata cached in the plan recursively.
    +   */
    +  def refresh(): Unit = children.foreach(_.refresh())
    --- End diff --
    
    use @tailrec


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69071788
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala ---
    @@ -85,5 +85,10 @@ case class LogicalRelation(
           expectedOutputAttributes,
           metastoreTableIdentifier).asInstanceOf[this.type]
     
    +  override def refresh(): Unit = relation match {
    +    case fs: HadoopFsRelation => fs.refresh()
    --- End diff --
    
    How about the other leaf nodes? `LogicalRelation` is just one of them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by petermaxlee <gi...@git.apache.org>.

Github user petermaxlee commented on the issue:

    https://github.com/apache/spark/pull/13989
  
    cc @rxin


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/13989
  
    In general, I think reconstructing a DataFrame/Dataset or using `REFRESH TABLE` may be a better approach to solve the problem this PR tries to solve. Did I missed some context here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/13989
  
    Test cases are not enough to cover the metadata refreshing. The current metadata cache is only used for data source tables. We still could convert Hive tables to data source tables. For example, parquet and orc. Thus, we also need to check the behaviors of these cases. 
    
    Try to design more test cases for metadata refreshing, including both positive and negative cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13989
  
    **[Test build #61524 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61524/consoleFull)** for PR 13989 at commit [`82f9bec`](https://github.com/apache/spark/commit/82f9bec79125ad3f1c4da504891a75adb5b33f2f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by petermaxlee <gi...@git.apache.org>.

Github user petermaxlee commented on the issue:

    https://github.com/apache/spark/pull/13989
  
    What do you mean by both positive and negative cases?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69077719
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala ---
    @@ -265,6 +265,11 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging {
               s"Reference '$name' is ambiguous, could be: $referenceNames.")
         }
       }
    +
    +  /**
    +   * Invalidates any metadata cached in the plan recursively.
    +   */
    +  def refresh(): Unit = children.foreach(_.refresh())
    --- End diff --
    
    Let us wait for the comments from the Committers. They might be feeling OK for your scenario. Normally, recursive implementation is not encouraged. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69081636
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala ---
    @@ -265,6 +265,11 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging {
               s"Reference '$name' is ambiguous, could be: $referenceNames.")
         }
       }
    +
    +  /**
    +   * Invalidates any metadata cached in the plan recursively.
    --- End diff --
    
    "Refreshes" instead of "Invalidates"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69078533
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala ---
    @@ -85,5 +85,10 @@ case class LogicalRelation(
           expectedOutputAttributes,
           metastoreTableIdentifier).asInstanceOf[this.type]
     
    +  override def refresh(): Unit = relation match {
    +    case fs: HadoopFsRelation => fs.refresh()
    --- End diff --
    
    `cachedDataSourceTables ` is only for the converted `MetastoreRelation `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13989
  
    **[Test build #3159 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3159/consoleFull)** for PR 13989 at commit [`82f9bec`](https://github.com/apache/spark/commit/82f9bec79125ad3f1c4da504891a75adb5b33f2f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69077123
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala ---
    @@ -265,6 +265,11 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging {
               s"Reference '$name' is ambiguous, could be: $referenceNames.")
         }
       }
    +
    +  /**
    +   * Invalidates any metadata cached in the plan recursively.
    +   */
    +  def refresh(): Unit = children.foreach(_.refresh())
    --- End diff --
    
    This is the comment I got when I using the recursive implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69073454
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala ---
    @@ -85,5 +85,10 @@ case class LogicalRelation(
           expectedOutputAttributes,
           metastoreTableIdentifier).asInstanceOf[this.type]
     
    +  override def refresh(): Unit = relation match {
    +    case fs: HadoopFsRelation => fs.refresh()
    --- End diff --
    
    You have to document the reason why only `LogicalRelation` override this function


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13989
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61524/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69073191
  
    --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -139,18 +139,6 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
       }
     
       def refreshTable(tableIdent: TableIdentifier): Unit = {
    -    // refreshTable does not eagerly reload the cache. It just invalidate the cache.
    -    // Next time when we use the table, it will be populated in the cache.
    -    // Since we also cache ParquetRelations converted from Hive Parquet tables and
    -    // adding converted ParquetRelations into the cache is not defined in the load function
    -    // of the cache (instead, we add the cache entry in convertToParquetRelation),
    -    // it is better at here to invalidate the cache to avoid confusing waring logs from the
    -    // cache loader (e.g. cannot find data source provider, which is only defined for
    -    // data source table.).
    --- End diff --
    
    Keep the comments? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69074411
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala ---
    @@ -85,5 +85,10 @@ case class LogicalRelation(
           expectedOutputAttributes,
           metastoreTableIdentifier).asInstanceOf[this.type]
     
    +  override def refresh(): Unit = relation match {
    +    case fs: HadoopFsRelation => fs.refresh()
    --- End diff --
    
    I know, but we need to write the comments for the code readers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by petermaxlee <gi...@git.apache.org>.

Github user petermaxlee commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69075298
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala ---
    @@ -166,8 +166,8 @@ private[sql] class SessionState(sparkSession: SparkSession) {
     
       def executePlan(plan: LogicalPlan): QueryExecution = new QueryExecution(sparkSession, plan)
     
    -  def invalidateTable(tableName: String): Unit = {
    -    catalog.invalidateTable(sqlParser.parseTableIdentifier(tableName))
    +  def refreshTable(tableName: String): Unit = {
    --- End diff --
    
    I just picked the one that was exposed to users (refresh in catalog and in sql).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by petermaxlee <gi...@git.apache.org>.

Github user petermaxlee commented on the issue:

    https://github.com/apache/spark/pull/13989
  
    Alright I will do that and submit a new pull request.
    
    Note that I think the data frame refresh is already possible via table refresh, if a data frame references a table.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by petermaxlee <gi...@git.apache.org>.

Github user petermaxlee commented on the issue:

    https://github.com/apache/spark/pull/13989
  
    Would this work? Traverse the logical plan to find whether it references any catalog relation, and if it does, call catalog.refreshTable("...")?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69074558
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala ---
    @@ -265,6 +265,11 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging {
               s"Reference '$name' is ambiguous, could be: $referenceNames.")
         }
       }
    +
    +  /**
    +   * Invalidates any metadata cached in the plan recursively.
    +   */
    +  def refresh(): Unit = children.foreach(_.refresh())
    --- End diff --
    
    I think we want to avoid recursive implementation at best. It is too expensive for a large tree.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69077414
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala ---
    @@ -85,5 +85,10 @@ case class LogicalRelation(
           expectedOutputAttributes,
           metastoreTableIdentifier).asInstanceOf[this.type]
     
    +  override def refresh(): Unit = relation match {
    +    case fs: HadoopFsRelation => fs.refresh()
    --- End diff --
    
    How about `MetastoreRelation`? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by petermaxlee <gi...@git.apache.org>.

Github user petermaxlee commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69074039
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala ---
    @@ -265,6 +265,11 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging {
               s"Reference '$name' is ambiguous, could be: $referenceNames.")
         }
       }
    +
    +  /**
    +   * Invalidates any metadata cached in the plan recursively.
    +   */
    +  def refresh(): Unit = children.foreach(_.refresh())
    --- End diff --
    
    This is not a tailrec?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by petermaxlee <gi...@git.apache.org>.

Github user petermaxlee commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69075247
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala ---
    @@ -85,5 +85,10 @@ case class LogicalRelation(
           expectedOutputAttributes,
           metastoreTableIdentifier).asInstanceOf[this.type]
     
    +  override def refresh(): Unit = relation match {
    +    case fs: HadoopFsRelation => fs.refresh()
    --- End diff --
    
    I don't agree on this one. LogicalRelation might not be the only one that needs to override this in the future. There can certainly be other logical plans in the future that keep some state and needs to implement refresh. The definition of "refresh" itself with a default implementation also means only plans that need to refresh anything should override it.
    
    I'm going to update refresh in LogicalPlan to make this more clear.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69083486
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala ---
    @@ -265,6 +265,11 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging {
               s"Reference '$name' is ambiguous, could be: $referenceNames.")
         }
       }
    +
    +  /**
    +   * Invalidates any metadata cached in the plan recursively.
    +   */
    +  def refresh(): Unit = children.foreach(_.refresh())
    --- End diff --
    
    `children.foreach(_.refresh())` isn't tail recursive. Using a non-tail recursion here is probably fine since this isn't a critical path. As @petermaxlee said, we are already using recursion to handle `TreeNode`s everywhere with the assumption that such trees are relatively shallow. We do need to pay attention when using recursion in other places though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/13989
  
    cc @cloud-fan / @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/13989
  
    I think @liancheng has a good point. Why don't we take out Dataset.refresh() for now?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69071525
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2307,6 +2307,19 @@ class Dataset[T] private[sql](
       def distinct(): Dataset[T] = dropDuplicates()
     
       /**
    +   * Refreshes the metadata and data cached in Spark for data associated with this Dataset.
    +   * An example use case is to invalidate the file system metadata cached by Spark, when the
    +   * underlying files have been updated by an external process.
    +   *
    +   * @group action
    +   * @since 2.0.0
    +   */
    +  def refresh(): Unit = {
    +    unpersist(false)
    --- End diff --
    
    It will remove the cached data. This is different from what JIRA describes. CC @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69077588
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala ---
    @@ -85,5 +85,10 @@ case class LogicalRelation(
           expectedOutputAttributes,
           metastoreTableIdentifier).asInstanceOf[this.type]
     
    +  override def refresh(): Unit = relation match {
    +    case fs: HadoopFsRelation => fs.refresh()
    --- End diff --
    
    Can it be refreshed?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69079649
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala ---
    @@ -85,5 +85,10 @@ case class LogicalRelation(
           expectedOutputAttributes,
           metastoreTableIdentifier).asInstanceOf[this.type]
     
    +  override def refresh(): Unit = relation match {
    +    case fs: HadoopFsRelation => fs.refresh()
    --- End diff --
    
    When we call `Dataset.refresh()` on the true `MetastoreRelation` (non-converted Hive tables), we have to rebuild the plan, right? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by petermaxlee <gi...@git.apache.org>.

Github user petermaxlee commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69074131
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala ---
    @@ -85,5 +85,10 @@ case class LogicalRelation(
           expectedOutputAttributes,
           metastoreTableIdentifier).asInstanceOf[this.type]
     
    +  override def refresh(): Unit = relation match {
    +    case fs: HadoopFsRelation => fs.refresh()
    --- End diff --
    
    What do you mean? Other leaf nodes don't keep state, do they?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69072136
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2307,6 +2307,19 @@ class Dataset[T] private[sql](
       def distinct(): Dataset[T] = dropDuplicates()
     
       /**
    +   * Refreshes the metadata and data cached in Spark for data associated with this Dataset.
    +   * An example use case is to invalidate the file system metadata cached by Spark, when the
    +   * underlying files have been updated by an external process.
    +   *
    +   * @group action
    +   * @since 2.0.0
    +   */
    +  def refresh(): Unit = {
    +    unpersist(false)
    --- End diff --
    
    This new API has different behaviors from the `refreshTable` API and `Refresh Table` SQL statement. See the following code:
    https://github.com/apache/spark/blob/02a029df43392c5d73697203bf6ff51b8d6efb83/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala#L349-L374
    
    IMO, if we using the word `refresh`, we have to make them consistent.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69090469
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2307,6 +2307,19 @@ class Dataset[T] private[sql](
       def distinct(): Dataset[T] = dropDuplicates()
     
       /**
    +   * Refreshes the metadata and data cached in Spark for data associated with this Dataset.
    +   * An example use case is to invalidate the file system metadata cached by Spark, when the
    +   * underlying files have been updated by an external process.
    +   *
    +   * @group action
    +   * @since 2.0.0
    +   */
    +  def refresh(): Unit = {
    +    unpersist(false)
    --- End diff --
    
    Actually we can and should call `unpersist`, but we should also call `persist()`/`cache()` again so that the Dataset will be cached lazily again with correct data when it gets executed next time. I guess that's also what @gatorsmile meant.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13989
  
    **[Test build #61524 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61524/consoleFull)** for PR 13989 at commit [`82f9bec`](https://github.com/apache/spark/commit/82f9bec79125ad3f1c4da504891a75adb5b33f2f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/13989
  
    Before, I tried to merge `invalidateTable` and `refreshTable`. @yhuai left the following comment: 
    https://github.com/apache/spark/pull/13156#discussion_r63729506
    
    I think maybe we can keep them separately? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/13989
  
    For example, I try to refresh the metadata of a DataFrame that has multiple leaf nodes of `MetastoreRelation`. I think we expect the metadata stored in `cachedDataSourceTables` should be invalidated. The positive case is the table is still cached. The negative case is the table is alraedy uncached. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by petermaxlee <gi...@git.apache.org>.

Github user petermaxlee commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69077392
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala ---
    @@ -265,6 +265,11 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging {
               s"Reference '$name' is ambiguous, could be: $referenceNames.")
         }
       }
    +
    +  /**
    +   * Invalidates any metadata cached in the plan recursively.
    +   */
    +  def refresh(): Unit = children.foreach(_.refresh())
    --- End diff --
    
    Can you point me to it? The entire TreeNode library I saw had a lot of recursive calls.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13989
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by petermaxlee <gi...@git.apache.org>.

Github user petermaxlee commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69074328
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala ---
    @@ -265,6 +265,11 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging {
               s"Reference '$name' is ambiguous, could be: $referenceNames.")
         }
       }
    +
    +  /**
    +   * Invalidates any metadata cached in the plan recursively.
    +   */
    +  def refresh(): Unit = children.foreach(_.refresh())
    --- End diff --
    
    But this function is not tail recursive.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69074265
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala ---
    @@ -265,6 +265,11 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging {
               s"Reference '$name' is ambiguous, could be: $referenceNames.")
         }
       }
    +
    +  /**
    +   * Invalidates any metadata cached in the plan recursively.
    +   */
    +  def refresh(): Unit = children.foreach(_.refresh())
    --- End diff --
    
    You need to mark it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by petermaxlee <gi...@git.apache.org>.

Github user petermaxlee closed the pull request at:

    https://github.com/apache/spark/pull/13989


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69073906
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala ---
    @@ -166,8 +166,8 @@ private[sql] class SessionState(sparkSession: SparkSession) {
     
       def executePlan(plan: LogicalPlan): QueryExecution = new QueryExecution(sparkSession, plan)
     
    -  def invalidateTable(tableName: String): Unit = {
    -    catalog.invalidateTable(sqlParser.parseTableIdentifier(tableName))
    +  def refreshTable(tableName: String): Unit = {
    --- End diff --
    
    To be honest, I still think `invalidateTable` is a right name. We are not doing `refresh`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by petermaxlee <gi...@git.apache.org>.

Github user petermaxlee commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69071622
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2307,6 +2307,19 @@ class Dataset[T] private[sql](
       def distinct(): Dataset[T] = dropDuplicates()
     
       /**
    +   * Refreshes the metadata and data cached in Spark for data associated with this Dataset.
    +   * An example use case is to invalidate the file system metadata cached by Spark, when the
    +   * underlying files have been updated by an external process.
    +   *
    +   * @group action
    +   * @since 2.0.0
    +   */
    +  def refresh(): Unit = {
    +    unpersist(false)
    --- End diff --
    
    Other refresh methods also remove cached data, so I thought this is better.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13989
  
    **[Test build #3159 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3159/consoleFull)** for PR 13989 at commit [`82f9bec`](https://github.com/apache/spark/commit/82f9bec79125ad3f1c4da504891a75adb5b33f2f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by liancheng <gi...@git.apache.org>.

Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/13989

One concern of mine is that, analyzed plan, optimized plan, and executed (physical) plan stored in `QueryExecution` are all lazy vals, which means that they won't be re-optimized/planned accordingly after refreshing metadata of the corresponding logical plan.

Say we constructed a DataFrame `df` to join a small table `A` and a large table `B`. After calling `df.write.parquet(...)`, analyzed, optimized, and executed plans of `df` are all computed. Since `A` is small, the planner may decide to broadcast it, and this decision is reflected in the physical plan.

Next, we add a bunch of files into the directory where table `A` lives to make it super large, then call `df.refresh()` to refresh the logical plan. Now, if we try to call `df.write.parquet(...)` again, the query may probably crash since the physical plan is not refreshed and still thinks that `A` should be broadcasted.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by petermaxlee <gi...@git.apache.org>.

Github user petermaxlee commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69074253
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -2307,6 +2307,19 @@ class Dataset[T] private[sql](
       def distinct(): Dataset[T] = dropDuplicates()
     
       /**
    +   * Refreshes the metadata and data cached in Spark for data associated with this Dataset.
    +   * An example use case is to invalidate the file system metadata cached by Spark, when the
    +   * underlying files have been updated by an external process.
    +   *
    +   * @group action
    +   * @since 2.0.0
    +   */
    +  def refresh(): Unit = {
    +    unpersist(false)
    --- End diff --
    
    ah ic - we can't unpersist.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by petermaxlee <gi...@git.apache.org>.

Github user petermaxlee commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69075198
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala ---
    @@ -265,6 +265,11 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging {
               s"Reference '$name' is ambiguous, could be: $referenceNames.")
         }
       }
    +
    +  /**
    +   * Invalidates any metadata cached in the plan recursively.
    +   */
    +  def refresh(): Unit = children.foreach(_.refresh())
    --- End diff --
    
    I don't get it. Why would this be more expensive than any other recursive calls that happen in logical plans?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13989#discussion_r69078242
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala ---
    @@ -85,5 +85,10 @@ case class LogicalRelation(
           expectedOutputAttributes,
           metastoreTableIdentifier).asInstanceOf[this.type]
     
    +  override def refresh(): Unit = relation match {
    +    case fs: HadoopFsRelation => fs.refresh()
    --- End diff --
    
    Yeah, we can invalidate the cache entry in `cachedDataSourceTables `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org