You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by sameeragarwal <gi...@git.apache.org> on 2016/06/08 19:06:00 UTC

[GitHub] spark pull request #13566: [SPARK-15678] Add support to REFRESH data source ...

GitHub user sameeragarwal opened a pull request:

    https://github.com/apache/spark/pull/13566

    [SPARK-15678] Add support to REFRESH data source paths

    ## What changes were proposed in this pull request?
    
    Spark currently incorrectly continues to use cached data even if the underlying data is overwritten.
    
    Current behavior:
    ```scala
    val dir = "/tmp/test"
    sqlContext.range(1000).write.mode("overwrite").parquet(dir)
    val df = sqlContext.read.parquet(dir).cache()
    df.count() // outputs 1000
    sqlContext.range(10).write.mode("overwrite").parquet(dir)
    sqlContext.read.parquet(dir).count() // outputs 1000 <---- We are still using the cached dataset
    ```
    
    This patch fixes this bug by adding support for `REFRESH path` that invalidates and refreshes all the cached data (and the associated metadata) for any dataframe that contains the given data source path.
    
    Expected behavior:
    ```scala
    val dir = "/tmp/test"
    sqlContext.range(1000).write.mode("overwrite").parquet(dir)
    val df = sqlContext.read.parquet(dir).cache()
    df.count() // outputs 1000
    sqlContext.range(10).write.mode("overwrite").parquet(dir)
    spark.catalog.refreshResource(dir)
    sqlContext.read.parquet(dir).count() // outputs 10 <---- We are not using the cached dataset
    ```
    
    ## How was this patch tested?
    
    Unit tests for overwrites and appends in `ParquetQuerySuite` and `CachedTableSuite`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sameeragarwal/spark refresh-path-2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13566.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13566
    
----
commit ece34abd63176c6192ebe4ef05f2e8799ff52955
Author: Sameer Agarwal <sa...@databricks.com>
Date:   2016-06-06T23:15:42Z

    [SPARK-15678] Add support to REFRESH data source paths

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13566: [SPARK-15678] Add support to REFRESH data source paths

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/13566
  
    Looks pretty good. Left one comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13566: [SPARK-15678] Add support to REFRESH data source ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13566#discussion_r66540478
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala ---
    @@ -157,4 +161,49 @@ private[sql] class CacheManager extends Logging {
           case _ =>
         }
       }
    +
    +  /**
    +   * Invalidates the cache of any data that contains `resourcePath` in one or more
    +   * `HadoopFsRelation` node(s) as part of its logical plan.
    +   */
    +  private[sql] def invalidateCachedPath(
    +      sparkSession: SparkSession, resourcePath: String): Unit = writeLock {
    +    val (fs, qualifiedPath) = {
    +      val path = new Path(resourcePath)
    +      val fs = path.getFileSystem(sparkSession.sessionState.newHadoopConf())
    +      (fs, path.makeQualified(fs.getUri, fs.getWorkingDirectory))
    +    }
    +
    +    cachedData.foreach {
    +      case data if data.plan.find(lookupAndRefresh(_, fs, qualifiedPath)).isDefined =>
    +        val dataIndex = cachedData.indexWhere(cd => data.plan.sameResult(cd.plan))
    +        if (dataIndex >= 0) {
    +          cachedData(dataIndex).cachedRepresentation.cachedColumnBuffers.unpersist(blocking = true)
    +          cachedData.remove(dataIndex)
    +        }
    +        sparkSession.sharedState.cacheManager.cacheQuery(Dataset.ofRows(sparkSession, data.plan))
    +      case _ => // Do Nothing
    +    }
    +  }
    +
    +  /**
    +   * Traverses a given `plan` and searches for the occurrences of `qualifiedPath` in the
    +   * [[org.apache.spark.sql.execution.datasources.FileCatalog]] of any [[HadoopFsRelation]] node
    +   * in the plan. If found, we refresh the metadata and return true. Otherwise, this method returns
    +   * false.
    +   */
    +  private def lookupAndRefresh(plan: LogicalPlan, fs: FileSystem, qualifiedPath: Path): Boolean = {
    +    plan match {
    +      case lr: LogicalRelation => lr.relation match {
    --- End diff --
    
    The LogicalRelation could be part of cached logical plan, should we also invalidate that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13566: [SPARK-15678] Add support to REFRESH data source paths

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13566
  
    **[Test build #60187 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60187/consoleFull)** for PR 13566 at commit [`ece34ab`](https://github.com/apache/spark/commit/ece34abd63176c6192ebe4ef05f2e8799ff52955).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class RefreshResource(path: String)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13566: [SPARK-15678] Add support to REFRESH data source ...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13566#discussion_r66666348
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala ---
    @@ -157,4 +161,49 @@ private[sql] class CacheManager extends Logging {
           case _ =>
         }
       }
    +
    +  /**
    +   * Invalidates the cache of any data that contains `resourcePath` in one or more
    +   * `HadoopFsRelation` node(s) as part of its logical plan.
    +   */
    +  private[sql] def invalidateCachedPath(
    +      sparkSession: SparkSession, resourcePath: String): Unit = writeLock {
    +    val (fs, qualifiedPath) = {
    +      val path = new Path(resourcePath)
    +      val fs = path.getFileSystem(sparkSession.sessionState.newHadoopConf())
    +      (fs, path.makeQualified(fs.getUri, fs.getWorkingDirectory))
    +    }
    +
    +    cachedData.foreach {
    +      case data if data.plan.find(lookupAndRefresh(_, fs, qualifiedPath)).isDefined =>
    +        val dataIndex = cachedData.indexWhere(cd => data.plan.sameResult(cd.plan))
    +        if (dataIndex >= 0) {
    +          cachedData(dataIndex).cachedRepresentation.cachedColumnBuffers.unpersist(blocking = true)
    +          cachedData.remove(dataIndex)
    --- End diff --
    
    Unfortunately, I didn't find a way around it. The old plan has an incorrect catalog, and removing and adding the new one seems like the only option.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13566: [SPARK-15678] Add support to REFRESH data source ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13566#discussion_r66541070
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala ---
    @@ -157,4 +161,49 @@ private[sql] class CacheManager extends Logging {
           case _ =>
         }
       }
    +
    +  /**
    +   * Invalidates the cache of any data that contains `resourcePath` in one or more
    +   * `HadoopFsRelation` node(s) as part of its logical plan.
    +   */
    +  private[sql] def invalidateCachedPath(
    +      sparkSession: SparkSession, resourcePath: String): Unit = writeLock {
    +    val (fs, qualifiedPath) = {
    +      val path = new Path(resourcePath)
    +      val fs = path.getFileSystem(sparkSession.sessionState.newHadoopConf())
    +      (fs, path.makeQualified(fs.getUri, fs.getWorkingDirectory))
    +    }
    +
    +    cachedData.foreach {
    +      case data if data.plan.find(lookupAndRefresh(_, fs, qualifiedPath)).isDefined =>
    +        val dataIndex = cachedData.indexWhere(cd => data.plan.sameResult(cd.plan))
    +        if (dataIndex >= 0) {
    +          cachedData(dataIndex).cachedRepresentation.cachedColumnBuffers.unpersist(blocking = true)
    --- End diff --
    
    data = cachedData(dataIndex), right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13566: [SPARK-15678] Add support to REFRESH data source paths

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13566
  
    **[Test build #60212 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60212/consoleFull)** for PR 13566 at commit [`6acd0c0`](https://github.com/apache/spark/commit/6acd0c044759721719d0fe62e7214fca8c95d7b0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13566: [SPARK-15678] Add support to REFRESH data source paths

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13566
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60187/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13566: [SPARK-15678] Add support to REFRESH data source paths

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13566
  
    **[Test build #60317 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60317/consoleFull)** for PR 13566 at commit [`e79f3f7`](https://github.com/apache/spark/commit/e79f3f782e22d758779d97ff64e94c998f9046d3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13566: [SPARK-15678] Add support to REFRESH data source ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13566#discussion_r66541029
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala ---
    @@ -157,4 +161,49 @@ private[sql] class CacheManager extends Logging {
           case _ =>
         }
       }
    +
    +  /**
    +   * Invalidates the cache of any data that contains `resourcePath` in one or more
    +   * `HadoopFsRelation` node(s) as part of its logical plan.
    +   */
    +  private[sql] def invalidateCachedPath(
    +      sparkSession: SparkSession, resourcePath: String): Unit = writeLock {
    +    val (fs, qualifiedPath) = {
    +      val path = new Path(resourcePath)
    +      val fs = path.getFileSystem(sparkSession.sessionState.newHadoopConf())
    +      (fs, path.makeQualified(fs.getUri, fs.getWorkingDirectory))
    +    }
    +
    +    cachedData.foreach {
    +      case data if data.plan.find(lookupAndRefresh(_, fs, qualifiedPath)).isDefined =>
    +        val dataIndex = cachedData.indexWhere(cd => data.plan.sameResult(cd.plan))
    +        if (dataIndex >= 0) {
    +          cachedData(dataIndex).cachedRepresentation.cachedColumnBuffers.unpersist(blocking = true)
    +          cachedData.remove(dataIndex)
    --- End diff --
    
    Once we refresh the LogicalRelation and unpersist the cachedRepresentation, do we still need to remove it and add it back?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13566: [SPARK-15678] Add support to REFRESH data source ...

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13566#discussion_r66344075
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala ---
    @@ -157,4 +161,38 @@ private[sql] class CacheManager extends Logging {
           case _ =>
         }
       }
    +
    +  /**
    +   * Invalidates the cache of any data that contains `resourcePath` in one or more
    +   * `HadoopFsRelation` node(s) as part of its logical plan.
    +   */
    +  private[sql] def invalidateCachedPath(
    +      sparkSession: SparkSession, resourcePath: String): Unit = writeLock {
    +    val (fs, qualifiedPath) = {
    +      val path = new Path(resourcePath)
    +      val fs = path.getFileSystem(sparkSession.sessionState.newHadoopConf())
    +      (fs, path.makeQualified(fs.getUri, fs.getWorkingDirectory))
    +    }
    +    cachedData.foreach {
    +      case data if data.plan.find {
    --- End diff --
    
    Could you move this into a separate function; it was kinda hard to understand that it is a part of the `case` guard.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13566: [SPARK-15678] Add support to REFRESH data source paths

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on the issue:

    https://github.com/apache/spark/pull/13566
  
    Thanks, I pulled it out in a separate function.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13566: [SPARK-15678] Add support to REFRESH data source paths

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the issue:

    https://github.com/apache/spark/pull/13566
  
    LGTM, 
    Merging this into master and 2.0, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13566: [SPARK-15678] Add support to REFRESH data source ...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13566#discussion_r66685151
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala ---
    @@ -226,4 +226,11 @@ abstract class Catalog {
        */
       def refreshTable(tableName: String): Unit
     
    +  /**
    +   * Invalidate and refresh all the cached data (and the associated metadata) for any dataframe that
    +   * contains the given data source path.
    +   *
    +   * @since 2.0.0
    +   */
    +  def refreshResource(path: String): Unit
    --- End diff --
    
    alright, changed this to `refreshByPath` based on @ericl's suggestion :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13566: [SPARK-15678] Add support to REFRESH data source paths

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13566
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13566: [SPARK-15678] Add support to REFRESH data source paths

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13566
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60317/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13566: [SPARK-15678] Add support to REFRESH data source paths

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13566
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13566: [SPARK-15678] Add support to REFRESH data source paths

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13566
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60212/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13566: [SPARK-15678] Add support to REFRESH data source ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13566#discussion_r66541600
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala ---
    @@ -226,4 +226,11 @@ abstract class Catalog {
        */
       def refreshTable(tableName: String): Unit
     
    +  /**
    +   * Invalidate and refresh all the cached data (and the associated metadata) for any dataframe that
    +   * contains the given data source path.
    +   *
    +   * @since 2.0.0
    +   */
    +  def refreshResource(path: String): Unit
    --- End diff --
    
    I'm confusing with these Catalog/SessionCatalog/ExternalCatalog here, thought this is SessionCatalog or ExternalCatalog, so it make sense to be here (together with other API related to cache).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13566: [SPARK-15678] Add support to REFRESH data source paths

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13566
  
    **[Test build #60212 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60212/consoleFull)** for PR 13566 at commit [`6acd0c0`](https://github.com/apache/spark/commit/6acd0c044759721719d0fe62e7214fca8c95d7b0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13566: [SPARK-15678] Add support to REFRESH data source ...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13566#discussion_r66666584
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala ---
    @@ -226,4 +226,11 @@ abstract class Catalog {
        */
       def refreshTable(tableName: String): Unit
     
    +  /**
    +   * Invalidate and refresh all the cached data (and the associated metadata) for any dataframe that
    +   * contains the given data source path.
    +   *
    +   * @since 2.0.0
    +   */
    +  def refreshResource(path: String): Unit
    --- End diff --
    
    I like `invalidateCache()` but the reason for choosing `refreshResource()` was to make it sound similar to `refreshTable ()` above. Let me know if you prefer one over the other.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13566: [SPARK-15678] Add support to REFRESH data source ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/13566


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13566: [SPARK-15678] Add support to REFRESH data source paths

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13566
  
    **[Test build #60187 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60187/consoleFull)** for PR 13566 at commit [`ece34ab`](https://github.com/apache/spark/commit/ece34abd63176c6192ebe4ef05f2e8799ff52955).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13566: [SPARK-15678] Add support to REFRESH data source ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13566#discussion_r66540666
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala ---
    @@ -157,4 +161,49 @@ private[sql] class CacheManager extends Logging {
           case _ =>
         }
       }
    +
    +  /**
    +   * Invalidates the cache of any data that contains `resourcePath` in one or more
    +   * `HadoopFsRelation` node(s) as part of its logical plan.
    +   */
    +  private[sql] def invalidateCachedPath(
    +      sparkSession: SparkSession, resourcePath: String): Unit = writeLock {
    +    val (fs, qualifiedPath) = {
    +      val path = new Path(resourcePath)
    +      val fs = path.getFileSystem(sparkSession.sessionState.newHadoopConf())
    +      (fs, path.makeQualified(fs.getUri, fs.getWorkingDirectory))
    +    }
    +
    +    cachedData.foreach {
    +      case data if data.plan.find(lookupAndRefresh(_, fs, qualifiedPath)).isDefined =>
    +        val dataIndex = cachedData.indexWhere(cd => data.plan.sameResult(cd.plan))
    +        if (dataIndex >= 0) {
    +          cachedData(dataIndex).cachedRepresentation.cachedColumnBuffers.unpersist(blocking = true)
    +          cachedData.remove(dataIndex)
    +        }
    +        sparkSession.sharedState.cacheManager.cacheQuery(Dataset.ofRows(sparkSession, data.plan))
    +      case _ => // Do Nothing
    +    }
    +  }
    +
    +  /**
    +   * Traverses a given `plan` and searches for the occurrences of `qualifiedPath` in the
    +   * [[org.apache.spark.sql.execution.datasources.FileCatalog]] of any [[HadoopFsRelation]] node
    +   * in the plan. If found, we refresh the metadata and return true. Otherwise, this method returns
    +   * false.
    +   */
    +  private def lookupAndRefresh(plan: LogicalPlan, fs: FileSystem, qualifiedPath: Path): Boolean = {
    +    plan match {
    +      case lr: LogicalRelation => lr.relation match {
    --- End diff --
    
    nvm, we are using `plan.find` in invalidateCachedPath


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13566: [SPARK-15678] Add support to REFRESH data source ...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13566#discussion_r66540134
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala ---
    @@ -226,4 +226,11 @@ abstract class Catalog {
        */
       def refreshTable(tableName: String): Unit
     
    +  /**
    +   * Invalidate and refresh all the cached data (and the associated metadata) for any dataframe that
    +   * contains the given data source path.
    +   *
    +   * @since 2.0.0
    +   */
    +  def refreshResource(path: String): Unit
    --- End diff --
    
    Should we call it `invalidateCache()` to reflect the things we actually done?
    
    Also, it's a bit confusing to have this API on `catalog`, can we put it on SparkSession?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13566: [SPARK-15678] Add support to REFRESH data source paths

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13566
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13566: [SPARK-15678] Add support to REFRESH data source paths

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13566
  
    **[Test build #60317 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60317/consoleFull)** for PR 13566 at commit [`e79f3f7`](https://github.com/apache/spark/commit/e79f3f782e22d758779d97ff64e94c998f9046d3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13566: [SPARK-15678] Add support to REFRESH data source ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13566#discussion_r66680102
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala ---
    @@ -226,4 +226,11 @@ abstract class Catalog {
        */
       def refreshTable(tableName: String): Unit
     
    +  /**
    +   * Invalidate and refresh all the cached data (and the associated metadata) for any dataframe that
    +   * contains the given data source path.
    +   *
    +   * @since 2.0.0
    +   */
    +  def refreshResource(path: String): Unit
    --- End diff --
    
    resource does sound a bit werid to me


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org