You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2017/05/02 14:28:04 UTC

[jira] [Resolved] (SPARK-20559) Refreshing a cached RDD without restarting the Spark application

     [ https://issues.apache.org/jira/browse/SPARK-20559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-20559.
-------------------------------
    Resolution: Invalid

This should go to user@spark.apache.org

> Refreshing a cached RDD without restarting the Spark application
> ----------------------------------------------------------------
>
>                 Key: SPARK-20559
>                 URL: https://issues.apache.org/jira/browse/SPARK-20559
>             Project: Spark
>          Issue Type: Question
>          Components: Spark Core
>    Affects Versions: 2.1.0
>            Reporter: Jayesh lalwani
>
> We have a Structured Streaming application that gets accounts from Kafka into a streaming data frame. We have a blacklist of accounts stored in S3 and we want to filter out all the accounts that are blacklisted. So, we are loading the blacklisted accounts into a batch data frame and joining it with the streaming data frame to filter out the bad accounts.
> Now, the blacklist doesn't change very often.. once a week at max. SO, we wanted to cache the blacklist data frame to prevent going out to S3 everytime. Since, the blacklist might change, we want to be able to refresh the cache at a cadence, without restarting the whole app.
> So, to begin with we wrote a simple app that caches and refreshes a simple data frame. The steps we followed are
> * Create a CSV file
> * load CSV into a DF: df = spark.read.csv(filename)
> * Persist the data frame: df.persist
> * Now when we do df.show, we see the contents of the csv.
> * We change the CSV, and call df.show, we can see that the old contents are being displayed, proving that the df is cached
> * df.unpersist
> * df.persist
> * df.show
> * What we see is that the rows that were modified in the CSV are reloaded.. But new rows aren't
> Is this expected behavior? Is there a better way to refresh cached data without restarting the Spark application?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org