You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Julian (JIRA)" <ji...@apache.org> on 2018/02/08 13:22:00 UTC

[jira] [Commented] (SPARK-20568) Delete files after processing in structured streaming

    [ https://issues.apache.org/jira/browse/SPARK-20568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356929#comment-16356929 ] 

Julian commented on SPARK-20568:
--------------------------------

I've started with Data ingestion using Structured Streaming where we will be processing large amounts of csv data (later XML via kafka to which I hope to switch to the kafka structured streaming source). In short, about 6+GB per minute which we need to process/transform through Spark. On smaller scale / user data sets, I can understand wanting to keep the data, however on large scale ELT/ETL and/or streaming flows, we typically want to archive the last N hours/days for recovery purposes - the raw data is just too large to keep (note above is just one of already 30 data sources we have connected and many more are coming). Often upstream systems also can re-push the data, so keeping is not a problem for all sources. It is very useful for us to be able to move the data once it is processed. I have no choice but to implement a solution for this, but I at least know now I need to build something for this. I can think of some simple "hdfs dfs -mv" commands to achieve something like this but I'm not yet fully understanding the relationship between the input files, for each writer close() method and parallel nature on the HDP cluster. Also, I notice if the process dies and restarts, it reads the data again (at the moment) which would be a disaster with this much data! Need to figure that out to.

> Delete files after processing in structured streaming
> -----------------------------------------------------
>
>                 Key: SPARK-20568
>                 URL: https://issues.apache.org/jira/browse/SPARK-20568
>             Project: Spark
>          Issue Type: New Feature
>          Components: Structured Streaming
>    Affects Versions: 2.1.0
>            Reporter: Saul Shanabrook
>            Priority: Major
>
> It would be great to be able to delete files after processing them with structured streaming.
> For example, I am reading in a bunch of JSON files and converting them into Parquet. If the JSON files are not deleted after they are processed, it quickly fills up my hard drive. I originally [posted this on Stack Overflow|http://stackoverflow.com/q/43671757/907060] and was recommended to make a feature request for it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org