You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Vladimir Matveev (JIRA)" <ji...@apache.org> on 2019/05/22 17:26:00 UTC

[jira] [Created] (SPARK-27808) Ability to ignore existing files for structured streaming

Vladimir Matveev created SPARK-27808:
----------------------------------------

             Summary: Ability to ignore existing files for structured streaming
                 Key: SPARK-27808
                 URL: https://issues.apache.org/jira/browse/SPARK-27808
             Project: Spark
          Issue Type: New Feature
          Components: Structured Streaming
    Affects Versions: 2.4.3, 2.3.3
            Reporter: Vladimir Matveev


Currently it is not easily possible to make a structured streaming query to ignore all of the existing data inside a directory and only process new files, created after the job was started. See here for example: [https://stackoverflow.com/questions/51391722/spark-structured-streaming-file-source-starting-offset]

 

My use case is to ignore everything which existed in the directory when the streaming job is first started (and there are no checkpoints), but to behave as usual when the stream is restarted, e.g. catch up reading new files since the last restart. This would allow us to use the streaming job for continuous processing, with all the benefits it brings, but also to keep the possibility to reprocess the data in the batch fashion by a different job, drop the checkpoints and make the streaming job only run for the new data.

 

It would be great to have an option similar to the `newFilesOnly` option on the original [StreamingContext.fileStream|https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.StreamingContext@fileStream[K,V,F%3C:org.apache.hadoop.mapreduce.InputFormat[K,V]](directory:String,filter:org.apache.hadoop.fs.Path=%3EBoolean,newFilesOnly:Boolean)(implicitevidence$7:scala.reflect.ClassTag[K],implicitevidence$8:scala.reflect.ClassTag[V],implicitevidence$9:scala.reflect.ClassTag[F]):org.apache.spark.streaming.dstream.InputDStream[(K,V)]] method, but probably with slightly different semantics, described above (ignore all existing for the first run, catch up for the following runs)>



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org