You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Georg Heiler <ge...@gmail.com> on 2019/09/20 14:25:50 UTC

custom FileStreamSource which reads from one partition onwards

Hi,

to my best knowledge, the existing FileStreamSource reads all the files in
a directory (hive table).
However, I need to be able to specify an initial partition it should start
from (i.e. like a Kafka offset/initial warmed-up state) and then only read
data which is semantically (i.e. using a file path lexicographically)
greater than the minimum committed initial state?

After playing around with the internals of the file format I have come to
the conclusion that manually modifying it and setting some values (i.e. the
last processed & committed partition) is not  feasible as spark regardless
will pick up all the files (even the older partitions)
https://stackoverflow.com/questions/58004832/spark-structured-streaming-file-source-read-from-a-certain-partition-onwards

Is this correct?

This leads me to the conclusion I need a custom StatefulFileStreamSource. I
tried to create one (
https://gist.github.com/geoHeil/6c0c51e43469ace71550b426cfcce1c1), but so
far fail to instantiate it (even though it is just a copy of the original
one as the constructor without any arguments does not seem to be defined:

NoSuchMethodException:
org.apache.spark.sql.execution.streaming.StatefulFileStreamSource.<init>()

Why is the default constructor not found? Even for a simple copy of an
existing (and presumably working class?

Note, I am currently working on spark 2.2.3

Best,
Georg