You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by vinay patil <vi...@gmail.com> on 2017/04/26 13:10:59 UTC

Queries regarding Historical Reprocessing

Hi Guys,

For historical reprocessing , I am reading the avro data from S3 and passing
these records to the same pipeline for processing. 

I have the following queries: 

1. I am running this pipeline as a stream application with checkpointing
enabled, the records are successfully written to S3, however they remain in
the pending state as checkpointing is not triggered when I doing
re-processing. Why does this happen ? (kept the checkpointing interval to 1
minute, pipeline ran for 10 minutes)
this is the code I am using for reading avro data from S3

/AvroInputFormat<SomeAvroClass> avroInputFormat = new AvroInputFormat<>(
                    new org.apache.flink.core.fs.Path(s3Path),
SomeAvroClass.class);

sourceStream = env.createInput(avroInputFormat).map(...);
/

2. For the source stream Flink sets the parallelism as 1 , and for the rest
of the operators the user specified parallelism is set. How does Flink reads
the data ? does it bring the entire file from S3 one at a time  and then
Split it according to parallelism ?

3. I am reading from two different S3 folders and treating them as separate
sourceStreams, how does Flink reads data in this case ? does it pick one
file from each S3 folder , split the data and pass it downstream ? Does
Flink reads the data sequentially ? I am confused here as only one Task
Manager is reading the data from S3 and then all TM's are getting the data.

4. Although I am running this as as stream application, the operators goes
into FINISHED state after processing , is this because Flink treats the S3
source as finite data ? What will happen if the data is continuously written
to S3 from one pipeline and from the second pipeline I am doing historical
re-processing ?

Regards,
Vinay Patil



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Queries-regarding-Historical-Reprocessing-tp12833.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.