You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Meghajit Mazumdar <me...@gojek.com> on 2022/01/24 16:32:20 UTC

Watermarking with FileSource

I had a doubt regarding watermarking w.r.t streaming with FileSource. Will
really appreciate it if somebody can explain this behavior.

Consider a filesystem with a root folder containing date wise sub folders
such as  *D1* , *D2*, … and so on. Each of these date folders further has
24 sub-folders inside corresponding to data generated for each hour,
i:e, *hr=0,
hr=1, hr=2*,… and so on. Each hourly folder has only one file inside it.
So there will be 24 files total. The file structure will look something
like this:


   - - *ROOT FOLDER*


   - - D1


   - - hr=0
   - - somefile.txt
   - - hr=1
   - - somefile.txt
   - … (more hour folders)


   - - D2


   - - hr=0
   - - somefile.txt
   - - hr=1
   - - somefile.txt
   - … (more hour folders)


   - …(more date folders)



Let’s assume we are running a Flink job  with a File Source and parallelism
of 15. Let’s say, one task manager has one slot. Also, let’s assume we want
one split per file (NonSplittingRecursiveEnumerator is used)
Also, let’s assume a case where each task manager picks up one file split
in the beginning. For the very sake of simplicity, let’s also assume the
splits are generated and picked in chronological order. So files from *hr=0*
 till *hr=14* will be picked up respectively by the 15 task managers. Total
15 files are picked for processing.

Now, even if all the task managers finish reading their files at the same
or different time, the next file split, i:e *hr=15*  can be picked up by
any task manager.
Unless otherwise the task manager which also processed *hr=14* file picks
this *hr=15* file as well, rows from this file will always be dropped by
other task managers unless the watermark interval is really huge, like 1
hour+.

Am I thinking about this correctly ? Is the solution then to keep a
really big watermark interval for a FileSource  ? Or is there an idiomatic
pattern to solve these kinds of problems with FileSource ?
-- 
*Regards,*
*Meghajit*