You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Paul Kelly <pk...@gmail.com> on 2021/04/07 13:24:04 UTC

ListS3 and ListGCSBucket missing objects

Hello,

I've noticed instances where ListS3 and ListGCSBucket never list certain
objects until after stopping the processor, clearing its state, and
restarting it.  They're usually large files in buckets that have frequent
writes.

Based on some testing I think S3 and GCS are setting an object's last
modified timestamp to the time when the object started uploading rather
than when it completes.  So any smaller objects that start and finish their
uploads during the time a larger object is still uploading will have a
newer last modified timestamp than the larger object.  If the List
processor triggers after the smaller object finishes uploading but before
the larger object finishes, it will see the small object and emit a flow
file for it, and the processor will set its state to the timestamp of the
smaller but newer object.  Once the larger object finishes uploading, it
now has a timestamp older than the smaller object, so this larger object
will be ignored and never listed during subsequent executions of the List
processor.

The ListAzureBlobStorage processor allows a listing strategy to track
entities, but the ListS3 and ListGCS processors do not, so they seem to
rely only on the last modified timestamps.  I tried setting up a second
ListS3 processor with a different run schedule and file aging settings, and
while it helps, some objects are still getting missed.

Has anyone else run into this?  Is there a feasible workaround?

Thank you,
Paul