You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by adamlamar <gi...@git.apache.org> on 2017/12/24 03:37:26 UTC

[GitHub] nifi pull request #2361: NIFI-4715: ListS3 produces duplicates in frequently...

GitHub user adamlamar opened a pull request:

    https://github.com/apache/nifi/pull/2361

    NIFI-4715: ListS3 produces duplicates in frequently updated buckets

    Thank you for submitting a contribution to Apache NiFi.
    
    In order to streamline the review of the contribution we ask you
    to ensure the following steps have been taken:
    
    ### For all changes:
    - [Y] Is there a JIRA ticket associated with this PR? Is it referenced 
         in the commit message?
    
    - [Y] Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
    
    - [Y] Has your PR been rebased against the latest commit within the target branch (typically master)?
    
    - [Y] Is your initial contribution a single, squashed commit?
    
    ### For code changes:
    - [Y] Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
    - [N] Have you written or updated unit tests to verify your changes?
    - [NA] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? 
    - [NA] If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
    - [NA] If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
    - [NA] If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?
    
    ### For documentation related changes:
    - [NA] Have you ensured that format looks appropriate for the output in which it is rendered?
    
    ### Note:
    Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/adamlamar/nifi NIFI-4715

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/nifi/pull/2361.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2361
    
----
commit bcaa84bba251c3a70c27a466185f6b20863eab93
Author: Adam Lamar <ad...@...>
Date:   2017-12-24T03:29:02Z

    NIFI-4715: ListS3 produces duplicates in frequently updated buckets

----


---

[GitHub] nifi issue #2361: NIFI-4715: ListS3 produces duplicates in frequently update...

Posted by ijokarumawak <gi...@git.apache.org>.
Github user ijokarumawak commented on the issue:

    https://github.com/apache/nifi/pull/2361
  
    Hi @adamlamar , I hope this message finds you well. Since some other users asked about this issue, I went ahead and took over the remaining concerns around updating `currentKeys` during list loop. And submitted another PR #3116 based on your commits.
    
    When it gets merged, this PR will be closed automatically. If you have any comments, please keep discussing on the new PR. Thanks again for your contribution!


---

[GitHub] nifi issue #2361: NIFI-4715: ListS3 produces duplicates in frequently update...

Posted by adamlamar <gi...@git.apache.org>.
Github user adamlamar commented on the issue:

    https://github.com/apache/nifi/pull/2361
  
    @ijokarumawak I've made some progress, but unfortunately just trying to find time! No tech questions at this point, but thanks for checking in.


---

[GitHub] nifi pull request #2361: NIFI-4715: ListS3 produces duplicates in frequently...

Posted by adamlamar <gi...@git.apache.org>.
Github user adamlamar commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2361#discussion_r159112711
  
    --- Diff: nifi-nar-bundles/nifi-aws-bundle/nifi-aws-processors/src/main/java/org/apache/nifi/processors/aws/s3/ListS3.java ---
    @@ -264,18 +265,19 @@ public void onTrigger(final ProcessContext context, final ProcessSession session
                 }
                 bucketLister.setNextMarker();
     
    +            totalListCount += listCount;
                 commit(context, session, listCount);
                 listCount = 0;
             } while (bucketLister.isTruncated());
    +
    +        // Update stateManger with the most recent timestamp
             currentTimestamp = maxTimestamp;
    +        persistState(context);
     
             final long listMillis = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startNanos);
             getLogger().info("Successfully listed S3 bucket {} in {} millis", new Object[]{bucket, listMillis});
     
    -        if (!commit(context, session, listCount)) {
    -            if (currentTimestamp > 0) {
    -                persistState(context);
    -            }
    --- End diff --
    
    Note that this `commit` isn't required, since the last part of the main do/while loop already does a `commit`. Further, it sets `listCount` to zero, so this branch would always be taken.


---

[GitHub] nifi issue #2361: NIFI-4715: ListS3 produces duplicates in frequently update...

Posted by ijokarumawak <gi...@git.apache.org>.
Github user ijokarumawak commented on the issue:

    https://github.com/apache/nifi/pull/2361
  
    @adamlamar How long do you think you need to address multiple bugs you are aware of? If those can be addressed in the same ListS3 processor, then I'd prefer to have all (as much as we can) in this JIRA/PR, as we can reduce testing effort and overall review cycle. If it's too complicated to be done at once, then please submit different JIRAs to beak those into smaller pieces. Thank you!


---

[GitHub] nifi pull request #2361: NIFI-4715: ListS3 produces duplicates in frequently...

Posted by adamlamar <gi...@git.apache.org>.
Github user adamlamar commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2361#discussion_r158740756
  
    --- Diff: nifi-nar-bundles/nifi-aws-bundle/nifi-aws-processors/src/main/java/org/apache/nifi/processors/aws/s3/ListS3.java ---
    @@ -267,26 +267,28 @@ public void onTrigger(final ProcessContext context, final ProcessSession session
                 commit(context, session, listCount);
                 listCount = 0;
             } while (bucketLister.isTruncated());
    -        currentTimestamp = maxTimestamp;
    +
    +        if (maxTimestamp > currentTimestamp) {
    +            currentTimestamp = maxTimestamp;
    +        }
     
             final long listMillis = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startNanos);
             getLogger().info("Successfully listed S3 bucket {} in {} millis", new Object[]{bucket, listMillis});
     
             if (!commit(context, session, listCount)) {
    -            if (currentTimestamp > 0) {
    -                persistState(context);
    -            }
                 getLogger().debug("No new objects in S3 bucket {} to list. Yielding.", new Object[]{bucket});
                 context.yield();
             }
    +
    +        // Persist all state, including any currentKeys
    +        persistState(context);
    --- End diff --
    
    @ijokarumawak Its important to `persistState` when `currentKeys` has been modified, even if the `currentTimestamp` hasn't been modified, to avoid producing duplicates when multiple files are listed during the same millisecond. This is a related but distinct issue. Its a rare condition though.
    
    You're correct that this change will cause more load on the state manager. How about setting a flag like `dirtyState` that would avoid calling `setState` if it has not been modified?


---

[GitHub] nifi pull request #2361: NIFI-4715: ListS3 produces duplicates in frequently...

Posted by adamlamar <gi...@git.apache.org>.
Github user adamlamar commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2361#discussion_r158593055
  
    --- Diff: nifi-nar-bundles/nifi-aws-bundle/nifi-aws-processors/src/main/java/org/apache/nifi/processors/aws/s3/ListS3.java ---
    @@ -267,26 +267,28 @@ public void onTrigger(final ProcessContext context, final ProcessSession session
                 commit(context, session, listCount);
                 listCount = 0;
             } while (bucketLister.isTruncated());
    -        currentTimestamp = maxTimestamp;
    +
    +        if (maxTimestamp > currentTimestamp) {
    +            currentTimestamp = maxTimestamp;
    +        }
     
             final long listMillis = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startNanos);
             getLogger().info("Successfully listed S3 bucket {} in {} millis", new Object[]{bucket, listMillis});
     
             if (!commit(context, session, listCount)) {
    -            if (currentTimestamp > 0) {
    -                persistState(context);
    -            }
    --- End diff --
    
    Since `currentTimestamp` is never overwritten by a `maxTimestamp` with a value of zero, this check shouldn't be necessary anymore.


---

[GitHub] nifi issue #2361: NIFI-4715: ListS3 produces duplicates in frequently update...

Posted by adamlamar <gi...@git.apache.org>.
Github user adamlamar commented on the issue:

    https://github.com/apache/nifi/pull/2361
  
    @ijokarumawak Roger, I will expand the scope of this JIRA to include those other fixes. Should be doable in a single JIRA, was just unsure how you all prefer to move forward. Thanks again for all your help - will likely be several days before I'm able to push new commits.


---

[GitHub] nifi issue #2361: NIFI-4715: ListS3 produces duplicates in frequently update...

Posted by ijokarumawak <gi...@git.apache.org>.
Github user ijokarumawak commented on the issue:

    https://github.com/apache/nifi/pull/2361
  
    @adamlamar I tried to find the API documentation but couldn't find the exact statement. Thanks, now it does makes sense to defer updating `currentTimestamp` after all listed entities are examined.
    
    Actually that makes me want to ask another related question. Do we risk making duplication by updating `currentKeys` in the middle of the loop?
    
    ## Simulation
    
    For example, ListS3 listed following entities at the 1st onTrigger. 
    
    ### The 1st onTrigger simulation at t1
    
    This should track `currentTimestamp` as `t1`, and `currentKeys` as `b1.txt`.
    
    |name|last modified|listed at|
    |-----|-------------|--------|
    |a1.txt|t1 - x|t1|
    |b1.txt|t1|t1|
    
    ### The 2nd onTrigger simulation at t2
    
    If S3 is being updated at the same time, there may be additional entities having the same t1 timestamp, but were not listed at the 1st onTrigger. In such case, those will be listed at the next onTrigger. Following table shows the expected effect of tracking `currentKeys`. `b1.txt` will not be listed at t2 because it's already listed at t1 and kept in `currentKeys`.
    
    |name|last modified|listed at|
    |-----|-------------|--------|
    |a1.txt|t1 - x|t1|
    |b1.txt|t1|t1|
    |c1.txt|t1|t2|
    |d1.txt|t1|t2|
    
    ### The 2nd onTrigger simulation at t2 with newer timestamp entry preceding in lexicographical order
    
    Based on the above scenario, let's think about an edge case. New entries having later lastModified timestamp can be added at the same time at the time of the 2nd onTrigger ran. This might break the current implementation that updates `currentKeys` in the middle of the loop because entities are returned in lexicographical order. What if there was `a2.txt` having later timestamp than t1?
    
    |name|last modified|listed at|
    |-----|-------------|--------|
    |a1.txt|t1 - x|t1|
    |a2.txt|t2|t2|
    |b1.txt|t1|t1 and *t2*|
    |c1.txt|t1|t2|
    |d1.txt|t1|t2|
    
    With current implementation, `b1.txt` would be listed again at t2.
    
    ## Suggestion
    
    - Like `maxTimestamp` (representing the latest timestamp at current onTrigger) and `currentTimestamp` (representing the latest timestamp at the last onTrigger), use separate variables to track the keys having the latest timestamp at the last run and current run.
    - Probably renaming variables would make code more readable. 
    - Update only `maxTimestamp` and the keys with the latest timestamp of current iteration inside the loop, leave the variables which tracks the previous onTrigger state as it is. Then, after the loop, update the variables to track `previous` onTrigger state.
    
    Above approach would reflect background better, and also provide cleaner easily understandable code. I may be overly concerning details, but am feeling this can be better a bit more. Thanks for your patience!


---

[GitHub] nifi issue #2361: NIFI-4715: ListS3 produces duplicates in frequently update...

Posted by adamlamar <gi...@git.apache.org>.
Github user adamlamar commented on the issue:

    https://github.com/apache/nifi/pull/2361
  
    Hi @ijokarumawak, I'm sorry I wasn't able to take this across the finish line. Thanks a bunch for continuing the effort! 


---

[GitHub] nifi issue #2361: NIFI-4715: ListS3 produces duplicates in frequently update...

Posted by jvwing <gi...@git.apache.org>.
Github user jvwing commented on the issue:

    https://github.com/apache/nifi/pull/2361
  
    Thanks, @adamlamar!


---

[GitHub] nifi pull request #2361: NIFI-4715: ListS3 produces duplicates in frequently...

Posted by adamlamar <gi...@git.apache.org>.
Github user adamlamar commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2361#discussion_r159111452
  
    --- Diff: nifi-nar-bundles/nifi-aws-bundle/nifi-aws-processors/src/main/java/org/apache/nifi/processors/aws/s3/ListS3.java ---
    @@ -267,26 +267,28 @@ public void onTrigger(final ProcessContext context, final ProcessSession session
                 commit(context, session, listCount);
                 listCount = 0;
             } while (bucketLister.isTruncated());
    -        currentTimestamp = maxTimestamp;
    +
    +        if (maxTimestamp > currentTimestamp) {
    +            currentTimestamp = maxTimestamp;
    +        }
     
             final long listMillis = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startNanos);
             getLogger().info("Successfully listed S3 bucket {} in {} millis", new Object[]{bucket, listMillis});
     
             if (!commit(context, session, listCount)) {
    -            if (currentTimestamp > 0) {
    -                persistState(context);
    -            }
                 getLogger().debug("No new objects in S3 bucket {} to list. Yielding.", new Object[]{bucket});
                 context.yield();
             }
    +
    +        // Persist all state, including any currentKeys
    +        persistState(context);
    --- End diff --
    
    @ijokarumawak I started writing an example, but then realized you are correct - there is no need to manually call `persistState` because any addition to `currentKeys` will also increment `listCount`, and the normal update mechanism will take over from there. We shouldn't need a `dirtyState` flag.


---

[GitHub] nifi pull request #2361: NIFI-4715: ListS3 produces duplicates in frequently...

Posted by adamlamar <gi...@git.apache.org>.
Github user adamlamar commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2361#discussion_r158593061
  
    --- Diff: nifi-nar-bundles/nifi-aws-bundle/nifi-aws-processors/src/main/java/org/apache/nifi/processors/aws/s3/ListS3.java ---
    @@ -267,26 +267,28 @@ public void onTrigger(final ProcessContext context, final ProcessSession session
                 commit(context, session, listCount);
                 listCount = 0;
             } while (bucketLister.isTruncated());
    -        currentTimestamp = maxTimestamp;
    +
    +        if (maxTimestamp > currentTimestamp) {
    +            currentTimestamp = maxTimestamp;
    +        }
     
             final long listMillis = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startNanos);
             getLogger().info("Successfully listed S3 bucket {} in {} millis", new Object[]{bucket, listMillis});
     
             if (!commit(context, session, listCount)) {
    -            if (currentTimestamp > 0) {
    -                persistState(context);
    -            }
                 getLogger().debug("No new objects in S3 bucket {} to list. Yielding.", new Object[]{bucket});
                 context.yield();
             }
    +
    +        // Persist all state, including any currentKeys
    +        persistState(context);
    --- End diff --
    
    Both exit paths already perform `persistState` - this just makes that more clear.


---

[GitHub] nifi issue #2361: NIFI-4715: ListS3 produces duplicates in frequently update...

Posted by adamlamar <gi...@git.apache.org>.
Github user adamlamar commented on the issue:

    https://github.com/apache/nifi/pull/2361
  
    @ijokarumawak From the <a href="https://docs.aws.amazon.com/AmazonS3/latest/API/v2-RESTBucketGET.html#v2-RESTBucketGET-requests">AWS S3 API documentation</a> (see the `continuation-token` section):
    
    > Amazon S3 lists objects in UTF-8 character encoding in lexicographical order
    
    I really wish we could take the approach you suggested (would certainly make things easier), but since the entries are in lexicographical/alphabetical order, we must iterate over the entire listing before updating `currentTimestamp`. Otherwise we risk skipping keys newer than `currentTimestamp` but older than keys in the middle of the list. The lexicographical ordering also matches my experience when using the API.
    
    Unfortunately this does also mean that duplicates are possible when a listing fails, like the `IOException` scenario you mentioned. This is an existing limitation in ListS3.
    
    I appreciate your help getting this reviewed! :)


---

[GitHub] nifi pull request #2361: NIFI-4715: ListS3 produces duplicates in frequently...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/nifi/pull/2361


---

[GitHub] nifi pull request #2361: NIFI-4715: ListS3 produces duplicates in frequently...

Posted by ijokarumawak <gi...@git.apache.org>.
Github user ijokarumawak commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2361#discussion_r158753087
  
    --- Diff: nifi-nar-bundles/nifi-aws-bundle/nifi-aws-processors/src/main/java/org/apache/nifi/processors/aws/s3/ListS3.java ---
    @@ -267,26 +267,28 @@ public void onTrigger(final ProcessContext context, final ProcessSession session
                 commit(context, session, listCount);
                 listCount = 0;
             } while (bucketLister.isTruncated());
    -        currentTimestamp = maxTimestamp;
    +
    +        if (maxTimestamp > currentTimestamp) {
    +            currentTimestamp = maxTimestamp;
    +        }
     
             final long listMillis = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startNanos);
             getLogger().info("Successfully listed S3 bucket {} in {} millis", new Object[]{bucket, listMillis});
     
             if (!commit(context, session, listCount)) {
    -            if (currentTimestamp > 0) {
    -                persistState(context);
    -            }
                 getLogger().debug("No new objects in S3 bucket {} to list. Yielding.", new Object[]{bucket});
                 context.yield();
             }
    +
    +        // Persist all state, including any currentKeys
    +        persistState(context);
    --- End diff --
    
    @adamlamar Well, if I'm not overlooking anything, `currentKeys` is only modified in `onTrigger` method when only new entry is found. Would you show me an example? It still confounds me. If we can set a flag like `dirtyState`, then the condition should be clarify though. Thanks!


---

[GitHub] nifi pull request #2361: NIFI-4715: ListS3 produces duplicates in frequently...

Posted by ijokarumawak <gi...@git.apache.org>.
Github user ijokarumawak commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2361#discussion_r158674300
  
    --- Diff: nifi-nar-bundles/nifi-aws-bundle/nifi-aws-processors/src/main/java/org/apache/nifi/processors/aws/s3/ListS3.java ---
    @@ -267,26 +267,28 @@ public void onTrigger(final ProcessContext context, final ProcessSession session
                 commit(context, session, listCount);
                 listCount = 0;
             } while (bucketLister.isTruncated());
    -        currentTimestamp = maxTimestamp;
    +
    +        if (maxTimestamp > currentTimestamp) {
    +            currentTimestamp = maxTimestamp;
    +        }
     
             final long listMillis = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startNanos);
             getLogger().info("Successfully listed S3 bucket {} in {} millis", new Object[]{bucket, listMillis});
     
             if (!commit(context, session, listCount)) {
    -            if (currentTimestamp > 0) {
    -                persistState(context);
    -            }
                 getLogger().debug("No new objects in S3 bucket {} to list. Yielding.", new Object[]{bucket});
                 context.yield();
             }
    +
    +        // Persist all state, including any currentKeys
    +        persistState(context);
    --- End diff --
    
    Do we still need this? Isn't updating state within commit() enough? We should minimize the number of status updates as some state storage is not designed for frequently updates, e.g. Zookeeper. I think if the processor didn't find any new file to list, then it does not have to update state, does it?
    
    I might be missing something as the original reason is not clear to me, to call persistState() when there was nothing to commit.


---

[GitHub] nifi issue #2361: NIFI-4715: ListS3 produces duplicates in frequently update...

Posted by ijokarumawak <gi...@git.apache.org>.
Github user ijokarumawak commented on the issue:

    https://github.com/apache/nifi/pull/2361
  
    Thank you, @adamlamar I understand that.


---

[GitHub] nifi pull request #2361: NIFI-4715: ListS3 produces duplicates in frequently...

Posted by ijokarumawak <gi...@git.apache.org>.
Github user ijokarumawak commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2361#discussion_r159117995
  
    --- Diff: nifi-nar-bundles/nifi-aws-bundle/nifi-aws-processors/src/main/java/org/apache/nifi/processors/aws/s3/ListS3.java ---
    @@ -264,18 +265,19 @@ public void onTrigger(final ProcessContext context, final ProcessSession session
                 }
                 bucketLister.setNextMarker();
     
    +            totalListCount += listCount;
                 commit(context, session, listCount);
                 listCount = 0;
             } while (bucketLister.isTruncated());
    +
    +        // Update stateManger with the most recent timestamp
             currentTimestamp = maxTimestamp;
    +        persistState(context);
    --- End diff --
    
    These two lines of code can be embedded in `commit` method.
    ```
             // Update stateManger with the most recent timestamp
             currentTimestamp = maxTimestamp;
             persistState(context);
    ```


---

[GitHub] nifi pull request #2361: NIFI-4715: ListS3 produces duplicates in frequently...

Posted by ijokarumawak <gi...@git.apache.org>.
Github user ijokarumawak commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2361#discussion_r159118001
  
    --- Diff: nifi-nar-bundles/nifi-aws-bundle/nifi-aws-processors/src/main/java/org/apache/nifi/processors/aws/s3/ListS3.java ---
    @@ -264,18 +265,19 @@ public void onTrigger(final ProcessContext context, final ProcessSession session
                 }
                 bucketLister.setNextMarker();
     
    +            totalListCount += listCount;
                 commit(context, session, listCount);
                 listCount = 0;
             } while (bucketLister.isTruncated());
    +
    +        // Update stateManger with the most recent timestamp
             currentTimestamp = maxTimestamp;
    +        persistState(context);
     
             final long listMillis = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startNanos);
             getLogger().info("Successfully listed S3 bucket {} in {} millis", new Object[]{bucket, listMillis});
     
    -        if (!commit(context, session, listCount)) {
    -            if (currentTimestamp > 0) {
    -                persistState(context);
    -            }
    +        if (totalListCount == 0) {
    --- End diff --
    
    Good catch!


---

[GitHub] nifi pull request #2361: NIFI-4715: ListS3 produces duplicates in frequently...

Posted by adamlamar <gi...@git.apache.org>.
Github user adamlamar commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2361#discussion_r158593053
  
    --- Diff: nifi-nar-bundles/nifi-aws-bundle/nifi-aws-processors/src/main/java/org/apache/nifi/processors/aws/s3/ListS3.java ---
    @@ -267,26 +267,28 @@ public void onTrigger(final ProcessContext context, final ProcessSession session
                 commit(context, session, listCount);
                 listCount = 0;
             } while (bucketLister.isTruncated());
    -        currentTimestamp = maxTimestamp;
    +
    +        if (maxTimestamp > currentTimestamp) {
    +            currentTimestamp = maxTimestamp;
    +        }
    --- End diff --
    
    `maxTimestamp` should always be greater than `currentTimestamp`, but this adds a sanity check.


---

[GitHub] nifi issue #2361: NIFI-4715: ListS3 produces duplicates in frequently update...

Posted by adamlamar <gi...@git.apache.org>.
Github user adamlamar commented on the issue:

    https://github.com/apache/nifi/pull/2361
  
    > Do we risk making duplication by updating currentKeys in the middle of the loop?
    
    Yes, I think we do! I identified a similar (possibly the same) bug, and I agree with all of your suggestions. The question in my mind is whether we should fix all of these issues in this JIRA or defer to another. As far as the original JIRA goes, I believe the current commit will address the issue. I also did a fair bit of manual testing so I would be comfortable moving forward with this change as-is.
    
    Before refactoring, I'd like to put some additional unit tests in place for safety. Its clear from the discussion that there is some meat here and I'd really like to enumerate a few cases I've seen while testing in unit tests.
    
    So its up to you - would you prefer that I start the unit tests and address (potentially) multiple bugs in this PR, or should we merge this and create another JIRA?


---

[GitHub] nifi issue #2361: NIFI-4715: ListS3 produces duplicates in frequently update...

Posted by adamlamar <gi...@git.apache.org>.
Github user adamlamar commented on the issue:

    https://github.com/apache/nifi/pull/2361
  
    @ijokarumawak I did as you suggested and pulled `persistState` out in the case when no new keys have been listed, but this actually caused unit tests to fail. This is because `currentTimestamp` never changes during the main loop, so even though `commit` calls `persistState`, the value of `currentTimestamp` doesn't change until the main loop exits. Which is why `persistState` is required in both exit paths.
    
    Instead, I took a slightly different approach with the change just pushed. Since `currentTimestamp` is the current value persisted to the state manager, `maxTimestamp` is the highest timestamp seen in the main loop, and `currentKeys` is tied to `maxTimestamp` (not `currentTimestamp`), I removed the `persistState` call in `commit`, and did `persistState` at the end of `onTrigger` only. While this does continue to `persistState` on each exit, it reduces the number of `persistState` calls to once per `onTrigger` rather than once per 1000 keys iterated (which was done previously in `commit`).
    
    I did a bunch of manual testing with concurrent `PutS3Object` and `ListS3` and always got the correct number of listed keys, even when uploading 20k+ objects using 10 threads. I tried a few strategies to skip `persistState` if nothing had changed, but in manual testing it always produced the wrong number of keys, although sometimes only off by 1. The current code should be quite an improvement to the load on the state manager, even if it isn't ideal.
    
    I also introduced `totalListCount` which helps tighten up the log messages a bit. Previously it would "successfully list X objects" followed by "no new objects to list" in a single `onTrigger` (this was apparent in the unit test output). `totalListCount` also avoids an unnecessary `yield`.
    
    There's a lot going on in this one - let me know if you have any other questions!


---

[GitHub] nifi issue #2361: NIFI-4715: ListS3 produces duplicates in frequently update...

Posted by ijokarumawak <gi...@git.apache.org>.
Github user ijokarumawak commented on the issue:

    https://github.com/apache/nifi/pull/2361
  
    @adamlamar How is it going? Looking forward to review the updated PR. Just wanted to check if you have any issues. Thanks!


---