You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/07/04 12:57:00 UTC

[jira] [Commented] (NIFI-3332) Bug in ListXXX causes matching timestamps to be ignored on later runs

    [ https://issues.apache.org/jira/browse/NIFI-3332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16073631#comment-16073631 ] 

ASF GitHub Bot commented on NIFI-3332:
--------------------------------------

GitHub user ijokarumawak opened a pull request:

    https://github.com/apache/nifi/pull/1975

    NIFI-3332: ListXXX to not miss files with the latest processed timestamp

    Thank you for submitting a contribution to Apache NiFi.
    
    In order to streamline the review of the contribution we ask you
    to ensure the following steps have been taken:
    
    ### For all changes:
    - [x] Is there a JIRA ticket associated with this PR? Is it referenced 
         in the commit message?
    
    - [x] Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
    
    - [x] Has your PR been rebased against the latest commit within the target branch (typically master)?
    
    - [ ] Is your initial contribution a single, squashed commit?
    
    ### For code changes:
    - [x] Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
    - [x] Have you written or updated unit tests to verify your changes?
    - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? 
    - [ ] If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
    - [ ] If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
    - [ ] If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?
    
    ### For documentation related changes:
    - [ ] Have you ensured that format looks appropriate for the output in which it is rendered?
    
    ### Note:
    Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ijokarumawak/nifi nifi-3332

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/nifi/pull/1975.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1975
    
----
commit 1e132be4c1cf1cc0debf37dc3dc2d38c93a94363
Author: Koji Kawamura <ij...@apache.org>
Date:   2017-06-14T06:21:01Z

    NIFI-4069: Make ListXXX work with timestamp precision in seconds or minutes
    
    - Refactored variable names to better represents what those are meant for.
    - Added deterministic logic which detects target filesystem timestamp precision and adjust lag time based on it.
    - Changed from using System.nanoTime() to System.currentTimeMillis in test because Java File API reports timestamp in milliseconds at the best granularity. Also, System.nanoTime should not be used in mix with epoch milliseconds because it uses arbitrary origin and measured differently.
    - Changed TestListFile to use more longer interval between file timestamps those are used by testFilterAge to provide more consistent test result because sleep time can be longer with filesystems whose timestamp in seconds precision.
    - Added logging at TestListFile.
    - Added TestWatcher to dump state in case assertion fails for further investigation.
    - Added Timestamp Precision property so that user can set if auto-detect is not enough
    - Adjust timestamps for ages test

commit bbe4319150deaf6f92671cd2849a2c1dc6d36fa7
Author: Koji Kawamura <ij...@apache.org>
Date:   2017-07-04T08:34:31Z

    NIFI-3332: ListXXX to not miss files with the latest processed timestamp
    
    Before this fix, it's possible that ListXXX processors can miss files those have the same timestamp as the one which was the latest processed timestamp at the previous cycle. Since it only used timestamps, it was not possible to determine whether a file is already processed or not.
    
    However, storing every single processed identifier as we used to will not perform well.
    Instead, this commit makes ListXXX to store only identifiers those have the latest timestamp at a cycle to minimize the amount of state data to store.

----


> Bug in ListXXX causes matching timestamps to be ignored on later runs
> ---------------------------------------------------------------------
>
>                 Key: NIFI-3332
>                 URL: https://issues.apache.org/jira/browse/NIFI-3332
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>    Affects Versions: 0.7.1, 1.1.1
>            Reporter: Joe Skora
>            Assignee: Koji Kawamura
>            Priority: Critical
>         Attachments: listfiles.png, Test-showing-ListFile-timestamp-bug.log, Test-showing-ListFile-timestamp-bug.patch
>
>
> The new state implementation for the ListXXX processors based on AbstractListProcessor creates a race conditions when processor runs occur while a batch of files is being written with the same timestamp.
> The changes to state management dropped tracking of the files processed for a given timestamp.  Without the record of files processed, the remainder of the batch is ignored on the next processor run since their timestamp is not greater than the one timestamp stored in processor state.  With the file tracking it was possible to process files that matched the timestamp exactly and exclude the previously processed files.
> A basic time goes as follows.
>   T0 - system creates or receives batch of files with Tx timestamp where Tx is more than the current timestamp in processor state.
>   T1 - system writes 1st half of Tx batch to the ListFile source directory.
>   T2 - ListFile runs picking up 1st half of Tx batch and stores Tx timestamp in processor state.
>   T3 - system writes 2nd half of Tx batch to ListFile source directory.
>   T4 - ListFile runs ignoring any files with T <= Tx, eliminating 2nd half Tx timestamp batch.
> I've attached a patch[1] for TestListFile.java that adds an instrumented unit test demonstrates the problem and a log[2] of the output from one such run.  The test writes 3 files each in two batches with processor runs after each batch.  Batch 2 writes files with timestamps older than, equal to, and newer than the timestamp stored when batch 1 was processed, but only the newer file is picked up.  The older file is correctly ignored but file with the matchin timestamp file should have been processed.
> [1] Test-showing-ListFile-timestamp-bug.patch
> [2] Test-showing-ListFile-timestamp-bug.log



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)