You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/07/19 13:22:00 UTC

[jira] [Commented] (NIFI-4205) TailFile can produce duplicated data when it wrongly assumes a file is rotated

    [ https://issues.apache.org/jira/browse/NIFI-4205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093056#comment-16093056 ] 

ASF GitHub Bot commented on NIFI-4205:
--------------------------------------

GitHub user ijokarumawak opened a pull request:

    https://github.com/apache/nifi/pull/2021

    NIFI-4205: Avoid duplicated data from TailFile

    Before this fix, it is possible that TailFile to produce duplicated data
    if an already tailed file has newer timestamp and fewer or the same
    amout of data.
    
    Thank you for submitting a contribution to Apache NiFi.
    
    In order to streamline the review of the contribution we ask you
    to ensure the following steps have been taken:
    
    ### For all changes:
    - [x] Is there a JIRA ticket associated with this PR? Is it referenced 
         in the commit message?
    
    - [x] Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
    
    - [x] Has your PR been rebased against the latest commit within the target branch (typically master)?
    
    - [x] Is your initial contribution a single, squashed commit?
    
    ### For code changes:
    - [ ] Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
    - [ ] Have you written or updated unit tests to verify your changes?
    - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? 
    - [ ] If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
    - [ ] If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
    - [ ] If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?
    
    ### For documentation related changes:
    - [ ] Have you ensured that format looks appropriate for the output in which it is rendered?
    
    ### Note:
    Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ijokarumawak/nifi nifi-4205

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/nifi/pull/2021.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2021
    
----
commit b2f43036876139186f9387f9d9fddd6442433c8a
Author: Koji Kawamura <ij...@apache.org>
Date:   2017-07-19T13:18:17Z

    NIFI-4205: Avoid duplicated data from TailFile
    
    Before this fix, it is possible that TailFile to produce duplicated data
    if an already tailed file has newer timestamp and fewer or the same
    amout of data.

----


> TailFile can produce duplicated data when it wrongly assumes a file is rotated
> ------------------------------------------------------------------------------
>
>                 Key: NIFI-4205
>                 URL: https://issues.apache.org/jira/browse/NIFI-4205
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Extensions
>    Affects Versions: 1.1.0
>            Reporter: Koji Kawamura
>
> TailFile checks whether a file being tailed is rotated by following lines of code, and if it determines so, it resets the reader and state for the file:
> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/TailFile.java#L693
> {code}
>         // Check if file has rotated
>         if (rolloverOccurred
>                 || (timestamp <= file.lastModified() && length > file.length())
>                 || (timestamp < file.lastModified() && length >= file.length())) {
>             // Since file has rotated, we close the reader, create a new one, and then reset our state.
>             try {
>                 reader.close();
>                 getLogger().debug("Closed FileChannel {}", new Object[]{reader, reader});
>             } catch (final IOException ioe) {
>                 getLogger().warn("Failed to close reader for {} due to {}", new Object[]{file, ioe});
>             }
>             reader = createReader(file, 0L);
>             position = 0L;
>             checksum.reset();
>         }
> {code}
> The third condition, newer timestamp but the same file size can work negatively in some situations. For example:
> # If an already fully tailed file is 'touched' and last modified timestamp is updated. This is the easiest way to produce duplicated content.
> # On Windows, if a file is being tailed and updated by an app that writes logs or some data to it consistently and frequently, then the last modified timestamp can be delayed to be updated compared to file size. I couldn't find canonical docs for this behavior, but testing on Windows consistently produces duplicated data. And the 3rd condition becomes true when such duplication occurs.
> TailFile updates the file timestamp and length when it reads some data from the file, specifically at these lines:
> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/TailFile.java#L765
> {code}
>             timestamp = Math.max(state.getTimestamp(), file.lastModified());
>             length = file.length();
> {code}
> As mentioned at the 2nd case above, file.lastModified can return a stale timestamp (or maybe just not being flushed yet) while length is replaced by the latest value. After this happens, at the next onTrigger cycle, the 3rd condition becomes true because it detects a newer timestamp.
> These conditions are added by NIFI-1170 and NIFI-1959.
> A simple flow, TailFile -> SplitText -> (FlowFiles are queued) -> UpdateAttribute(Stopped) can reproduce this, with a command-line to simulate frequently updated log file:
> {code}
> $ for i in `seq 1 10000`; do echo $i >> test.log; done
> {code}
> The expected result is having 10000 generated FlowFiles queued at the relationship between SplitText and UpdateAttribute. But on Windows, more FlowFiles are generated.
> By enabling debug level log for TailFile, following log messages can be confirmed:
> {code}
> Add this to conf/logback.xml
> <logger name="org.apache.nifi.processors.standard.TailFile" level="DEBUG"/>
> 2017-07-19 10:22:07,134 DEBUG [Timer-Driven Process Thread-3] o.a.nifi.processors.standard.TailFile TailFile[id=59ef6ea7-0
> 15d-1000-d6c2-c57a61e58a80] Recovering Rolled Off Files; total number of files rolled off = 0
> 2017-07-19 10:22:07,134 DEBUG [Timer-Driven Process Thread-3] o.a.nifi.processors.standard.TailFile TailFile[id=59ef6ea7-015d-1000-d6c2-c57a61e58a80] Closed FileChannel sun.nio.ch.FileChannelImpl@6d2a1eaf
> 2017-07-19 10:22:07,134 DEBUG [Timer-Driven Process Thread-3] o.a.nifi.processors.standard.TailFile TailFile[id=59ef6ea7-015d-1000-d6c2-c57a61e58a80] Created FileChannel sun.nio.ch.FileChannelImpl@4aefddb3 for C:\logs\test.log
> 2017-07-19 10:22:07,150 DEBUG [Timer-Driven Process Thread-3] o.a.nifi.processors.standard.TailFile TailFile[id=59ef6ea7-015d-1000-d6c2-c57a61e58a80] Reading lines starting at position 0
> {code}
> The 3rd condition should be removed to avoid having these duplicated data ingested. Or if there's any specific need, we should discuss about it and implement additional solution.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)