You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nifi.apache.org by "Andre (JIRA)" <ji...@apache.org> on 2015/10/28 05:04:27 UTC
[jira] [Comment Edited] (NIFI-994) Processor to tail files

    [ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977680#comment-14977680 ] 

Andre edited comment on NIFI-994 at 10/28/15 4:04 AM:
------------------------------------------------------

[~markap14]

Flume has since version 1.7 (snapshot) a [taildir source|https://issues.apache.org/jira/browse/FLUME-2498].

The way they currently keep track of the files is using a position JSON sidecar file with content describing the log, inode and position of the tail against a file:

{code}
[{"inode":13209775,"pos":13771668368,"file":"/mnt/logs/logfilename.log"}]
{code}

It is not fault proof as the process tends to fail to detect changes to a file that result in the exact same size, e.g.:

So supposing the tail last queried a file with the following state:
{code}
$ cat log.log
AAAA
{code}

Updating it with similar content 
{code}
$ echo BBBB > log.log 
{code}

Would not trigger a new tail.

A more robust alternative would be to use checksums as suggested by [~jskora] but instead of checksumming the processed content, one would checksum a fixed number of bytes preceding the saved seek position.

More or less like (apologies for my weird pseudo-code):
{code}
IF SEEK_POSITION AND FILESIZE >= 8 BYTES
   lf = OPEN logfile
   SEEK lf AT SEEK_POSITION - 8 BYTES
   SHA256(READ 8 BYTES FROM lf)
{code}

What do you think?


was (Author: trixpan):
[~markap14]

Flume has since version 1.7 (snapshot) a [taildir source|https://issues.apache.org/jira/browse/FLUME-2498].

The way they currently keep track of the files is using a position JSON sidecar file with content describing the log, inode and position of the tail against a file:

{code}
[{"inode":13209775,"pos":13771668368,"file":"/mnt/logs/logfilename.log"}]
{code}

It is not fault proof as the process tends to fail to detect changes to a file that result in the exact same size, e.g.:

So supposing the tail last queried a file with the following state:
{code}
$ cat log.log
AAAA
{code}

Updating it with similar content 
{code}
$ echo BBBB > log.log 
{code}

Would not trigger a new tail.

A more robust alternative would be to use checksums as suggested by [~jskora] but instead of checksumming the processed content, one would checksum a fixed number of bytes preceding the saved seek position.

More or less like (apologies for my weird pseudo-code):
{code}
IF SEEK_POSITION AND FILESIZE >= 8 BYTES
   if = OPEN logfile
   SEEK lf AT SEEK_POSITION - 8 BYTES
   SHA256(READ 8 BYTES FROM if)
{code}

What do you think?

> Processor to tail files
> -----------------------
>
>                 Key: NIFI-994
>                 URL: https://issues.apache.org/jira/browse/NIFI-994
>             Project: Apache NiFi
>          Issue Type: New Feature
>    Affects Versions: 0.4.0
>            Reporter: Joseph Percivall
>            Assignee: Mark Payne
>             Fix For: 0.4.0
>
>         Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch
>
>
> It's a very common data ingest situation to want to input text into the system by "tailing" a file, most commonly log files. Currently we don't have an easy way to do this. 
> A simple processor to tail a file would benefit many users. There would need to be an option to not just tail a file but pick up where the processor left off if it is interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)