You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nifi.apache.org by "Andre (JIRA)" <ji...@apache.org> on 2015/10/28 05:04:27 UTC
[jira] [Comment Edited] (NIFI-994) Processor to tail files
[ https://issues.apache.org/jira/browse/NIFI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977680#comment-14977680 ]
Andre edited comment on NIFI-994 at 10/28/15 4:04 AM:
------------------------------------------------------
[~markap14]
Flume has since version 1.7 (snapshot) a [taildir source|https://issues.apache.org/jira/browse/FLUME-2498].
The way they currently keep track of the files is using a position JSON sidecar file with content describing the log, inode and position of the tail against a file:
{code}
[{"inode":13209775,"pos":13771668368,"file":"/mnt/logs/logfilename.log"}]
{code}
It is not fault proof as the process tends to fail to detect changes to a file that result in the exact same size, e.g.:
So supposing the tail last queried a file with the following state:
{code}
$ cat log.log
AAAA
{code}
Updating it with similar content
{code}
$ echo BBBB > log.log
{code}
Would not trigger a new tail.
A more robust alternative would be to use checksums as suggested by [~jskora] but instead of checksumming the processed content, one would checksum a fixed number of bytes preceding the saved seek position.
More or less like (apologies for my weird pseudo-code):
{code}
IF SEEK_POSITION AND FILESIZE >= 8 BYTES
lf = OPEN logfile
SEEK lf AT SEEK_POSITION - 8 BYTES
SHA256(READ 8 BYTES FROM lf)
{code}
What do you think?
was (Author: trixpan):
[~markap14]
Flume has since version 1.7 (snapshot) a [taildir source|https://issues.apache.org/jira/browse/FLUME-2498].
The way they currently keep track of the files is using a position JSON sidecar file with content describing the log, inode and position of the tail against a file:
{code}
[{"inode":13209775,"pos":13771668368,"file":"/mnt/logs/logfilename.log"}]
{code}
It is not fault proof as the process tends to fail to detect changes to a file that result in the exact same size, e.g.:
So supposing the tail last queried a file with the following state:
{code}
$ cat log.log
AAAA
{code}
Updating it with similar content
{code}
$ echo BBBB > log.log
{code}
Would not trigger a new tail.
A more robust alternative would be to use checksums as suggested by [~jskora] but instead of checksumming the processed content, one would checksum a fixed number of bytes preceding the saved seek position.
More or less like (apologies for my weird pseudo-code):
{code}
IF SEEK_POSITION AND FILESIZE >= 8 BYTES
if = OPEN logfile
SEEK lf AT SEEK_POSITION - 8 BYTES
SHA256(READ 8 BYTES FROM if)
{code}
What do you think?
> Processor to tail files
> -----------------------
>
> Key: NIFI-994
> URL: https://issues.apache.org/jira/browse/NIFI-994
> Project: Apache NiFi
> Issue Type: New Feature
> Affects Versions: 0.4.0
> Reporter: Joseph Percivall
> Assignee: Mark Payne
> Fix For: 0.4.0
>
> Attachments: 0001-NIFI-994-Initial-import-of-TailFile.patch
>
>
> It's a very common data ingest situation to want to input text into the system by "tailing" a file, most commonly log files. Currently we don't have an easy way to do this.
> A simple processor to tail a file would benefit many users. There would need to be an option to not just tail a file but pick up where the processor left off if it is interrupted.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)