You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Andre <an...@fucs.org> on 2015/11/15 06:39:37 UTC

TailFile: Wildcards and filenames

Hi there,

I am trying to push the boundaries of the TailFile processor and
noticed an interesting behavior:

First I configure the File to Tail to "/log_path/test1"

I then configure "Rolling Filename Pattern to *"

I then start the processor and generate data:

$ echo AAAA > test1
$ echo AAAA > test1
$ echo BBBB > test1
$ echo CCCC > test1

Until here everything goes by the book. All data is tailed. No losses.


I then test my workaround to NIFI-1170

$ echo DDDD > test2
$ echo EEEE > test3
$ echo FFFF > test4

Data is not ingested. So far so good, we are dealing with a hack after all. :-)

However to my surprise when I tested the following, NiFi failed to
identify the two lines being fed into files matching the Rolling
Filename Pattern Expression:

$ echo GGGG > test11
$ echo HHHH > test12

I than stopped the processor and restarted. NiFI then proceeds to
ingest the data present in all files without duplication.

Is that the expected behaviour?

Cheers

Re: TailFile: Wildcards and filenames

Posted by Mark Payne <ma...@hotmail.com>.
Andre,

I love that you're digging in here and making sure that the quality is here! Thank you.

So in your example here, you said that you have Rolling Filename Pattern set to: *

So that should match any file in the directory (except that it won't count the actual file being tailed).
From the description you gave, it sounds as if you are expecting * to match anything starting with test1.
I.e., test1*. Instead, it will matching anything in the directory.

So with that in mind, I do believe that what you are seeing is the expected behavior, as you are really
hitting some corner cases here.

In order to understand how this is functioning, we need to consider how some corner cases are handled.
First, the Processor doesn't scan for files that have rolled over each time it runs. It scans only when the
File to Tail has rolled over (i.e., when that file has been truncated). This is done because continually
scanning the directory for any new files would be very expensive and generally is not necessary for a
rolling file pattern. So if you wrote to test2, then test3, it would not notice them until test1 rolls over.

Also, when the file is rolled over, it will look for other files that have rolled over, but it will ignore any file
whose Last Modified date is before that of the just-rolled-over file. So if you write to file test2, then test3, and
then you appended to test1, it will not pick up test2 and test3, as their timestamps come before the file that
you were testing. While this may seem erroneous given the test that you are providing here, it does work
well for true "rolling file" scenarios, which is what this Processor is aiming to address.

One thing that I am noticing, as I review this myself, is that if a file rolls over multiple times while the Processor
is running, it does not pick up all of the changes. I will be addressing this shortly. I created a ticket [1] for this.
This probably is okay for most use cases, but it could potentially miss some updates to the file the file is written
at a high rate and the Processor is not scheduled to run very often.

Does all of this make sense? Anything that I'm missing?

Thanks
-Mark

[1] https://issues.apache.org/jira/browse/NIFI-1171 <https://issues.apache.org/jira/browse/NIFI-1171>


> On Nov 15, 2015, at 12:39 AM, Andre <an...@fucs.org> wrote:
> 
> Hi there,
> 
> I am trying to push the boundaries of the TailFile processor and
> noticed an interesting behavior:
> 
> First I configure the File to Tail to "/log_path/test1"
> 
> I then configure "Rolling Filename Pattern to *"
> 
> I then start the processor and generate data:
> 
> $ echo AAAA > test1
> $ echo AAAA > test1
> $ echo BBBB > test1
> $ echo CCCC > test1
> 
> Until here everything goes by the book. All data is tailed. No losses.
> 
> 
> I then test my workaround to NIFI-1170
> 
> $ echo DDDD > test2
> $ echo EEEE > test3
> $ echo FFFF > test4
> 
> Data is not ingested. So far so good, we are dealing with a hack after all. :-)
> 
> However to my surprise when I tested the following, NiFi failed to
> identify the two lines being fed into files matching the Rolling
> Filename Pattern Expression:
> 
> $ echo GGGG > test11
> $ echo HHHH > test12
> 
> I than stopped the processor and restarted. NiFI then proceeds to
> ingest the data present in all files without duplication.
> 
> Is that the expected behaviour?
> 
> Cheers