You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by Andre <an...@fucs.org> on 2016/07/07 06:03:45 UTC

NIFI-1170 (ie TailDir )

Hi there,

I was having a look on minifi and while the agent is truly an amazing to
the NiFi family, playing with it brought back some terrible memories from
the past... :-)

In particular the way TailFile is only able to handle fully qualified file
paths, instead of adopting a wildcard approach.

This issue was raised as part of NIFI-1170 and now with minifi going live I
was wondering, should we revisit this?

The use case is fairly simple and common place. Many(all) log producers
will create dynamically named files following pre-defined naming
conventions such as:


$ find /tmp/log
/tmp/log
/tmp/log/host1-2016-05-04.log
/tmp/log/firewall1
/tmp/log/firewall1/2016-05-04.log
/tmp/log/host3-2016-05-04.log
/tmp/log/host2-2016-05-04.log
/tmp/log/router1
/tmp/log/router1/2016-05-04.log
/tmp/log/host1-2016-05-02.log.gz



For users, the main advantage of using such structure is to avoid having to
rotate the logs (and HUP or truncate files), instead the log producer
dynamically opens a new file and starts writing straight to it, simplifying
the life of processes relying on those files.

The file matching strategy I had in mind is something simple where a single
capture group is used as a base for the filename.

public class Main {
    final static String filenameRegEx =
"/tmp/log/((?:[^/]+)?(?:/)?(?:host\\d)?(?:-)?\\d+-\\d+-\\d+.log)";
    final static Pattern p = Pattern.compile(filenameRegEx);

    public static void main(String[] args) throws IOException {
        String directory = "/tmp/log/";

        final File tailDir = new File(directory);

        listEntry(tailDir.toPath());
    }
    public static void listEntry(Path path) throws IOException {

        try (final DirectoryStream<Path> dirStream =
Files.newDirectoryStream(path)) {
            for (final Path entry : dirStream) {
                if (Files.isDirectory(entry)) {
                    listEntry(entry);
                }
                if (Files.isRegularFile(entry)) {
                    Matcher m = p.matcher(entry.toString());
                    if (m.find()) {
                        System.out.print(m.group(1).toString() + "\n");
                    }
                }
            }
        }
    }
}

Resulting in:

host1-2016-05-04.log
firewall1/2016-05-04.log
host3-2016-05-04.log
host2-2016-05-04.log
router1/2016-05-04.log
host1-2016-05-02.log

While the logic of file matching is not particularly concerning, I wonder:

1. Should adjust TailFile or to create another processor (e.g. TailDir)?

2. I would also be keen on hearing from you suggestions on how to handle
parallel tailing of multiple files.
    From what I gather, tailFile will not tail the older file. Instead it
will either chomp it fully (in case no state exists) or to seek to the last
known position, validate and then chomp the remaining data.



Cheers

Re: NIFI-1170 (ie TailDir )

Posted by Mark Payne <ma...@hotmail.com>.
Andre,

The state provide offers an arbitrary set of key/value pairs via the Map interface.
I'd recommend going with something like "file.0.name" => "/data/myfile.txt", "file.0.timestamp" => "147327302730", etc.

Is there something that I'm missing, so that this won't work?

Thanks
-Mark



> On Aug 9, 2016, at 3:12 AM, Andre <an...@fucs.org> wrote:
> 
> Mark,
> 
> I am preparing to start working on NIFI-1170 again and I was wondering, is a second level of state space something that can be done?
> 
> "TailDir" (similar to Flume's tail processor) should be capable of holding the state of multiple files, however, I presently need to serialised, store, read and deserialise every time a file state is updated. 
> 
> Would it be possible to extend the state beyond a single state per processor?
> 
> Cheers
> 
> On Sat, Jul 9, 2016 at 12:01 AM, Bryan Bende <bbende@gmail.com <ma...@gmail.com>> wrote:
> Andre,
> 
> Currently each processor can only persist one state map. The reason for
> this is that behind the scenes it is storing the state under a key like
> "components/<processor-uuid>" to ensure that the state is only for that
> processor, and can't be messed with by other processors.
> I supposes we could still have a way for the state manager API to let a key
> be specified and allow for something like
> "components/<processor-uuid>/state1" and
> "components/<processor-uuid>/state2", Mark Payne would probably need to
> comment more on this idea.
> 
> As far as serializing/deserializing though, I think it is only
> deserializing in an @OnScheduled method called recoverState... so while the
> processor is running it is continuously serializing the state so that if it
> ever crashes it can pick back up, but it only ever
> reads that state if the processor restarts (either manual stop/start, or
> crash and restart). Hope that helps.
> 
> Also, I'm wondering if TailDir can end up handling both cases of tailing a
> single file, and also tailing everything in a directory. I don't know all
> the ins and outs, but it seems like tailing everything in a directory with
> some kind filename filter might allow for tailing a single file as well,
> but I'm just thinking out loud here.
> 
> -Bryan
> 
> 
> On Fri, Jul 8, 2016 at 7:49 AM, Andre <andre-lists@fucs.org <ma...@fucs.org>> wrote:
> 
> > all,
> >
> > I ended up doing a fork TailFile and bolting together a frankenprototype of
> > the processor here:
> > (apologies for the spaghettiness of the code but the task was clearly
> > beyond my league... :-D )
> >
> > https://github.com/trixpan/nifi/tree/NIFI-1170 <https://github.com/trixpan/nifi/tree/NIFI-1170>
> >
> > I am still going through the basics of it but I would like to reach out for
> > feedback.
> >
> > Presently I am having to serialize and unserialize the state holding
> > object, something that doesn't seem to be the most efficient way. So I was
> > wondering:
> >
> > Can a processor store more than one state per context? If so, how?
> >
> > I thank you in advance
> >
> 


Re: NIFI-1170 (ie TailDir )

Posted by Andre <an...@fucs.org>.
Mark,

I am preparing to start working on NIFI-1170 again and I was wondering, is
a second level of state space something that can be done?

"TailDir" (similar to Flume's tail processor) should be capable of holding
the state of multiple files, however, I presently need to serialised,
store, read and deserialise every time a file state is updated.

Would it be possible to extend the state beyond a single state per
processor?

Cheers

On Sat, Jul 9, 2016 at 12:01 AM, Bryan Bende <bb...@gmail.com> wrote:

> Andre,
>
> Currently each processor can only persist one state map. The reason for
> this is that behind the scenes it is storing the state under a key like
> "components/<processor-uuid>" to ensure that the state is only for that
> processor, and can't be messed with by other processors.
> I supposes we could still have a way for the state manager API to let a key
> be specified and allow for something like
> "components/<processor-uuid>/state1" and
> "components/<processor-uuid>/state2", Mark Payne would probably need to
> comment more on this idea.
>
> As far as serializing/deserializing though, I think it is only
> deserializing in an @OnScheduled method called recoverState... so while the
> processor is running it is continuously serializing the state so that if it
> ever crashes it can pick back up, but it only ever
> reads that state if the processor restarts (either manual stop/start, or
> crash and restart). Hope that helps.
>
> Also, I'm wondering if TailDir can end up handling both cases of tailing a
> single file, and also tailing everything in a directory. I don't know all
> the ins and outs, but it seems like tailing everything in a directory with
> some kind filename filter might allow for tailing a single file as well,
> but I'm just thinking out loud here.
>
> -Bryan
>
>
> On Fri, Jul 8, 2016 at 7:49 AM, Andre <an...@fucs.org> wrote:
>
> > all,
> >
> > I ended up doing a fork TailFile and bolting together a frankenprototype
> of
> > the processor here:
> > (apologies for the spaghettiness of the code but the task was clearly
> > beyond my league... :-D )
> >
> > https://github.com/trixpan/nifi/tree/NIFI-1170
> >
> > I am still going through the basics of it but I would like to reach out
> for
> > feedback.
> >
> > Presently I am having to serialize and unserialize the state holding
> > object, something that doesn't seem to be the most efficient way. So I
> was
> > wondering:
> >
> > Can a processor store more than one state per context? If so, how?
> >
> > I thank you in advance
> >
>

Re: NIFI-1170 (ie TailDir )

Posted by Bryan Bende <bb...@gmail.com>.
Andre,

Currently each processor can only persist one state map. The reason for
this is that behind the scenes it is storing the state under a key like
"components/<processor-uuid>" to ensure that the state is only for that
processor, and can't be messed with by other processors.
I supposes we could still have a way for the state manager API to let a key
be specified and allow for something like
"components/<processor-uuid>/state1" and
"components/<processor-uuid>/state2", Mark Payne would probably need to
comment more on this idea.

As far as serializing/deserializing though, I think it is only
deserializing in an @OnScheduled method called recoverState... so while the
processor is running it is continuously serializing the state so that if it
ever crashes it can pick back up, but it only ever
reads that state if the processor restarts (either manual stop/start, or
crash and restart). Hope that helps.

Also, I'm wondering if TailDir can end up handling both cases of tailing a
single file, and also tailing everything in a directory. I don't know all
the ins and outs, but it seems like tailing everything in a directory with
some kind filename filter might allow for tailing a single file as well,
but I'm just thinking out loud here.

-Bryan


On Fri, Jul 8, 2016 at 7:49 AM, Andre <an...@fucs.org> wrote:

> all,
>
> I ended up doing a fork TailFile and bolting together a frankenprototype of
> the processor here:
> (apologies for the spaghettiness of the code but the task was clearly
> beyond my league... :-D )
>
> https://github.com/trixpan/nifi/tree/NIFI-1170
>
> I am still going through the basics of it but I would like to reach out for
> feedback.
>
> Presently I am having to serialize and unserialize the state holding
> object, something that doesn't seem to be the most efficient way. So I was
> wondering:
>
> Can a processor store more than one state per context? If so, how?
>
> I thank you in advance
>

Re: NIFI-1170 (ie TailDir )

Posted by Andre <an...@fucs.org>.
all,

I ended up doing a fork TailFile and bolting together a frankenprototype of
the processor here:
(apologies for the spaghettiness of the code but the task was clearly
beyond my league... :-D )

https://github.com/trixpan/nifi/tree/NIFI-1170

I am still going through the basics of it but I would like to reach out for
feedback.

Presently I am having to serialize and unserialize the state holding
object, something that doesn't seem to be the most efficient way. So I was
wondering:

Can a processor store more than one state per context? If so, how?

I thank you in advance