You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by James McMahon <js...@gmail.com> on 2020/04/14 11:56:45 UTC

How to get a complete listing of flowfiles in a queue?

I have an issue with a ListFile processor. It does not appear to be
consuming all the raw data files that show up throughout the day in a
landing directory. My count at end of the day is less than the count of all
the files in the directory at end of the day. I suspect it has to do with
the way the ListFile has been configured (right now we only accept files
that are 30 minutes old or older), or it has to do with the fact that large
multiples of file can arrive at the same hh:mm differentiated by seconds or
milliseconds.  Perhaps ListFile is recording its state only to the
hour-minute or hour-minute-second (I notice that all millisecond values in
the epoch time are 000 in View State), and so when ListFile runs in its
following cycle it overlooks all the other files that share hh:mm, but are
later in time by some seconds or milliseconds on the file time? I'm
grasping for a logical cause at this point.

I want to do a comparison of what I have read in so far today against an
exhaustive list of today's directory. My intention is that such a
comparison should flag gaps, which then may lead me to a cause.

I have saved to a queue that persists the results of ListFile Success path
for 24 hours, which I started after all files yesterday had stopped
arriving (point being, queue will only have flowfiles in it from the today
directory). Right now it totals 16,231 flowfiles. The "read only" directory
on the linux system has nearly 20,000 files in it. Looking at the queue
from the UI isn't quite what I require: it only lets me view 100 flowfiles,
and I can't output the list.

Can I use the API or other option to generate the complete list of
flowfiles in that queue? I hope to output a list that includes Filename,
file.lastModifiedTime, and file.creationTime .
Thank you in advance for your help.

Re: How to get a complete listing of flowfiles in a queue?

Posted by Joe Witt <jo...@gmail.com>.
If you go to nifi.apache.org and clicks documentation one option is wiki.

From there as a user you can select dataflow templates.

https://cwiki.apache.org/confluence/display/NIFI/Example+Dataflow+Templates


The example there now can probably be done even easier.  Alternatively you
can just do LogAttribute after listfile and grep through the logs.

Many options for you here.

Thanks

On Tue, Apr 14, 2020 at 8:21 AM James McMahon <js...@gmail.com> wrote:

> Thank you Joe. This sounds promising and I will try to apply the example.
> Can you provide the link to what you refer to as our wiki? I'll search for
> the example.
>
> On Tue, Apr 14, 2020 at 8:08 AM Joe Witt <jo...@gmail.com> wrote:
>
>> James
>>
>> Using the provenance events from this processor is the best way.  Grab
>> all receive events for the time period of interest.
>>
>> You can do this in a few ways but one that works well is to send prov
>> events via reporting task, filter events for that component, write those
>> out to a file or set of files and review.  I think we have an example of
>> this on our wiki.
>>
>> Thanks
>>
>> On Tue, Apr 14, 2020 at 7:57 AM James McMahon <js...@gmail.com>
>> wrote:
>>
>>> I have an issue with a ListFile processor. It does not appear to be
>>> consuming all the raw data files that show up throughout the day in a
>>> landing directory. My count at end of the day is less than the count of all
>>> the files in the directory at end of the day. I suspect it has to do with
>>> the way the ListFile has been configured (right now we only accept files
>>> that are 30 minutes old or older), or it has to do with the fact that large
>>> multiples of file can arrive at the same hh:mm differentiated by seconds or
>>> milliseconds.  Perhaps ListFile is recording its state only to the
>>> hour-minute or hour-minute-second (I notice that all millisecond values in
>>> the epoch time are 000 in View State), and so when ListFile runs in its
>>> following cycle it overlooks all the other files that share hh:mm, but are
>>> later in time by some seconds or milliseconds on the file time? I'm
>>> grasping for a logical cause at this point.
>>>
>>> I want to do a comparison of what I have read in so far today against an
>>> exhaustive list of today's directory. My intention is that such a
>>> comparison should flag gaps, which then may lead me to a cause.
>>>
>>> I have saved to a queue that persists the results of ListFile Success
>>> path for 24 hours, which I started after all files yesterday had stopped
>>> arriving (point being, queue will only have flowfiles in it from the today
>>> directory). Right now it totals 16,231 flowfiles. The "read only" directory
>>> on the linux system has nearly 20,000 files in it. Looking at the queue
>>> from the UI isn't quite what I require: it only lets me view 100 flowfiles,
>>> and I can't output the list.
>>>
>>> Can I use the API or other option to generate the complete list of
>>> flowfiles in that queue? I hope to output a list that includes Filename,
>>> file.lastModifiedTime, and file.creationTime .
>>> Thank you in advance for your help.
>>>
>>>
>>>

Re: How to get a complete listing of flowfiles in a queue?

Posted by James McMahon <js...@gmail.com>.
Thank you Joe. This sounds promising and I will try to apply the example.
Can you provide the link to what you refer to as our wiki? I'll search for
the example.

On Tue, Apr 14, 2020 at 8:08 AM Joe Witt <jo...@gmail.com> wrote:

> James
>
> Using the provenance events from this processor is the best way.  Grab all
> receive events for the time period of interest.
>
> You can do this in a few ways but one that works well is to send prov
> events via reporting task, filter events for that component, write those
> out to a file or set of files and review.  I think we have an example of
> this on our wiki.
>
> Thanks
>
> On Tue, Apr 14, 2020 at 7:57 AM James McMahon <js...@gmail.com>
> wrote:
>
>> I have an issue with a ListFile processor. It does not appear to be
>> consuming all the raw data files that show up throughout the day in a
>> landing directory. My count at end of the day is less than the count of all
>> the files in the directory at end of the day. I suspect it has to do with
>> the way the ListFile has been configured (right now we only accept files
>> that are 30 minutes old or older), or it has to do with the fact that large
>> multiples of file can arrive at the same hh:mm differentiated by seconds or
>> milliseconds.  Perhaps ListFile is recording its state only to the
>> hour-minute or hour-minute-second (I notice that all millisecond values in
>> the epoch time are 000 in View State), and so when ListFile runs in its
>> following cycle it overlooks all the other files that share hh:mm, but are
>> later in time by some seconds or milliseconds on the file time? I'm
>> grasping for a logical cause at this point.
>>
>> I want to do a comparison of what I have read in so far today against an
>> exhaustive list of today's directory. My intention is that such a
>> comparison should flag gaps, which then may lead me to a cause.
>>
>> I have saved to a queue that persists the results of ListFile Success
>> path for 24 hours, which I started after all files yesterday had stopped
>> arriving (point being, queue will only have flowfiles in it from the today
>> directory). Right now it totals 16,231 flowfiles. The "read only" directory
>> on the linux system has nearly 20,000 files in it. Looking at the queue
>> from the UI isn't quite what I require: it only lets me view 100 flowfiles,
>> and I can't output the list.
>>
>> Can I use the API or other option to generate the complete list of
>> flowfiles in that queue? I hope to output a list that includes Filename,
>> file.lastModifiedTime, and file.creationTime .
>> Thank you in advance for your help.
>>
>>
>>

Re: How to get a complete listing of flowfiles in a queue?

Posted by Joe Witt <jo...@gmail.com>.
James

Using the provenance events from this processor is the best way.  Grab all
receive events for the time period of interest.

You can do this in a few ways but one that works well is to send prov
events via reporting task, filter events for that component, write those
out to a file or set of files and review.  I think we have an example of
this on our wiki.

Thanks

On Tue, Apr 14, 2020 at 7:57 AM James McMahon <js...@gmail.com> wrote:

> I have an issue with a ListFile processor. It does not appear to be
> consuming all the raw data files that show up throughout the day in a
> landing directory. My count at end of the day is less than the count of all
> the files in the directory at end of the day. I suspect it has to do with
> the way the ListFile has been configured (right now we only accept files
> that are 30 minutes old or older), or it has to do with the fact that large
> multiples of file can arrive at the same hh:mm differentiated by seconds or
> milliseconds.  Perhaps ListFile is recording its state only to the
> hour-minute or hour-minute-second (I notice that all millisecond values in
> the epoch time are 000 in View State), and so when ListFile runs in its
> following cycle it overlooks all the other files that share hh:mm, but are
> later in time by some seconds or milliseconds on the file time? I'm
> grasping for a logical cause at this point.
>
> I want to do a comparison of what I have read in so far today against an
> exhaustive list of today's directory. My intention is that such a
> comparison should flag gaps, which then may lead me to a cause.
>
> I have saved to a queue that persists the results of ListFile Success path
> for 24 hours, which I started after all files yesterday had stopped
> arriving (point being, queue will only have flowfiles in it from the today
> directory). Right now it totals 16,231 flowfiles. The "read only" directory
> on the linux system has nearly 20,000 files in it. Looking at the queue
> from the UI isn't quite what I require: it only lets me view 100 flowfiles,
> and I can't output the list.
>
> Can I use the API or other option to generate the complete list of
> flowfiles in that queue? I hope to output a list that includes Filename,
> file.lastModifiedTime, and file.creationTime .
> Thank you in advance for your help.
>
>
>