You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by Roman <ra...@gmail.com> on 2017/05/31 16:25:29 UTC

processors ListFile/ListSFTP do not store milliseconds in timestamp

Hi there, i need help.

We prepare high load project and tested this processors. All time see
listing.timestamp and processed.timestamp keys without milliseconds
(xxxxxxxxxx000). In this way, if generate several files in one second, not
all files will be listened.


Test:
1. start processor ListFile/ListSFTP
2. generate 10000 zero size files. my command:  for i in {1..10000}; do
touch ./test_$i; done
3. see processor stats: out 3952 (0 bytes)


I'm somewhere wrong? Or is it a bug nifi/java/etc?

Environment

Ubuntu 14.04.5 LTS, x64, ext4 file system
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
Nifi 1.2.0 From 3a605af, Tagged nifi-1.2.0-RC2


Thanks
Roman



--
View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/processors-ListFile-ListSFTP-do-not-store-milliseconds-in-timestamp-tp16037.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.

Re: processors ListFile/ListSFTP do not store milliseconds in timestamp

Posted by Roman <ra...@gmail.com>.
Hello Koji,

Thanks for the answer. I know about it and use ext4, stat returns me right
precision - 2017-06-01 10:10:18.783447047

Thanks,
Roman



--
View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/processors-ListFile-ListSFTP-do-not-store-milliseconds-in-timestamp-tp16037p16059.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.

Re: processors ListFile/ListSFTP do not store milliseconds in timestamp

Posted by Koji Kawamura <ij...@gmail.com>.
Hi Roman, Joe S, and others,

I've finally made some progress on this ListXXX processor issues.

Now I confirmed ListFile can list 100_000 files without missing anything:
for i in {1..100000}; do touch ./test_$i; done
works fine!! (it requires both NIFI-4069 and NIFI-3332)

1. ListFile can miss files with filesystems those do not provide
timestamps in milliseconds precision (NIFI-4069)
#1915 is ready for review. This PR focuses only on solving timestamp
precision issue.

2. ListFile can miss files having the same timestamp same as the
previously processed latest timestamp (NIFI-3332)
#1975 is also ready for review.

3. ListFile can not pickup files whose timestamp is older than the
previously processed latest timestamp (NIFI-2383)
I haven't done anything with this.

With #1 and #2, ListXXX are reliable enough and at a good balance in
terms of reliability and efficiency. NIFI-3332 brings back storing
file identifiers into state, but only for those having the latest
timestamp. Previously, it stores whole identifiers it processed.

Could anyone review PRs above?

Thanks
Koji

On Wed, Jun 21, 2017 at 2:01 PM, Koji Kawamura <ij...@gmail.com> wrote:
> Thanks Joe, I agree with you on the idea to make ListXXX as reliable
> as possible. If it's done, I'm also interested in providing different
> means using watch APIs to cover use-cases that ListXXX can't (by
> timestamps).
>
> Roman, thanks for testing the change.
> Test 1 and 2 results are expected.
> Test 3 ... this might have been affected by the issue reported by
> NIFI-3332 (files having the same timestamp processed at previous
> cycle). I'll take a look if there's anything we can do.
>
>> 2. Still do not see milliseconds, however my ext4 file system show modify date in nanoseconds
>
> Roman, would you try creating a simple Java program to see if the
> issue resides in NiFi codebase, or native code for your environment?
> There is a similar issue reported in Stackoverflow:
> https://stackoverflow.com/questions/24804618/get-file-mtime-with-millisecond-resolution-from-java
>
> If the simple program can return timestamp in milliseconds, we should
> fix something in NiFi.
>
> I really appreciate your feedback! Thanks!
> Koji
>
> On Tue, Jun 20, 2017 at 9:17 PM, Roman <ra...@gmail.com> wrote:
>> Hello Koji,
>>
>> Thanks for NIFI-4069 (not NIFI-4096 =))
>>
>> I tested your PR in several ways on version: From a0f2834 on branch
>> nifi-4069
>>
>> Test 1:
>> 1. set Target System Timestamp Precision: Auto Detect
>> 2. start ListFile
>> 3. start script for i in {1..10000}; do touch ./test_$i; done
>>
>> Result: no miss files
>>
>>
>> Test 2:
>> 1. set Target System Timestamp Precision: Milliseconds
>> 2. start ListFile
>> 3. start script for i in {1..10000}; do touch ./test_$i; done
>>
>> Result: there are missing files
>>
>>
>> Test 3 and 4 (100k files):
>> 1. set Target System Timestamp Precision: Auto Detect
>> 2. start ListFile
>> 3. start script for i in {1..100000}; do touch ./test_$i; done
>>
>> Result: missing 68 and 40 files
>>
>>
>> In all tests listing.timestamp and processed.timestamp still not have
>> milliseconds
>>
>>
>>
>> Summary:
>> 1. Now much better than it was. Thanks Koji for good job!
>> 2. Still do not see milliseconds, however my ext4 file system show modify
>> date in nanoseconds
>>
>>
>> Koji Kawamura-2 wrote
>>> Hi Roman and all,
>>>
>>> As I investigated further on ListFile processor, I found those are two
>>> different issues.
>>> Also I found another JIRA related to ListFile. Currently there seem to
>>> be three issues:
>>>
>>> 1. ListFile can miss files with filesystems those do not provide
>>> timestamps in milliseconds precision (NIFI-4096)
>>> 2. ListFile can miss files having the same timestamp same as the
>>> previously processed latest timestamp (NIFI-3332)
>>> 3. ListFile can not pickup files whose timestamp is older than the
>>> previously processed latest timestamp (NIFI-2383)
>>>
>>> # NIFI-4096
>>> I created JIRA NIFI-4096 to address issue#1 above, by adding
>>> deterministic logic to detect target filesystem timestamp precision.
>>> With NIFI-4096, ListFile can list whole 10,000 files created by the
>>> command you shared before without missing anything:
>>>
>>> ```
>>> for i in {1..10000}; do touch ./test_$i; done
>>> ```
>>>
>>> The PR is ready for review. I appreciate if you can test the fix with
>>> your use case.
>>>
>>> Additionally, I refactored variable names in AbstractListProcessor to
>>> explain purpose and timestamp unit better. I hope it makes the code
>>> more readable and maintainable.
>>>
>>> # NIFI-3332
>>> I'm thinking about adding a processor property to specify whether
>>> track the listed filenames with the latest processed timestamp.
>>> Although it will be less efficient, it'd be good for some use cases.
>>>
>>> # NIFI-2383
>>> This is the most difficult case to handle right with only timestamp.
>>> We need different processor which can use watch API..
>>>
>>> Any comment would be appreciated.
>>>
>>> Thanks,
>>> Koji
>>>
>>> On Tue, Jun 6, 2017 at 9:18 PM, Koji Kawamura &lt;
>>
>>> ijokarumawak@
>>
>>> &gt; wrote:
>>>> Hi Roman,
>>>>
>>>> I think NIFI-3332 is probably related as I can see timestamps in logs
>>>> don't have milliseconds.
>>>>
>>>> I've been considering how we can support all corner cases with minimal
>>>> state to persist, and make it works even if the filesystem only
>>>> provide last modified timestamp in seconds precision.
>>>> Changing code and testing locally, but not ready for send a PR yet,
>>>> and I am not fully confident on how to fix.
>>>>
>>>> Any suggestion or insight would be appreciated to make these ListXXXX
>>>> processor better.
>>>>
>>>> Thanks,
>>>> Koji
>>>>
>>>> On Tue, Jun 6, 2017 at 8:54 PM, Roman &lt;
>>
>>> ramon9869@
>>
>>> &gt; wrote:
>>>>> Hi there,
>>>>>
>>>>> During digging into this issue, I found open issue in jira  NIFI-3332
>>>>> &lt;https://issues.apache.org/jira/browse/NIFI-3332&gt;  . Can it be
>>>>> related to my
>>>>> situation with missed milliseconds?
>>>>>
>>>>> Thanks
>>>>> Roman
>>>>>
>>>>>
>>>>> Koji Kawamura-2 wrote
>>>>>> Hello Roman,
>>>>>>
>>>>>> It seems the resolution of last modified timestamp depends on the file
>>>>>> system implementation.
>>>>>> https://stackoverflow.com/questions/3805201/how-to-get-ubuntu-file-timestamp-in-millisecond
>>>>>>
>>>>>> I reproduced the same behavior on OS X, which uses HFS that has the
>>>>>> same limitation of resolution in seconds.
>>>>>> https://stackoverflow.com/questions/18403588/how-to-return-millisecond-information-for-file-access-on-mac-os-x-in-java
>>>>>>
>>>>>> Which file system are you using on your Ubuntu? If it is ext3, then
>>>>>> changing it to ext4 may address the issue.
>>>>>>
>>>>>> Thanks,
>>>>>> Koji
>>>>>>
>>>>>> On Thu, Jun 1, 2017 at 1:25 AM, Roman &lt;
>>>>>
>>>>>> ramon9869@
>>>>>
>>>>>> &gt; wrote:
>>>>>>> Hi there, i need help.
>>>>>>>
>>>>>>> We prepare high load project and tested this processors. All time see
>>>>>>> listing.timestamp and processed.timestamp keys without milliseconds
>>>>>>> (xxxxxxxxxx000). In this way, if generate several files in one second,
>>>>>>> not
>>>>>>> all files will be listened.
>>>>>>>
>>>>>>>
>>>>>>> Test:
>>>>>>> 1. start processor ListFile/ListSFTP
>>>>>>> 2. generate 10000 zero size files. my command:  for i in {1..10000};
>>>>>>> do
>>>>>>> touch ./test_$i; done
>>>>>>> 3. see processor stats: out 3952 (0 bytes)
>>>>>>>
>>>>>>>
>>>>>>> I'm somewhere wrong? Or is it a bug nifi/java/etc?
>>>>>>>
>>>>>>> Environment
>>>>>>>
>>>>>>> Ubuntu 14.04.5 LTS, x64, ext4 file system
>>>>>>> Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
>>>>>>> Nifi 1.2.0 From 3a605af, Tagged nifi-1.2.0-RC2
>>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>> Roman
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://apache-nifi-developer-list.39713.n7.nabble.com/processors-ListFile-ListSFTP-do-not-store-milliseconds-in-timestamp-tp16037.html
>>>>>>> Sent from the Apache NiFi Developer List mailing list archive at
>>>>>>> Nabble.com.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-nifi-developer-list.39713.n7.nabble.com/processors-ListFile-ListSFTP-do-not-store-milliseconds-in-timestamp-tp16037p16118.html
>>>>> Sent from the Apache NiFi Developer List mailing list archive at
>>>>> Nabble.com.
>>
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/processors-ListFile-ListSFTP-do-not-store-milliseconds-in-timestamp-tp16037p16221.html
>> Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.

Re: processors ListFile/ListSFTP do not store milliseconds in timestamp

Posted by Koji Kawamura <ij...@gmail.com>.
Thanks Joe, I agree with you on the idea to make ListXXX as reliable
as possible. If it's done, I'm also interested in providing different
means using watch APIs to cover use-cases that ListXXX can't (by
timestamps).

Roman, thanks for testing the change.
Test 1 and 2 results are expected.
Test 3 ... this might have been affected by the issue reported by
NIFI-3332 (files having the same timestamp processed at previous
cycle). I'll take a look if there's anything we can do.

> 2. Still do not see milliseconds, however my ext4 file system show modify date in nanoseconds

Roman, would you try creating a simple Java program to see if the
issue resides in NiFi codebase, or native code for your environment?
There is a similar issue reported in Stackoverflow:
https://stackoverflow.com/questions/24804618/get-file-mtime-with-millisecond-resolution-from-java

If the simple program can return timestamp in milliseconds, we should
fix something in NiFi.

I really appreciate your feedback! Thanks!
Koji

On Tue, Jun 20, 2017 at 9:17 PM, Roman <ra...@gmail.com> wrote:
> Hello Koji,
>
> Thanks for NIFI-4069 (not NIFI-4096 =))
>
> I tested your PR in several ways on version: From a0f2834 on branch
> nifi-4069
>
> Test 1:
> 1. set Target System Timestamp Precision: Auto Detect
> 2. start ListFile
> 3. start script for i in {1..10000}; do touch ./test_$i; done
>
> Result: no miss files
>
>
> Test 2:
> 1. set Target System Timestamp Precision: Milliseconds
> 2. start ListFile
> 3. start script for i in {1..10000}; do touch ./test_$i; done
>
> Result: there are missing files
>
>
> Test 3 and 4 (100k files):
> 1. set Target System Timestamp Precision: Auto Detect
> 2. start ListFile
> 3. start script for i in {1..100000}; do touch ./test_$i; done
>
> Result: missing 68 and 40 files
>
>
> In all tests listing.timestamp and processed.timestamp still not have
> milliseconds
>
>
>
> Summary:
> 1. Now much better than it was. Thanks Koji for good job!
> 2. Still do not see milliseconds, however my ext4 file system show modify
> date in nanoseconds
>
>
> Koji Kawamura-2 wrote
>> Hi Roman and all,
>>
>> As I investigated further on ListFile processor, I found those are two
>> different issues.
>> Also I found another JIRA related to ListFile. Currently there seem to
>> be three issues:
>>
>> 1. ListFile can miss files with filesystems those do not provide
>> timestamps in milliseconds precision (NIFI-4096)
>> 2. ListFile can miss files having the same timestamp same as the
>> previously processed latest timestamp (NIFI-3332)
>> 3. ListFile can not pickup files whose timestamp is older than the
>> previously processed latest timestamp (NIFI-2383)
>>
>> # NIFI-4096
>> I created JIRA NIFI-4096 to address issue#1 above, by adding
>> deterministic logic to detect target filesystem timestamp precision.
>> With NIFI-4096, ListFile can list whole 10,000 files created by the
>> command you shared before without missing anything:
>>
>> ```
>> for i in {1..10000}; do touch ./test_$i; done
>> ```
>>
>> The PR is ready for review. I appreciate if you can test the fix with
>> your use case.
>>
>> Additionally, I refactored variable names in AbstractListProcessor to
>> explain purpose and timestamp unit better. I hope it makes the code
>> more readable and maintainable.
>>
>> # NIFI-3332
>> I'm thinking about adding a processor property to specify whether
>> track the listed filenames with the latest processed timestamp.
>> Although it will be less efficient, it'd be good for some use cases.
>>
>> # NIFI-2383
>> This is the most difficult case to handle right with only timestamp.
>> We need different processor which can use watch API..
>>
>> Any comment would be appreciated.
>>
>> Thanks,
>> Koji
>>
>> On Tue, Jun 6, 2017 at 9:18 PM, Koji Kawamura &lt;
>
>> ijokarumawak@
>
>> &gt; wrote:
>>> Hi Roman,
>>>
>>> I think NIFI-3332 is probably related as I can see timestamps in logs
>>> don't have milliseconds.
>>>
>>> I've been considering how we can support all corner cases with minimal
>>> state to persist, and make it works even if the filesystem only
>>> provide last modified timestamp in seconds precision.
>>> Changing code and testing locally, but not ready for send a PR yet,
>>> and I am not fully confident on how to fix.
>>>
>>> Any suggestion or insight would be appreciated to make these ListXXXX
>>> processor better.
>>>
>>> Thanks,
>>> Koji
>>>
>>> On Tue, Jun 6, 2017 at 8:54 PM, Roman &lt;
>
>> ramon9869@
>
>> &gt; wrote:
>>>> Hi there,
>>>>
>>>> During digging into this issue, I found open issue in jira  NIFI-3332
>>>> &lt;https://issues.apache.org/jira/browse/NIFI-3332&gt;  . Can it be
>>>> related to my
>>>> situation with missed milliseconds?
>>>>
>>>> Thanks
>>>> Roman
>>>>
>>>>
>>>> Koji Kawamura-2 wrote
>>>>> Hello Roman,
>>>>>
>>>>> It seems the resolution of last modified timestamp depends on the file
>>>>> system implementation.
>>>>> https://stackoverflow.com/questions/3805201/how-to-get-ubuntu-file-timestamp-in-millisecond
>>>>>
>>>>> I reproduced the same behavior on OS X, which uses HFS that has the
>>>>> same limitation of resolution in seconds.
>>>>> https://stackoverflow.com/questions/18403588/how-to-return-millisecond-information-for-file-access-on-mac-os-x-in-java
>>>>>
>>>>> Which file system are you using on your Ubuntu? If it is ext3, then
>>>>> changing it to ext4 may address the issue.
>>>>>
>>>>> Thanks,
>>>>> Koji
>>>>>
>>>>> On Thu, Jun 1, 2017 at 1:25 AM, Roman &lt;
>>>>
>>>>> ramon9869@
>>>>
>>>>> &gt; wrote:
>>>>>> Hi there, i need help.
>>>>>>
>>>>>> We prepare high load project and tested this processors. All time see
>>>>>> listing.timestamp and processed.timestamp keys without milliseconds
>>>>>> (xxxxxxxxxx000). In this way, if generate several files in one second,
>>>>>> not
>>>>>> all files will be listened.
>>>>>>
>>>>>>
>>>>>> Test:
>>>>>> 1. start processor ListFile/ListSFTP
>>>>>> 2. generate 10000 zero size files. my command:  for i in {1..10000};
>>>>>> do
>>>>>> touch ./test_$i; done
>>>>>> 3. see processor stats: out 3952 (0 bytes)
>>>>>>
>>>>>>
>>>>>> I'm somewhere wrong? Or is it a bug nifi/java/etc?
>>>>>>
>>>>>> Environment
>>>>>>
>>>>>> Ubuntu 14.04.5 LTS, x64, ext4 file system
>>>>>> Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
>>>>>> Nifi 1.2.0 From 3a605af, Tagged nifi-1.2.0-RC2
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>> Roman
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-nifi-developer-list.39713.n7.nabble.com/processors-ListFile-ListSFTP-do-not-store-milliseconds-in-timestamp-tp16037.html
>>>>>> Sent from the Apache NiFi Developer List mailing list archive at
>>>>>> Nabble.com.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-nifi-developer-list.39713.n7.nabble.com/processors-ListFile-ListSFTP-do-not-store-milliseconds-in-timestamp-tp16037p16118.html
>>>> Sent from the Apache NiFi Developer List mailing list archive at
>>>> Nabble.com.
>
>
>
>
>
> --
> View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/processors-ListFile-ListSFTP-do-not-store-milliseconds-in-timestamp-tp16037p16221.html
> Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.

Re: processors ListFile/ListSFTP do not store milliseconds in timestamp

Posted by Roman <ra...@gmail.com>.
Hello Koji,

Thanks for NIFI-4069 (not NIFI-4096 =))

I tested your PR in several ways on version: From a0f2834 on branch
nifi-4069

Test 1:
1. set Target System Timestamp Precision: Auto Detect
2. start ListFile
3. start script for i in {1..10000}; do touch ./test_$i; done

Result: no miss files


Test 2:
1. set Target System Timestamp Precision: Milliseconds
2. start ListFile
3. start script for i in {1..10000}; do touch ./test_$i; done

Result: there are missing files


Test 3 and 4 (100k files):
1. set Target System Timestamp Precision: Auto Detect
2. start ListFile
3. start script for i in {1..100000}; do touch ./test_$i; done

Result: missing 68 and 40 files


In all tests listing.timestamp and processed.timestamp still not have
milliseconds



Summary:
1. Now much better than it was. Thanks Koji for good job!
2. Still do not see milliseconds, however my ext4 file system show modify
date in nanoseconds


Koji Kawamura-2 wrote
> Hi Roman and all,
> 
> As I investigated further on ListFile processor, I found those are two
> different issues.
> Also I found another JIRA related to ListFile. Currently there seem to
> be three issues:
> 
> 1. ListFile can miss files with filesystems those do not provide
> timestamps in milliseconds precision (NIFI-4096)
> 2. ListFile can miss files having the same timestamp same as the
> previously processed latest timestamp (NIFI-3332)
> 3. ListFile can not pickup files whose timestamp is older than the
> previously processed latest timestamp (NIFI-2383)
> 
> # NIFI-4096
> I created JIRA NIFI-4096 to address issue#1 above, by adding
> deterministic logic to detect target filesystem timestamp precision.
> With NIFI-4096, ListFile can list whole 10,000 files created by the
> command you shared before without missing anything:
> 
> ```
> for i in {1..10000}; do touch ./test_$i; done
> ```
> 
> The PR is ready for review. I appreciate if you can test the fix with
> your use case.
> 
> Additionally, I refactored variable names in AbstractListProcessor to
> explain purpose and timestamp unit better. I hope it makes the code
> more readable and maintainable.
> 
> # NIFI-3332
> I'm thinking about adding a processor property to specify whether
> track the listed filenames with the latest processed timestamp.
> Although it will be less efficient, it'd be good for some use cases.
> 
> # NIFI-2383
> This is the most difficult case to handle right with only timestamp.
> We need different processor which can use watch API..
> 
> Any comment would be appreciated.
> 
> Thanks,
> Koji
> 
> On Tue, Jun 6, 2017 at 9:18 PM, Koji Kawamura &lt;

> ijokarumawak@

> &gt; wrote:
>> Hi Roman,
>>
>> I think NIFI-3332 is probably related as I can see timestamps in logs
>> don't have milliseconds.
>>
>> I've been considering how we can support all corner cases with minimal
>> state to persist, and make it works even if the filesystem only
>> provide last modified timestamp in seconds precision.
>> Changing code and testing locally, but not ready for send a PR yet,
>> and I am not fully confident on how to fix.
>>
>> Any suggestion or insight would be appreciated to make these ListXXXX
>> processor better.
>>
>> Thanks,
>> Koji
>>
>> On Tue, Jun 6, 2017 at 8:54 PM, Roman &lt;

> ramon9869@

> &gt; wrote:
>>> Hi there,
>>>
>>> During digging into this issue, I found open issue in jira  NIFI-3332
>>> &lt;https://issues.apache.org/jira/browse/NIFI-3332&gt;  . Can it be
>>> related to my
>>> situation with missed milliseconds?
>>>
>>> Thanks
>>> Roman
>>>
>>>
>>> Koji Kawamura-2 wrote
>>>> Hello Roman,
>>>>
>>>> It seems the resolution of last modified timestamp depends on the file
>>>> system implementation.
>>>> https://stackoverflow.com/questions/3805201/how-to-get-ubuntu-file-timestamp-in-millisecond
>>>>
>>>> I reproduced the same behavior on OS X, which uses HFS that has the
>>>> same limitation of resolution in seconds.
>>>> https://stackoverflow.com/questions/18403588/how-to-return-millisecond-information-for-file-access-on-mac-os-x-in-java
>>>>
>>>> Which file system are you using on your Ubuntu? If it is ext3, then
>>>> changing it to ext4 may address the issue.
>>>>
>>>> Thanks,
>>>> Koji
>>>>
>>>> On Thu, Jun 1, 2017 at 1:25 AM, Roman &lt;
>>>
>>>> ramon9869@
>>>
>>>> &gt; wrote:
>>>>> Hi there, i need help.
>>>>>
>>>>> We prepare high load project and tested this processors. All time see
>>>>> listing.timestamp and processed.timestamp keys without milliseconds
>>>>> (xxxxxxxxxx000). In this way, if generate several files in one second,
>>>>> not
>>>>> all files will be listened.
>>>>>
>>>>>
>>>>> Test:
>>>>> 1. start processor ListFile/ListSFTP
>>>>> 2. generate 10000 zero size files. my command:  for i in {1..10000};
>>>>> do
>>>>> touch ./test_$i; done
>>>>> 3. see processor stats: out 3952 (0 bytes)
>>>>>
>>>>>
>>>>> I'm somewhere wrong? Or is it a bug nifi/java/etc?
>>>>>
>>>>> Environment
>>>>>
>>>>> Ubuntu 14.04.5 LTS, x64, ext4 file system
>>>>> Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
>>>>> Nifi 1.2.0 From 3a605af, Tagged nifi-1.2.0-RC2
>>>>>
>>>>>
>>>>> Thanks
>>>>> Roman
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-nifi-developer-list.39713.n7.nabble.com/processors-ListFile-ListSFTP-do-not-store-milliseconds-in-timestamp-tp16037.html
>>>>> Sent from the Apache NiFi Developer List mailing list archive at
>>>>> Nabble.com.
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-nifi-developer-list.39713.n7.nabble.com/processors-ListFile-ListSFTP-do-not-store-milliseconds-in-timestamp-tp16037p16118.html
>>> Sent from the Apache NiFi Developer List mailing list archive at
>>> Nabble.com.





--
View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/processors-ListFile-ListSFTP-do-not-store-milliseconds-in-timestamp-tp16037p16221.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.

Re: processors ListFile/ListSFTP do not store milliseconds in timestamp

Posted by Joe Skora <js...@gmail.com>.
Koji and Roman,

Sorry to jump in here late, I meant to followup last week.

I created NIFI-3332 because Issue #2, when ListFile fires while between OS
writes of a batch of files, files with the same timestamp that the OS
writes after the processor fired are missed.  I suspect #1 is an is an
amplification of #2 where the second resolution will unfortunately increase
both the potential collision rate and potential state to be tracked each
1,000 fold.

I have a harder time with #3 as I understand the opinion that it's a new
file if I just wrote it, even if I kept the old timestamp.  But NiFi has to
use a discrete means to identify new files and I think it is reasonable to
use file timestamps, especially since this scenario can be mitigated by
updating the file timestamp.  It could be possible to use a combination of
modification and creation times (where both are available) to minimize
potential misses, but I don't think #3 is as likely as #1 and 2 once the
logic is understood, especially since a workaround is fairly easy.

I think a ListXXX processor that tracks events from Linux iNotify and/or
Windows FileSystemWatcher (or something similar) services would be a great
addition, but the simplicity of ListFile would still be useful if I could
trust it to not silently drop files.

I hope that helps.

Regards,
Joe

On Wed, Jun 14, 2017 at 5:00 AM, Koji Kawamura <ij...@gmail.com>
wrote:

> Hi Roman and all,
>
> As I investigated further on ListFile processor, I found those are two
> different issues.
> Also I found another JIRA related to ListFile. Currently there seem to
> be three issues:
>
> 1. ListFile can miss files with filesystems those do not provide
> timestamps in milliseconds precision (NIFI-4096)
> 2. ListFile can miss files having the same timestamp same as the
> previously processed latest timestamp (NIFI-3332)
> 3. ListFile can not pickup files whose timestamp is older than the
> previously processed latest timestamp (NIFI-2383)
>
> # NIFI-4096
> I created JIRA NIFI-4096 to address issue#1 above, by adding
> deterministic logic to detect target filesystem timestamp precision.
> With NIFI-4096, ListFile can list whole 10,000 files created by the
> command you shared before without missing anything:
>
> ```
> for i in {1..10000}; do touch ./test_$i; done
> ```
>
> The PR is ready for review. I appreciate if you can test the fix with
> your use case.
>
> Additionally, I refactored variable names in AbstractListProcessor to
> explain purpose and timestamp unit better. I hope it makes the code
> more readable and maintainable.
>
> # NIFI-3332
> I'm thinking about adding a processor property to specify whether
> track the listed filenames with the latest processed timestamp.
> Although it will be less efficient, it'd be good for some use cases.
>
> # NIFI-2383
> This is the most difficult case to handle right with only timestamp.
> We need different processor which can use watch API..
>
> Any comment would be appreciated.
>
> Thanks,
> Koji
>
> On Tue, Jun 6, 2017 at 9:18 PM, Koji Kawamura <ij...@gmail.com>
> wrote:
> > Hi Roman,
> >
> > I think NIFI-3332 is probably related as I can see timestamps in logs
> > don't have milliseconds.
> >
> > I've been considering how we can support all corner cases with minimal
> > state to persist, and make it works even if the filesystem only
> > provide last modified timestamp in seconds precision.
> > Changing code and testing locally, but not ready for send a PR yet,
> > and I am not fully confident on how to fix.
> >
> > Any suggestion or insight would be appreciated to make these ListXXXX
> > processor better.
> >
> > Thanks,
> > Koji
> >
> > On Tue, Jun 6, 2017 at 8:54 PM, Roman <ra...@gmail.com> wrote:
> >> Hi there,
> >>
> >> During digging into this issue, I found open issue in jira  NIFI-3332
> >> <https://issues.apache.org/jira/browse/NIFI-3332>  . Can it be related
> to my
> >> situation with missed milliseconds?
> >>
> >> Thanks
> >> Roman
> >>
> >>
> >> Koji Kawamura-2 wrote
> >>> Hello Roman,
> >>>
> >>> It seems the resolution of last modified timestamp depends on the file
> >>> system implementation.
> >>> https://stackoverflow.com/questions/3805201/how-to-get-
> ubuntu-file-timestamp-in-millisecond
> >>>
> >>> I reproduced the same behavior on OS X, which uses HFS that has the
> >>> same limitation of resolution in seconds.
> >>> https://stackoverflow.com/questions/18403588/how-to-
> return-millisecond-information-for-file-access-on-mac-os-x-in-java
> >>>
> >>> Which file system are you using on your Ubuntu? If it is ext3, then
> >>> changing it to ext4 may address the issue.
> >>>
> >>> Thanks,
> >>> Koji
> >>>
> >>> On Thu, Jun 1, 2017 at 1:25 AM, Roman &lt;
> >>
> >>> ramon9869@
> >>
> >>> &gt; wrote:
> >>>> Hi there, i need help.
> >>>>
> >>>> We prepare high load project and tested this processors. All time see
> >>>> listing.timestamp and processed.timestamp keys without milliseconds
> >>>> (xxxxxxxxxx000). In this way, if generate several files in one second,
> >>>> not
> >>>> all files will be listened.
> >>>>
> >>>>
> >>>> Test:
> >>>> 1. start processor ListFile/ListSFTP
> >>>> 2. generate 10000 zero size files. my command:  for i in {1..10000};
> do
> >>>> touch ./test_$i; done
> >>>> 3. see processor stats: out 3952 (0 bytes)
> >>>>
> >>>>
> >>>> I'm somewhere wrong? Or is it a bug nifi/java/etc?
> >>>>
> >>>> Environment
> >>>>
> >>>> Ubuntu 14.04.5 LTS, x64, ext4 file system
> >>>> Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
> >>>> Nifi 1.2.0 From 3a605af, Tagged nifi-1.2.0-RC2
> >>>>
> >>>>
> >>>> Thanks
> >>>> Roman
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> View this message in context:
> >>>> http://apache-nifi-developer-list.39713.n7.nabble.com/
> processors-ListFile-ListSFTP-do-not-store-milliseconds-in-
> timestamp-tp16037.html
> >>>> Sent from the Apache NiFi Developer List mailing list archive at
> >>>> Nabble.com.
> >>
> >>
> >>
> >>
> >>
> >> --
> >> View this message in context: http://apache-nifi-developer-
> list.39713.n7.nabble.com/processors-ListFile-ListSFTP-
> do-not-store-milliseconds-in-timestamp-tp16037p16118.html
> >> Sent from the Apache NiFi Developer List mailing list archive at
> Nabble.com.
>

Re: processors ListFile/ListSFTP do not store milliseconds in timestamp

Posted by Koji Kawamura <ij...@gmail.com>.
Hi Roman and all,

As I investigated further on ListFile processor, I found those are two
different issues.
Also I found another JIRA related to ListFile. Currently there seem to
be three issues:

1. ListFile can miss files with filesystems those do not provide
timestamps in milliseconds precision (NIFI-4096)
2. ListFile can miss files having the same timestamp same as the
previously processed latest timestamp (NIFI-3332)
3. ListFile can not pickup files whose timestamp is older than the
previously processed latest timestamp (NIFI-2383)

# NIFI-4096
I created JIRA NIFI-4096 to address issue#1 above, by adding
deterministic logic to detect target filesystem timestamp precision.
With NIFI-4096, ListFile can list whole 10,000 files created by the
command you shared before without missing anything:

```
for i in {1..10000}; do touch ./test_$i; done
```

The PR is ready for review. I appreciate if you can test the fix with
your use case.

Additionally, I refactored variable names in AbstractListProcessor to
explain purpose and timestamp unit better. I hope it makes the code
more readable and maintainable.

# NIFI-3332
I'm thinking about adding a processor property to specify whether
track the listed filenames with the latest processed timestamp.
Although it will be less efficient, it'd be good for some use cases.

# NIFI-2383
This is the most difficult case to handle right with only timestamp.
We need different processor which can use watch API..

Any comment would be appreciated.

Thanks,
Koji

On Tue, Jun 6, 2017 at 9:18 PM, Koji Kawamura <ij...@gmail.com> wrote:
> Hi Roman,
>
> I think NIFI-3332 is probably related as I can see timestamps in logs
> don't have milliseconds.
>
> I've been considering how we can support all corner cases with minimal
> state to persist, and make it works even if the filesystem only
> provide last modified timestamp in seconds precision.
> Changing code and testing locally, but not ready for send a PR yet,
> and I am not fully confident on how to fix.
>
> Any suggestion or insight would be appreciated to make these ListXXXX
> processor better.
>
> Thanks,
> Koji
>
> On Tue, Jun 6, 2017 at 8:54 PM, Roman <ra...@gmail.com> wrote:
>> Hi there,
>>
>> During digging into this issue, I found open issue in jira  NIFI-3332
>> <https://issues.apache.org/jira/browse/NIFI-3332>  . Can it be related to my
>> situation with missed milliseconds?
>>
>> Thanks
>> Roman
>>
>>
>> Koji Kawamura-2 wrote
>>> Hello Roman,
>>>
>>> It seems the resolution of last modified timestamp depends on the file
>>> system implementation.
>>> https://stackoverflow.com/questions/3805201/how-to-get-ubuntu-file-timestamp-in-millisecond
>>>
>>> I reproduced the same behavior on OS X, which uses HFS that has the
>>> same limitation of resolution in seconds.
>>> https://stackoverflow.com/questions/18403588/how-to-return-millisecond-information-for-file-access-on-mac-os-x-in-java
>>>
>>> Which file system are you using on your Ubuntu? If it is ext3, then
>>> changing it to ext4 may address the issue.
>>>
>>> Thanks,
>>> Koji
>>>
>>> On Thu, Jun 1, 2017 at 1:25 AM, Roman &lt;
>>
>>> ramon9869@
>>
>>> &gt; wrote:
>>>> Hi there, i need help.
>>>>
>>>> We prepare high load project and tested this processors. All time see
>>>> listing.timestamp and processed.timestamp keys without milliseconds
>>>> (xxxxxxxxxx000). In this way, if generate several files in one second,
>>>> not
>>>> all files will be listened.
>>>>
>>>>
>>>> Test:
>>>> 1. start processor ListFile/ListSFTP
>>>> 2. generate 10000 zero size files. my command:  for i in {1..10000}; do
>>>> touch ./test_$i; done
>>>> 3. see processor stats: out 3952 (0 bytes)
>>>>
>>>>
>>>> I'm somewhere wrong? Or is it a bug nifi/java/etc?
>>>>
>>>> Environment
>>>>
>>>> Ubuntu 14.04.5 LTS, x64, ext4 file system
>>>> Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
>>>> Nifi 1.2.0 From 3a605af, Tagged nifi-1.2.0-RC2
>>>>
>>>>
>>>> Thanks
>>>> Roman
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-nifi-developer-list.39713.n7.nabble.com/processors-ListFile-ListSFTP-do-not-store-milliseconds-in-timestamp-tp16037.html
>>>> Sent from the Apache NiFi Developer List mailing list archive at
>>>> Nabble.com.
>>
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/processors-ListFile-ListSFTP-do-not-store-milliseconds-in-timestamp-tp16037p16118.html
>> Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.

Re: processors ListFile/ListSFTP do not store milliseconds in timestamp

Posted by Koji Kawamura <ij...@gmail.com>.
Hi Roman,

I think NIFI-3332 is probably related as I can see timestamps in logs
don't have milliseconds.

I've been considering how we can support all corner cases with minimal
state to persist, and make it works even if the filesystem only
provide last modified timestamp in seconds precision.
Changing code and testing locally, but not ready for send a PR yet,
and I am not fully confident on how to fix.

Any suggestion or insight would be appreciated to make these ListXXXX
processor better.

Thanks,
Koji

On Tue, Jun 6, 2017 at 8:54 PM, Roman <ra...@gmail.com> wrote:
> Hi there,
>
> During digging into this issue, I found open issue in jira  NIFI-3332
> <https://issues.apache.org/jira/browse/NIFI-3332>  . Can it be related to my
> situation with missed milliseconds?
>
> Thanks
> Roman
>
>
> Koji Kawamura-2 wrote
>> Hello Roman,
>>
>> It seems the resolution of last modified timestamp depends on the file
>> system implementation.
>> https://stackoverflow.com/questions/3805201/how-to-get-ubuntu-file-timestamp-in-millisecond
>>
>> I reproduced the same behavior on OS X, which uses HFS that has the
>> same limitation of resolution in seconds.
>> https://stackoverflow.com/questions/18403588/how-to-return-millisecond-information-for-file-access-on-mac-os-x-in-java
>>
>> Which file system are you using on your Ubuntu? If it is ext3, then
>> changing it to ext4 may address the issue.
>>
>> Thanks,
>> Koji
>>
>> On Thu, Jun 1, 2017 at 1:25 AM, Roman &lt;
>
>> ramon9869@
>
>> &gt; wrote:
>>> Hi there, i need help.
>>>
>>> We prepare high load project and tested this processors. All time see
>>> listing.timestamp and processed.timestamp keys without milliseconds
>>> (xxxxxxxxxx000). In this way, if generate several files in one second,
>>> not
>>> all files will be listened.
>>>
>>>
>>> Test:
>>> 1. start processor ListFile/ListSFTP
>>> 2. generate 10000 zero size files. my command:  for i in {1..10000}; do
>>> touch ./test_$i; done
>>> 3. see processor stats: out 3952 (0 bytes)
>>>
>>>
>>> I'm somewhere wrong? Or is it a bug nifi/java/etc?
>>>
>>> Environment
>>>
>>> Ubuntu 14.04.5 LTS, x64, ext4 file system
>>> Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
>>> Nifi 1.2.0 From 3a605af, Tagged nifi-1.2.0-RC2
>>>
>>>
>>> Thanks
>>> Roman
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-nifi-developer-list.39713.n7.nabble.com/processors-ListFile-ListSFTP-do-not-store-milliseconds-in-timestamp-tp16037.html
>>> Sent from the Apache NiFi Developer List mailing list archive at
>>> Nabble.com.
>
>
>
>
>
> --
> View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/processors-ListFile-ListSFTP-do-not-store-milliseconds-in-timestamp-tp16037p16118.html
> Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.

Re: processors ListFile/ListSFTP do not store milliseconds in timestamp

Posted by Roman <ra...@gmail.com>.
Hi there,

During digging into this issue, I found open issue in jira  NIFI-3332
<https://issues.apache.org/jira/browse/NIFI-3332>  . Can it be related to my
situation with missed milliseconds?

Thanks
Roman


Koji Kawamura-2 wrote
> Hello Roman,
> 
> It seems the resolution of last modified timestamp depends on the file
> system implementation.
> https://stackoverflow.com/questions/3805201/how-to-get-ubuntu-file-timestamp-in-millisecond
> 
> I reproduced the same behavior on OS X, which uses HFS that has the
> same limitation of resolution in seconds.
> https://stackoverflow.com/questions/18403588/how-to-return-millisecond-information-for-file-access-on-mac-os-x-in-java
> 
> Which file system are you using on your Ubuntu? If it is ext3, then
> changing it to ext4 may address the issue.
> 
> Thanks,
> Koji
> 
> On Thu, Jun 1, 2017 at 1:25 AM, Roman &lt;

> ramon9869@

> &gt; wrote:
>> Hi there, i need help.
>>
>> We prepare high load project and tested this processors. All time see
>> listing.timestamp and processed.timestamp keys without milliseconds
>> (xxxxxxxxxx000). In this way, if generate several files in one second,
>> not
>> all files will be listened.
>>
>>
>> Test:
>> 1. start processor ListFile/ListSFTP
>> 2. generate 10000 zero size files. my command:  for i in {1..10000}; do
>> touch ./test_$i; done
>> 3. see processor stats: out 3952 (0 bytes)
>>
>>
>> I'm somewhere wrong? Or is it a bug nifi/java/etc?
>>
>> Environment
>>
>> Ubuntu 14.04.5 LTS, x64, ext4 file system
>> Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
>> Nifi 1.2.0 From 3a605af, Tagged nifi-1.2.0-RC2
>>
>>
>> Thanks
>> Roman
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-nifi-developer-list.39713.n7.nabble.com/processors-ListFile-ListSFTP-do-not-store-milliseconds-in-timestamp-tp16037.html
>> Sent from the Apache NiFi Developer List mailing list archive at
>> Nabble.com.





--
View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/processors-ListFile-ListSFTP-do-not-store-milliseconds-in-timestamp-tp16037p16118.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.

Re: processors ListFile/ListSFTP do not store milliseconds in timestamp

Posted by Koji Kawamura <ij...@gmail.com>.
Hello Roman,

It seems the resolution of last modified timestamp depends on the file
system implementation.
https://stackoverflow.com/questions/3805201/how-to-get-ubuntu-file-timestamp-in-millisecond

I reproduced the same behavior on OS X, which uses HFS that has the
same limitation of resolution in seconds.
https://stackoverflow.com/questions/18403588/how-to-return-millisecond-information-for-file-access-on-mac-os-x-in-java

Which file system are you using on your Ubuntu? If it is ext3, then
changing it to ext4 may address the issue.

Thanks,
Koji

On Thu, Jun 1, 2017 at 1:25 AM, Roman <ra...@gmail.com> wrote:
> Hi there, i need help.
>
> We prepare high load project and tested this processors. All time see
> listing.timestamp and processed.timestamp keys without milliseconds
> (xxxxxxxxxx000). In this way, if generate several files in one second, not
> all files will be listened.
>
>
> Test:
> 1. start processor ListFile/ListSFTP
> 2. generate 10000 zero size files. my command:  for i in {1..10000}; do
> touch ./test_$i; done
> 3. see processor stats: out 3952 (0 bytes)
>
>
> I'm somewhere wrong? Or is it a bug nifi/java/etc?
>
> Environment
>
> Ubuntu 14.04.5 LTS, x64, ext4 file system
> Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
> Nifi 1.2.0 From 3a605af, Tagged nifi-1.2.0-RC2
>
>
> Thanks
> Roman
>
>
>
> --
> View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/processors-ListFile-ListSFTP-do-not-store-milliseconds-in-timestamp-tp16037.html
> Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.