You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by "Sethuram, Anup" <an...@philips.com> on 2015/05/04 16:11:04 UTC

Fetch change list

Hi ,
                I'm trying to fetch a set of files which have recently changed in a "filesystem". Also I'm supposed to keep the original copy as it is.
For obtaining the latest files that have changed, I'm using a PutFile with "replace" strategy piped to a GetFile with a minimum age of 5 sec,  max file age of 30 sec, Keep source file as true,

Also, running it in clustered mode. I'm seeing the below issues

-          The queue starts growing if there's an error.

-          Continuous errors with 'NoSuchFileException'

-          Penalizing StandardFlowFileErrors




ERROR

0ab3b920-1f05-4f24-b861-4fded3d5d826

161.91.234.248:7087

GetFile[id=0ab3b920-1f05-4f24-b861-4fded3d5d826] Failed to retrieve files due to org.apache.nifi.processor.exception.FlowFileAccessException: Failed to import data from /nifi/UNZ/log201403230000.log for StandardFlowFileRecord[uuid=f29bda59-8611-427c-b4d7-c921ee5e74b8,claim=,offset=0,name=6908587554457536,size=0] due to java.nio.file.NoSuchFileException: /nifi/UNZ/log201403230000.log

18:45:56 IST



10:54:50 IST

ERROR

c552b5bc-f627-3cc3-b3d0-545c519eafd9

161.91.234.248:6087

PutFile[id=c552b5bc-f627-3cc3-b3d0-545c519eafd9] Penalizing StandardFlowFileRecord[uuid=876e51f7-9a3d-4bf9-9d11-9073a5c950ad,claim=1430717088883-73580,offset=0,name=file1.log,size=29314779] and transferring to failure due to org.apache.nifi.processor.exception.ProcessException: Could not rename /nifi/UNZ/.file1.log: org.apache.nifi.processor.exception.ProcessException: Could not rename: /nifi/UNZ/.file1.log

10:54:56 IST

ERROR

60662bb3-490a-3b47-9371-e11c12cdfa1a

161.91.234.248:7087

PutFile[id=60662bb3-490a-3b47-9371-e11c12cdfa1a] Penalizing StandardFlowFileRecord[uuid=522a2401-8269-4f0f-aff5-152d25cdcefa,claim=1430717094668-73059,offset=1533296,name=file2.log,size=28014262] and transferring to failure due to org.apache.nifi.processor.exception.ProcessException: Could not rename: /data/softwares/RS/nifi/OUT/.file2.log: org.apache.nifi.processor.exception.ProcessException: Could not rename: /nifi/OUT/.file2.log



Do I have to tweak the Run schedule or keep the same minimum file age and maximum file age to overcome this issue?
What might be an elegant solution in NiFi?


Thanks,
anup

________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.

Re: Fetch change list

Posted by Oscar dela Pena <od...@exist.com>.
Hi Mark, 

My team and I are working on a similar scenario as Anup but we're using SFTP not HDFS remote file source. 
I'm wondering if there will also be processors like ListSFTP and FetchSFTP in the 0.1.0 release 
that can keep state about what have been already pulled? We are thinking of implementing a custom processor 
just to do that. 

Thanks! 
Owie 

----- Original Message -----

From: "Corey Flowers" <cf...@onyxpoint.com> 
To: dev@nifi.incubator.apache.org 
Sent: Wednesday, May 6, 2015 12:05:48 AM 
Subject: Re: Fetch change list 

Wahoo! Thanks Mark for saving me on this one! 

Anup, before this release, it would not have been pretty to pull that delta 
off! :-) 

On Tue, May 5, 2015 at 11:39 AM, Mark Payne <ma...@hotmail.com> wrote: 



Anup, 
With the 0.1.0 release that we are working on right now, there are two new 
processors: ListHDFS, FetchHDFS, that are able to keep state about what has 
been pulled from HDFS. This way you can keep the data in HDFS and still 
only pull in new data. Will this help? 
Thanks-Mark 

> From: anup.sethuram@philips.com 
> To: dev@nifi.incubator.apache.org 
> Subject: RE: Fetch change list 
> Date: Tue, 5 May 2015 15:32:07 +0000 
> 
> Thanks Corey for that info. But the major problem I'm facing is I am 
backing up a large set of data into HDFS (with a GetHDFS , source retained 
as true) and then trying to fetch the delta from it. (get only the files 
which have arrived recently by using the min Age and max Age). But I'm 
unable to get the exact delta if I have 'keep source file' as true.. 
> I played around a lot with schedule time and min & max age but didn't 
help. 
> 
> -----Original Message----- 
> From: Corey Flowers [mailto:cflowers@onyxpoint.com] 
> Sent: Tuesday, May 05, 2015 5:35 PM 
> To: dev@nifi.incubator.apache.org 
> Subject: Re: Fetch change list 
> 
> Ok, the get file that is running, is basically causing a race condition 
between all of the servers in your cluster. That is why you are seeing the 
"NoSuchFile" error. If you change the scheduling strategy on that processor 
to "On Primary node" Then the only system that will try to pick up data 
from that mount point, is the server you have designated "primary node". 
> This should fix that issue. 
> 
> On Mon, May 4, 2015 at 11:30 PM, Sethuram, Anup < 
anup.sethuram@philips.com> 
> wrote: 
> 
> > Yes Corey, Right now the pickup directory is from a network share 
> > mount point. The data is picked up from one location and transferred 
> > to the other. I'm using site-to-site communication. 
> > 
> > -----Original Message----- 
> > From: Corey Flowers [mailto:cflowers@onyxpoint.com] 
> > Sent: Monday, May 04, 2015 7:57 PM 
> > To: dev@nifi.incubator.apache.org 
> > Subject: Re: Fetch change list 
> > 
> > Good morning Anup! 
> > 
> > Is the pickup directory coming from a network share mount 
point? 
> > 
> > On Mon, May 4, 2015 at 10:11 AM, Sethuram, Anup 
> > <anup.sethuram@philips.com 
> > > 
> > wrote: 
> > 
> > > Hi , 
> > > I'm trying to fetch a set of files which have 
> > > recently changed in a "filesystem". Also I'm supposed to keep the 
> > > original copy as it is. 
> > > For obtaining the latest files that have changed, I'm using a 
> > > PutFile with "replace" strategy piped to a GetFile with a minimum 
> > > age of 5 sec, max file age of 30 sec, Keep source file as true, 
> > > 
> > > Also, running it in clustered mode. I'm seeing the below issues 
> > > 
> > > - The queue starts growing if there's an error. 
> > > 
> > > - Continuous errors with 'NoSuchFileException' 
> > > 
> > > - Penalizing StandardFlowFileErrors 
> > > 
> > > 
> > > 
> > > 
> > > ERROR 
> > > 
> > > 0ab3b920-1f05-4f24-b861-4fded3d5d826 
> > > 
> > > 161.91.234.248:7087 
> > > 
> > > GetFile[id=0ab3b920-1f05-4f24-b861-4fded3d5d826] Failed to retrieve 
> > > files due to 
> > > org.apache.nifi.processor.exception.FlowFileAccessException: Failed 
> > > to import data from /nifi/UNZ/log201403230000.log for 
> > > StandardFlowFileRecord[uuid=f29bda59-8611-427c-b4d7-c921ee5e74b8,cla 
> > > im =,offset=0,name=6908587554457536,size=0] 
> > > due to java.nio.file.NoSuchFileException: 
> > > /nifi/UNZ/log201403230000.log 
> > > 
> > > 18:45:56 IST 
> > > 
> > > 
> > > 
> > > 10:54:50 IST 
> > > 
> > > ERROR 
> > > 
> > > c552b5bc-f627-3cc3-b3d0-545c519eafd9 
> > > 
> > > 161.91.234.248:6087 
> > > 
> > > PutFile[id=c552b5bc-f627-3cc3-b3d0-545c519eafd9] Penalizing 
> > > StandardFlowFileRecord[uuid=876e51f7-9a3d-4bf9-9d11-9073a5c950ad,cla 
> > > im =1430717088883-73580,offset=0,name=file1.log,size=29314779] 
> > > and transferring to failure due to 
> > > org.apache.nifi.processor.exception.ProcessException: Could not 
> > > rename 
> > > /nifi/UNZ/.file1.log: 
> > org.apache.nifi.processor.exception.ProcessException: 
> > > Could not rename: /nifi/UNZ/.file1.log 
> > > 
> > > 10:54:56 IST 
> > > 
> > > ERROR 
> > > 
> > > 60662bb3-490a-3b47-9371-e11c12cdfa1a 
> > > 
> > > 161.91.234.248:7087 
> > > 
> > > PutFile[id=60662bb3-490a-3b47-9371-e11c12cdfa1a] Penalizing 
> > > StandardFlowFileRecord[uuid=522a2401-8269-4f0f-aff5-152d25cdcefa,cla 
> > > im =1430717094668-73059,offset=1533296,name=file2.log,size=28014262] 
> > > and transferring to failure due to 
> > > org.apache.nifi.processor.exception.ProcessException: Could not 
rename: 
> > > /data/softwares/RS/nifi/OUT/.file2.log: 
> > > org.apache.nifi.processor.exception.ProcessException: Could not 
rename: 
> > > /nifi/OUT/.file2.log 
> > > 
> > > 
> > > 
> > > Do I have to tweak the Run schedule or keep the same minimum file 
> > > age and maximum file age to overcome this issue? 
> > > What might be an elegant solution in NiFi? 
> > > 
> > > 
> > > Thanks, 
> > > anup 
> > > 
> > > ________________________________ 
> > > The information contained in this message may be confidential and 
> > > legally protected under applicable law. The message is intended 
> > > solely for the addressee(s). If you are not the intended recipient, 
> > > you are hereby notified that any use, forwarding, dissemination, or 
> > > reproduction of this message is strictly prohibited and may be 
> > > unlawful. If you are not the intended recipient, please contact the 
> > > sender by return e-mail and destroy all copies of the original 
message. 
> > > 
> > 
> > 
> > 
> > -- 
> > Corey Flowers 
> > Vice President, Onyx Point, Inc 
> > (410) 541-6699 
> > cflowers@onyxpoint.com 
> > 
> > -- This account not approved for unencrypted proprietary information 
> > -- 
> > 
> > ________________________________ 
> > The information contained in this message may be confidential and 
> > legally protected under applicable law. The message is intended solely 
> > for the addressee(s). If you are not the intended recipient, you are 
> > hereby notified that any use, forwarding, dissemination, or 
> > reproduction of this message is strictly prohibited and may be 
> > unlawful. If you are not the intended recipient, please contact the 
> > sender by return e-mail and destroy all copies of the original message. 
> > 
> 
> 
> 
> -- 
> Corey Flowers 
> Vice President, Onyx Point, Inc 
> (410) 541-6699 
> cflowers@onyxpoint.com 
> 
> -- This account not approved for unencrypted proprietary information -- 
> 
> ________________________________ 
> The information contained in this message may be confidential and 
legally protected under applicable law. The message is intended solely for 
the addressee(s). If you are not the intended recipient, you are hereby 
notified that any use, forwarding, dissemination, or reproduction of this 
message is strictly prohibited and may be unlawful. If you are not the 
intended recipient, please contact the sender by return e-mail and destroy 
all copies of the original message. 





-- 



Corey Flowers 
Vice President, Onyx Point, Inc 
(410) 541-6699 
cflowers@onyxpoint.com 

-- This account not approved for unencrypted proprietary information -- 

Re: Fetch change list

Posted by Corey Flowers <cf...@onyxpoint.com>.
Wahoo! Thanks Mark for saving me on this one!

Anup, before this release, it would not have been pretty to pull that delta
off! :-)

On Tue, May 5, 2015 at 11:39 AM, Mark Payne <ma...@hotmail.com> wrote:

> Anup,
> With the 0.1.0 release that we are working on right now, there are two new
> processors: ListHDFS, FetchHDFS, that are able to keep state about what has
> been pulled from HDFS. This way you can keep the data in HDFS and still
> only pull in new data. Will this help?
> Thanks-Mark
>
> > From: anup.sethuram@philips.com
> > To: dev@nifi.incubator.apache.org
> > Subject: RE: Fetch change list
> > Date: Tue, 5 May 2015 15:32:07 +0000
> >
> > Thanks Corey for that info. But the major problem I'm facing is I am
> backing up a large set of data into HDFS (with a GetHDFS , source retained
> as true) and then trying to fetch the delta from it. (get only the files
> which have arrived recently by using the min Age and max Age). But I'm
> unable to get the exact delta if I have 'keep source file' as true..
> > I played around a lot with schedule time and min & max age but didn't
> help.
> >
> > -----Original Message-----
> > From: Corey Flowers [mailto:cflowers@onyxpoint.com]
> > Sent: Tuesday, May 05, 2015 5:35 PM
> > To: dev@nifi.incubator.apache.org
> > Subject: Re: Fetch change list
> >
> > Ok, the get file that is running, is basically causing a race condition
> between all of the servers in your cluster. That is why you are seeing the
> "NoSuchFile" error. If you change the scheduling strategy on that processor
> to "On Primary node" Then the only system that will try to pick up data
> from that mount point, is the server you have designated "primary node".
> > This should fix that issue.
> >
> > On Mon, May 4, 2015 at 11:30 PM, Sethuram, Anup <
> anup.sethuram@philips.com>
> > wrote:
> >
> > > Yes Corey, Right now the pickup directory is from a network share
> > > mount point. The data is picked up from one location and transferred
> > > to the other. I'm using site-to-site communication.
> > >
> > > -----Original Message-----
> > > From: Corey Flowers [mailto:cflowers@onyxpoint.com]
> > > Sent: Monday, May 04, 2015 7:57 PM
> > > To: dev@nifi.incubator.apache.org
> > > Subject: Re: Fetch change list
> > >
> > > Good morning Anup!
> > >
> > >          Is the pickup directory coming from a network share mount
> point?
> > >
> > > On Mon, May 4, 2015 at 10:11 AM, Sethuram, Anup
> > > <anup.sethuram@philips.com
> > > >
> > > wrote:
> > >
> > > > Hi ,
> > > >                 I'm trying to fetch a set of files which have
> > > > recently changed in a "filesystem". Also I'm supposed to keep the
> > > > original copy as it is.
> > > > For obtaining the latest files that have changed, I'm using a
> > > > PutFile with "replace" strategy piped to a GetFile with a minimum
> > > > age of 5 sec,  max file age of 30 sec, Keep source file as true,
> > > >
> > > > Also, running it in clustered mode. I'm seeing the below issues
> > > >
> > > > -          The queue starts growing if there's an error.
> > > >
> > > > -          Continuous errors with 'NoSuchFileException'
> > > >
> > > > -          Penalizing StandardFlowFileErrors
> > > >
> > > >
> > > >
> > > >
> > > > ERROR
> > > >
> > > > 0ab3b920-1f05-4f24-b861-4fded3d5d826
> > > >
> > > > 161.91.234.248:7087
> > > >
> > > > GetFile[id=0ab3b920-1f05-4f24-b861-4fded3d5d826] Failed to retrieve
> > > > files due to
> > > > org.apache.nifi.processor.exception.FlowFileAccessException: Failed
> > > > to import data from /nifi/UNZ/log201403230000.log for
> > > > StandardFlowFileRecord[uuid=f29bda59-8611-427c-b4d7-c921ee5e74b8,cla
> > > > im =,offset=0,name=6908587554457536,size=0]
> > > > due to java.nio.file.NoSuchFileException:
> > > > /nifi/UNZ/log201403230000.log
> > > >
> > > > 18:45:56 IST
> > > >
> > > >
> > > >
> > > > 10:54:50 IST
> > > >
> > > > ERROR
> > > >
> > > > c552b5bc-f627-3cc3-b3d0-545c519eafd9
> > > >
> > > > 161.91.234.248:6087
> > > >
> > > > PutFile[id=c552b5bc-f627-3cc3-b3d0-545c519eafd9] Penalizing
> > > > StandardFlowFileRecord[uuid=876e51f7-9a3d-4bf9-9d11-9073a5c950ad,cla
> > > > im =1430717088883-73580,offset=0,name=file1.log,size=29314779]
> > > > and transferring to failure due to
> > > > org.apache.nifi.processor.exception.ProcessException: Could not
> > > > rename
> > > > /nifi/UNZ/.file1.log:
> > > org.apache.nifi.processor.exception.ProcessException:
> > > > Could not rename: /nifi/UNZ/.file1.log
> > > >
> > > > 10:54:56 IST
> > > >
> > > > ERROR
> > > >
> > > > 60662bb3-490a-3b47-9371-e11c12cdfa1a
> > > >
> > > > 161.91.234.248:7087
> > > >
> > > > PutFile[id=60662bb3-490a-3b47-9371-e11c12cdfa1a] Penalizing
> > > > StandardFlowFileRecord[uuid=522a2401-8269-4f0f-aff5-152d25cdcefa,cla
> > > > im =1430717094668-73059,offset=1533296,name=file2.log,size=28014262]
> > > > and transferring to failure due to
> > > > org.apache.nifi.processor.exception.ProcessException: Could not
> rename:
> > > > /data/softwares/RS/nifi/OUT/.file2.log:
> > > > org.apache.nifi.processor.exception.ProcessException: Could not
> rename:
> > > > /nifi/OUT/.file2.log
> > > >
> > > >
> > > >
> > > > Do I have to tweak the Run schedule or keep the same minimum file
> > > > age and maximum file age to overcome this issue?
> > > > What might be an elegant solution in NiFi?
> > > >
> > > >
> > > > Thanks,
> > > > anup
> > > >
> > > > ________________________________
> > > > The information contained in this message may be confidential and
> > > > legally protected under applicable law. The message is intended
> > > > solely for the addressee(s). If you are not the intended recipient,
> > > > you are hereby notified that any use, forwarding, dissemination, or
> > > > reproduction of this message is strictly prohibited and may be
> > > > unlawful. If you are not the intended recipient, please contact the
> > > > sender by return e-mail and destroy all copies of the original
> message.
> > > >
> > >
> > >
> > >
> > > --
> > > Corey Flowers
> > > Vice President, Onyx Point, Inc
> > > (410) 541-6699
> > > cflowers@onyxpoint.com
> > >
> > > -- This account not approved for unencrypted proprietary information
> > > --
> > >
> > > ________________________________
> > > The information contained in this message may be confidential and
> > > legally protected under applicable law. The message is intended solely
> > > for the addressee(s). If you are not the intended recipient, you are
> > > hereby notified that any use, forwarding, dissemination, or
> > > reproduction of this message is strictly prohibited and may be
> > > unlawful. If you are not the intended recipient, please contact the
> > > sender by return e-mail and destroy all copies of the original message.
> > >
> >
> >
> >
> > --
> > Corey Flowers
> > Vice President, Onyx Point, Inc
> > (410) 541-6699
> > cflowers@onyxpoint.com
> >
> > -- This account not approved for unencrypted proprietary information --
> >
> > ________________________________
> > The information contained in this message may be confidential and
> legally protected under applicable law. The message is intended solely for
> the addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>
>



-- 
Corey Flowers
Vice President, Onyx Point, Inc
(410) 541-6699
cflowers@onyxpoint.com

-- This account not approved for unencrypted proprietary information --

RE: Fetch change list

Posted by Mark Payne <ma...@hotmail.com>.
Owie,

I think at this point we are still in the phase of designing how the state management would work and throwing around ideas. So it'll likely be a while before that's available.

If you want to tackle the (S)FTP stuff then by all means, go for it! We'd love to have the contribution. We can worry about iterating to take advantage of those new state management features once they exist but I don't know when that will happen, so it makes total sense to do at least a first round implementation without it.

----------------------------------------
> Date: Wed, 20 May 2015 14:10:07 +0800
> From: odelapena@exist.com
> To: dev@nifi.incubator.apache.org
> Subject: Re: Fetch change list
>
> Hi Mark,
> I would like to help and contribute to the project by writing an implementation of List/Fetch (S)FTP.
> I am thinking of doing a similar approach as the List/Fetch HDFS where the list of previously fetched files
> are persisted in the distributed cache service. From the cache service the processor can filter
> and check which filenames have been downloaded already and skip those.
>
> I understand that the "simple state management to the framework" you mentioned before might
> be the long-term solution and perhaps the more elegant implementation. Please let me know if there is already ongoing
> effort in implementing that or if there's a plan to implement it soon. Then I may just have to wait for that.
>
> I'd be happy work on the List/Fetch SFTP and share the implementation since I will be using it on my own project as well.
> Thanks!
> Owie
> ----- Original Message -----
>
> From: "Oscar dela Pena" <od...@exist.com>
> To: dev@nifi.incubator.apache.org
> Sent: Thursday, May 7, 2015 6:37:15 AM
> Subject: Re: Fetch change list
>
> Thanks Mark for the response. We will try to work on the SFTP List/Retrieve. We will be glad to give our contribution
> if time permits and our task schedule fits.
> Owie
>
> ----- Original Message -----
>
> From: "Anup Sethuram" <an...@philips.com>
> To: dev@nifi.incubator.apache.org
> Sent: Wednesday, May 6, 2015 11:38:52 AM
> Subject: Re: Fetch change list
>
> Thanks Mark for that one; that should be a big relief. I¹d be waiting to
> check that out!
>
> Regards,
> anup
>
> On 05/05/15 9:09 pm, "Mark Payne" <ma...@hotmail.com> wrote:
>
>>Anup,
>>With the 0.1.0 release that we are working on right now, there are two
>>new processors: ListHDFS, FetchHDFS, that are able to keep state about
>>what has been pulled from HDFS. This way you can keep the data in HDFS
>>and still only pull in new data. Will this help?
>>Thanks-Mark
>>
>>> From: anup.sethuram@philips.com
>>> To: dev@nifi.incubator.apache.org
>>> Subject: RE: Fetch change list
>>> Date: Tue, 5 May 2015 15:32:07 +0000
>>>
>>> Thanks Corey for that info. But the major problem I'm facing is I am
>>>backing up a large set of data into HDFS (with a GetHDFS , source
>>>retained as true) and then trying to fetch the delta from it. (get only
>>>the files which have arrived recently by using the min Age and max Age).
>>>But I'm unable to get the exact delta if I have 'keep source file' as
>>>true..
>>> I played around a lot with schedule time and min & max age but didn't
>>>help.
>>>
>>> -----Original Message-----
>>> From: Corey Flowers [mailto:cflowers@onyxpoint.com]
>>> Sent: Tuesday, May 05, 2015 5:35 PM
>>> To: dev@nifi.incubator.apache.org
>>> Subject: Re: Fetch change list
>>>
>>> Ok, the get file that is running, is basically causing a race condition
>>>between all of the servers in your cluster. That is why you are seeing
>>>the "NoSuchFile" error. If you change the scheduling strategy on that
>>>processor to "On Primary node" Then the only system that will try to
>>>pick up data from that mount point, is the server you have designated
>>>"primary node".
>>> This should fix that issue.
>>>
>>> On Mon, May 4, 2015 at 11:30 PM, Sethuram, Anup
>>><an...@philips.com>
>>> wrote:
>>>
>>>> Yes Corey, Right now the pickup directory is from a network share
>>>> mount point. The data is picked up from one location and transferred
>>>> to the other. I'm using site-to-site communication.
>>>>
>>>> -----Original Message-----
>>>> From: Corey Flowers [mailto:cflowers@onyxpoint.com]
>>>> Sent: Monday, May 04, 2015 7:57 PM
>>>> To: dev@nifi.incubator.apache.org
>>>> Subject: Re: Fetch change list
>>>>
>>>> Good morning Anup!
>>>>
>>>> Is the pickup directory coming from a network share mount
>>>point?
>>>>
>>>> On Mon, May 4, 2015 at 10:11 AM, Sethuram, Anup
>>>> <anup.sethuram@philips.com
>>>>>
>>>> wrote:
>>>>
>>>>> Hi ,
>>>>> I'm trying to fetch a set of files which have
>>>>> recently changed in a "filesystem". Also I'm supposed to keep the
>>>>> original copy as it is.
>>>>> For obtaining the latest files that have changed, I'm using a
>>>>> PutFile with "replace" strategy piped to a GetFile with a minimum
>>>>> age of 5 sec, max file age of 30 sec, Keep source file as true,
>>>>>
>>>>> Also, running it in clustered mode. I'm seeing the below issues
>>>>>
>>>>> - The queue starts growing if there's an error.
>>>>>
>>>>> - Continuous errors with 'NoSuchFileException'
>>>>>
>>>>> - Penalizing StandardFlowFileErrors
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ERROR
>>>>>
>>>>> 0ab3b920-1f05-4f24-b861-4fded3d5d826
>>>>>
>>>>> 161.91.234.248:7087
>>>>>
>>>>> GetFile[id=0ab3b920-1f05-4f24-b861-4fded3d5d826] Failed to retrieve
>>>>> files due to
>>>>> org.apache.nifi.processor.exception.FlowFileAccessException: Failed
>>>>> to import data from /nifi/UNZ/log201403230000.log for
>>>>> StandardFlowFileRecord[uuid=f29bda59-8611-427c-b4d7-c921ee5e74b8,cla
>>>>> im =,offset=0,name=6908587554457536,size=0]
>>>>> due to java.nio.file.NoSuchFileException:
>>>>> /nifi/UNZ/log201403230000.log
>>>>>
>>>>> 18:45:56 IST
>>>>>
>>>>>
>>>>>
>>>>> 10:54:50 IST
>>>>>
>>>>> ERROR
>>>>>
>>>>> c552b5bc-f627-3cc3-b3d0-545c519eafd9
>>>>>
>>>>> 161.91.234.248:6087
>>>>>
>>>>> PutFile[id=c552b5bc-f627-3cc3-b3d0-545c519eafd9] Penalizing
>>>>> StandardFlowFileRecord[uuid=876e51f7-9a3d-4bf9-9d11-9073a5c950ad,cla
>>>>> im =1430717088883-73580,offset=0,name=file1.log,size=29314779]
>>>>> and transferring to failure due to
>>>>> org.apache.nifi.processor.exception.ProcessException: Could not
>>>>> rename
>>>>> /nifi/UNZ/.file1.log:
>>>> org.apache.nifi.processor.exception.ProcessException:
>>>>> Could not rename: /nifi/UNZ/.file1.log
>>>>>
>>>>> 10:54:56 IST
>>>>>
>>>>> ERROR
>>>>>
>>>>> 60662bb3-490a-3b47-9371-e11c12cdfa1a
>>>>>
>>>>> 161.91.234.248:7087
>>>>>
>>>>> PutFile[id=60662bb3-490a-3b47-9371-e11c12cdfa1a] Penalizing
>>>>> StandardFlowFileRecord[uuid=522a2401-8269-4f0f-aff5-152d25cdcefa,cla
>>>>> im =1430717094668-73059,offset=1533296,name=file2.log,size=28014262]
>>>>> and transferring to failure due to
>>>>> org.apache.nifi.processor.exception.ProcessException: Could not
>>>rename:
>>>>> /data/softwares/RS/nifi/OUT/.file2.log:
>>>>> org.apache.nifi.processor.exception.ProcessException: Could not
>>>rename:
>>>>> /nifi/OUT/.file2.log
>>>>>
>>>>>
>>>>>
>>>>> Do I have to tweak the Run schedule or keep the same minimum file
>>>>> age and maximum file age to overcome this issue?
>>>>> What might be an elegant solution in NiFi?
>>>>>
>>>>>
>>>>> Thanks,
>>>>> anup
>>>>>
>>>>> ________________________________
>>>>> The information contained in this message may be confidential and
>>>>> legally protected under applicable law. The message is intended
>>>>> solely for the addressee(s). If you are not the intended recipient,
>>>>> you are hereby notified that any use, forwarding, dissemination, or
>>>>> reproduction of this message is strictly prohibited and may be
>>>>> unlawful. If you are not the intended recipient, please contact the
>>>>> sender by return e-mail and destroy all copies of the original
>>>message.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Corey Flowers
>>>> Vice President, Onyx Point, Inc
>>>> (410) 541-6699
>>>> cflowers@onyxpoint.com
>>>>
>>>> -- This account not approved for unencrypted proprietary information
>>>> --
>>>>
>>>> ________________________________
>>>> The information contained in this message may be confidential and
>>>> legally protected under applicable law. The message is intended solely
>>>> for the addressee(s). If you are not the intended recipient, you are
>>>> hereby notified that any use, forwarding, dissemination, or
>>>> reproduction of this message is strictly prohibited and may be
>>>> unlawful. If you are not the intended recipient, please contact the
>>>> sender by return e-mail and destroy all copies of the original
>>>message.
>>>>
>>>
>>>
>>>
>>> --
>>> Corey Flowers
>>> Vice President, Onyx Point, Inc
>>> (410) 541-6699
>>> cflowers@onyxpoint.com
>>>
>>> -- This account not approved for unencrypted proprietary information --
>>>
>>> ________________________________
>>> The information contained in this message may be confidential and
>>>legally protected under applicable law. The message is intended solely
>>>for the addressee(s). If you are not the intended recipient, you are
>>>hereby notified that any use, forwarding, dissemination, or reproduction
>>>of this message is strictly prohibited and may be unlawful. If you are
>>>not the intended recipient, please contact the sender by return e-mail
>>>and destroy all copies of the original message.
>>
>
>
> ________________________________
> The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
>
 		 	   		  

Re: Fetch change list

Posted by Oscar dela Pena <od...@exist.com>.
Hi Mark, 
I would like to help and contribute to the project by writing an implementation of List/Fetch (S)FTP. 
I am thinking of doing a similar approach as the List/Fetch HDFS where the list of previously fetched files 
are persisted in the distributed cache service. From the cache service the processor can filter 
and check which filenames have been downloaded already and skip those. 

I understand that the "simple state management to the framework" you mentioned before might 
be the long-term solution and perhaps the more elegant implementation. Please let me know if there is already ongoing 
effort in implementing that or if there's a plan to implement it soon. Then I may just have to wait for that. 

I'd be happy work on the List/Fetch SFTP and share the implementation since I will be using it on my own project as well. 
Thanks! 
Owie 
----- Original Message -----

From: "Oscar dela Pena" <od...@exist.com> 
To: dev@nifi.incubator.apache.org 
Sent: Thursday, May 7, 2015 6:37:15 AM 
Subject: Re: Fetch change list 

Thanks Mark for the response. We will try to work on the SFTP List/Retrieve. We will be glad to give our contribution 
if time permits and our task schedule fits. 
Owie 

----- Original Message -----

From: "Anup Sethuram" <an...@philips.com> 
To: dev@nifi.incubator.apache.org 
Sent: Wednesday, May 6, 2015 11:38:52 AM 
Subject: Re: Fetch change list 

Thanks Mark for that one; that should be a big relief. I¹d be waiting to 
check that out! 

Regards, 
anup 

On 05/05/15 9:09 pm, "Mark Payne" <ma...@hotmail.com> wrote: 

>Anup, 
>With the 0.1.0 release that we are working on right now, there are two 
>new processors: ListHDFS, FetchHDFS, that are able to keep state about 
>what has been pulled from HDFS. This way you can keep the data in HDFS 
>and still only pull in new data. Will this help? 
>Thanks-Mark 
> 
>> From: anup.sethuram@philips.com 
>> To: dev@nifi.incubator.apache.org 
>> Subject: RE: Fetch change list 
>> Date: Tue, 5 May 2015 15:32:07 +0000 
>> 
>> Thanks Corey for that info. But the major problem I'm facing is I am 
>>backing up a large set of data into HDFS (with a GetHDFS , source 
>>retained as true) and then trying to fetch the delta from it. (get only 
>>the files which have arrived recently by using the min Age and max Age). 
>>But I'm unable to get the exact delta if I have 'keep source file' as 
>>true.. 
>> I played around a lot with schedule time and min & max age but didn't 
>>help. 
>> 
>> -----Original Message----- 
>> From: Corey Flowers [mailto:cflowers@onyxpoint.com] 
>> Sent: Tuesday, May 05, 2015 5:35 PM 
>> To: dev@nifi.incubator.apache.org 
>> Subject: Re: Fetch change list 
>> 
>> Ok, the get file that is running, is basically causing a race condition 
>>between all of the servers in your cluster. That is why you are seeing 
>>the "NoSuchFile" error. If you change the scheduling strategy on that 
>>processor to "On Primary node" Then the only system that will try to 
>>pick up data from that mount point, is the server you have designated 
>>"primary node". 
>> This should fix that issue. 
>> 
>> On Mon, May 4, 2015 at 11:30 PM, Sethuram, Anup 
>><an...@philips.com> 
>> wrote: 
>> 
>> > Yes Corey, Right now the pickup directory is from a network share 
>> > mount point. The data is picked up from one location and transferred 
>> > to the other. I'm using site-to-site communication. 
>> > 
>> > -----Original Message----- 
>> > From: Corey Flowers [mailto:cflowers@onyxpoint.com] 
>> > Sent: Monday, May 04, 2015 7:57 PM 
>> > To: dev@nifi.incubator.apache.org 
>> > Subject: Re: Fetch change list 
>> > 
>> > Good morning Anup! 
>> > 
>> > Is the pickup directory coming from a network share mount 
>>point? 
>> > 
>> > On Mon, May 4, 2015 at 10:11 AM, Sethuram, Anup 
>> > <anup.sethuram@philips.com 
>> > > 
>> > wrote: 
>> > 
>> > > Hi , 
>> > > I'm trying to fetch a set of files which have 
>> > > recently changed in a "filesystem". Also I'm supposed to keep the 
>> > > original copy as it is. 
>> > > For obtaining the latest files that have changed, I'm using a 
>> > > PutFile with "replace" strategy piped to a GetFile with a minimum 
>> > > age of 5 sec, max file age of 30 sec, Keep source file as true, 
>> > > 
>> > > Also, running it in clustered mode. I'm seeing the below issues 
>> > > 
>> > > - The queue starts growing if there's an error. 
>> > > 
>> > > - Continuous errors with 'NoSuchFileException' 
>> > > 
>> > > - Penalizing StandardFlowFileErrors 
>> > > 
>> > > 
>> > > 
>> > > 
>> > > ERROR 
>> > > 
>> > > 0ab3b920-1f05-4f24-b861-4fded3d5d826 
>> > > 
>> > > 161.91.234.248:7087 
>> > > 
>> > > GetFile[id=0ab3b920-1f05-4f24-b861-4fded3d5d826] Failed to retrieve 
>> > > files due to 
>> > > org.apache.nifi.processor.exception.FlowFileAccessException: Failed 
>> > > to import data from /nifi/UNZ/log201403230000.log for 
>> > > StandardFlowFileRecord[uuid=f29bda59-8611-427c-b4d7-c921ee5e74b8,cla 
>> > > im =,offset=0,name=6908587554457536,size=0] 
>> > > due to java.nio.file.NoSuchFileException: 
>> > > /nifi/UNZ/log201403230000.log 
>> > > 
>> > > 18:45:56 IST 
>> > > 
>> > > 
>> > > 
>> > > 10:54:50 IST 
>> > > 
>> > > ERROR 
>> > > 
>> > > c552b5bc-f627-3cc3-b3d0-545c519eafd9 
>> > > 
>> > > 161.91.234.248:6087 
>> > > 
>> > > PutFile[id=c552b5bc-f627-3cc3-b3d0-545c519eafd9] Penalizing 
>> > > StandardFlowFileRecord[uuid=876e51f7-9a3d-4bf9-9d11-9073a5c950ad,cla 
>> > > im =1430717088883-73580,offset=0,name=file1.log,size=29314779] 
>> > > and transferring to failure due to 
>> > > org.apache.nifi.processor.exception.ProcessException: Could not 
>> > > rename 
>> > > /nifi/UNZ/.file1.log: 
>> > org.apache.nifi.processor.exception.ProcessException: 
>> > > Could not rename: /nifi/UNZ/.file1.log 
>> > > 
>> > > 10:54:56 IST 
>> > > 
>> > > ERROR 
>> > > 
>> > > 60662bb3-490a-3b47-9371-e11c12cdfa1a 
>> > > 
>> > > 161.91.234.248:7087 
>> > > 
>> > > PutFile[id=60662bb3-490a-3b47-9371-e11c12cdfa1a] Penalizing 
>> > > StandardFlowFileRecord[uuid=522a2401-8269-4f0f-aff5-152d25cdcefa,cla 
>> > > im =1430717094668-73059,offset=1533296,name=file2.log,size=28014262] 
>> > > and transferring to failure due to 
>> > > org.apache.nifi.processor.exception.ProcessException: Could not 
>>rename: 
>> > > /data/softwares/RS/nifi/OUT/.file2.log: 
>> > > org.apache.nifi.processor.exception.ProcessException: Could not 
>>rename: 
>> > > /nifi/OUT/.file2.log 
>> > > 
>> > > 
>> > > 
>> > > Do I have to tweak the Run schedule or keep the same minimum file 
>> > > age and maximum file age to overcome this issue? 
>> > > What might be an elegant solution in NiFi? 
>> > > 
>> > > 
>> > > Thanks, 
>> > > anup 
>> > > 
>> > > ________________________________ 
>> > > The information contained in this message may be confidential and 
>> > > legally protected under applicable law. The message is intended 
>> > > solely for the addressee(s). If you are not the intended recipient, 
>> > > you are hereby notified that any use, forwarding, dissemination, or 
>> > > reproduction of this message is strictly prohibited and may be 
>> > > unlawful. If you are not the intended recipient, please contact the 
>> > > sender by return e-mail and destroy all copies of the original 
>>message. 
>> > > 
>> > 
>> > 
>> > 
>> > -- 
>> > Corey Flowers 
>> > Vice President, Onyx Point, Inc 
>> > (410) 541-6699 
>> > cflowers@onyxpoint.com 
>> > 
>> > -- This account not approved for unencrypted proprietary information 
>> > -- 
>> > 
>> > ________________________________ 
>> > The information contained in this message may be confidential and 
>> > legally protected under applicable law. The message is intended solely 
>> > for the addressee(s). If you are not the intended recipient, you are 
>> > hereby notified that any use, forwarding, dissemination, or 
>> > reproduction of this message is strictly prohibited and may be 
>> > unlawful. If you are not the intended recipient, please contact the 
>> > sender by return e-mail and destroy all copies of the original 
>>message. 
>> > 
>> 
>> 
>> 
>> -- 
>> Corey Flowers 
>> Vice President, Onyx Point, Inc 
>> (410) 541-6699 
>> cflowers@onyxpoint.com 
>> 
>> -- This account not approved for unencrypted proprietary information -- 
>> 
>> ________________________________ 
>> The information contained in this message may be confidential and 
>>legally protected under applicable law. The message is intended solely 
>>for the addressee(s). If you are not the intended recipient, you are 
>>hereby notified that any use, forwarding, dissemination, or reproduction 
>>of this message is strictly prohibited and may be unlawful. If you are 
>>not the intended recipient, please contact the sender by return e-mail 
>>and destroy all copies of the original message. 
> 


________________________________ 
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message. 


Re: Fetch change list

Posted by Oscar dela Pena <od...@exist.com>.
Thanks Mark for the response. We will try to work on the SFTP List/Retrieve. We will be glad to give our contribution 
if time permits and our task schedule fits.
Owie

----- Original Message -----
From: "Anup Sethuram" <an...@philips.com>
To: dev@nifi.incubator.apache.org
Sent: Wednesday, May 6, 2015 11:38:52 AM
Subject: Re: Fetch change list

Thanks Mark for that one; that should be a big relief. I¹d be waiting to
check that out!

Regards,
anup

On 05/05/15 9:09 pm, "Mark Payne" <ma...@hotmail.com> wrote:

>Anup,
>With the 0.1.0 release that we are working on right now, there are two
>new processors: ListHDFS, FetchHDFS, that are able to keep state about
>what has been pulled from HDFS. This way you can keep the data in HDFS
>and still only pull in new data. Will this help?
>Thanks-Mark
>
>> From: anup.sethuram@philips.com
>> To: dev@nifi.incubator.apache.org
>> Subject: RE: Fetch change list
>> Date: Tue, 5 May 2015 15:32:07 +0000
>>
>> Thanks Corey for that info. But the major problem I'm facing is I am
>>backing up a large set of data into HDFS (with a GetHDFS , source
>>retained as true) and then trying to fetch the delta from it. (get only
>>the files which have arrived recently by using the min Age and max Age).
>>But I'm unable to get the exact delta if I have 'keep source file' as
>>true..
>> I played around a lot with schedule time and min & max age but didn't
>>help.
>>
>> -----Original Message-----
>> From: Corey Flowers [mailto:cflowers@onyxpoint.com]
>> Sent: Tuesday, May 05, 2015 5:35 PM
>> To: dev@nifi.incubator.apache.org
>> Subject: Re: Fetch change list
>>
>> Ok, the get file that is running, is basically causing a race condition
>>between all of the servers in your cluster. That is why you are seeing
>>the "NoSuchFile" error. If you change the scheduling strategy on that
>>processor to "On Primary node" Then the only system that will try to
>>pick up data from that mount point, is the server you have designated
>>"primary node".
>> This should fix that issue.
>>
>> On Mon, May 4, 2015 at 11:30 PM, Sethuram, Anup
>><an...@philips.com>
>> wrote:
>>
>> > Yes Corey, Right now the pickup directory is from a network share
>> > mount point. The data is picked up from one location and transferred
>> > to the other. I'm using site-to-site communication.
>> >
>> > -----Original Message-----
>> > From: Corey Flowers [mailto:cflowers@onyxpoint.com]
>> > Sent: Monday, May 04, 2015 7:57 PM
>> > To: dev@nifi.incubator.apache.org
>> > Subject: Re: Fetch change list
>> >
>> > Good morning Anup!
>> >
>> >          Is the pickup directory coming from a network share mount
>>point?
>> >
>> > On Mon, May 4, 2015 at 10:11 AM, Sethuram, Anup
>> > <anup.sethuram@philips.com
>> > >
>> > wrote:
>> >
>> > > Hi ,
>> > >                 I'm trying to fetch a set of files which have
>> > > recently changed in a "filesystem". Also I'm supposed to keep the
>> > > original copy as it is.
>> > > For obtaining the latest files that have changed, I'm using a
>> > > PutFile with "replace" strategy piped to a GetFile with a minimum
>> > > age of 5 sec,  max file age of 30 sec, Keep source file as true,
>> > >
>> > > Also, running it in clustered mode. I'm seeing the below issues
>> > >
>> > > -          The queue starts growing if there's an error.
>> > >
>> > > -          Continuous errors with 'NoSuchFileException'
>> > >
>> > > -          Penalizing StandardFlowFileErrors
>> > >
>> > >
>> > >
>> > >
>> > > ERROR
>> > >
>> > > 0ab3b920-1f05-4f24-b861-4fded3d5d826
>> > >
>> > > 161.91.234.248:7087
>> > >
>> > > GetFile[id=0ab3b920-1f05-4f24-b861-4fded3d5d826] Failed to retrieve
>> > > files due to
>> > > org.apache.nifi.processor.exception.FlowFileAccessException: Failed
>> > > to import data from /nifi/UNZ/log201403230000.log for
>> > > StandardFlowFileRecord[uuid=f29bda59-8611-427c-b4d7-c921ee5e74b8,cla
>> > > im =,offset=0,name=6908587554457536,size=0]
>> > > due to java.nio.file.NoSuchFileException:
>> > > /nifi/UNZ/log201403230000.log
>> > >
>> > > 18:45:56 IST
>> > >
>> > >
>> > >
>> > > 10:54:50 IST
>> > >
>> > > ERROR
>> > >
>> > > c552b5bc-f627-3cc3-b3d0-545c519eafd9
>> > >
>> > > 161.91.234.248:6087
>> > >
>> > > PutFile[id=c552b5bc-f627-3cc3-b3d0-545c519eafd9] Penalizing
>> > > StandardFlowFileRecord[uuid=876e51f7-9a3d-4bf9-9d11-9073a5c950ad,cla
>> > > im =1430717088883-73580,offset=0,name=file1.log,size=29314779]
>> > > and transferring to failure due to
>> > > org.apache.nifi.processor.exception.ProcessException: Could not
>> > > rename
>> > > /nifi/UNZ/.file1.log:
>> > org.apache.nifi.processor.exception.ProcessException:
>> > > Could not rename: /nifi/UNZ/.file1.log
>> > >
>> > > 10:54:56 IST
>> > >
>> > > ERROR
>> > >
>> > > 60662bb3-490a-3b47-9371-e11c12cdfa1a
>> > >
>> > > 161.91.234.248:7087
>> > >
>> > > PutFile[id=60662bb3-490a-3b47-9371-e11c12cdfa1a] Penalizing
>> > > StandardFlowFileRecord[uuid=522a2401-8269-4f0f-aff5-152d25cdcefa,cla
>> > > im =1430717094668-73059,offset=1533296,name=file2.log,size=28014262]
>> > > and transferring to failure due to
>> > > org.apache.nifi.processor.exception.ProcessException: Could not
>>rename:
>> > > /data/softwares/RS/nifi/OUT/.file2.log:
>> > > org.apache.nifi.processor.exception.ProcessException: Could not
>>rename:
>> > > /nifi/OUT/.file2.log
>> > >
>> > >
>> > >
>> > > Do I have to tweak the Run schedule or keep the same minimum file
>> > > age and maximum file age to overcome this issue?
>> > > What might be an elegant solution in NiFi?
>> > >
>> > >
>> > > Thanks,
>> > > anup
>> > >
>> > > ________________________________
>> > > The information contained in this message may be confidential and
>> > > legally protected under applicable law. The message is intended
>> > > solely for the addressee(s). If you are not the intended recipient,
>> > > you are hereby notified that any use, forwarding, dissemination, or
>> > > reproduction of this message is strictly prohibited and may be
>> > > unlawful. If you are not the intended recipient, please contact the
>> > > sender by return e-mail and destroy all copies of the original
>>message.
>> > >
>> >
>> >
>> >
>> > --
>> > Corey Flowers
>> > Vice President, Onyx Point, Inc
>> > (410) 541-6699
>> > cflowers@onyxpoint.com
>> >
>> > -- This account not approved for unencrypted proprietary information
>> > --
>> >
>> > ________________________________
>> > The information contained in this message may be confidential and
>> > legally protected under applicable law. The message is intended solely
>> > for the addressee(s). If you are not the intended recipient, you are
>> > hereby notified that any use, forwarding, dissemination, or
>> > reproduction of this message is strictly prohibited and may be
>> > unlawful. If you are not the intended recipient, please contact the
>> > sender by return e-mail and destroy all copies of the original
>>message.
>> >
>>
>>
>>
>> --
>> Corey Flowers
>> Vice President, Onyx Point, Inc
>> (410) 541-6699
>> cflowers@onyxpoint.com
>>
>> -- This account not approved for unencrypted proprietary information --
>>
>> ________________________________
>> The information contained in this message may be confidential and
>>legally protected under applicable law. The message is intended solely
>>for the addressee(s). If you are not the intended recipient, you are
>>hereby notified that any use, forwarding, dissemination, or reproduction
>>of this message is strictly prohibited and may be unlawful. If you are
>>not the intended recipient, please contact the sender by return e-mail
>>and destroy all copies of the original message.
>


________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.

RE: Fetch change list

Posted by Mark Payne <ma...@hotmail.com>.
No problem. I've created a ticket for this: https://issues.apache.org/jira/browse/NIFI-673

If you'd like, you can also create tickets yourself by going to https://issues.apache.org/jira/browse/NIFI. You'll have to create an account, but you won't need any permissions or anything added to the account, I don't believe.

Thanks
-Mark


----------------------------------------
> Date: Tue, 9 Jun 2015 11:54:36 +0800
> From: odelapena@exist.com
> To: dev@nifi.incubator.apache.org
> CC: users@nifi.incubator.apache.org
> Subject: Re: Fetch change list
>
> Hi Mark,
>
> Can we also create a ticket for List and Fetch SFTP?
>
> Thanks!
>
>
> ----- Original Message -----
> From: "Mark Payne" <ma...@hotmail.com>
> To: dev@nifi.incubator.apache.org, users@nifi.incubator.apache.org
> Sent: Wednesday, June 3, 2015 5:56:46 AM
> Subject: RE: Fetch change list
>
> Anup,
>
> I have created a ticket for creating two new Processors: ListFile, FetchFile. These should provide a much nicer user experience for what you're trying to do here.
>
> The ticket is NIFI-631: https://issues.apache.org/jira/browse/NIFI-631
>
> Thanks
> -Mark
>
> ----------------------------------------
>> Date: Tue, 2 Jun 2015 07:41:45 -0700
>> From: anup.sethuram@philips.com
>> To: dev@nifi.incubator.apache.org
>> Subject: Re: Fetch change list
>>
>> Suppose, I have 1 TB of data that I need to backup/sync to a HDFS location
>> and then be passed onto a Kafka, is there a way out to do that?
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/Fetch-change-list-tp1351p1706.html
>> Sent from the Apache NiFi (incubating) Developer List mailing list archive at Nabble.com.
>
 		 	   		  

Re: Fetch change list

Posted by Oscar dela Pena <od...@exist.com>.
Hi Mark,

Can we also create a ticket for List and Fetch SFTP?

Thanks!


----- Original Message -----
From: "Mark Payne" <ma...@hotmail.com>
To: dev@nifi.incubator.apache.org, users@nifi.incubator.apache.org
Sent: Wednesday, June 3, 2015 5:56:46 AM
Subject: RE: Fetch change list

Anup,

I have created a ticket for creating two new Processors: ListFile, FetchFile. These should provide a much nicer user experience for what you're trying to do here.

The ticket is NIFI-631:  https://issues.apache.org/jira/browse/NIFI-631

Thanks
-Mark

----------------------------------------
> Date: Tue, 2 Jun 2015 07:41:45 -0700
> From: anup.sethuram@philips.com
> To: dev@nifi.incubator.apache.org
> Subject: Re: Fetch change list
>
> Suppose, I have 1 TB of data that I need to backup/sync to a HDFS location
> and then be passed onto a Kafka, is there a way out to do that?
>
>
>
>
> --
> View this message in context: http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/Fetch-change-list-tp1351p1706.html
> Sent from the Apache NiFi (incubating) Developer List mailing list archive at Nabble.com.
 		 	   		 

Re: Fetch change list

Posted by Adam Taft <ad...@adamtaft.com>.
1)  And, file permissions may not necessarily allow a write-lock on a
file.  The NiFi user might only be allowed read permissions to a given file.

2)  I really like this concept, +1 to the idea.  In this way, the "List"
operation is following the unix design philosophy of doing exactly one
thing only.  Conceptually, sitting between the "List" and "Fetch" operation
could be a handful of standard processors designed to filter, augment, or
ignore any fetch request.  This could be a very powerful way to compose the
functionality (though possibly at the expense of simplicity for the
dataflow manager).



On Wed, Jul 29, 2015 at 1:00 PM, Joe Witt <jo...@gmail.com> wrote:

> On 1) there are very few guarantees across os.  Some support locking but
> many apps dont use it.  File io is wild wild west of idioms.
>
> On 2) you certainly can tackle it that way.     This gets into the more art
> than science part of designing and composing processors.  Key is to always
> keep the operations person perspective in mind as the user.
>
> Joe
> On Jul 29, 2015 9:25 AM, "Joe Skora" <js...@gmail.com> wrote:
>
> > 1. Is there any reason it wouldn't work to try to open the files for
> write
> > and only begin to handle it when it is writable?  It seems like a file
> > source would typically open for write, write everything, and then close.
> > Cases where something re-opens and appends would obviously not work in
> that
> > case, but that seems a less likely situation.
> >
> > 2. Is there any value in breaking it into 3 phases, with a "selection"
> > phase, "decision" phase, and "handling" phase?  The "selection" phase
> that
> > lists ALL possible files to be considered, the "decision" phase
> determines
> > which files to process, and the "handling" phase manages processing the
> > selected files.  Processors in the "decision" provide the "combination of
> > signals" Adam mentions, using what ever variety state and other factors
> > necessary.  Extending the decision logic only requires a new processor.
> > Obviously, there's still a bit of back-and-forth among the phase that
> would
> > have to be worked out for managing file removal, etc.
> >
> > Joe
> >
> > On Wed, Jul 29, 2015 at 10:31 AM, Joe Witt <jo...@gmail.com> wrote:
> >
> > > Turning noatime on kicks last mod out the window.  It is for sure the
> > > case when dealing with file IO that there really are no rules.  As
> > > Adam notes it is about giving options/strategies.
> > >
> > > Surprisingly hard to do this well.  But good MVP options exist to get
> > > something out and get more feedback on true need.
> > >
> > > On Wed, Jul 29, 2015 at 10:26 AM, Adam Taft <ad...@adamtaft.com> wrote:
> > > > Some additional feature requests for sake of consideration...
> > > >
> > > > For some file systems (I can think of one), the last modified date
> may
> > > not
> > > > be dependable or possibly not high enough precision.  Additional
> > > strategies
> > > > could be considered for determining whether a file has been
> previously
> > > > processed.  For example, the byte size of the file, or the md5 hash,
> or
> > > > possibly other signals.
> > > >
> > > > While these additional strategies may not be coded initially, I think
> > > they
> > > > would add nice features for the proposed AbstractListFileProcessor.
> In
> > > > this way, the abstract processor could use one or even a combination
> of
> > > > signals to determine if a file has been modified and needs to be
> pulled
> > > > again.
> > > >
> > > > Additionally, it might be good to have other mechanisms in place to
> > mark
> > > a
> > > > file as unavailable.  The "dot file" convention is pretty common, but
> > > there
> > > > might be additional ways which indicates that a file is still be
> > > > manipulated.  i.e. maybe not all writers to the file system
> understand
> > > the
> > > > dot file convention, and so other strategies might be required.
> > > >
> > > > For example, in one processor I worked with, it was required to pull
> > the
> > > > list of remote files twice in order to monitor the file sizes.  If
> the
> > > file
> > > > size stayed consistent between two pulls, it could safely be
> considered
> > > > ready for processing.  However, if the file size differed in the two
> > > pulls,
> > > > we could assume that a client was still writing to the file.
> > > >
> > > > Adam
> > > >
> > > >
> > > > On Wed, Jul 29, 2015 at 7:34 AM, Mark Payne <ma...@hotmail.com>
> > > wrote:
> > > >
> > > >> Joe S,
> > > >>
> > > >> I agree, i think the design of List/Fetch HDFS is extremely
> applicable
> > > to
> > > >> this. The way it saves state is by
> > > >> using a DistributedMapCacheServer. The intent is to run the List
> > > processor
> > > >> on primary node only, and it
> > > >> will store its state there so that if the primary node is changed,
> any
> > > >> other node can pick up where the
> > > >> last one left off. In order to avoid saving a massive amount of
> state
> > in
> > > >> memory, it stores the timestamp of
> > > >> the latest file that it has fetched, as well as all files that have
> > that
> > > >> same timestamp (timestamp = last modified date
> > > >> in this case). So the next time it runs, it can pull just things
> whose
> > > >> lastModifiedDate is later than or equal to
> > > >> that timestamp, but it can still know which things to avoid pulling
> > > twice
> > > >> because we've saved that info as well.
> > > >>
> > > >> Now, with ListFile it will be a bit different. We tend to think of
> > > GetFile
> > > >> and List/Fetch File as pulling from a local
> > > >> file system. However, it is also certainly used to pull from a
> > > >> network-mounted file system. In this case, all nodes
> > > >> in the cluster need the ability to pull the data in unison. So in
> this
> > > >> case, we will want to save the state in such a way
> > > >> that all nodes in the cluster have access to it, in case the primary
> > > node
> > > >> changes. But if the file is local, we don't want
> > > >> to save state across the cluster, because each node needs its own
> > state.
> > > >> So that would likely just be an extra property
> > > >> on the processor.
> > > >>
> > > >> If saving state locally, it's easy enough to just write to a text
> file
> > > >> (recommend you allow user to specify the state file
> > > >> and default it to conf/ListFile-<processor id>.state or something
> like
> > > >> that.
> > > >>
> > > >> I have not documented this pattern. Specifically because we've been
> > > >> talking for a while about implementing the Simple
> > > >> State Management but we just haven't gotten there yet. I expected
> that
> > > we
> > > >> would have that finished before writing many
> > > >> more of these List/Fetch processors. That will radically change how
> we
> > > >> handle all of this.
> > > >>
> > > >> But since it is not there... it may actually make sense to just
> > refactor
> > > >> the ListHDFS processor into an AbstractListFileProcessor
> > > >> that is responsible for handling the state management. I am not sure
> > how
> > > >> complicated that would get, though. Just a
> > > >> thought.
> > > >>
> > > >> Hopefully this helped to clear things up, rather than muddy them up
> :)
> > > >> Feel free to fire back any questions.
> > > >>
> > > >> Thanks
> > > >> -Mark
> > > >>
> > > >>
> > > >> ----------------------------------------
> > > >> > Date: Wed, 29 Jul 2015 06:42:39 -0400
> > > >> > Subject: Re: Fetch change list
> > > >> > From: joe.witt@gmail.com
> > > >> > To: dev@nifi.apache.org
> > > >> >
> > > >> > JoeS
> > > >> >
> > > >> > Sounds great. I'd ignore my provenance comment as that was really
> > > >> > more about how something external could keep tabs on progress,
> etc..
> > > >> > Mark Payne designed/built the List/Fetch HDFS one so I'll defer to
> > him
> > > >> > for the good bits. But the logic to follow for saving state you'll
> > > >> > want is probably the same.
> > > >> >
> > > >> > Mark - do you have the design of that thing documented anywhere?
> It
> > > >> > is a good pattern to describe because it is effectively a model
> for
> > > >> > taking non-scaleable dataflow interfaces and making them behave as
> > if
> > > >> > they were.
> > > >> >
> > > >> > Thanks
> > > >> > JoeW
> > > >> >
> > > >> > On Wed, Jul 29, 2015 at 6:07 AM, Joe Skora <js...@gmail.com>
> > wrote:
> > > >> >> Joe,
> > > >> >>
> > > >> >> I'm interested in working on List/FetchFile. It seems like
> starting
> > > with
> > > >> >> [NIFI-631|https://issues.apache.org/jira/browse/NIFI-631] makes
> > > sense.
> > > >> >> I'll look at List/FetchHDFS, but is there any further detail on
> how
> > > this
> > > >> >> functionality should differ from GetFile? As for keeping state,
> > > >> >> provenance was suggested, a separate state folder might work, or
> > some
> > > >> file
> > > >> >> systems support additional state that might be usable.
> > > >> >>
> > > >> >> Regards,
> > > >> >> Joe
> > > >> >>
> > > >> >> On Tue, Jul 28, 2015 at 12:42 AM, Joe Witt <jo...@gmail.com>
> > > wrote:
> > > >> >>
> > > >> >>> Anup,
> > > >> >>>
> > > >> >>> The two tickets in question appear to be:
> > > >> >>> https://issues.apache.org/jira/browse/NIFI-631
> > > >> >>> https://issues.apache.org/jira/browse/NIFI-673
> > > >> >>>
> > > >> >>> Neither have been claimed as of yet. Anybody interested in
> taking
> > > one
> > > >> >>> or both of these on? It would be a lot like List/Fetch HDFS so
> > > you'll
> > > >> >>> have good examples to work from.
> > > >> >>>
> > > >> >>> Thanks
> > > >> >>> Joe
> > > >> >>>
> > > >> >>> On Tue, Jul 28, 2015 at 12:37 AM, Sethuram, Anup
> > > >> >>> <an...@philips.com> wrote:
> > > >> >>>> Can I expect this functionality in the upcoming releases of
> Nifi
> > ?
> > > >> >>>>
> > > >> >>>> On 13/07/15 9:13 am, "Sethuram, Anup" <
> anup.sethuram@philips.com
> > >
> > > >> wrote:
> > > >> >>>>
> > > >> >>>>>Where is this 1TB dataset living today?
> > > >> >>>>>[anup] Resides in a filesystem
> > > >> >>>>>
> > > >> >>>>>- What is the current nature of the dataset? Is it already in
> > large
> > > >> >>>>>bundles as files or is it a series of tiny messages, etc..?
> Does
> > it
> > > >> >>>>>need to be split/merged/etc..
> > > >> >>>>>[anup] Archived files of size 3MB each collected over a period.
> > > >> Directory
> > > >> >>>>>(1TB) -> Sub-Directories -> Files
> > > >> >>>>>
> > > >> >>>>>- What is the format of the data? Is it something that can
> easily
> > > be
> > > >> >>>>>split/merged or will it require special processes to do so?
> > > >> >>>>>[anup] zip, tar formats.
> > > >> >>>>>
> > > >> >>>>>
> > > >> >>>>>
> > > >> >>>>>--
> > > >> >>>>>View this message in context:
> > > >> >>>>>
> > > >> >>>
> > > >>
> > >
> >
> http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/Fetch-cha
> > > >> >>>>>nge-list-tp1351p2126.html
> > > >> >>>>>Sent from the Apache NiFi (incubating) Developer List mailing
> > list
> > > >> >>>>>archive at Nabble.com.
> > > >> >>>>>
> > > >> >>>>>________________________________
> > > >> >>>>>The information contained in this message may be confidential
> and
> > > >> legally
> > > >> >>>>>protected under applicable law. The message is intended solely
> > for
> > > the
> > > >> >>>>>addressee(s). If you are not the intended recipient, you are
> > hereby
> > > >> >>>>>notified that any use, forwarding, dissemination, or
> reproduction
> > > of
> > > >> this
> > > >> >>>>>message is strictly prohibited and may be unlawful. If you are
> > not
> > > the
> > > >> >>>>>intended recipient, please contact the sender by return e-mail
> > and
> > > >> >>>>>destroy all copies of the original message.
> > > >> >>>>
> > > >> >>>>
> > > >> >>>> ________________________________
> > > >> >>>> The information contained in this message may be confidential
> and
> > > >> >>> legally protected under applicable law. The message is intended
> > > solely
> > > >> for
> > > >> >>> the addressee(s). If you are not the intended recipient, you are
> > > hereby
> > > >> >>> notified that any use, forwarding, dissemination, or
> reproduction
> > of
> > > >> this
> > > >> >>> message is strictly prohibited and may be unlawful. If you are
> not
> > > the
> > > >> >>> intended recipient, please contact the sender by return e-mail
> and
> > > >> destroy
> > > >> >>> all copies of the original message.
> > > >> >>>
> > > >>
> > > >>
> > >
> >
>

Re: Fetch change list

Posted by Joe Witt <jo...@gmail.com>.
On 1) there are very few guarantees across os.  Some support locking but
many apps dont use it.  File io is wild wild west of idioms.

On 2) you certainly can tackle it that way.     This gets into the more art
than science part of designing and composing processors.  Key is to always
keep the operations person perspective in mind as the user.

Joe
On Jul 29, 2015 9:25 AM, "Joe Skora" <js...@gmail.com> wrote:

> 1. Is there any reason it wouldn't work to try to open the files for write
> and only begin to handle it when it is writable?  It seems like a file
> source would typically open for write, write everything, and then close.
> Cases where something re-opens and appends would obviously not work in that
> case, but that seems a less likely situation.
>
> 2. Is there any value in breaking it into 3 phases, with a "selection"
> phase, "decision" phase, and "handling" phase?  The "selection" phase that
> lists ALL possible files to be considered, the "decision" phase determines
> which files to process, and the "handling" phase manages processing the
> selected files.  Processors in the "decision" provide the "combination of
> signals" Adam mentions, using what ever variety state and other factors
> necessary.  Extending the decision logic only requires a new processor.
> Obviously, there's still a bit of back-and-forth among the phase that would
> have to be worked out for managing file removal, etc.
>
> Joe
>
> On Wed, Jul 29, 2015 at 10:31 AM, Joe Witt <jo...@gmail.com> wrote:
>
> > Turning noatime on kicks last mod out the window.  It is for sure the
> > case when dealing with file IO that there really are no rules.  As
> > Adam notes it is about giving options/strategies.
> >
> > Surprisingly hard to do this well.  But good MVP options exist to get
> > something out and get more feedback on true need.
> >
> > On Wed, Jul 29, 2015 at 10:26 AM, Adam Taft <ad...@adamtaft.com> wrote:
> > > Some additional feature requests for sake of consideration...
> > >
> > > For some file systems (I can think of one), the last modified date may
> > not
> > > be dependable or possibly not high enough precision.  Additional
> > strategies
> > > could be considered for determining whether a file has been previously
> > > processed.  For example, the byte size of the file, or the md5 hash, or
> > > possibly other signals.
> > >
> > > While these additional strategies may not be coded initially, I think
> > they
> > > would add nice features for the proposed AbstractListFileProcessor.  In
> > > this way, the abstract processor could use one or even a combination of
> > > signals to determine if a file has been modified and needs to be pulled
> > > again.
> > >
> > > Additionally, it might be good to have other mechanisms in place to
> mark
> > a
> > > file as unavailable.  The "dot file" convention is pretty common, but
> > there
> > > might be additional ways which indicates that a file is still be
> > > manipulated.  i.e. maybe not all writers to the file system understand
> > the
> > > dot file convention, and so other strategies might be required.
> > >
> > > For example, in one processor I worked with, it was required to pull
> the
> > > list of remote files twice in order to monitor the file sizes.  If the
> > file
> > > size stayed consistent between two pulls, it could safely be considered
> > > ready for processing.  However, if the file size differed in the two
> > pulls,
> > > we could assume that a client was still writing to the file.
> > >
> > > Adam
> > >
> > >
> > > On Wed, Jul 29, 2015 at 7:34 AM, Mark Payne <ma...@hotmail.com>
> > wrote:
> > >
> > >> Joe S,
> > >>
> > >> I agree, i think the design of List/Fetch HDFS is extremely applicable
> > to
> > >> this. The way it saves state is by
> > >> using a DistributedMapCacheServer. The intent is to run the List
> > processor
> > >> on primary node only, and it
> > >> will store its state there so that if the primary node is changed, any
> > >> other node can pick up where the
> > >> last one left off. In order to avoid saving a massive amount of state
> in
> > >> memory, it stores the timestamp of
> > >> the latest file that it has fetched, as well as all files that have
> that
> > >> same timestamp (timestamp = last modified date
> > >> in this case). So the next time it runs, it can pull just things whose
> > >> lastModifiedDate is later than or equal to
> > >> that timestamp, but it can still know which things to avoid pulling
> > twice
> > >> because we've saved that info as well.
> > >>
> > >> Now, with ListFile it will be a bit different. We tend to think of
> > GetFile
> > >> and List/Fetch File as pulling from a local
> > >> file system. However, it is also certainly used to pull from a
> > >> network-mounted file system. In this case, all nodes
> > >> in the cluster need the ability to pull the data in unison. So in this
> > >> case, we will want to save the state in such a way
> > >> that all nodes in the cluster have access to it, in case the primary
> > node
> > >> changes. But if the file is local, we don't want
> > >> to save state across the cluster, because each node needs its own
> state.
> > >> So that would likely just be an extra property
> > >> on the processor.
> > >>
> > >> If saving state locally, it's easy enough to just write to a text file
> > >> (recommend you allow user to specify the state file
> > >> and default it to conf/ListFile-<processor id>.state or something like
> > >> that.
> > >>
> > >> I have not documented this pattern. Specifically because we've been
> > >> talking for a while about implementing the Simple
> > >> State Management but we just haven't gotten there yet. I expected that
> > we
> > >> would have that finished before writing many
> > >> more of these List/Fetch processors. That will radically change how we
> > >> handle all of this.
> > >>
> > >> But since it is not there... it may actually make sense to just
> refactor
> > >> the ListHDFS processor into an AbstractListFileProcessor
> > >> that is responsible for handling the state management. I am not sure
> how
> > >> complicated that would get, though. Just a
> > >> thought.
> > >>
> > >> Hopefully this helped to clear things up, rather than muddy them up :)
> > >> Feel free to fire back any questions.
> > >>
> > >> Thanks
> > >> -Mark
> > >>
> > >>
> > >> ----------------------------------------
> > >> > Date: Wed, 29 Jul 2015 06:42:39 -0400
> > >> > Subject: Re: Fetch change list
> > >> > From: joe.witt@gmail.com
> > >> > To: dev@nifi.apache.org
> > >> >
> > >> > JoeS
> > >> >
> > >> > Sounds great. I'd ignore my provenance comment as that was really
> > >> > more about how something external could keep tabs on progress, etc..
> > >> > Mark Payne designed/built the List/Fetch HDFS one so I'll defer to
> him
> > >> > for the good bits. But the logic to follow for saving state you'll
> > >> > want is probably the same.
> > >> >
> > >> > Mark - do you have the design of that thing documented anywhere? It
> > >> > is a good pattern to describe because it is effectively a model for
> > >> > taking non-scaleable dataflow interfaces and making them behave as
> if
> > >> > they were.
> > >> >
> > >> > Thanks
> > >> > JoeW
> > >> >
> > >> > On Wed, Jul 29, 2015 at 6:07 AM, Joe Skora <js...@gmail.com>
> wrote:
> > >> >> Joe,
> > >> >>
> > >> >> I'm interested in working on List/FetchFile. It seems like starting
> > with
> > >> >> [NIFI-631|https://issues.apache.org/jira/browse/NIFI-631] makes
> > sense.
> > >> >> I'll look at List/FetchHDFS, but is there any further detail on how
> > this
> > >> >> functionality should differ from GetFile? As for keeping state,
> > >> >> provenance was suggested, a separate state folder might work, or
> some
> > >> file
> > >> >> systems support additional state that might be usable.
> > >> >>
> > >> >> Regards,
> > >> >> Joe
> > >> >>
> > >> >> On Tue, Jul 28, 2015 at 12:42 AM, Joe Witt <jo...@gmail.com>
> > wrote:
> > >> >>
> > >> >>> Anup,
> > >> >>>
> > >> >>> The two tickets in question appear to be:
> > >> >>> https://issues.apache.org/jira/browse/NIFI-631
> > >> >>> https://issues.apache.org/jira/browse/NIFI-673
> > >> >>>
> > >> >>> Neither have been claimed as of yet. Anybody interested in taking
> > one
> > >> >>> or both of these on? It would be a lot like List/Fetch HDFS so
> > you'll
> > >> >>> have good examples to work from.
> > >> >>>
> > >> >>> Thanks
> > >> >>> Joe
> > >> >>>
> > >> >>> On Tue, Jul 28, 2015 at 12:37 AM, Sethuram, Anup
> > >> >>> <an...@philips.com> wrote:
> > >> >>>> Can I expect this functionality in the upcoming releases of Nifi
> ?
> > >> >>>>
> > >> >>>> On 13/07/15 9:13 am, "Sethuram, Anup" <anup.sethuram@philips.com
> >
> > >> wrote:
> > >> >>>>
> > >> >>>>>Where is this 1TB dataset living today?
> > >> >>>>>[anup] Resides in a filesystem
> > >> >>>>>
> > >> >>>>>- What is the current nature of the dataset? Is it already in
> large
> > >> >>>>>bundles as files or is it a series of tiny messages, etc..? Does
> it
> > >> >>>>>need to be split/merged/etc..
> > >> >>>>>[anup] Archived files of size 3MB each collected over a period.
> > >> Directory
> > >> >>>>>(1TB) -> Sub-Directories -> Files
> > >> >>>>>
> > >> >>>>>- What is the format of the data? Is it something that can easily
> > be
> > >> >>>>>split/merged or will it require special processes to do so?
> > >> >>>>>[anup] zip, tar formats.
> > >> >>>>>
> > >> >>>>>
> > >> >>>>>
> > >> >>>>>--
> > >> >>>>>View this message in context:
> > >> >>>>>
> > >> >>>
> > >>
> >
> http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/Fetch-cha
> > >> >>>>>nge-list-tp1351p2126.html
> > >> >>>>>Sent from the Apache NiFi (incubating) Developer List mailing
> list
> > >> >>>>>archive at Nabble.com.
> > >> >>>>>
> > >> >>>>>________________________________
> > >> >>>>>The information contained in this message may be confidential and
> > >> legally
> > >> >>>>>protected under applicable law. The message is intended solely
> for
> > the
> > >> >>>>>addressee(s). If you are not the intended recipient, you are
> hereby
> > >> >>>>>notified that any use, forwarding, dissemination, or reproduction
> > of
> > >> this
> > >> >>>>>message is strictly prohibited and may be unlawful. If you are
> not
> > the
> > >> >>>>>intended recipient, please contact the sender by return e-mail
> and
> > >> >>>>>destroy all copies of the original message.
> > >> >>>>
> > >> >>>>
> > >> >>>> ________________________________
> > >> >>>> The information contained in this message may be confidential and
> > >> >>> legally protected under applicable law. The message is intended
> > solely
> > >> for
> > >> >>> the addressee(s). If you are not the intended recipient, you are
> > hereby
> > >> >>> notified that any use, forwarding, dissemination, or reproduction
> of
> > >> this
> > >> >>> message is strictly prohibited and may be unlawful. If you are not
> > the
> > >> >>> intended recipient, please contact the sender by return e-mail and
> > >> destroy
> > >> >>> all copies of the original message.
> > >> >>>
> > >>
> > >>
> >
>

Re: Fetch change list

Posted by Joe Skora <js...@gmail.com>.
1. Is there any reason it wouldn't work to try to open the files for write
and only begin to handle it when it is writable?  It seems like a file
source would typically open for write, write everything, and then close.
Cases where something re-opens and appends would obviously not work in that
case, but that seems a less likely situation.

2. Is there any value in breaking it into 3 phases, with a "selection"
phase, "decision" phase, and "handling" phase?  The "selection" phase that
lists ALL possible files to be considered, the "decision" phase determines
which files to process, and the "handling" phase manages processing the
selected files.  Processors in the "decision" provide the "combination of
signals" Adam mentions, using what ever variety state and other factors
necessary.  Extending the decision logic only requires a new processor.
Obviously, there's still a bit of back-and-forth among the phase that would
have to be worked out for managing file removal, etc.

Joe

On Wed, Jul 29, 2015 at 10:31 AM, Joe Witt <jo...@gmail.com> wrote:

> Turning noatime on kicks last mod out the window.  It is for sure the
> case when dealing with file IO that there really are no rules.  As
> Adam notes it is about giving options/strategies.
>
> Surprisingly hard to do this well.  But good MVP options exist to get
> something out and get more feedback on true need.
>
> On Wed, Jul 29, 2015 at 10:26 AM, Adam Taft <ad...@adamtaft.com> wrote:
> > Some additional feature requests for sake of consideration...
> >
> > For some file systems (I can think of one), the last modified date may
> not
> > be dependable or possibly not high enough precision.  Additional
> strategies
> > could be considered for determining whether a file has been previously
> > processed.  For example, the byte size of the file, or the md5 hash, or
> > possibly other signals.
> >
> > While these additional strategies may not be coded initially, I think
> they
> > would add nice features for the proposed AbstractListFileProcessor.  In
> > this way, the abstract processor could use one or even a combination of
> > signals to determine if a file has been modified and needs to be pulled
> > again.
> >
> > Additionally, it might be good to have other mechanisms in place to mark
> a
> > file as unavailable.  The "dot file" convention is pretty common, but
> there
> > might be additional ways which indicates that a file is still be
> > manipulated.  i.e. maybe not all writers to the file system understand
> the
> > dot file convention, and so other strategies might be required.
> >
> > For example, in one processor I worked with, it was required to pull the
> > list of remote files twice in order to monitor the file sizes.  If the
> file
> > size stayed consistent between two pulls, it could safely be considered
> > ready for processing.  However, if the file size differed in the two
> pulls,
> > we could assume that a client was still writing to the file.
> >
> > Adam
> >
> >
> > On Wed, Jul 29, 2015 at 7:34 AM, Mark Payne <ma...@hotmail.com>
> wrote:
> >
> >> Joe S,
> >>
> >> I agree, i think the design of List/Fetch HDFS is extremely applicable
> to
> >> this. The way it saves state is by
> >> using a DistributedMapCacheServer. The intent is to run the List
> processor
> >> on primary node only, and it
> >> will store its state there so that if the primary node is changed, any
> >> other node can pick up where the
> >> last one left off. In order to avoid saving a massive amount of state in
> >> memory, it stores the timestamp of
> >> the latest file that it has fetched, as well as all files that have that
> >> same timestamp (timestamp = last modified date
> >> in this case). So the next time it runs, it can pull just things whose
> >> lastModifiedDate is later than or equal to
> >> that timestamp, but it can still know which things to avoid pulling
> twice
> >> because we've saved that info as well.
> >>
> >> Now, with ListFile it will be a bit different. We tend to think of
> GetFile
> >> and List/Fetch File as pulling from a local
> >> file system. However, it is also certainly used to pull from a
> >> network-mounted file system. In this case, all nodes
> >> in the cluster need the ability to pull the data in unison. So in this
> >> case, we will want to save the state in such a way
> >> that all nodes in the cluster have access to it, in case the primary
> node
> >> changes. But if the file is local, we don't want
> >> to save state across the cluster, because each node needs its own state.
> >> So that would likely just be an extra property
> >> on the processor.
> >>
> >> If saving state locally, it's easy enough to just write to a text file
> >> (recommend you allow user to specify the state file
> >> and default it to conf/ListFile-<processor id>.state or something like
> >> that.
> >>
> >> I have not documented this pattern. Specifically because we've been
> >> talking for a while about implementing the Simple
> >> State Management but we just haven't gotten there yet. I expected that
> we
> >> would have that finished before writing many
> >> more of these List/Fetch processors. That will radically change how we
> >> handle all of this.
> >>
> >> But since it is not there... it may actually make sense to just refactor
> >> the ListHDFS processor into an AbstractListFileProcessor
> >> that is responsible for handling the state management. I am not sure how
> >> complicated that would get, though. Just a
> >> thought.
> >>
> >> Hopefully this helped to clear things up, rather than muddy them up :)
> >> Feel free to fire back any questions.
> >>
> >> Thanks
> >> -Mark
> >>
> >>
> >> ----------------------------------------
> >> > Date: Wed, 29 Jul 2015 06:42:39 -0400
> >> > Subject: Re: Fetch change list
> >> > From: joe.witt@gmail.com
> >> > To: dev@nifi.apache.org
> >> >
> >> > JoeS
> >> >
> >> > Sounds great. I'd ignore my provenance comment as that was really
> >> > more about how something external could keep tabs on progress, etc..
> >> > Mark Payne designed/built the List/Fetch HDFS one so I'll defer to him
> >> > for the good bits. But the logic to follow for saving state you'll
> >> > want is probably the same.
> >> >
> >> > Mark - do you have the design of that thing documented anywhere? It
> >> > is a good pattern to describe because it is effectively a model for
> >> > taking non-scaleable dataflow interfaces and making them behave as if
> >> > they were.
> >> >
> >> > Thanks
> >> > JoeW
> >> >
> >> > On Wed, Jul 29, 2015 at 6:07 AM, Joe Skora <js...@gmail.com> wrote:
> >> >> Joe,
> >> >>
> >> >> I'm interested in working on List/FetchFile. It seems like starting
> with
> >> >> [NIFI-631|https://issues.apache.org/jira/browse/NIFI-631] makes
> sense.
> >> >> I'll look at List/FetchHDFS, but is there any further detail on how
> this
> >> >> functionality should differ from GetFile? As for keeping state,
> >> >> provenance was suggested, a separate state folder might work, or some
> >> file
> >> >> systems support additional state that might be usable.
> >> >>
> >> >> Regards,
> >> >> Joe
> >> >>
> >> >> On Tue, Jul 28, 2015 at 12:42 AM, Joe Witt <jo...@gmail.com>
> wrote:
> >> >>
> >> >>> Anup,
> >> >>>
> >> >>> The two tickets in question appear to be:
> >> >>> https://issues.apache.org/jira/browse/NIFI-631
> >> >>> https://issues.apache.org/jira/browse/NIFI-673
> >> >>>
> >> >>> Neither have been claimed as of yet. Anybody interested in taking
> one
> >> >>> or both of these on? It would be a lot like List/Fetch HDFS so
> you'll
> >> >>> have good examples to work from.
> >> >>>
> >> >>> Thanks
> >> >>> Joe
> >> >>>
> >> >>> On Tue, Jul 28, 2015 at 12:37 AM, Sethuram, Anup
> >> >>> <an...@philips.com> wrote:
> >> >>>> Can I expect this functionality in the upcoming releases of Nifi ?
> >> >>>>
> >> >>>> On 13/07/15 9:13 am, "Sethuram, Anup" <an...@philips.com>
> >> wrote:
> >> >>>>
> >> >>>>>Where is this 1TB dataset living today?
> >> >>>>>[anup] Resides in a filesystem
> >> >>>>>
> >> >>>>>- What is the current nature of the dataset? Is it already in large
> >> >>>>>bundles as files or is it a series of tiny messages, etc..? Does it
> >> >>>>>need to be split/merged/etc..
> >> >>>>>[anup] Archived files of size 3MB each collected over a period.
> >> Directory
> >> >>>>>(1TB) -> Sub-Directories -> Files
> >> >>>>>
> >> >>>>>- What is the format of the data? Is it something that can easily
> be
> >> >>>>>split/merged or will it require special processes to do so?
> >> >>>>>[anup] zip, tar formats.
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>>--
> >> >>>>>View this message in context:
> >> >>>>>
> >> >>>
> >>
> http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/Fetch-cha
> >> >>>>>nge-list-tp1351p2126.html
> >> >>>>>Sent from the Apache NiFi (incubating) Developer List mailing list
> >> >>>>>archive at Nabble.com.
> >> >>>>>
> >> >>>>>________________________________
> >> >>>>>The information contained in this message may be confidential and
> >> legally
> >> >>>>>protected under applicable law. The message is intended solely for
> the
> >> >>>>>addressee(s). If you are not the intended recipient, you are hereby
> >> >>>>>notified that any use, forwarding, dissemination, or reproduction
> of
> >> this
> >> >>>>>message is strictly prohibited and may be unlawful. If you are not
> the
> >> >>>>>intended recipient, please contact the sender by return e-mail and
> >> >>>>>destroy all copies of the original message.
> >> >>>>
> >> >>>>
> >> >>>> ________________________________
> >> >>>> The information contained in this message may be confidential and
> >> >>> legally protected under applicable law. The message is intended
> solely
> >> for
> >> >>> the addressee(s). If you are not the intended recipient, you are
> hereby
> >> >>> notified that any use, forwarding, dissemination, or reproduction of
> >> this
> >> >>> message is strictly prohibited and may be unlawful. If you are not
> the
> >> >>> intended recipient, please contact the sender by return e-mail and
> >> destroy
> >> >>> all copies of the original message.
> >> >>>
> >>
> >>
>

Re: Fetch change list

Posted by Joe Witt <jo...@gmail.com>.
Turning noatime on kicks last mod out the window.  It is for sure the
case when dealing with file IO that there really are no rules.  As
Adam notes it is about giving options/strategies.

Surprisingly hard to do this well.  But good MVP options exist to get
something out and get more feedback on true need.

On Wed, Jul 29, 2015 at 10:26 AM, Adam Taft <ad...@adamtaft.com> wrote:
> Some additional feature requests for sake of consideration...
>
> For some file systems (I can think of one), the last modified date may not
> be dependable or possibly not high enough precision.  Additional strategies
> could be considered for determining whether a file has been previously
> processed.  For example, the byte size of the file, or the md5 hash, or
> possibly other signals.
>
> While these additional strategies may not be coded initially, I think they
> would add nice features for the proposed AbstractListFileProcessor.  In
> this way, the abstract processor could use one or even a combination of
> signals to determine if a file has been modified and needs to be pulled
> again.
>
> Additionally, it might be good to have other mechanisms in place to mark a
> file as unavailable.  The "dot file" convention is pretty common, but there
> might be additional ways which indicates that a file is still be
> manipulated.  i.e. maybe not all writers to the file system understand the
> dot file convention, and so other strategies might be required.
>
> For example, in one processor I worked with, it was required to pull the
> list of remote files twice in order to monitor the file sizes.  If the file
> size stayed consistent between two pulls, it could safely be considered
> ready for processing.  However, if the file size differed in the two pulls,
> we could assume that a client was still writing to the file.
>
> Adam
>
>
> On Wed, Jul 29, 2015 at 7:34 AM, Mark Payne <ma...@hotmail.com> wrote:
>
>> Joe S,
>>
>> I agree, i think the design of List/Fetch HDFS is extremely applicable to
>> this. The way it saves state is by
>> using a DistributedMapCacheServer. The intent is to run the List processor
>> on primary node only, and it
>> will store its state there so that if the primary node is changed, any
>> other node can pick up where the
>> last one left off. In order to avoid saving a massive amount of state in
>> memory, it stores the timestamp of
>> the latest file that it has fetched, as well as all files that have that
>> same timestamp (timestamp = last modified date
>> in this case). So the next time it runs, it can pull just things whose
>> lastModifiedDate is later than or equal to
>> that timestamp, but it can still know which things to avoid pulling twice
>> because we've saved that info as well.
>>
>> Now, with ListFile it will be a bit different. We tend to think of GetFile
>> and List/Fetch File as pulling from a local
>> file system. However, it is also certainly used to pull from a
>> network-mounted file system. In this case, all nodes
>> in the cluster need the ability to pull the data in unison. So in this
>> case, we will want to save the state in such a way
>> that all nodes in the cluster have access to it, in case the primary node
>> changes. But if the file is local, we don't want
>> to save state across the cluster, because each node needs its own state.
>> So that would likely just be an extra property
>> on the processor.
>>
>> If saving state locally, it's easy enough to just write to a text file
>> (recommend you allow user to specify the state file
>> and default it to conf/ListFile-<processor id>.state or something like
>> that.
>>
>> I have not documented this pattern. Specifically because we've been
>> talking for a while about implementing the Simple
>> State Management but we just haven't gotten there yet. I expected that we
>> would have that finished before writing many
>> more of these List/Fetch processors. That will radically change how we
>> handle all of this.
>>
>> But since it is not there... it may actually make sense to just refactor
>> the ListHDFS processor into an AbstractListFileProcessor
>> that is responsible for handling the state management. I am not sure how
>> complicated that would get, though. Just a
>> thought.
>>
>> Hopefully this helped to clear things up, rather than muddy them up :)
>> Feel free to fire back any questions.
>>
>> Thanks
>> -Mark
>>
>>
>> ----------------------------------------
>> > Date: Wed, 29 Jul 2015 06:42:39 -0400
>> > Subject: Re: Fetch change list
>> > From: joe.witt@gmail.com
>> > To: dev@nifi.apache.org
>> >
>> > JoeS
>> >
>> > Sounds great. I'd ignore my provenance comment as that was really
>> > more about how something external could keep tabs on progress, etc..
>> > Mark Payne designed/built the List/Fetch HDFS one so I'll defer to him
>> > for the good bits. But the logic to follow for saving state you'll
>> > want is probably the same.
>> >
>> > Mark - do you have the design of that thing documented anywhere? It
>> > is a good pattern to describe because it is effectively a model for
>> > taking non-scaleable dataflow interfaces and making them behave as if
>> > they were.
>> >
>> > Thanks
>> > JoeW
>> >
>> > On Wed, Jul 29, 2015 at 6:07 AM, Joe Skora <js...@gmail.com> wrote:
>> >> Joe,
>> >>
>> >> I'm interested in working on List/FetchFile. It seems like starting with
>> >> [NIFI-631|https://issues.apache.org/jira/browse/NIFI-631] makes sense.
>> >> I'll look at List/FetchHDFS, but is there any further detail on how this
>> >> functionality should differ from GetFile? As for keeping state,
>> >> provenance was suggested, a separate state folder might work, or some
>> file
>> >> systems support additional state that might be usable.
>> >>
>> >> Regards,
>> >> Joe
>> >>
>> >> On Tue, Jul 28, 2015 at 12:42 AM, Joe Witt <jo...@gmail.com> wrote:
>> >>
>> >>> Anup,
>> >>>
>> >>> The two tickets in question appear to be:
>> >>> https://issues.apache.org/jira/browse/NIFI-631
>> >>> https://issues.apache.org/jira/browse/NIFI-673
>> >>>
>> >>> Neither have been claimed as of yet. Anybody interested in taking one
>> >>> or both of these on? It would be a lot like List/Fetch HDFS so you'll
>> >>> have good examples to work from.
>> >>>
>> >>> Thanks
>> >>> Joe
>> >>>
>> >>> On Tue, Jul 28, 2015 at 12:37 AM, Sethuram, Anup
>> >>> <an...@philips.com> wrote:
>> >>>> Can I expect this functionality in the upcoming releases of Nifi ?
>> >>>>
>> >>>> On 13/07/15 9:13 am, "Sethuram, Anup" <an...@philips.com>
>> wrote:
>> >>>>
>> >>>>>Where is this 1TB dataset living today?
>> >>>>>[anup] Resides in a filesystem
>> >>>>>
>> >>>>>- What is the current nature of the dataset? Is it already in large
>> >>>>>bundles as files or is it a series of tiny messages, etc..? Does it
>> >>>>>need to be split/merged/etc..
>> >>>>>[anup] Archived files of size 3MB each collected over a period.
>> Directory
>> >>>>>(1TB) -> Sub-Directories -> Files
>> >>>>>
>> >>>>>- What is the format of the data? Is it something that can easily be
>> >>>>>split/merged or will it require special processes to do so?
>> >>>>>[anup] zip, tar formats.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>--
>> >>>>>View this message in context:
>> >>>>>
>> >>>
>> http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/Fetch-cha
>> >>>>>nge-list-tp1351p2126.html
>> >>>>>Sent from the Apache NiFi (incubating) Developer List mailing list
>> >>>>>archive at Nabble.com.
>> >>>>>
>> >>>>>________________________________
>> >>>>>The information contained in this message may be confidential and
>> legally
>> >>>>>protected under applicable law. The message is intended solely for the
>> >>>>>addressee(s). If you are not the intended recipient, you are hereby
>> >>>>>notified that any use, forwarding, dissemination, or reproduction of
>> this
>> >>>>>message is strictly prohibited and may be unlawful. If you are not the
>> >>>>>intended recipient, please contact the sender by return e-mail and
>> >>>>>destroy all copies of the original message.
>> >>>>
>> >>>>
>> >>>> ________________________________
>> >>>> The information contained in this message may be confidential and
>> >>> legally protected under applicable law. The message is intended solely
>> for
>> >>> the addressee(s). If you are not the intended recipient, you are hereby
>> >>> notified that any use, forwarding, dissemination, or reproduction of
>> this
>> >>> message is strictly prohibited and may be unlawful. If you are not the
>> >>> intended recipient, please contact the sender by return e-mail and
>> destroy
>> >>> all copies of the original message.
>> >>>
>>
>>

Re: Fetch change list

Posted by Adam Taft <ad...@adamtaft.com>.
Some additional feature requests for sake of consideration...

For some file systems (I can think of one), the last modified date may not
be dependable or possibly not high enough precision.  Additional strategies
could be considered for determining whether a file has been previously
processed.  For example, the byte size of the file, or the md5 hash, or
possibly other signals.

While these additional strategies may not be coded initially, I think they
would add nice features for the proposed AbstractListFileProcessor.  In
this way, the abstract processor could use one or even a combination of
signals to determine if a file has been modified and needs to be pulled
again.

Additionally, it might be good to have other mechanisms in place to mark a
file as unavailable.  The "dot file" convention is pretty common, but there
might be additional ways which indicates that a file is still be
manipulated.  i.e. maybe not all writers to the file system understand the
dot file convention, and so other strategies might be required.

For example, in one processor I worked with, it was required to pull the
list of remote files twice in order to monitor the file sizes.  If the file
size stayed consistent between two pulls, it could safely be considered
ready for processing.  However, if the file size differed in the two pulls,
we could assume that a client was still writing to the file.

Adam


On Wed, Jul 29, 2015 at 7:34 AM, Mark Payne <ma...@hotmail.com> wrote:

> Joe S,
>
> I agree, i think the design of List/Fetch HDFS is extremely applicable to
> this. The way it saves state is by
> using a DistributedMapCacheServer. The intent is to run the List processor
> on primary node only, and it
> will store its state there so that if the primary node is changed, any
> other node can pick up where the
> last one left off. In order to avoid saving a massive amount of state in
> memory, it stores the timestamp of
> the latest file that it has fetched, as well as all files that have that
> same timestamp (timestamp = last modified date
> in this case). So the next time it runs, it can pull just things whose
> lastModifiedDate is later than or equal to
> that timestamp, but it can still know which things to avoid pulling twice
> because we've saved that info as well.
>
> Now, with ListFile it will be a bit different. We tend to think of GetFile
> and List/Fetch File as pulling from a local
> file system. However, it is also certainly used to pull from a
> network-mounted file system. In this case, all nodes
> in the cluster need the ability to pull the data in unison. So in this
> case, we will want to save the state in such a way
> that all nodes in the cluster have access to it, in case the primary node
> changes. But if the file is local, we don't want
> to save state across the cluster, because each node needs its own state.
> So that would likely just be an extra property
> on the processor.
>
> If saving state locally, it's easy enough to just write to a text file
> (recommend you allow user to specify the state file
> and default it to conf/ListFile-<processor id>.state or something like
> that.
>
> I have not documented this pattern. Specifically because we've been
> talking for a while about implementing the Simple
> State Management but we just haven't gotten there yet. I expected that we
> would have that finished before writing many
> more of these List/Fetch processors. That will radically change how we
> handle all of this.
>
> But since it is not there... it may actually make sense to just refactor
> the ListHDFS processor into an AbstractListFileProcessor
> that is responsible for handling the state management. I am not sure how
> complicated that would get, though. Just a
> thought.
>
> Hopefully this helped to clear things up, rather than muddy them up :)
> Feel free to fire back any questions.
>
> Thanks
> -Mark
>
>
> ----------------------------------------
> > Date: Wed, 29 Jul 2015 06:42:39 -0400
> > Subject: Re: Fetch change list
> > From: joe.witt@gmail.com
> > To: dev@nifi.apache.org
> >
> > JoeS
> >
> > Sounds great. I'd ignore my provenance comment as that was really
> > more about how something external could keep tabs on progress, etc..
> > Mark Payne designed/built the List/Fetch HDFS one so I'll defer to him
> > for the good bits. But the logic to follow for saving state you'll
> > want is probably the same.
> >
> > Mark - do you have the design of that thing documented anywhere? It
> > is a good pattern to describe because it is effectively a model for
> > taking non-scaleable dataflow interfaces and making them behave as if
> > they were.
> >
> > Thanks
> > JoeW
> >
> > On Wed, Jul 29, 2015 at 6:07 AM, Joe Skora <js...@gmail.com> wrote:
> >> Joe,
> >>
> >> I'm interested in working on List/FetchFile. It seems like starting with
> >> [NIFI-631|https://issues.apache.org/jira/browse/NIFI-631] makes sense.
> >> I'll look at List/FetchHDFS, but is there any further detail on how this
> >> functionality should differ from GetFile? As for keeping state,
> >> provenance was suggested, a separate state folder might work, or some
> file
> >> systems support additional state that might be usable.
> >>
> >> Regards,
> >> Joe
> >>
> >> On Tue, Jul 28, 2015 at 12:42 AM, Joe Witt <jo...@gmail.com> wrote:
> >>
> >>> Anup,
> >>>
> >>> The two tickets in question appear to be:
> >>> https://issues.apache.org/jira/browse/NIFI-631
> >>> https://issues.apache.org/jira/browse/NIFI-673
> >>>
> >>> Neither have been claimed as of yet. Anybody interested in taking one
> >>> or both of these on? It would be a lot like List/Fetch HDFS so you'll
> >>> have good examples to work from.
> >>>
> >>> Thanks
> >>> Joe
> >>>
> >>> On Tue, Jul 28, 2015 at 12:37 AM, Sethuram, Anup
> >>> <an...@philips.com> wrote:
> >>>> Can I expect this functionality in the upcoming releases of Nifi ?
> >>>>
> >>>> On 13/07/15 9:13 am, "Sethuram, Anup" <an...@philips.com>
> wrote:
> >>>>
> >>>>>Where is this 1TB dataset living today?
> >>>>>[anup] Resides in a filesystem
> >>>>>
> >>>>>- What is the current nature of the dataset? Is it already in large
> >>>>>bundles as files or is it a series of tiny messages, etc..? Does it
> >>>>>need to be split/merged/etc..
> >>>>>[anup] Archived files of size 3MB each collected over a period.
> Directory
> >>>>>(1TB) -> Sub-Directories -> Files
> >>>>>
> >>>>>- What is the format of the data? Is it something that can easily be
> >>>>>split/merged or will it require special processes to do so?
> >>>>>[anup] zip, tar formats.
> >>>>>
> >>>>>
> >>>>>
> >>>>>--
> >>>>>View this message in context:
> >>>>>
> >>>
> http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/Fetch-cha
> >>>>>nge-list-tp1351p2126.html
> >>>>>Sent from the Apache NiFi (incubating) Developer List mailing list
> >>>>>archive at Nabble.com.
> >>>>>
> >>>>>________________________________
> >>>>>The information contained in this message may be confidential and
> legally
> >>>>>protected under applicable law. The message is intended solely for the
> >>>>>addressee(s). If you are not the intended recipient, you are hereby
> >>>>>notified that any use, forwarding, dissemination, or reproduction of
> this
> >>>>>message is strictly prohibited and may be unlawful. If you are not the
> >>>>>intended recipient, please contact the sender by return e-mail and
> >>>>>destroy all copies of the original message.
> >>>>
> >>>>
> >>>> ________________________________
> >>>> The information contained in this message may be confidential and
> >>> legally protected under applicable law. The message is intended solely
> for
> >>> the addressee(s). If you are not the intended recipient, you are hereby
> >>> notified that any use, forwarding, dissemination, or reproduction of
> this
> >>> message is strictly prohibited and may be unlawful. If you are not the
> >>> intended recipient, please contact the sender by return e-mail and
> destroy
> >>> all copies of the original message.
> >>>
>
>

RE: Fetch change list

Posted by Mark Payne <ma...@hotmail.com>.
Joe S,

I agree, i think the design of List/Fetch HDFS is extremely applicable to this. The way it saves state is by
using a DistributedMapCacheServer. The intent is to run the List processor on primary node only, and it
will store its state there so that if the primary node is changed, any other node can pick up where the
last one left off. In order to avoid saving a massive amount of state in memory, it stores the timestamp of
the latest file that it has fetched, as well as all files that have that same timestamp (timestamp = last modified date
in this case). So the next time it runs, it can pull just things whose lastModifiedDate is later than or equal to
that timestamp, but it can still know which things to avoid pulling twice because we've saved that info as well.

Now, with ListFile it will be a bit different. We tend to think of GetFile and List/Fetch File as pulling from a local
file system. However, it is also certainly used to pull from a network-mounted file system. In this case, all nodes
in the cluster need the ability to pull the data in unison. So in this case, we will want to save the state in such a way
that all nodes in the cluster have access to it, in case the primary node changes. But if the file is local, we don't want
to save state across the cluster, because each node needs its own state. So that would likely just be an extra property
on the processor.

If saving state locally, it's easy enough to just write to a text file (recommend you allow user to specify the state file
and default it to conf/ListFile-<processor id>.state or something like that.

I have not documented this pattern. Specifically because we've been talking for a while about implementing the Simple
State Management but we just haven't gotten there yet. I expected that we would have that finished before writing many
more of these List/Fetch processors. That will radically change how we handle all of this.

But since it is not there... it may actually make sense to just refactor the ListHDFS processor into an AbstractListFileProcessor
that is responsible for handling the state management. I am not sure how complicated that would get, though. Just a
thought.

Hopefully this helped to clear things up, rather than muddy them up :) Feel free to fire back any questions.

Thanks
-Mark


----------------------------------------
> Date: Wed, 29 Jul 2015 06:42:39 -0400
> Subject: Re: Fetch change list
> From: joe.witt@gmail.com
> To: dev@nifi.apache.org
>
> JoeS
>
> Sounds great. I'd ignore my provenance comment as that was really
> more about how something external could keep tabs on progress, etc..
> Mark Payne designed/built the List/Fetch HDFS one so I'll defer to him
> for the good bits. But the logic to follow for saving state you'll
> want is probably the same.
>
> Mark - do you have the design of that thing documented anywhere? It
> is a good pattern to describe because it is effectively a model for
> taking non-scaleable dataflow interfaces and making them behave as if
> they were.
>
> Thanks
> JoeW
>
> On Wed, Jul 29, 2015 at 6:07 AM, Joe Skora <js...@gmail.com> wrote:
>> Joe,
>>
>> I'm interested in working on List/FetchFile. It seems like starting with
>> [NIFI-631|https://issues.apache.org/jira/browse/NIFI-631] makes sense.
>> I'll look at List/FetchHDFS, but is there any further detail on how this
>> functionality should differ from GetFile? As for keeping state,
>> provenance was suggested, a separate state folder might work, or some file
>> systems support additional state that might be usable.
>>
>> Regards,
>> Joe
>>
>> On Tue, Jul 28, 2015 at 12:42 AM, Joe Witt <jo...@gmail.com> wrote:
>>
>>> Anup,
>>>
>>> The two tickets in question appear to be:
>>> https://issues.apache.org/jira/browse/NIFI-631
>>> https://issues.apache.org/jira/browse/NIFI-673
>>>
>>> Neither have been claimed as of yet. Anybody interested in taking one
>>> or both of these on? It would be a lot like List/Fetch HDFS so you'll
>>> have good examples to work from.
>>>
>>> Thanks
>>> Joe
>>>
>>> On Tue, Jul 28, 2015 at 12:37 AM, Sethuram, Anup
>>> <an...@philips.com> wrote:
>>>> Can I expect this functionality in the upcoming releases of Nifi ?
>>>>
>>>> On 13/07/15 9:13 am, "Sethuram, Anup" <an...@philips.com> wrote:
>>>>
>>>>>Where is this 1TB dataset living today?
>>>>>[anup] Resides in a filesystem
>>>>>
>>>>>- What is the current nature of the dataset? Is it already in large
>>>>>bundles as files or is it a series of tiny messages, etc..? Does it
>>>>>need to be split/merged/etc..
>>>>>[anup] Archived files of size 3MB each collected over a period. Directory
>>>>>(1TB) -> Sub-Directories -> Files
>>>>>
>>>>>- What is the format of the data? Is it something that can easily be
>>>>>split/merged or will it require special processes to do so?
>>>>>[anup] zip, tar formats.
>>>>>
>>>>>
>>>>>
>>>>>--
>>>>>View this message in context:
>>>>>
>>> http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/Fetch-cha
>>>>>nge-list-tp1351p2126.html
>>>>>Sent from the Apache NiFi (incubating) Developer List mailing list
>>>>>archive at Nabble.com.
>>>>>
>>>>>________________________________
>>>>>The information contained in this message may be confidential and legally
>>>>>protected under applicable law. The message is intended solely for the
>>>>>addressee(s). If you are not the intended recipient, you are hereby
>>>>>notified that any use, forwarding, dissemination, or reproduction of this
>>>>>message is strictly prohibited and may be unlawful. If you are not the
>>>>>intended recipient, please contact the sender by return e-mail and
>>>>>destroy all copies of the original message.
>>>>
>>>>
>>>> ________________________________
>>>> The information contained in this message may be confidential and
>>> legally protected under applicable law. The message is intended solely for
>>> the addressee(s). If you are not the intended recipient, you are hereby
>>> notified that any use, forwarding, dissemination, or reproduction of this
>>> message is strictly prohibited and may be unlawful. If you are not the
>>> intended recipient, please contact the sender by return e-mail and destroy
>>> all copies of the original message.
>>>
 		 	   		  

Re: Fetch change list

Posted by Joe Witt <jo...@gmail.com>.
JoeS

Sounds great.  I'd ignore my provenance comment as that was really
more about how something external could keep tabs on progress, etc..
Mark Payne designed/built the List/Fetch HDFS one so I'll defer to him
for the good bits.  But the logic to follow for saving state you'll
want is probably the same.

Mark - do you have the design of that thing documented anywhere?  It
is a good pattern to describe because it is effectively a model for
taking non-scaleable dataflow interfaces and making them behave as if
they were.

Thanks
JoeW

On Wed, Jul 29, 2015 at 6:07 AM, Joe Skora <js...@gmail.com> wrote:
> Joe,
>
> I'm interested in working on List/FetchFile.  It seems like starting with
> [NIFI-631|https://issues.apache.org/jira/browse/NIFI-631] makes sense.
> I'll look at List/FetchHDFS, but is there any further detail on how this
> functionality should differ from GetFile?   As for keeping state,
> provenance was suggested, a separate state folder might work, or some file
> systems support additional state that might be usable.
>
> Regards,
> Joe
>
> On Tue, Jul 28, 2015 at 12:42 AM, Joe Witt <jo...@gmail.com> wrote:
>
>> Anup,
>>
>> The two tickets in question appear to be:
>> https://issues.apache.org/jira/browse/NIFI-631
>> https://issues.apache.org/jira/browse/NIFI-673
>>
>> Neither have been claimed as of yet.  Anybody interested in taking one
>> or both of these on?  It would be a lot like List/Fetch HDFS so you'll
>> have good examples to work from.
>>
>> Thanks
>> Joe
>>
>> On Tue, Jul 28, 2015 at 12:37 AM, Sethuram, Anup
>> <an...@philips.com> wrote:
>> > Can I expect this functionality in the upcoming releases of Nifi ?
>> >
>> > On 13/07/15 9:13 am, "Sethuram, Anup" <an...@philips.com> wrote:
>> >
>> >>Where is this 1TB dataset living today?
>> >>[anup] Resides in a filesystem
>> >>
>> >>- What is the current nature of the dataset?  Is it already in large
>> >>bundles as files or is it a series of tiny messages, etc..?  Does it
>> >>need to be split/merged/etc..
>> >>[anup] Archived files of size 3MB each collected over a period. Directory
>> >>(1TB) -> Sub-Directories  -> Files
>> >>
>> >>- What is the format of the data?  Is it something that can easily be
>> >>split/merged or will it require special processes to do so?
>> >>[anup] zip, tar formats.
>> >>
>> >>
>> >>
>> >>--
>> >>View this message in context:
>> >>
>> http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/Fetch-cha
>> >>nge-list-tp1351p2126.html
>> >>Sent from the Apache NiFi (incubating) Developer List mailing list
>> >>archive at Nabble.com.
>> >>
>> >>________________________________
>> >>The information contained in this message may be confidential and legally
>> >>protected under applicable law. The message is intended solely for the
>> >>addressee(s). If you are not the intended recipient, you are hereby
>> >>notified that any use, forwarding, dissemination, or reproduction of this
>> >>message is strictly prohibited and may be unlawful. If you are not the
>> >>intended recipient, please contact the sender by return e-mail and
>> >>destroy all copies of the original message.
>> >
>> >
>> > ________________________________
>> > The information contained in this message may be confidential and
>> legally protected under applicable law. The message is intended solely for
>> the addressee(s). If you are not the intended recipient, you are hereby
>> notified that any use, forwarding, dissemination, or reproduction of this
>> message is strictly prohibited and may be unlawful. If you are not the
>> intended recipient, please contact the sender by return e-mail and destroy
>> all copies of the original message.
>>

Re: Fetch change list

Posted by Joe Skora <js...@gmail.com>.
Joe,

I'm interested in working on List/FetchFile.  It seems like starting with
[NIFI-631|https://issues.apache.org/jira/browse/NIFI-631] makes sense.
I'll look at List/FetchHDFS, but is there any further detail on how this
functionality should differ from GetFile?   As for keeping state,
provenance was suggested, a separate state folder might work, or some file
systems support additional state that might be usable.

Regards,
Joe

On Tue, Jul 28, 2015 at 12:42 AM, Joe Witt <jo...@gmail.com> wrote:

> Anup,
>
> The two tickets in question appear to be:
> https://issues.apache.org/jira/browse/NIFI-631
> https://issues.apache.org/jira/browse/NIFI-673
>
> Neither have been claimed as of yet.  Anybody interested in taking one
> or both of these on?  It would be a lot like List/Fetch HDFS so you'll
> have good examples to work from.
>
> Thanks
> Joe
>
> On Tue, Jul 28, 2015 at 12:37 AM, Sethuram, Anup
> <an...@philips.com> wrote:
> > Can I expect this functionality in the upcoming releases of Nifi ?
> >
> > On 13/07/15 9:13 am, "Sethuram, Anup" <an...@philips.com> wrote:
> >
> >>Where is this 1TB dataset living today?
> >>[anup] Resides in a filesystem
> >>
> >>- What is the current nature of the dataset?  Is it already in large
> >>bundles as files or is it a series of tiny messages, etc..?  Does it
> >>need to be split/merged/etc..
> >>[anup] Archived files of size 3MB each collected over a period. Directory
> >>(1TB) -> Sub-Directories  -> Files
> >>
> >>- What is the format of the data?  Is it something that can easily be
> >>split/merged or will it require special processes to do so?
> >>[anup] zip, tar formats.
> >>
> >>
> >>
> >>--
> >>View this message in context:
> >>
> http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/Fetch-cha
> >>nge-list-tp1351p2126.html
> >>Sent from the Apache NiFi (incubating) Developer List mailing list
> >>archive at Nabble.com.
> >>
> >>________________________________
> >>The information contained in this message may be confidential and legally
> >>protected under applicable law. The message is intended solely for the
> >>addressee(s). If you are not the intended recipient, you are hereby
> >>notified that any use, forwarding, dissemination, or reproduction of this
> >>message is strictly prohibited and may be unlawful. If you are not the
> >>intended recipient, please contact the sender by return e-mail and
> >>destroy all copies of the original message.
> >
> >
> > ________________________________
> > The information contained in this message may be confidential and
> legally protected under applicable law. The message is intended solely for
> the addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>

Re: Fetch change list

Posted by Joe Witt <jo...@gmail.com>.
Anup,

The two tickets in question appear to be:
https://issues.apache.org/jira/browse/NIFI-631
https://issues.apache.org/jira/browse/NIFI-673

Neither have been claimed as of yet.  Anybody interested in taking one
or both of these on?  It would be a lot like List/Fetch HDFS so you'll
have good examples to work from.

Thanks
Joe

On Tue, Jul 28, 2015 at 12:37 AM, Sethuram, Anup
<an...@philips.com> wrote:
> Can I expect this functionality in the upcoming releases of Nifi ?
>
> On 13/07/15 9:13 am, "Sethuram, Anup" <an...@philips.com> wrote:
>
>>Where is this 1TB dataset living today?
>>[anup] Resides in a filesystem
>>
>>- What is the current nature of the dataset?  Is it already in large
>>bundles as files or is it a series of tiny messages, etc..?  Does it
>>need to be split/merged/etc..
>>[anup] Archived files of size 3MB each collected over a period. Directory
>>(1TB) -> Sub-Directories  -> Files
>>
>>- What is the format of the data?  Is it something that can easily be
>>split/merged or will it require special processes to do so?
>>[anup] zip, tar formats.
>>
>>
>>
>>--
>>View this message in context:
>>http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/Fetch-cha
>>nge-list-tp1351p2126.html
>>Sent from the Apache NiFi (incubating) Developer List mailing list
>>archive at Nabble.com.
>>
>>________________________________
>>The information contained in this message may be confidential and legally
>>protected under applicable law. The message is intended solely for the
>>addressee(s). If you are not the intended recipient, you are hereby
>>notified that any use, forwarding, dissemination, or reproduction of this
>>message is strictly prohibited and may be unlawful. If you are not the
>>intended recipient, please contact the sender by return e-mail and
>>destroy all copies of the original message.
>
>
> ________________________________
> The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.

Re: Fetch change list

Posted by "Sethuram, Anup" <an...@philips.com>.
Can I expect this functionality in the upcoming releases of Nifi ?

On 13/07/15 9:13 am, "Sethuram, Anup" <an...@philips.com> wrote:

>Where is this 1TB dataset living today?
>[anup] Resides in a filesystem
>
>- What is the current nature of the dataset?  Is it already in large
>bundles as files or is it a series of tiny messages, etc..?  Does it
>need to be split/merged/etc..
>[anup] Archived files of size 3MB each collected over a period. Directory
>(1TB) -> Sub-Directories  -> Files
>
>- What is the format of the data?  Is it something that can easily be
>split/merged or will it require special processes to do so?
>[anup] zip, tar formats.
>
>
>
>--
>View this message in context:
>http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/Fetch-cha
>nge-list-tp1351p2126.html
>Sent from the Apache NiFi (incubating) Developer List mailing list
>archive at Nabble.com.
>
>________________________________
>The information contained in this message may be confidential and legally
>protected under applicable law. The message is intended solely for the
>addressee(s). If you are not the intended recipient, you are hereby
>notified that any use, forwarding, dissemination, or reproduction of this
>message is strictly prohibited and may be unlawful. If you are not the
>intended recipient, please contact the sender by return e-mail and
>destroy all copies of the original message.


________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.

Re: Fetch change list

Posted by anup s <an...@philips.com>.
Where is this 1TB dataset living today?  
[anup] Resides in a filesystem

- What is the current nature of the dataset?  Is it already in large 
bundles as files or is it a series of tiny messages, etc..?  Does it 
need to be split/merged/etc.. 
[anup] Archived files of size 3MB each collected over a period. Directory
(1TB) -> Sub-Directories  -> Files

- What is the format of the data?  Is it something that can easily be 
split/merged or will it require special processes to do so? 
[anup] zip, tar formats. 



--
View this message in context: http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/Fetch-change-list-tp1351p2126.html
Sent from the Apache NiFi (incubating) Developer List mailing list archive at Nabble.com.

RE: Fetch change list

Posted by Mark Payne <ma...@hotmail.com>.
Anup,

I have created a ticket for creating two new Processors: ListFile, FetchFile. These should provide a much nicer user experience for what you're trying to do here.

The ticket is NIFI-631:  https://issues.apache.org/jira/browse/NIFI-631

Thanks
-Mark

----------------------------------------
> Date: Tue, 2 Jun 2015 07:41:45 -0700
> From: anup.sethuram@philips.com
> To: dev@nifi.incubator.apache.org
> Subject: Re: Fetch change list
>
> Suppose, I have 1 TB of data that I need to backup/sync to a HDFS location
> and then be passed onto a Kafka, is there a way out to do that?
>
>
>
>
> --
> View this message in context: http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/Fetch-change-list-tp1351p1706.html
> Sent from the Apache NiFi (incubating) Developer List mailing list archive at Nabble.com.
 		 	   		  

RE: Fetch change list

Posted by Mark Payne <ma...@hotmail.com>.
Anup,

I have created a ticket for creating two new Processors: ListFile, FetchFile. These should provide a much nicer user experience for what you're trying to do here.

The ticket is NIFI-631:  https://issues.apache.org/jira/browse/NIFI-631

Thanks
-Mark

----------------------------------------
> Date: Tue, 2 Jun 2015 07:41:45 -0700
> From: anup.sethuram@philips.com
> To: dev@nifi.incubator.apache.org
> Subject: Re: Fetch change list
>
> Suppose, I have 1 TB of data that I need to backup/sync to a HDFS location
> and then be passed onto a Kafka, is there a way out to do that?
>
>
>
>
> --
> View this message in context: http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/Fetch-change-list-tp1351p1706.html
> Sent from the Apache NiFi (incubating) Developer List mailing list archive at Nabble.com.
 		 	   		  

Re: Fetch change list

Posted by Joe Witt <jo...@gmail.com>.
Anup,

Cross posting this to users since it is a great user question.

That answer is: Absolutely.

So couple of details to iron out to get started.  I'll ask the
question and explain why.  First some background:
- Kafka wants the small events themselves ideally.
- HDFS wants those events bundled together typically along whatever
block size you have in HDFS.

The questions:
- Where is this 1TB dataset living today?  This will help determine
best way to pull the dataset in.

- What is the current nature of the dataset?  Is it already in large
bundles as files or is it a series of tiny messages, etc..?  Does it
need to be split/merged/etc..

- What is the format of the data?  Is it something that can easily be
split/merged or will it require special processes to do so?

These are good to start with.

Thanks
Joe


On Tue, Jun 2, 2015 at 10:41 AM, anup s <an...@philips.com> wrote:
> Suppose, I have 1 TB of data that I need to backup/sync to a HDFS location
> and then be passed onto a Kafka, is there a way out to do that?
>
>
>
>
> --
> View this message in context: http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/Fetch-change-list-tp1351p1706.html
> Sent from the Apache NiFi (incubating) Developer List mailing list archive at Nabble.com.

Re: Fetch change list

Posted by Joe Witt <jo...@gmail.com>.
Anup,

Cross posting this to users since it is a great user question.

That answer is: Absolutely.

So couple of details to iron out to get started.  I'll ask the
question and explain why.  First some background:
- Kafka wants the small events themselves ideally.
- HDFS wants those events bundled together typically along whatever
block size you have in HDFS.

The questions:
- Where is this 1TB dataset living today?  This will help determine
best way to pull the dataset in.

- What is the current nature of the dataset?  Is it already in large
bundles as files or is it a series of tiny messages, etc..?  Does it
need to be split/merged/etc..

- What is the format of the data?  Is it something that can easily be
split/merged or will it require special processes to do so?

These are good to start with.

Thanks
Joe


On Tue, Jun 2, 2015 at 10:41 AM, anup s <an...@philips.com> wrote:
> Suppose, I have 1 TB of data that I need to backup/sync to a HDFS location
> and then be passed onto a Kafka, is there a way out to do that?
>
>
>
>
> --
> View this message in context: http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/Fetch-change-list-tp1351p1706.html
> Sent from the Apache NiFi (incubating) Developer List mailing list archive at Nabble.com.

Re: Fetch change list

Posted by anup s <an...@philips.com>.
Suppose, I have 1 TB of data that I need to backup/sync to a HDFS location
and then be passed onto a Kafka, is there a way out to do that?




--
View this message in context: http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/Fetch-change-list-tp1351p1706.html
Sent from the Apache NiFi (incubating) Developer List mailing list archive at Nabble.com.

Re: Fetch change list

Posted by "Sethuram, Anup" <an...@philips.com>.
Thanks Mark for those tips.. I was mostly trying around the third option
but it didn¹t work well as the data we are playing with is huge..

The problem with 1 and 2 options is we cannot move or update that
directory.



On 22/05/15 7:25 pm, "Mark Payne" <ma...@hotmail.com> wrote:

>The List/Fetch HDFS would allow you to pull new data from HDFS without
>destroying it.
>


________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.

RE: Fetch change list

Posted by Mark Payne <ma...@hotmail.com>.
Anup,

The List/Fetch HDFS would allow you to pull new data from HDFS without destroying it.

But it sounds like what you want here is to also pull from disk without removing it. The GetFile processor does
not currently keep any state about what it's pulled in. It would likely be a fairly easy modification to GetFile, if it is
reading from a local filesystem. If reading from a network-mounted file system like nfs then it gets much more complex, as
the state would have to be shared across the cluster, as with ListHDFS.

A few possible solutions that I could offer in the meantime (I realize none is great but should work):

1. If you can move the data, you could use GetFile and then immediately route to PutFile. PutFile would then put the data to a different directory.

2. Similar to #1, you could use GetFile -> UpdateAttribute -> PutFile, and put the data back to the same directory but use UpdateAttribute to change
the filename, perhaps to "${filename}.pulled" and then configure GetFile to ignore files that end with ".pulled"

3. Use GetFile and configure it with a "Maximum File Age" of say 10 minutes, and only run every 5 minutes. Then, use DetectDuplicate
and throw away any duplicate. The downside here is that you would potentially pull in the data a couple of times, which means that you're
not being super efficient. If there is a huge amount of data coming in, this may be less than ideal. But if the data is coming in slowly, like
10 MB/sec then maybe this is fine.

Does any of this help?

Thanks
-Mark

----------------------------------------
> Date: Thu, 21 May 2015 20:01:30 -0700
> From: anup.sethuram@philips.com
> To: dev@nifi.incubator.apache.org
> Subject: Re: Fetch change list
>
> Hi Mark,
> I downloaded the latest version and I see that the FetchHDFS processor
> could be used for my delta files that have arrived to the HDFS. But how do I
> maintain a *sync * from a local file system to my HDFS. I cannot move files
> from the local filesystem. It needs to be copied.
>
> I'm facing issues with queueing trying to maintain a sync.
>
> Any thoughts on how I could tackle this issue?
>
> Regards,
> anup
>
>
>
> --
> View this message in context: http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/Fetch-change-list-tp1351p1615.html
> Sent from the Apache NiFi (incubating) Developer List mailing list archive at Nabble.com.
 		 	   		  

Re: Fetch change list

Posted by anup s <an...@philips.com>.
Hi Mark,
   I downloaded the latest version and I see that the FetchHDFS processor
could be used for my delta files that have arrived to the HDFS. But how do I
maintain a *sync * from a local file system to my HDFS. I cannot move files
from the local filesystem. It needs to be copied. 

I'm facing issues with queueing trying to maintain a sync. 

Any thoughts on how I could tackle this issue?

Regards,
anup



--
View this message in context: http://apache-nifi-incubating-developer-list.39713.n7.nabble.com/Fetch-change-list-tp1351p1615.html
Sent from the Apache NiFi (incubating) Developer List mailing list archive at Nabble.com.

Re: Fetch change list

Posted by "Sethuram, Anup" <an...@philips.com>.
Thanks Mark for that one; that should be a big relief. I¹d be waiting to
check that out!

Regards,
anup

On 05/05/15 9:09 pm, "Mark Payne" <ma...@hotmail.com> wrote:

>Anup,
>With the 0.1.0 release that we are working on right now, there are two
>new processors: ListHDFS, FetchHDFS, that are able to keep state about
>what has been pulled from HDFS. This way you can keep the data in HDFS
>and still only pull in new data. Will this help?
>Thanks-Mark
>
>> From: anup.sethuram@philips.com
>> To: dev@nifi.incubator.apache.org
>> Subject: RE: Fetch change list
>> Date: Tue, 5 May 2015 15:32:07 +0000
>>
>> Thanks Corey for that info. But the major problem I'm facing is I am
>>backing up a large set of data into HDFS (with a GetHDFS , source
>>retained as true) and then trying to fetch the delta from it. (get only
>>the files which have arrived recently by using the min Age and max Age).
>>But I'm unable to get the exact delta if I have 'keep source file' as
>>true..
>> I played around a lot with schedule time and min & max age but didn't
>>help.
>>
>> -----Original Message-----
>> From: Corey Flowers [mailto:cflowers@onyxpoint.com]
>> Sent: Tuesday, May 05, 2015 5:35 PM
>> To: dev@nifi.incubator.apache.org
>> Subject: Re: Fetch change list
>>
>> Ok, the get file that is running, is basically causing a race condition
>>between all of the servers in your cluster. That is why you are seeing
>>the "NoSuchFile" error. If you change the scheduling strategy on that
>>processor to "On Primary node" Then the only system that will try to
>>pick up data from that mount point, is the server you have designated
>>"primary node".
>> This should fix that issue.
>>
>> On Mon, May 4, 2015 at 11:30 PM, Sethuram, Anup
>><an...@philips.com>
>> wrote:
>>
>> > Yes Corey, Right now the pickup directory is from a network share
>> > mount point. The data is picked up from one location and transferred
>> > to the other. I'm using site-to-site communication.
>> >
>> > -----Original Message-----
>> > From: Corey Flowers [mailto:cflowers@onyxpoint.com]
>> > Sent: Monday, May 04, 2015 7:57 PM
>> > To: dev@nifi.incubator.apache.org
>> > Subject: Re: Fetch change list
>> >
>> > Good morning Anup!
>> >
>> >          Is the pickup directory coming from a network share mount
>>point?
>> >
>> > On Mon, May 4, 2015 at 10:11 AM, Sethuram, Anup
>> > <anup.sethuram@philips.com
>> > >
>> > wrote:
>> >
>> > > Hi ,
>> > >                 I'm trying to fetch a set of files which have
>> > > recently changed in a "filesystem". Also I'm supposed to keep the
>> > > original copy as it is.
>> > > For obtaining the latest files that have changed, I'm using a
>> > > PutFile with "replace" strategy piped to a GetFile with a minimum
>> > > age of 5 sec,  max file age of 30 sec, Keep source file as true,
>> > >
>> > > Also, running it in clustered mode. I'm seeing the below issues
>> > >
>> > > -          The queue starts growing if there's an error.
>> > >
>> > > -          Continuous errors with 'NoSuchFileException'
>> > >
>> > > -          Penalizing StandardFlowFileErrors
>> > >
>> > >
>> > >
>> > >
>> > > ERROR
>> > >
>> > > 0ab3b920-1f05-4f24-b861-4fded3d5d826
>> > >
>> > > 161.91.234.248:7087
>> > >
>> > > GetFile[id=0ab3b920-1f05-4f24-b861-4fded3d5d826] Failed to retrieve
>> > > files due to
>> > > org.apache.nifi.processor.exception.FlowFileAccessException: Failed
>> > > to import data from /nifi/UNZ/log201403230000.log for
>> > > StandardFlowFileRecord[uuid=f29bda59-8611-427c-b4d7-c921ee5e74b8,cla
>> > > im =,offset=0,name=6908587554457536,size=0]
>> > > due to java.nio.file.NoSuchFileException:
>> > > /nifi/UNZ/log201403230000.log
>> > >
>> > > 18:45:56 IST
>> > >
>> > >
>> > >
>> > > 10:54:50 IST
>> > >
>> > > ERROR
>> > >
>> > > c552b5bc-f627-3cc3-b3d0-545c519eafd9
>> > >
>> > > 161.91.234.248:6087
>> > >
>> > > PutFile[id=c552b5bc-f627-3cc3-b3d0-545c519eafd9] Penalizing
>> > > StandardFlowFileRecord[uuid=876e51f7-9a3d-4bf9-9d11-9073a5c950ad,cla
>> > > im =1430717088883-73580,offset=0,name=file1.log,size=29314779]
>> > > and transferring to failure due to
>> > > org.apache.nifi.processor.exception.ProcessException: Could not
>> > > rename
>> > > /nifi/UNZ/.file1.log:
>> > org.apache.nifi.processor.exception.ProcessException:
>> > > Could not rename: /nifi/UNZ/.file1.log
>> > >
>> > > 10:54:56 IST
>> > >
>> > > ERROR
>> > >
>> > > 60662bb3-490a-3b47-9371-e11c12cdfa1a
>> > >
>> > > 161.91.234.248:7087
>> > >
>> > > PutFile[id=60662bb3-490a-3b47-9371-e11c12cdfa1a] Penalizing
>> > > StandardFlowFileRecord[uuid=522a2401-8269-4f0f-aff5-152d25cdcefa,cla
>> > > im =1430717094668-73059,offset=1533296,name=file2.log,size=28014262]
>> > > and transferring to failure due to
>> > > org.apache.nifi.processor.exception.ProcessException: Could not
>>rename:
>> > > /data/softwares/RS/nifi/OUT/.file2.log:
>> > > org.apache.nifi.processor.exception.ProcessException: Could not
>>rename:
>> > > /nifi/OUT/.file2.log
>> > >
>> > >
>> > >
>> > > Do I have to tweak the Run schedule or keep the same minimum file
>> > > age and maximum file age to overcome this issue?
>> > > What might be an elegant solution in NiFi?
>> > >
>> > >
>> > > Thanks,
>> > > anup
>> > >
>> > > ________________________________
>> > > The information contained in this message may be confidential and
>> > > legally protected under applicable law. The message is intended
>> > > solely for the addressee(s). If you are not the intended recipient,
>> > > you are hereby notified that any use, forwarding, dissemination, or
>> > > reproduction of this message is strictly prohibited and may be
>> > > unlawful. If you are not the intended recipient, please contact the
>> > > sender by return e-mail and destroy all copies of the original
>>message.
>> > >
>> >
>> >
>> >
>> > --
>> > Corey Flowers
>> > Vice President, Onyx Point, Inc
>> > (410) 541-6699
>> > cflowers@onyxpoint.com
>> >
>> > -- This account not approved for unencrypted proprietary information
>> > --
>> >
>> > ________________________________
>> > The information contained in this message may be confidential and
>> > legally protected under applicable law. The message is intended solely
>> > for the addressee(s). If you are not the intended recipient, you are
>> > hereby notified that any use, forwarding, dissemination, or
>> > reproduction of this message is strictly prohibited and may be
>> > unlawful. If you are not the intended recipient, please contact the
>> > sender by return e-mail and destroy all copies of the original
>>message.
>> >
>>
>>
>>
>> --
>> Corey Flowers
>> Vice President, Onyx Point, Inc
>> (410) 541-6699
>> cflowers@onyxpoint.com
>>
>> -- This account not approved for unencrypted proprietary information --
>>
>> ________________________________
>> The information contained in this message may be confidential and
>>legally protected under applicable law. The message is intended solely
>>for the addressee(s). If you are not the intended recipient, you are
>>hereby notified that any use, forwarding, dissemination, or reproduction
>>of this message is strictly prohibited and may be unlawful. If you are
>>not the intended recipient, please contact the sender by return e-mail
>>and destroy all copies of the original message.
>


________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.

RE: Fetch change list

Posted by Mark Payne <ma...@hotmail.com>.
Anup,
With the 0.1.0 release that we are working on right now, there are two new processors: ListHDFS, FetchHDFS, that are able to keep state about what has been pulled from HDFS. This way you can keep the data in HDFS and still only pull in new data. Will this help?
Thanks-Mark

> From: anup.sethuram@philips.com
> To: dev@nifi.incubator.apache.org
> Subject: RE: Fetch change list
> Date: Tue, 5 May 2015 15:32:07 +0000
> 
> Thanks Corey for that info. But the major problem I'm facing is I am backing up a large set of data into HDFS (with a GetHDFS , source retained as true) and then trying to fetch the delta from it. (get only the files which have arrived recently by using the min Age and max Age). But I'm unable to get the exact delta if I have 'keep source file' as true..
> I played around a lot with schedule time and min & max age but didn't help.
> 
> -----Original Message-----
> From: Corey Flowers [mailto:cflowers@onyxpoint.com]
> Sent: Tuesday, May 05, 2015 5:35 PM
> To: dev@nifi.incubator.apache.org
> Subject: Re: Fetch change list
> 
> Ok, the get file that is running, is basically causing a race condition between all of the servers in your cluster. That is why you are seeing the "NoSuchFile" error. If you change the scheduling strategy on that processor to "On Primary node" Then the only system that will try to pick up data from that mount point, is the server you have designated "primary node".
> This should fix that issue.
> 
> On Mon, May 4, 2015 at 11:30 PM, Sethuram, Anup <an...@philips.com>
> wrote:
> 
> > Yes Corey, Right now the pickup directory is from a network share
> > mount point. The data is picked up from one location and transferred
> > to the other. I'm using site-to-site communication.
> >
> > -----Original Message-----
> > From: Corey Flowers [mailto:cflowers@onyxpoint.com]
> > Sent: Monday, May 04, 2015 7:57 PM
> > To: dev@nifi.incubator.apache.org
> > Subject: Re: Fetch change list
> >
> > Good morning Anup!
> >
> >          Is the pickup directory coming from a network share mount point?
> >
> > On Mon, May 4, 2015 at 10:11 AM, Sethuram, Anup
> > <anup.sethuram@philips.com
> > >
> > wrote:
> >
> > > Hi ,
> > >                 I'm trying to fetch a set of files which have
> > > recently changed in a "filesystem". Also I'm supposed to keep the
> > > original copy as it is.
> > > For obtaining the latest files that have changed, I'm using a
> > > PutFile with "replace" strategy piped to a GetFile with a minimum
> > > age of 5 sec,  max file age of 30 sec, Keep source file as true,
> > >
> > > Also, running it in clustered mode. I'm seeing the below issues
> > >
> > > -          The queue starts growing if there's an error.
> > >
> > > -          Continuous errors with 'NoSuchFileException'
> > >
> > > -          Penalizing StandardFlowFileErrors
> > >
> > >
> > >
> > >
> > > ERROR
> > >
> > > 0ab3b920-1f05-4f24-b861-4fded3d5d826
> > >
> > > 161.91.234.248:7087
> > >
> > > GetFile[id=0ab3b920-1f05-4f24-b861-4fded3d5d826] Failed to retrieve
> > > files due to
> > > org.apache.nifi.processor.exception.FlowFileAccessException: Failed
> > > to import data from /nifi/UNZ/log201403230000.log for
> > > StandardFlowFileRecord[uuid=f29bda59-8611-427c-b4d7-c921ee5e74b8,cla
> > > im =,offset=0,name=6908587554457536,size=0]
> > > due to java.nio.file.NoSuchFileException:
> > > /nifi/UNZ/log201403230000.log
> > >
> > > 18:45:56 IST
> > >
> > >
> > >
> > > 10:54:50 IST
> > >
> > > ERROR
> > >
> > > c552b5bc-f627-3cc3-b3d0-545c519eafd9
> > >
> > > 161.91.234.248:6087
> > >
> > > PutFile[id=c552b5bc-f627-3cc3-b3d0-545c519eafd9] Penalizing
> > > StandardFlowFileRecord[uuid=876e51f7-9a3d-4bf9-9d11-9073a5c950ad,cla
> > > im =1430717088883-73580,offset=0,name=file1.log,size=29314779]
> > > and transferring to failure due to
> > > org.apache.nifi.processor.exception.ProcessException: Could not
> > > rename
> > > /nifi/UNZ/.file1.log:
> > org.apache.nifi.processor.exception.ProcessException:
> > > Could not rename: /nifi/UNZ/.file1.log
> > >
> > > 10:54:56 IST
> > >
> > > ERROR
> > >
> > > 60662bb3-490a-3b47-9371-e11c12cdfa1a
> > >
> > > 161.91.234.248:7087
> > >
> > > PutFile[id=60662bb3-490a-3b47-9371-e11c12cdfa1a] Penalizing
> > > StandardFlowFileRecord[uuid=522a2401-8269-4f0f-aff5-152d25cdcefa,cla
> > > im =1430717094668-73059,offset=1533296,name=file2.log,size=28014262]
> > > and transferring to failure due to
> > > org.apache.nifi.processor.exception.ProcessException: Could not rename:
> > > /data/softwares/RS/nifi/OUT/.file2.log:
> > > org.apache.nifi.processor.exception.ProcessException: Could not rename:
> > > /nifi/OUT/.file2.log
> > >
> > >
> > >
> > > Do I have to tweak the Run schedule or keep the same minimum file
> > > age and maximum file age to overcome this issue?
> > > What might be an elegant solution in NiFi?
> > >
> > >
> > > Thanks,
> > > anup
> > >
> > > ________________________________
> > > The information contained in this message may be confidential and
> > > legally protected under applicable law. The message is intended
> > > solely for the addressee(s). If you are not the intended recipient,
> > > you are hereby notified that any use, forwarding, dissemination, or
> > > reproduction of this message is strictly prohibited and may be
> > > unlawful. If you are not the intended recipient, please contact the
> > > sender by return e-mail and destroy all copies of the original message.
> > >
> >
> >
> >
> > --
> > Corey Flowers
> > Vice President, Onyx Point, Inc
> > (410) 541-6699
> > cflowers@onyxpoint.com
> >
> > -- This account not approved for unencrypted proprietary information
> > --
> >
> > ________________________________
> > The information contained in this message may be confidential and
> > legally protected under applicable law. The message is intended solely
> > for the addressee(s). If you are not the intended recipient, you are
> > hereby notified that any use, forwarding, dissemination, or
> > reproduction of this message is strictly prohibited and may be
> > unlawful. If you are not the intended recipient, please contact the
> > sender by return e-mail and destroy all copies of the original message.
> >
> 
> 
> 
> --
> Corey Flowers
> Vice President, Onyx Point, Inc
> (410) 541-6699
> cflowers@onyxpoint.com
> 
> -- This account not approved for unencrypted proprietary information --
> 
> ________________________________
> The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
 		 	   		  

RE: Fetch change list

Posted by "Sethuram, Anup" <an...@philips.com>.
Thanks Corey for that info. But the major problem I'm facing is I am backing up a large set of data into HDFS (with a GetHDFS , source retained as true) and then trying to fetch the delta from it. (get only the files which have arrived recently by using the min Age and max Age). But I'm unable to get the exact delta if I have 'keep source file' as true..
I played around a lot with schedule time and min & max age but didn't help.

-----Original Message-----
From: Corey Flowers [mailto:cflowers@onyxpoint.com]
Sent: Tuesday, May 05, 2015 5:35 PM
To: dev@nifi.incubator.apache.org
Subject: Re: Fetch change list

Ok, the get file that is running, is basically causing a race condition between all of the servers in your cluster. That is why you are seeing the "NoSuchFile" error. If you change the scheduling strategy on that processor to "On Primary node" Then the only system that will try to pick up data from that mount point, is the server you have designated "primary node".
This should fix that issue.

On Mon, May 4, 2015 at 11:30 PM, Sethuram, Anup <an...@philips.com>
wrote:

> Yes Corey, Right now the pickup directory is from a network share
> mount point. The data is picked up from one location and transferred
> to the other. I'm using site-to-site communication.
>
> -----Original Message-----
> From: Corey Flowers [mailto:cflowers@onyxpoint.com]
> Sent: Monday, May 04, 2015 7:57 PM
> To: dev@nifi.incubator.apache.org
> Subject: Re: Fetch change list
>
> Good morning Anup!
>
>          Is the pickup directory coming from a network share mount point?
>
> On Mon, May 4, 2015 at 10:11 AM, Sethuram, Anup
> <anup.sethuram@philips.com
> >
> wrote:
>
> > Hi ,
> >                 I'm trying to fetch a set of files which have
> > recently changed in a "filesystem". Also I'm supposed to keep the
> > original copy as it is.
> > For obtaining the latest files that have changed, I'm using a
> > PutFile with "replace" strategy piped to a GetFile with a minimum
> > age of 5 sec,  max file age of 30 sec, Keep source file as true,
> >
> > Also, running it in clustered mode. I'm seeing the below issues
> >
> > -          The queue starts growing if there's an error.
> >
> > -          Continuous errors with 'NoSuchFileException'
> >
> > -          Penalizing StandardFlowFileErrors
> >
> >
> >
> >
> > ERROR
> >
> > 0ab3b920-1f05-4f24-b861-4fded3d5d826
> >
> > 161.91.234.248:7087
> >
> > GetFile[id=0ab3b920-1f05-4f24-b861-4fded3d5d826] Failed to retrieve
> > files due to
> > org.apache.nifi.processor.exception.FlowFileAccessException: Failed
> > to import data from /nifi/UNZ/log201403230000.log for
> > StandardFlowFileRecord[uuid=f29bda59-8611-427c-b4d7-c921ee5e74b8,cla
> > im =,offset=0,name=6908587554457536,size=0]
> > due to java.nio.file.NoSuchFileException:
> > /nifi/UNZ/log201403230000.log
> >
> > 18:45:56 IST
> >
> >
> >
> > 10:54:50 IST
> >
> > ERROR
> >
> > c552b5bc-f627-3cc3-b3d0-545c519eafd9
> >
> > 161.91.234.248:6087
> >
> > PutFile[id=c552b5bc-f627-3cc3-b3d0-545c519eafd9] Penalizing
> > StandardFlowFileRecord[uuid=876e51f7-9a3d-4bf9-9d11-9073a5c950ad,cla
> > im =1430717088883-73580,offset=0,name=file1.log,size=29314779]
> > and transferring to failure due to
> > org.apache.nifi.processor.exception.ProcessException: Could not
> > rename
> > /nifi/UNZ/.file1.log:
> org.apache.nifi.processor.exception.ProcessException:
> > Could not rename: /nifi/UNZ/.file1.log
> >
> > 10:54:56 IST
> >
> > ERROR
> >
> > 60662bb3-490a-3b47-9371-e11c12cdfa1a
> >
> > 161.91.234.248:7087
> >
> > PutFile[id=60662bb3-490a-3b47-9371-e11c12cdfa1a] Penalizing
> > StandardFlowFileRecord[uuid=522a2401-8269-4f0f-aff5-152d25cdcefa,cla
> > im =1430717094668-73059,offset=1533296,name=file2.log,size=28014262]
> > and transferring to failure due to
> > org.apache.nifi.processor.exception.ProcessException: Could not rename:
> > /data/softwares/RS/nifi/OUT/.file2.log:
> > org.apache.nifi.processor.exception.ProcessException: Could not rename:
> > /nifi/OUT/.file2.log
> >
> >
> >
> > Do I have to tweak the Run schedule or keep the same minimum file
> > age and maximum file age to overcome this issue?
> > What might be an elegant solution in NiFi?
> >
> >
> > Thanks,
> > anup
> >
> > ________________________________
> > The information contained in this message may be confidential and
> > legally protected under applicable law. The message is intended
> > solely for the addressee(s). If you are not the intended recipient,
> > you are hereby notified that any use, forwarding, dissemination, or
> > reproduction of this message is strictly prohibited and may be
> > unlawful. If you are not the intended recipient, please contact the
> > sender by return e-mail and destroy all copies of the original message.
> >
>
>
>
> --
> Corey Flowers
> Vice President, Onyx Point, Inc
> (410) 541-6699
> cflowers@onyxpoint.com
>
> -- This account not approved for unencrypted proprietary information
> --
>
> ________________________________
> The information contained in this message may be confidential and
> legally protected under applicable law. The message is intended solely
> for the addressee(s). If you are not the intended recipient, you are
> hereby notified that any use, forwarding, dissemination, or
> reproduction of this message is strictly prohibited and may be
> unlawful. If you are not the intended recipient, please contact the
> sender by return e-mail and destroy all copies of the original message.
>



--
Corey Flowers
Vice President, Onyx Point, Inc
(410) 541-6699
cflowers@onyxpoint.com

-- This account not approved for unencrypted proprietary information --

________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.

Re: Fetch change list

Posted by Corey Flowers <cf...@onyxpoint.com>.
Ok, the get file that is running, is basically causing a race condition
between all of the servers in your cluster. That is why you are seeing the
"NoSuchFile" error. If you change the scheduling strategy on that processor
to "On Primary node" Then the only system that will try to pick up data
from that mount point, is the server you have designated "primary node".
This should fix that issue.

On Mon, May 4, 2015 at 11:30 PM, Sethuram, Anup <an...@philips.com>
wrote:

> Yes Corey, Right now the pickup directory is from a network share mount
> point. The data is picked up from one location and transferred to the
> other. I'm using site-to-site communication.
>
> -----Original Message-----
> From: Corey Flowers [mailto:cflowers@onyxpoint.com]
> Sent: Monday, May 04, 2015 7:57 PM
> To: dev@nifi.incubator.apache.org
> Subject: Re: Fetch change list
>
> Good morning Anup!
>
>          Is the pickup directory coming from a network share mount point?
>
> On Mon, May 4, 2015 at 10:11 AM, Sethuram, Anup <anup.sethuram@philips.com
> >
> wrote:
>
> > Hi ,
> >                 I'm trying to fetch a set of files which have recently
> > changed in a "filesystem". Also I'm supposed to keep the original copy
> > as it is.
> > For obtaining the latest files that have changed, I'm using a PutFile
> > with "replace" strategy piped to a GetFile with a minimum age of 5
> > sec,  max file age of 30 sec, Keep source file as true,
> >
> > Also, running it in clustered mode. I'm seeing the below issues
> >
> > -          The queue starts growing if there's an error.
> >
> > -          Continuous errors with 'NoSuchFileException'
> >
> > -          Penalizing StandardFlowFileErrors
> >
> >
> >
> >
> > ERROR
> >
> > 0ab3b920-1f05-4f24-b861-4fded3d5d826
> >
> > 161.91.234.248:7087
> >
> > GetFile[id=0ab3b920-1f05-4f24-b861-4fded3d5d826] Failed to retrieve
> > files due to
> > org.apache.nifi.processor.exception.FlowFileAccessException: Failed to
> > import data from /nifi/UNZ/log201403230000.log for
> > StandardFlowFileRecord[uuid=f29bda59-8611-427c-b4d7-c921ee5e74b8,claim
> > =,offset=0,name=6908587554457536,size=0]
> > due to java.nio.file.NoSuchFileException:
> > /nifi/UNZ/log201403230000.log
> >
> > 18:45:56 IST
> >
> >
> >
> > 10:54:50 IST
> >
> > ERROR
> >
> > c552b5bc-f627-3cc3-b3d0-545c519eafd9
> >
> > 161.91.234.248:6087
> >
> > PutFile[id=c552b5bc-f627-3cc3-b3d0-545c519eafd9] Penalizing
> > StandardFlowFileRecord[uuid=876e51f7-9a3d-4bf9-9d11-9073a5c950ad,claim
> > =1430717088883-73580,offset=0,name=file1.log,size=29314779]
> > and transferring to failure due to
> > org.apache.nifi.processor.exception.ProcessException: Could not rename
> > /nifi/UNZ/.file1.log:
> org.apache.nifi.processor.exception.ProcessException:
> > Could not rename: /nifi/UNZ/.file1.log
> >
> > 10:54:56 IST
> >
> > ERROR
> >
> > 60662bb3-490a-3b47-9371-e11c12cdfa1a
> >
> > 161.91.234.248:7087
> >
> > PutFile[id=60662bb3-490a-3b47-9371-e11c12cdfa1a] Penalizing
> > StandardFlowFileRecord[uuid=522a2401-8269-4f0f-aff5-152d25cdcefa,claim
> > =1430717094668-73059,offset=1533296,name=file2.log,size=28014262]
> > and transferring to failure due to
> > org.apache.nifi.processor.exception.ProcessException: Could not rename:
> > /data/softwares/RS/nifi/OUT/.file2.log:
> > org.apache.nifi.processor.exception.ProcessException: Could not rename:
> > /nifi/OUT/.file2.log
> >
> >
> >
> > Do I have to tweak the Run schedule or keep the same minimum file age
> > and maximum file age to overcome this issue?
> > What might be an elegant solution in NiFi?
> >
> >
> > Thanks,
> > anup
> >
> > ________________________________
> > The information contained in this message may be confidential and
> > legally protected under applicable law. The message is intended solely
> > for the addressee(s). If you are not the intended recipient, you are
> > hereby notified that any use, forwarding, dissemination, or
> > reproduction of this message is strictly prohibited and may be
> > unlawful. If you are not the intended recipient, please contact the
> > sender by return e-mail and destroy all copies of the original message.
> >
>
>
>
> --
> Corey Flowers
> Vice President, Onyx Point, Inc
> (410) 541-6699
> cflowers@onyxpoint.com
>
> -- This account not approved for unencrypted proprietary information --
>
> ________________________________
> The information contained in this message may be confidential and legally
> protected under applicable law. The message is intended solely for the
> addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>



-- 
Corey Flowers
Vice President, Onyx Point, Inc
(410) 541-6699
cflowers@onyxpoint.com

-- This account not approved for unencrypted proprietary information --

RE: Fetch change list

Posted by "Sethuram, Anup" <an...@philips.com>.
Yes Corey, Right now the pickup directory is from a network share mount point. The data is picked up from one location and transferred to the other. I'm using site-to-site communication.

-----Original Message-----
From: Corey Flowers [mailto:cflowers@onyxpoint.com]
Sent: Monday, May 04, 2015 7:57 PM
To: dev@nifi.incubator.apache.org
Subject: Re: Fetch change list

Good morning Anup!

         Is the pickup directory coming from a network share mount point?

On Mon, May 4, 2015 at 10:11 AM, Sethuram, Anup <an...@philips.com>
wrote:

> Hi ,
>                 I'm trying to fetch a set of files which have recently
> changed in a "filesystem". Also I'm supposed to keep the original copy
> as it is.
> For obtaining the latest files that have changed, I'm using a PutFile
> with "replace" strategy piped to a GetFile with a minimum age of 5
> sec,  max file age of 30 sec, Keep source file as true,
>
> Also, running it in clustered mode. I'm seeing the below issues
>
> -          The queue starts growing if there's an error.
>
> -          Continuous errors with 'NoSuchFileException'
>
> -          Penalizing StandardFlowFileErrors
>
>
>
>
> ERROR
>
> 0ab3b920-1f05-4f24-b861-4fded3d5d826
>
> 161.91.234.248:7087
>
> GetFile[id=0ab3b920-1f05-4f24-b861-4fded3d5d826] Failed to retrieve
> files due to
> org.apache.nifi.processor.exception.FlowFileAccessException: Failed to
> import data from /nifi/UNZ/log201403230000.log for
> StandardFlowFileRecord[uuid=f29bda59-8611-427c-b4d7-c921ee5e74b8,claim
> =,offset=0,name=6908587554457536,size=0]
> due to java.nio.file.NoSuchFileException:
> /nifi/UNZ/log201403230000.log
>
> 18:45:56 IST
>
>
>
> 10:54:50 IST
>
> ERROR
>
> c552b5bc-f627-3cc3-b3d0-545c519eafd9
>
> 161.91.234.248:6087
>
> PutFile[id=c552b5bc-f627-3cc3-b3d0-545c519eafd9] Penalizing
> StandardFlowFileRecord[uuid=876e51f7-9a3d-4bf9-9d11-9073a5c950ad,claim
> =1430717088883-73580,offset=0,name=file1.log,size=29314779]
> and transferring to failure due to
> org.apache.nifi.processor.exception.ProcessException: Could not rename
> /nifi/UNZ/.file1.log: org.apache.nifi.processor.exception.ProcessException:
> Could not rename: /nifi/UNZ/.file1.log
>
> 10:54:56 IST
>
> ERROR
>
> 60662bb3-490a-3b47-9371-e11c12cdfa1a
>
> 161.91.234.248:7087
>
> PutFile[id=60662bb3-490a-3b47-9371-e11c12cdfa1a] Penalizing
> StandardFlowFileRecord[uuid=522a2401-8269-4f0f-aff5-152d25cdcefa,claim
> =1430717094668-73059,offset=1533296,name=file2.log,size=28014262]
> and transferring to failure due to
> org.apache.nifi.processor.exception.ProcessException: Could not rename:
> /data/softwares/RS/nifi/OUT/.file2.log:
> org.apache.nifi.processor.exception.ProcessException: Could not rename:
> /nifi/OUT/.file2.log
>
>
>
> Do I have to tweak the Run schedule or keep the same minimum file age
> and maximum file age to overcome this issue?
> What might be an elegant solution in NiFi?
>
>
> Thanks,
> anup
>
> ________________________________
> The information contained in this message may be confidential and
> legally protected under applicable law. The message is intended solely
> for the addressee(s). If you are not the intended recipient, you are
> hereby notified that any use, forwarding, dissemination, or
> reproduction of this message is strictly prohibited and may be
> unlawful. If you are not the intended recipient, please contact the
> sender by return e-mail and destroy all copies of the original message.
>



--
Corey Flowers
Vice President, Onyx Point, Inc
(410) 541-6699
cflowers@onyxpoint.com

-- This account not approved for unencrypted proprietary information --

________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.

Re: Fetch change list

Posted by Corey Flowers <cf...@onyxpoint.com>.
Good morning Anup!

         Is the pickup directory coming from a network share mount point?

On Mon, May 4, 2015 at 10:11 AM, Sethuram, Anup <an...@philips.com>
wrote:

> Hi ,
>                 I'm trying to fetch a set of files which have recently
> changed in a "filesystem". Also I'm supposed to keep the original copy as
> it is.
> For obtaining the latest files that have changed, I'm using a PutFile with
> "replace" strategy piped to a GetFile with a minimum age of 5 sec,  max
> file age of 30 sec, Keep source file as true,
>
> Also, running it in clustered mode. I'm seeing the below issues
>
> -          The queue starts growing if there's an error.
>
> -          Continuous errors with 'NoSuchFileException'
>
> -          Penalizing StandardFlowFileErrors
>
>
>
>
> ERROR
>
> 0ab3b920-1f05-4f24-b861-4fded3d5d826
>
> 161.91.234.248:7087
>
> GetFile[id=0ab3b920-1f05-4f24-b861-4fded3d5d826] Failed to retrieve files
> due to org.apache.nifi.processor.exception.FlowFileAccessException: Failed
> to import data from /nifi/UNZ/log201403230000.log for
> StandardFlowFileRecord[uuid=f29bda59-8611-427c-b4d7-c921ee5e74b8,claim=,offset=0,name=6908587554457536,size=0]
> due to java.nio.file.NoSuchFileException: /nifi/UNZ/log201403230000.log
>
> 18:45:56 IST
>
>
>
> 10:54:50 IST
>
> ERROR
>
> c552b5bc-f627-3cc3-b3d0-545c519eafd9
>
> 161.91.234.248:6087
>
> PutFile[id=c552b5bc-f627-3cc3-b3d0-545c519eafd9] Penalizing
> StandardFlowFileRecord[uuid=876e51f7-9a3d-4bf9-9d11-9073a5c950ad,claim=1430717088883-73580,offset=0,name=file1.log,size=29314779]
> and transferring to failure due to
> org.apache.nifi.processor.exception.ProcessException: Could not rename
> /nifi/UNZ/.file1.log: org.apache.nifi.processor.exception.ProcessException:
> Could not rename: /nifi/UNZ/.file1.log
>
> 10:54:56 IST
>
> ERROR
>
> 60662bb3-490a-3b47-9371-e11c12cdfa1a
>
> 161.91.234.248:7087
>
> PutFile[id=60662bb3-490a-3b47-9371-e11c12cdfa1a] Penalizing
> StandardFlowFileRecord[uuid=522a2401-8269-4f0f-aff5-152d25cdcefa,claim=1430717094668-73059,offset=1533296,name=file2.log,size=28014262]
> and transferring to failure due to
> org.apache.nifi.processor.exception.ProcessException: Could not rename:
> /data/softwares/RS/nifi/OUT/.file2.log:
> org.apache.nifi.processor.exception.ProcessException: Could not rename:
> /nifi/OUT/.file2.log
>
>
>
> Do I have to tweak the Run schedule or keep the same minimum file age and
> maximum file age to overcome this issue?
> What might be an elegant solution in NiFi?
>
>
> Thanks,
> anup
>
> ________________________________
> The information contained in this message may be confidential and legally
> protected under applicable law. The message is intended solely for the
> addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>



-- 
Corey Flowers
Vice President, Onyx Point, Inc
(410) 541-6699
cflowers@onyxpoint.com

-- This account not approved for unencrypted proprietary information --