You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nifi.apache.org by Michael Moser <mo...@gmail.com> on 2016/04/27 17:31:26 UTC

NiFi 0.5.1 "too many open files"

Devs,

We recently upgraded from NiFi 0.4.1 to 0.5.1 on a cluster.  We noticed
half of our cluster nodes getting "too many open files" errors that require
a NiFi restart, while the other half works without this problem.  Using
'lsof -p <pid>' to identify the open file descriptors at the time of the
problem, we see most of the file descriptors reference deleted files in the
content repository like this:

java <pid> <user> <fd> ... /content_repository/81/123456789-123456 (deleted)

A 'ls /content_repository/81/123456789-123456' confirms that the file has
been deleted.

We are continuing our investigation into why some of our nodes have a
problem but others don't.  Has anyone else seen this?  Did anything change
between 0.4.1 and 0.5.1 related to deleting files from the content
repository?

Regards,
-- Mike

Re: NiFi 0.5.1 "too many open files"

Posted by Michael Moser <mo...@gmail.com>.

That's great, Mark.  Thanks for looking into it.  No I haven't been able to
trace through the logs to identify the cause of the
ContentNotFoundException yet.  If I manage to find anything I will share
what I find.

-- Mike


On Thu, Apr 28, 2016 at 11:16 AM, Mark Payne <ma...@hotmail.com> wrote:

> Mike,
>
> I am looking into this, and I was able to figure out how we could
> potentially archive (and eventually
> age off from the archive) a piece of data for which there is still an open
> file handle. Specifically,
> I am able to understand how this could happen only when we have a problem
> reading a FlowFile
> (such as ContentNotFoundException) during a call to ProcessSession.write()
> with a StreamCallback.
>
> So exactly the situation that you laid out here. I will be looking into
> how to address this.
>
> Do you have any idea what may have caused your ContentNotFoundException to
> occur in the first
> place?
>
> Thanks
> -Mark
>
>
> > On Apr 27, 2016, at 5:24 PM, Michael Moser <mo...@gmail.com> wrote:
> >
> > I found something in the logs on the nodes where I had a problem.  A
> > ContentNotFoundException begins occurring on these nodes and after many
> > thousands of times we eventually get "too many open files".  Once I do
> > surgery on the content repository so that ContentNotFoundException stops
> > happening, then lsof tells me that I'm not 'leaking' any more (deleted)
> > file descriptors.  I have no idea why the ContentNotFoundException starts
> > happening, but it sounds similar to Tony Kurc's "heisenbug" [1] reported
> > back in March.
> >
> > Here's a stack trace of the ContentNotFoundException
> >
> > 2016-04-27 12:00:00,00 ERROR [Timer-Driven Process Thread-1]
> > MyCustomProcessor
> > org.apache.nifi.processor.exception.MissingFlowFileException: Unable to
> > find content for FlowFile
> >  at
> >
> o.a.n.controller.repository.StandardProcessSession.handleContentNotFound(StandardProcessSession.java:2349)
> >  at
> >
> o.a.n.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2186)
> >  at MyCustomProcessor.onTrigger()
> >  at
> o.a.n.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
> > Caused by: o.a.n.controller.repository.ContentNotFoundException: Cound
> not
> > find content for StandardContentClaim
> > [resourceClaim=StandardResourceClaim[id=123456789-1, container=default,
> > section=1], offset=351144, length=30720]
> >  at
> >
> o.a.n.controller.repository.FileSystemRepository.getPath(FileSystemRepository.java:492)
> >  at
> >
> o.a.n.controller.repository.FileSystemRepository.read(FileSystemRepository.java:801)
> >  at
> >
> o.a.n.controller.repository.StandardProcessSession.getInputStream(StandardProcessSession.java:1743)
> >  at
> >
> o.a.n.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2152)
> >  ... common frames omitted
> >
> > MyCustomProcessor just calls session.write() with a StreamCallback
> > implementation that reads from an InputStream and writes to an
> OutputStream.
> >
> > [1] -
> >
> http://apache-nifi-developer-list.39713.n7.nabble.com/heisenbug-causing-quot-lost-quot-content-claims-td7775.html
> >
> > -- Mike
> >
> >
> >
> >
> > On Wed, Apr 27, 2016 at 12:15 PM, Joe Witt <jo...@gmail.com> wrote:
> >
> >> Mike
> >>
> >> Ok that is a good data point.  In my case they are all in archive but
> >> I do agree that isn't super meaningful because in reality nothing
> >> should ever be open for writing in the archive.
> >>
> >> If you can and have enough logging on try searching for that first
> >> part of the filename in your logs.  Let's see if we can narrow this to
> >> a particular type of flow or something.
> >>
> >> Thanks
> >> Joe
> >>
> >> On Wed, Apr 27, 2016 at 12:04 PM, Michael Moser <mo...@gmail.com>
> >> wrote:
> >>> Another data point ... we had archiving turned on at first, and then
> most
> >>> (but not all) files that lsof reported were
> >>>
> >>> /content_repository/0/archive/123456789-123456 (deleted).
> >>>
> >>> We turned archiving off, hoping that was related in some way, but it
> was
> >>> not.
> >>>
> >>> -- Mike
> >>>
> >>>
> >>> On Wed, Apr 27, 2016 at 11:53 AM, Joe Witt <jo...@gmail.com> wrote:
> >>>
> >>>> Mike,
> >>>>
> >>>> Definitely does not sound familiar.  However, just looked up what you
> >>>> describe and I do see it.  In my case there are only three files but
> >>>> they are sitting there open for writing by the nifi process and yet
> >>>> have been deleted.  So I do believe there is an issue...will dig in a
> >>>> bit but obviously if you make more discoveries here please share.
> >>>>
> >>>> Thanks
> >>>> Joe
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Apr 27, 2016 at 11:31 AM, Michael Moser <mo...@gmail.com>
> >>>> wrote:
> >>>>> Devs,
> >>>>>
> >>>>> We recently upgraded from NiFi 0.4.1 to 0.5.1 on a cluster.  We
> >> noticed
> >>>>> half of our cluster nodes getting "too many open files" errors that
> >>>> require
> >>>>> a NiFi restart, while the other half works without this problem.
> >> Using
> >>>>> 'lsof -p <pid>' to identify the open file descriptors at the time of
> >> the
> >>>>> problem, we see most of the file descriptors reference deleted files
> >> in
> >>>> the
> >>>>> content repository like this:
> >>>>>
> >>>>> java <pid> <user> <fd> ... /content_repository/81/123456789-123456
> >>>> (deleted)
> >>>>>
> >>>>> A 'ls /content_repository/81/123456789-123456' confirms that the file
> >> has
> >>>>> been deleted.
> >>>>>
> >>>>> We are continuing our investigation into why some of our nodes have a
> >>>>> problem but others don't.  Has anyone else seen this?  Did anything
> >>>> change
> >>>>> between 0.4.1 and 0.5.1 related to deleting files from the content
> >>>>> repository?
> >>>>>
> >>>>> Regards,
> >>>>> -- Mike
> >>>>
> >>
>
>

Re: NiFi 0.5.1 "too many open files"

Posted by Mark Payne <ma...@hotmail.com>.

Mike,

I am looking into this, and I was able to figure out how we could potentially archive (and eventually
age off from the archive) a piece of data for which there is still an open file handle. Specifically,
I am able to understand how this could happen only when we have a problem reading a FlowFile
(such as ContentNotFoundException) during a call to ProcessSession.write() with a StreamCallback.

So exactly the situation that you laid out here. I will be looking into how to address this.

Do you have any idea what may have caused your ContentNotFoundException to occur in the first
place?

Thanks
-Mark


> On Apr 27, 2016, at 5:24 PM, Michael Moser <mo...@gmail.com> wrote:
> 
> I found something in the logs on the nodes where I had a problem.  A
> ContentNotFoundException begins occurring on these nodes and after many
> thousands of times we eventually get "too many open files".  Once I do
> surgery on the content repository so that ContentNotFoundException stops
> happening, then lsof tells me that I'm not 'leaking' any more (deleted)
> file descriptors.  I have no idea why the ContentNotFoundException starts
> happening, but it sounds similar to Tony Kurc's "heisenbug" [1] reported
> back in March.
> 
> Here's a stack trace of the ContentNotFoundException
> 
> 2016-04-27 12:00:00,00 ERROR [Timer-Driven Process Thread-1]
> MyCustomProcessor
> org.apache.nifi.processor.exception.MissingFlowFileException: Unable to
> find content for FlowFile
>  at
> o.a.n.controller.repository.StandardProcessSession.handleContentNotFound(StandardProcessSession.java:2349)
>  at
> o.a.n.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2186)
>  at MyCustomProcessor.onTrigger()
>  at o.a.n.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
> Caused by: o.a.n.controller.repository.ContentNotFoundException: Cound not
> find content for StandardContentClaim
> [resourceClaim=StandardResourceClaim[id=123456789-1, container=default,
> section=1], offset=351144, length=30720]
>  at
> o.a.n.controller.repository.FileSystemRepository.getPath(FileSystemRepository.java:492)
>  at
> o.a.n.controller.repository.FileSystemRepository.read(FileSystemRepository.java:801)
>  at
> o.a.n.controller.repository.StandardProcessSession.getInputStream(StandardProcessSession.java:1743)
>  at
> o.a.n.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2152)
>  ... common frames omitted
> 
> MyCustomProcessor just calls session.write() with a StreamCallback
> implementation that reads from an InputStream and writes to an OutputStream.
> 
> [1] -
> http://apache-nifi-developer-list.39713.n7.nabble.com/heisenbug-causing-quot-lost-quot-content-claims-td7775.html
> 
> -- Mike
> 
> 
> 
> 
> On Wed, Apr 27, 2016 at 12:15 PM, Joe Witt <jo...@gmail.com> wrote:
> 
>> Mike
>> 
>> Ok that is a good data point.  In my case they are all in archive but
>> I do agree that isn't super meaningful because in reality nothing
>> should ever be open for writing in the archive.
>> 
>> If you can and have enough logging on try searching for that first
>> part of the filename in your logs.  Let's see if we can narrow this to
>> a particular type of flow or something.
>> 
>> Thanks
>> Joe
>> 
>> On Wed, Apr 27, 2016 at 12:04 PM, Michael Moser <mo...@gmail.com>
>> wrote:
>>> Another data point ... we had archiving turned on at first, and then most
>>> (but not all) files that lsof reported were
>>> 
>>> /content_repository/0/archive/123456789-123456 (deleted).
>>> 
>>> We turned archiving off, hoping that was related in some way, but it was
>>> not.
>>> 
>>> -- Mike
>>> 
>>> 
>>> On Wed, Apr 27, 2016 at 11:53 AM, Joe Witt <jo...@gmail.com> wrote:
>>> 
>>>> Mike,
>>>> 
>>>> Definitely does not sound familiar.  However, just looked up what you
>>>> describe and I do see it.  In my case there are only three files but
>>>> they are sitting there open for writing by the nifi process and yet
>>>> have been deleted.  So I do believe there is an issue...will dig in a
>>>> bit but obviously if you make more discoveries here please share.
>>>> 
>>>> Thanks
>>>> Joe
>>>> 
>>>> 
>>>> 
>>>> On Wed, Apr 27, 2016 at 11:31 AM, Michael Moser <mo...@gmail.com>
>>>> wrote:
>>>>> Devs,
>>>>> 
>>>>> We recently upgraded from NiFi 0.4.1 to 0.5.1 on a cluster.  We
>> noticed
>>>>> half of our cluster nodes getting "too many open files" errors that
>>>> require
>>>>> a NiFi restart, while the other half works without this problem.
>> Using
>>>>> 'lsof -p <pid>' to identify the open file descriptors at the time of
>> the
>>>>> problem, we see most of the file descriptors reference deleted files
>> in
>>>> the
>>>>> content repository like this:
>>>>> 
>>>>> java <pid> <user> <fd> ... /content_repository/81/123456789-123456
>>>> (deleted)
>>>>> 
>>>>> A 'ls /content_repository/81/123456789-123456' confirms that the file
>> has
>>>>> been deleted.
>>>>> 
>>>>> We are continuing our investigation into why some of our nodes have a
>>>>> problem but others don't.  Has anyone else seen this?  Did anything
>>>> change
>>>>> between 0.4.1 and 0.5.1 related to deleting files from the content
>>>>> repository?
>>>>> 
>>>>> Regards,
>>>>> -- Mike
>>>> 
>>

Re: NiFi 0.5.1 "too many open files"

Posted by Michael Moser <mo...@gmail.com>.

I found something in the logs on the nodes where I had a problem.  A
ContentNotFoundException begins occurring on these nodes and after many
thousands of times we eventually get "too many open files".  Once I do
surgery on the content repository so that ContentNotFoundException stops
happening, then lsof tells me that I'm not 'leaking' any more (deleted)
file descriptors.  I have no idea why the ContentNotFoundException starts
happening, but it sounds similar to Tony Kurc's "heisenbug" [1] reported
back in March.

Here's a stack trace of the ContentNotFoundException

2016-04-27 12:00:00,00 ERROR [Timer-Driven Process Thread-1]
MyCustomProcessor
org.apache.nifi.processor.exception.MissingFlowFileException: Unable to
find content for FlowFile
  at
o.a.n.controller.repository.StandardProcessSession.handleContentNotFound(StandardProcessSession.java:2349)
  at
o.a.n.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2186)
  at MyCustomProcessor.onTrigger()
  at o.a.n.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
Caused by: o.a.n.controller.repository.ContentNotFoundException: Cound not
find content for StandardContentClaim
[resourceClaim=StandardResourceClaim[id=123456789-1, container=default,
section=1], offset=351144, length=30720]
  at
o.a.n.controller.repository.FileSystemRepository.getPath(FileSystemRepository.java:492)
  at
o.a.n.controller.repository.FileSystemRepository.read(FileSystemRepository.java:801)
  at
o.a.n.controller.repository.StandardProcessSession.getInputStream(StandardProcessSession.java:1743)
  at
o.a.n.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2152)
  ... common frames omitted

MyCustomProcessor just calls session.write() with a StreamCallback
implementation that reads from an InputStream and writes to an OutputStream.

[1] -
http://apache-nifi-developer-list.39713.n7.nabble.com/heisenbug-causing-quot-lost-quot-content-claims-td7775.html

-- Mike

On Wed, Apr 27, 2016 at 12:15 PM, Joe Witt <jo...@gmail.com> wrote:

> Mike
>
> Ok that is a good data point.  In my case they are all in archive but
> I do agree that isn't super meaningful because in reality nothing
> should ever be open for writing in the archive.
>
> If you can and have enough logging on try searching for that first
> part of the filename in your logs.  Let's see if we can narrow this to
> a particular type of flow or something.
>
> Thanks
> Joe
>
> On Wed, Apr 27, 2016 at 12:04 PM, Michael Moser <mo...@gmail.com>
> wrote:
> > Another data point ... we had archiving turned on at first, and then most
> > (but not all) files that lsof reported were
> >
> > /content_repository/0/archive/123456789-123456 (deleted).
> >
> > We turned archiving off, hoping that was related in some way, but it was
> > not.
> >
> > -- Mike
> >
> >
> > On Wed, Apr 27, 2016 at 11:53 AM, Joe Witt <jo...@gmail.com> wrote:
> >
> >> Mike,
> >>
> >> Definitely does not sound familiar.  However, just looked up what you
> >> describe and I do see it.  In my case there are only three files but
> >> they are sitting there open for writing by the nifi process and yet
> >> have been deleted.  So I do believe there is an issue...will dig in a
> >> bit but obviously if you make more discoveries here please share.
> >>
> >> Thanks
> >> Joe
> >>
> >>
> >>
> >> On Wed, Apr 27, 2016 at 11:31 AM, Michael Moser <mo...@gmail.com>
> >> wrote:
> >> > Devs,
> >> >
> >> > We recently upgraded from NiFi 0.4.1 to 0.5.1 on a cluster.  We
> noticed
> >> > half of our cluster nodes getting "too many open files" errors that
> >> require
> >> > a NiFi restart, while the other half works without this problem.
> Using
> >> > 'lsof -p <pid>' to identify the open file descriptors at the time of
> the
> >> > problem, we see most of the file descriptors reference deleted files
> in
> >> the
> >> > content repository like this:
> >> >
> >> > java <pid> <user> <fd> ... /content_repository/81/123456789-123456
> >> (deleted)
> >> >
> >> > A 'ls /content_repository/81/123456789-123456' confirms that the file
> has
> >> > been deleted.
> >> >
> >> > We are continuing our investigation into why some of our nodes have a
> >> > problem but others don't.  Has anyone else seen this?  Did anything
> >> change
> >> > between 0.4.1 and 0.5.1 related to deleting files from the content
> >> > repository?
> >> >
> >> > Regards,
> >> > -- Mike
> >>
>

Re: NiFi 0.5.1 "too many open files"

Posted by Joe Witt <jo...@gmail.com>.

Mike

Ok that is a good data point.  In my case they are all in archive but
I do agree that isn't super meaningful because in reality nothing
should ever be open for writing in the archive.

If you can and have enough logging on try searching for that first
part of the filename in your logs.  Let's see if we can narrow this to
a particular type of flow or something.

Thanks
Joe

On Wed, Apr 27, 2016 at 12:04 PM, Michael Moser <mo...@gmail.com> wrote:
> Another data point ... we had archiving turned on at first, and then most
> (but not all) files that lsof reported were
>
> /content_repository/0/archive/123456789-123456 (deleted).
>
> We turned archiving off, hoping that was related in some way, but it was
> not.
>
> -- Mike
>
>
> On Wed, Apr 27, 2016 at 11:53 AM, Joe Witt <jo...@gmail.com> wrote:
>
>> Mike,
>>
>> Definitely does not sound familiar.  However, just looked up what you
>> describe and I do see it.  In my case there are only three files but
>> they are sitting there open for writing by the nifi process and yet
>> have been deleted.  So I do believe there is an issue...will dig in a
>> bit but obviously if you make more discoveries here please share.
>>
>> Thanks
>> Joe
>>
>>
>>
>> On Wed, Apr 27, 2016 at 11:31 AM, Michael Moser <mo...@gmail.com>
>> wrote:
>> > Devs,
>> >
>> > We recently upgraded from NiFi 0.4.1 to 0.5.1 on a cluster.  We noticed
>> > half of our cluster nodes getting "too many open files" errors that
>> require
>> > a NiFi restart, while the other half works without this problem.  Using
>> > 'lsof -p <pid>' to identify the open file descriptors at the time of the
>> > problem, we see most of the file descriptors reference deleted files in
>> the
>> > content repository like this:
>> >
>> > java <pid> <user> <fd> ... /content_repository/81/123456789-123456
>> (deleted)
>> >
>> > A 'ls /content_repository/81/123456789-123456' confirms that the file has
>> > been deleted.
>> >
>> > We are continuing our investigation into why some of our nodes have a
>> > problem but others don't.  Has anyone else seen this?  Did anything
>> change
>> > between 0.4.1 and 0.5.1 related to deleting files from the content
>> > repository?
>> >
>> > Regards,
>> > -- Mike
>>

Re: NiFi 0.5.1 "too many open files"

Posted by Michael Moser <mo...@gmail.com>.

Another data point ... we had archiving turned on at first, and then most
(but not all) files that lsof reported were

/content_repository/0/archive/123456789-123456 (deleted).

We turned archiving off, hoping that was related in some way, but it was
not.

-- Mike


On Wed, Apr 27, 2016 at 11:53 AM, Joe Witt <jo...@gmail.com> wrote:

> Mike,
>
> Definitely does not sound familiar.  However, just looked up what you
> describe and I do see it.  In my case there are only three files but
> they are sitting there open for writing by the nifi process and yet
> have been deleted.  So I do believe there is an issue...will dig in a
> bit but obviously if you make more discoveries here please share.
>
> Thanks
> Joe
>
>
>
> On Wed, Apr 27, 2016 at 11:31 AM, Michael Moser <mo...@gmail.com>
> wrote:
> > Devs,
> >
> > We recently upgraded from NiFi 0.4.1 to 0.5.1 on a cluster.  We noticed
> > half of our cluster nodes getting "too many open files" errors that
> require
> > a NiFi restart, while the other half works without this problem.  Using
> > 'lsof -p <pid>' to identify the open file descriptors at the time of the
> > problem, we see most of the file descriptors reference deleted files in
> the
> > content repository like this:
> >
> > java <pid> <user> <fd> ... /content_repository/81/123456789-123456
> (deleted)
> >
> > A 'ls /content_repository/81/123456789-123456' confirms that the file has
> > been deleted.
> >
> > We are continuing our investigation into why some of our nodes have a
> > problem but others don't.  Has anyone else seen this?  Did anything
> change
> > between 0.4.1 and 0.5.1 related to deleting files from the content
> > repository?
> >
> > Regards,
> > -- Mike
>

Re: NiFi 0.5.1 "too many open files"

Posted by Joe Witt <jo...@gmail.com>.

Mike,

Definitely does not sound familiar.  However, just looked up what you
describe and I do see it.  In my case there are only three files but
they are sitting there open for writing by the nifi process and yet
have been deleted.  So I do believe there is an issue...will dig in a
bit but obviously if you make more discoveries here please share.

Thanks
Joe



On Wed, Apr 27, 2016 at 11:31 AM, Michael Moser <mo...@gmail.com> wrote:
> Devs,
>
> We recently upgraded from NiFi 0.4.1 to 0.5.1 on a cluster.  We noticed
> half of our cluster nodes getting "too many open files" errors that require
> a NiFi restart, while the other half works without this problem.  Using
> 'lsof -p <pid>' to identify the open file descriptors at the time of the
> problem, we see most of the file descriptors reference deleted files in the
> content repository like this:
>
> java <pid> <user> <fd> ... /content_repository/81/123456789-123456 (deleted)
>
> A 'ls /content_repository/81/123456789-123456' confirms that the file has
> been deleted.
>
> We are continuing our investigation into why some of our nodes have a
> problem but others don't.  Has anyone else seen this?  Did anything change
> between 0.4.1 and 0.5.1 related to deleting files from the content
> repository?
>
> Regards,
> -- Mike