You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by Erlend Garåsen <e....@usit.uio.no> on 2014/09/16 10:48:18 UTC

Zookeeper configured MCF not working in production mode

I'm not able to run MCF 1.7 properly with Zookeeper-based 
synchronization. After some hours, it just stops fetching documents. 
Until now I have been using FileLockManager to get around this problem.

A thread dump and my manifoldcf.log file can be found here:
http://folk.uio.no/erlendfg/manifoldcf/

Erlend

Re: Zookeeper configured MCF not working in production mode

Posted by Karl Wright <da...@gmail.com>.

The processID=A is important mainly because that's needed to distinguish
information that needs to be cleaned up between instances of the agents
processes.  So if you have only one agents process, this is OK.

Karl


On Tue, Sep 16, 2014 at 5:06 AM, Erlend Garåsen <e....@usit.uio.no>
wrote:

> On 16.09.14 10:53, lalit jangra wrote:
>
>> Hi Erlend,
>>
>> Can you please elaborate on how you have configured zookeeper based
>> synchronization, is it in stand alone mode or clustered mode? How many
>> zookeeper nodes are you running for each of node and how many agents are
>> you running?
>>
>
> I'm not very familiar with Zookeeper, so I have just followed the examples
> inside the multiprocess-zk-example folder, i.e.:
> $MCF_HOME/../runzookeeper.sh > /dev/null 2>&1 &
>         # Reading global properties:
>         $MCF_HOME/../setglobalproperties.sh > /dev/null 2>&1 &
>         # Starting Agent process:
>         $MCF_HOME/processes/executecommand.sh
> org.apache.manifoldcf.agents.AgentRun \
>         1>>$LOGDIR/mcf_agent.stdout.log 2>>$LOGDIR/mcf_agent.stderr.log &
> pid=$!
>
> The above lines are from my startup script. I see now that I haven't
> specified "-Dorg.apache.manifoldcf.processid=A", I'm not sure this is
> important, but I can of course try to include that into my script and
> restart everything.
>
> So to the question about how many zookeeper nodes I'm using, the answer is
> one. The same applies to the number of running agents.
>
> Erlend
>

Re: Zookeeper configured MCF not working in production mode

Posted by Erlend Garåsen <e....@usit.uio.no>.

I guess the issue affects version 1.6.x as well. We had exactly the same 
problem with that version, but unfortunately I have no thread dump from 
that time to investigate.

Erlend

On 17.09.14 12:01, Aeham Abushwashi wrote:
> Thanks for finding and fixing the issue. Could you confirm whether it
> affects 1.6.x? A quick look at ZooKeeperConnection.obtainWriteLock() in
> 1.6.1 shows the same pattern identified in CONNECTORS-1031 -
> https://issues.apache.org/jira/browse/CONNECTORS-1031?focusedCommentId=14135978
>
> On 16 September 2014 22:19, Karl Wright <daddywri@gmail.com
> <ma...@gmail.com>> wrote:
>
>     I believe I've fixed the problem for real.  There's a patch attached
>     to the CONNECTORS-1031 ticket, which should be applicable to 1.7.
>     The fix is already checked into the dev_1x branch, as well as trunk
>     (which is MCF 2.0, so don't use that yet).
>
>     I also believe that we're going to need to make a 1.7.1 release that
>     contains this fix, and others of similar importance.
>
>     Karl
>
>
>     On Tue, Sep 16, 2014 at 9:15 AM, Karl Wright <daddywri@gmail.com
>     <ma...@gmail.com>> wrote:
>
>         After some research, I found that increasing the zookeeper.cfg
>         tick time count from 2000 to 5000 makes this problem go away for me.
>
>         Clearly we have an issue, still, with resetting zookeeper
>         connections after tick timeout failures.  The connections are
>         reset but the state of the connections are somehow incorrect.
>         I'll need to do more research to figure out how this can be
>         addressed.
>
>         For the interim, increasing the tick time seems to be a
>         reasonable workaround.
>
>         Thanks,
>         Karl
>
>
>         On Tue, Sep 16, 2014 at 8:14 AM, Karl Wright <daddywri@gmail.com
>         <ma...@gmail.com>> wrote:
>
>             Believe it or not, I was able to reproduce this here with a
>             crawl of 100000 documents.  I get this in the Zookeeper
>             server-side log, hundreds of times:
>
>              >>>>>>
>             [SyncThread:0] ERROR
>             org.apache.zookeeper.server.NIOServerCnxn - Unexpected Exce
>             ption:
>             java.nio.channels.CancelledKeyException
>                      at
>             sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
>                      at
>             sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
>                      at
>             org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja
>             va:153)
>                      at
>             org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.
>             java:1076)
>                      at
>             org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina
>             lRequestProcessor.java:170)
>                      at
>             org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro
>             cessor.java:167)
>                      at
>             org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce
>             ssor.java:101)
>             [SyncThread:0] ERROR
>             org.apache.zookeeper.server.NIOServerCnxn - Unexpected Exce
>             ption:
>             java.nio.channels.CancelledKeyException
>                      at
>             sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
>                      at
>             sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
>                      at
>             org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja
>             va:153)
>                      at
>             org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.
>             java:1076)
>                      at
>             org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina
>             lRequestProcessor.java:170)
>                      at
>             org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro
>             cessor.java:167)
>                      at
>             org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce
>             ssor.java:101)
>             <<<<<<
>
>             ... and then everything locks up.  I have no idea what is
>             happening; seems to be an NIO exception ZooKeeper is not
>             expecting.
>
>             Karl
>
>
>             On Tue, Sep 16, 2014 at 7:52 AM, Erlend Garåsen
>             <e.f.garasen@usit.uio.no <ma...@usit.uio.no>>
>             wrote:
>
>
>                 Ouch, I forgot to place the Zookeeper logs on web. Since
>                 they do not include timestamps and I have restarted MCF
>                 after a few changes, I guess it will be difficult to get
>                 the relevant lines. I'll do that next time it hangs,
>                 probably in the end of the day.
>
>                 I will add the new Zookeeper configuration settings as
>                 Lalit suggested next time I'm restarting MCF.
>
>                     How many worker threads are you using?  How many
>                     documents (about) do
>                     you crawl before things hang?
>
>
>                 Throttling -> max connections: 30
>                 Throttling -> Max fetches/min: 100
>                 Bandwith -> max connections: 25
>                 Bandwith -> max kbytes/sec: 8000
>                 Bandwith -> max fetches/min: 20
>
>                 I have four jobs configured. The one I'm running now has
>                 100,000 documents configured. Totally around 110,000
>                 documents for all four jobs.
>
>                 I guess there are more documents involved since the
>                 largest job excludes a lot of documents based on
>                 sophisticated and complex filtering rules. Maybe 50%
>                 more even though they are not added to Solr (but they
>                 are of course fetched).
>
>                 Erlend
>
>
>                     You may also want to try to increase the parameter:
>                     maxClientCnxns in
>                     zookeeper.cfg to something bigger, if you have a lot
>                     of worker threads.
>                     I'm thinking 1000 or some such.  See if it makes a
>                     difference for you.
>
>
>                 I'll try that at next restart.
>
>                 Erlend
>
>
>
>
>

Re: Zookeeper configured MCF not working in production mode

Posted by Aeham Abushwashi <ae...@exonar.com>.

Thanks for finding and fixing the issue. Could you confirm whether it
affects 1.6.x? A quick look at ZooKeeperConnection.obtainWriteLock() in
1.6.1 shows the same pattern identified in CONNECTORS-1031 -
https://issues.apache.org/jira/browse/CONNECTORS-1031?focusedCommentId=14135978

On 16 September 2014 22:19, Karl Wright <da...@gmail.com> wrote:

> I believe I've fixed the problem for real.  There's a patch attached to
> the CONNECTORS-1031 ticket, which should be applicable to 1.7.  The fix is
> already checked into the dev_1x branch, as well as trunk (which is MCF 2.0,
> so don't use that yet).
>
> I also believe that we're going to need to make a 1.7.1 release that
> contains this fix, and others of similar importance.
>
> Karl
>
>
> On Tue, Sep 16, 2014 at 9:15 AM, Karl Wright <da...@gmail.com> wrote:
>
>> After some research, I found that increasing the zookeeper.cfg tick time
>> count from 2000 to 5000 makes this problem go away for me.
>>
>> Clearly we have an issue, still, with resetting zookeeper connections
>> after tick timeout failures.  The connections are reset but the state of
>> the connections are somehow incorrect.  I'll need to do more research to
>> figure out how this can be addressed.
>>
>> For the interim, increasing the tick time seems to be a reasonable
>> workaround.
>>
>> Thanks,
>> Karl
>>
>>
>> On Tue, Sep 16, 2014 at 8:14 AM, Karl Wright <da...@gmail.com> wrote:
>>
>>> Believe it or not, I was able to reproduce this here with a crawl of
>>> 100000 documents.  I get this in the Zookeeper server-side log, hundreds of
>>> times:
>>>
>>> >>>>>>
>>> [SyncThread:0] ERROR org.apache.zookeeper.server.NIOServerCnxn -
>>> Unexpected Exce
>>> ption:
>>> java.nio.channels.CancelledKeyException
>>>         at
>>> sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
>>>         at
>>> sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
>>>         at
>>> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja
>>> va:153)
>>>         at
>>> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.
>>> java:1076)
>>>         at
>>> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina
>>> lRequestProcessor.java:170)
>>>         at
>>> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro
>>> cessor.java:167)
>>>         at
>>> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce
>>> ssor.java:101)
>>> [SyncThread:0] ERROR org.apache.zookeeper.server.NIOServerCnxn -
>>> Unexpected Exce
>>> ption:
>>> java.nio.channels.CancelledKeyException
>>>         at
>>> sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
>>>         at
>>> sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
>>>         at
>>> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja
>>> va:153)
>>>         at
>>> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.
>>> java:1076)
>>>         at
>>> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina
>>> lRequestProcessor.java:170)
>>>         at
>>> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro
>>> cessor.java:167)
>>>         at
>>> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce
>>> ssor.java:101)
>>> <<<<<<
>>>
>>> ... and then everything locks up.  I have no idea what is happening;
>>> seems to be an NIO exception ZooKeeper is not expecting.
>>>
>>> Karl
>>>
>>>
>>> On Tue, Sep 16, 2014 at 7:52 AM, Erlend Garåsen <e.f.garasen@usit.uio.no
>>> > wrote:
>>>
>>>>
>>>> Ouch, I forgot to place the Zookeeper logs on web. Since they do not
>>>> include timestamps and I have restarted MCF after a few changes, I guess it
>>>> will be difficult to get the relevant lines. I'll do that next time it
>>>> hangs, probably in the end of the day.
>>>>
>>>> I will add the new Zookeeper configuration settings as Lalit suggested
>>>> next time I'm restarting MCF.
>>>>
>>>>  How many worker threads are you using?  How many documents (about) do
>>>>> you crawl before things hang?
>>>>>
>>>>
>>>> Throttling -> max connections: 30
>>>> Throttling -> Max fetches/min: 100
>>>> Bandwith -> max connections: 25
>>>> Bandwith -> max kbytes/sec: 8000
>>>> Bandwith -> max fetches/min: 20
>>>>
>>>> I have four jobs configured. The one I'm running now has 100,000
>>>> documents configured. Totally around 110,000 documents for all four jobs.
>>>>
>>>> I guess there are more documents involved since the largest job
>>>> excludes a lot of documents based on sophisticated and complex filtering
>>>> rules. Maybe 50% more even though they are not added to Solr (but they are
>>>> of course fetched).
>>>>
>>>> Erlend
>>>>
>>>>
>>>>> You may also want to try to increase the parameter: maxClientCnxns in
>>>>> zookeeper.cfg to something bigger, if you have a lot of worker threads.
>>>>> I'm thinking 1000 or some such.  See if it makes a difference for you.
>>>>>
>>>>
>>>> I'll try that at next restart.
>>>>
>>>> Erlend
>>>>
>>>
>>>
>>
>

RE: Zookeeper configured MCF not working in production mode

Posted by Adrian Conlon <Ad...@arup.com>.

Hi Karl,

+1 on a 1.7.1 update to fix this.

Because of various issues with file based synchronisation, we’ve been looking at using zookeeper for synchronisation and have been hitting the CancelledKeyException problem all the time.  Up until this thread, I’d assumed we’d missed something obvious in our zookeeper setup, and hadn’t reported it.  That’ll teach me!

Thanks,

Adrian

p.s.
I *always* feel better you get one of my problems Karl…

From: Karl Wright [mailto:daddywri@gmail.com]
Sent: 16 September 2014 22:20
To: user@manifoldcf.apache.org
Subject: Re: Zookeeper configured MCF not working in production mode

I believe I've fixed the problem for real.  There's a patch attached to the CONNECTORS-1031 ticket, which should be applicable to 1.7.  The fix is already checked into the dev_1x branch, as well as trunk (which is MCF 2.0, so don't use that yet).
I also believe that we're going to need to make a 1.7.1 release that contains this fix, and others of similar importance.

Karl

On Tue, Sep 16, 2014 at 9:15 AM, Karl Wright <da...@gmail.com>> wrote:
After some research, I found that increasing the zookeeper.cfg tick time count from 2000 to 5000 makes this problem go away for me.
Clearly we have an issue, still, with resetting zookeeper connections after tick timeout failures.  The connections are reset but the state of the connections are somehow incorrect.  I'll need to do more research to figure out how this can be addressed.

For the interim, increasing the tick time seems to be a reasonable workaround.

Thanks,
Karl

On Tue, Sep 16, 2014 at 8:14 AM, Karl Wright <da...@gmail.com>> wrote:
Believe it or not, I was able to reproduce this here with a crawl of 100000 documents.  I get this in the Zookeeper server-side log, hundreds of times:

>>>>>>
[SyncThread:0] ERROR org.apache.zookeeper.server.NIOServerCnxn - Unexpected Exce
ption:
java.nio.channels.CancelledKeyException
        at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
        at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
        at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja
va:153)
        at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.
java:1076)
        at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina
lRequestProcessor.java:170)
        at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro
cessor.java:167)
        at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce
ssor.java:101)
[SyncThread:0] ERROR org.apache.zookeeper.server.NIOServerCnxn - Unexpected Exce
ption:
java.nio.channels.CancelledKeyException
        at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
        at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
        at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja
va:153)
        at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.
java:1076)
        at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina
lRequestProcessor.java:170)
        at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro
cessor.java:167)
        at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce
ssor.java:101)
<<<<<<
... and then everything locks up.  I have no idea what is happening; seems to be an NIO exception ZooKeeper is not expecting.

Karl

On Tue, Sep 16, 2014 at 7:52 AM, Erlend Garåsen <e....@usit.uio.no>> wrote:

Ouch, I forgot to place the Zookeeper logs on web. Since they do not include timestamps and I have restarted MCF after a few changes, I guess it will be difficult to get the relevant lines. I'll do that next time it hangs, probably in the end of the day.

I will add the new Zookeeper configuration settings as Lalit suggested next time I'm restarting MCF.
How many worker threads are you using?  How many documents (about) do
you crawl before things hang?

Throttling -> max connections: 30
Throttling -> Max fetches/min: 100
Bandwith -> max connections: 25
Bandwith -> max kbytes/sec: 8000
Bandwith -> max fetches/min: 20

I have four jobs configured. The one I'm running now has 100,000 documents configured. Totally around 110,000 documents for all four jobs.

I guess there are more documents involved since the largest job excludes a lot of documents based on sophisticated and complex filtering rules. Maybe 50% more even though they are not added to Solr (but they are of course fetched).

Erlend

You may also want to try to increase the parameter: maxClientCnxns in
zookeeper.cfg to something bigger, if you have a lot of worker threads.
I'm thinking 1000 or some such.  See if it makes a difference for you.

I'll try that at next restart.

Erlend



____________________________________________________________
Electronic mail messages entering and leaving Arup  business
systems are scanned for acceptability of content and viruses

Re: Zookeeper configured MCF not working in production mode

Posted by Karl Wright <da...@gmail.com>.

I believe I've fixed the problem for real.  There's a patch attached to the
CONNECTORS-1031 ticket, which should be applicable to 1.7.  The fix is
already checked into the dev_1x branch, as well as trunk (which is MCF 2.0,
so don't use that yet).

I also believe that we're going to need to make a 1.7.1 release that
contains this fix, and others of similar importance.

Karl


On Tue, Sep 16, 2014 at 9:15 AM, Karl Wright <da...@gmail.com> wrote:

> After some research, I found that increasing the zookeeper.cfg tick time
> count from 2000 to 5000 makes this problem go away for me.
>
> Clearly we have an issue, still, with resetting zookeeper connections
> after tick timeout failures.  The connections are reset but the state of
> the connections are somehow incorrect.  I'll need to do more research to
> figure out how this can be addressed.
>
> For the interim, increasing the tick time seems to be a reasonable
> workaround.
>
> Thanks,
> Karl
>
>
> On Tue, Sep 16, 2014 at 8:14 AM, Karl Wright <da...@gmail.com> wrote:
>
>> Believe it or not, I was able to reproduce this here with a crawl of
>> 100000 documents.  I get this in the Zookeeper server-side log, hundreds of
>> times:
>>
>> >>>>>>
>> [SyncThread:0] ERROR org.apache.zookeeper.server.NIOServerCnxn -
>> Unexpected Exce
>> ption:
>> java.nio.channels.CancelledKeyException
>>         at
>> sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
>>         at
>> sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
>>         at
>> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja
>> va:153)
>>         at
>> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.
>> java:1076)
>>         at
>> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina
>> lRequestProcessor.java:170)
>>         at
>> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro
>> cessor.java:167)
>>         at
>> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce
>> ssor.java:101)
>> [SyncThread:0] ERROR org.apache.zookeeper.server.NIOServerCnxn -
>> Unexpected Exce
>> ption:
>> java.nio.channels.CancelledKeyException
>>         at
>> sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
>>         at
>> sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
>>         at
>> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja
>> va:153)
>>         at
>> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.
>> java:1076)
>>         at
>> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina
>> lRequestProcessor.java:170)
>>         at
>> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro
>> cessor.java:167)
>>         at
>> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce
>> ssor.java:101)
>> <<<<<<
>>
>> ... and then everything locks up.  I have no idea what is happening;
>> seems to be an NIO exception ZooKeeper is not expecting.
>>
>> Karl
>>
>>
>> On Tue, Sep 16, 2014 at 7:52 AM, Erlend Garåsen <e....@usit.uio.no>
>> wrote:
>>
>>>
>>> Ouch, I forgot to place the Zookeeper logs on web. Since they do not
>>> include timestamps and I have restarted MCF after a few changes, I guess it
>>> will be difficult to get the relevant lines. I'll do that next time it
>>> hangs, probably in the end of the day.
>>>
>>> I will add the new Zookeeper configuration settings as Lalit suggested
>>> next time I'm restarting MCF.
>>>
>>>  How many worker threads are you using?  How many documents (about) do
>>>> you crawl before things hang?
>>>>
>>>
>>> Throttling -> max connections: 30
>>> Throttling -> Max fetches/min: 100
>>> Bandwith -> max connections: 25
>>> Bandwith -> max kbytes/sec: 8000
>>> Bandwith -> max fetches/min: 20
>>>
>>> I have four jobs configured. The one I'm running now has 100,000
>>> documents configured. Totally around 110,000 documents for all four jobs.
>>>
>>> I guess there are more documents involved since the largest job excludes
>>> a lot of documents based on sophisticated and complex filtering rules.
>>> Maybe 50% more even though they are not added to Solr (but they are of
>>> course fetched).
>>>
>>> Erlend
>>>
>>>
>>>> You may also want to try to increase the parameter: maxClientCnxns in
>>>> zookeeper.cfg to something bigger, if you have a lot of worker threads.
>>>> I'm thinking 1000 or some such.  See if it makes a difference for you.
>>>>
>>>
>>> I'll try that at next restart.
>>>
>>> Erlend
>>>
>>
>>
>

Re: Zookeeper configured MCF not working in production mode

Posted by Karl Wright <da...@gmail.com>.

After some research, I found that increasing the zookeeper.cfg tick time
count from 2000 to 5000 makes this problem go away for me.

Clearly we have an issue, still, with resetting zookeeper connections after
tick timeout failures.  The connections are reset but the state of the
connections are somehow incorrect.  I'll need to do more research to figure
out how this can be addressed.

For the interim, increasing the tick time seems to be a reasonable
workaround.

Thanks,
Karl


On Tue, Sep 16, 2014 at 8:14 AM, Karl Wright <da...@gmail.com> wrote:

> Believe it or not, I was able to reproduce this here with a crawl of
> 100000 documents.  I get this in the Zookeeper server-side log, hundreds of
> times:
>
> >>>>>>
> [SyncThread:0] ERROR org.apache.zookeeper.server.NIOServerCnxn -
> Unexpected Exce
> ption:
> java.nio.channels.CancelledKeyException
>         at
> sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
>         at
> sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
>         at
> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja
> va:153)
>         at
> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.
> java:1076)
>         at
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina
> lRequestProcessor.java:170)
>         at
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro
> cessor.java:167)
>         at
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce
> ssor.java:101)
> [SyncThread:0] ERROR org.apache.zookeeper.server.NIOServerCnxn -
> Unexpected Exce
> ption:
> java.nio.channels.CancelledKeyException
>         at
> sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
>         at
> sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
>         at
> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja
> va:153)
>         at
> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.
> java:1076)
>         at
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina
> lRequestProcessor.java:170)
>         at
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro
> cessor.java:167)
>         at
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce
> ssor.java:101)
> <<<<<<
>
> ... and then everything locks up.  I have no idea what is happening; seems
> to be an NIO exception ZooKeeper is not expecting.
>
> Karl
>
>
> On Tue, Sep 16, 2014 at 7:52 AM, Erlend Garåsen <e....@usit.uio.no>
> wrote:
>
>>
>> Ouch, I forgot to place the Zookeeper logs on web. Since they do not
>> include timestamps and I have restarted MCF after a few changes, I guess it
>> will be difficult to get the relevant lines. I'll do that next time it
>> hangs, probably in the end of the day.
>>
>> I will add the new Zookeeper configuration settings as Lalit suggested
>> next time I'm restarting MCF.
>>
>>  How many worker threads are you using?  How many documents (about) do
>>> you crawl before things hang?
>>>
>>
>> Throttling -> max connections: 30
>> Throttling -> Max fetches/min: 100
>> Bandwith -> max connections: 25
>> Bandwith -> max kbytes/sec: 8000
>> Bandwith -> max fetches/min: 20
>>
>> I have four jobs configured. The one I'm running now has 100,000
>> documents configured. Totally around 110,000 documents for all four jobs.
>>
>> I guess there are more documents involved since the largest job excludes
>> a lot of documents based on sophisticated and complex filtering rules.
>> Maybe 50% more even though they are not added to Solr (but they are of
>> course fetched).
>>
>> Erlend
>>
>>
>>> You may also want to try to increase the parameter: maxClientCnxns in
>>> zookeeper.cfg to something bigger, if you have a lot of worker threads.
>>> I'm thinking 1000 or some such.  See if it makes a difference for you.
>>>
>>
>> I'll try that at next restart.
>>
>> Erlend
>>
>
>

Re: Zookeeper configured MCF not working in production mode

Posted by Karl Wright <da...@gmail.com>.

Believe it or not, I was able to reproduce this here with a crawl of 100000
documents.  I get this in the Zookeeper server-side log, hundreds of times:

>>>>>>
[SyncThread:0] ERROR org.apache.zookeeper.server.NIOServerCnxn - Unexpected
Exce
ption:
java.nio.channels.CancelledKeyException
        at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
        at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
        at
org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja
va:153)
        at
org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.
java:1076)
        at
org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina
lRequestProcessor.java:170)
        at
org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro
cessor.java:167)
        at
org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce
ssor.java:101)
[SyncThread:0] ERROR org.apache.zookeeper.server.NIOServerCnxn - Unexpected
Exce
ption:
java.nio.channels.CancelledKeyException
        at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
        at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
        at
org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja
va:153)
        at
org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.
java:1076)
        at
org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina
lRequestProcessor.java:170)
        at
org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro
cessor.java:167)
        at
org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce
ssor.java:101)
<<<<<<

... and then everything locks up.  I have no idea what is happening; seems
to be an NIO exception ZooKeeper is not expecting.

Karl


On Tue, Sep 16, 2014 at 7:52 AM, Erlend Garåsen <e....@usit.uio.no>
wrote:

>
> Ouch, I forgot to place the Zookeeper logs on web. Since they do not
> include timestamps and I have restarted MCF after a few changes, I guess it
> will be difficult to get the relevant lines. I'll do that next time it
> hangs, probably in the end of the day.
>
> I will add the new Zookeeper configuration settings as Lalit suggested
> next time I'm restarting MCF.
>
>  How many worker threads are you using?  How many documents (about) do
>> you crawl before things hang?
>>
>
> Throttling -> max connections: 30
> Throttling -> Max fetches/min: 100
> Bandwith -> max connections: 25
> Bandwith -> max kbytes/sec: 8000
> Bandwith -> max fetches/min: 20
>
> I have four jobs configured. The one I'm running now has 100,000 documents
> configured. Totally around 110,000 documents for all four jobs.
>
> I guess there are more documents involved since the largest job excludes a
> lot of documents based on sophisticated and complex filtering rules. Maybe
> 50% more even though they are not added to Solr (but they are of course
> fetched).
>
> Erlend
>
>
>> You may also want to try to increase the parameter: maxClientCnxns in
>> zookeeper.cfg to something bigger, if you have a lot of worker threads.
>> I'm thinking 1000 or some such.  See if it makes a difference for you.
>>
>
> I'll try that at next restart.
>
> Erlend
>

Re: Zookeeper configured MCF not working in production mode

Posted by Erlend Garåsen <e....@usit.uio.no>.

Ouch, I forgot to place the Zookeeper logs on web. Since they do not 
include timestamps and I have restarted MCF after a few changes, I guess 
it will be difficult to get the relevant lines. I'll do that next time 
it hangs, probably in the end of the day.

I will add the new Zookeeper configuration settings as Lalit suggested 
next time I'm restarting MCF.

> How many worker threads are you using?  How many documents (about) do
> you crawl before things hang?

Throttling -> max connections: 30
Throttling -> Max fetches/min: 100
Bandwith -> max connections: 25
Bandwith -> max kbytes/sec: 8000
Bandwith -> max fetches/min: 20

I have four jobs configured. The one I'm running now has 100,000 
documents configured. Totally around 110,000 documents for all four jobs.

I guess there are more documents involved since the largest job excludes 
a lot of documents based on sophisticated and complex filtering rules. 
Maybe 50% more even though they are not added to Solr (but they are of 
course fetched).

Erlend

>
> You may also want to try to increase the parameter: maxClientCnxns in
> zookeeper.cfg to something bigger, if you have a lot of worker threads.
> I'm thinking 1000 or some such.  See if it makes a difference for you.

I'll try that at next restart.

Erlend

Re: Zookeeper configured MCF not working in production mode

Posted by Karl Wright <da...@gmail.com>.

Hi Erlend,

How many worker threads are you using?  How many documents (about) do you
crawl before things hang?

You may also want to try to increase the parameter: maxClientCnxns in
zookeeper.cfg to something bigger, if you have a lot of worker threads.
I'm thinking 1000 or some such.  See if it makes a difference for you.

I'll try a large crawl here using Zookeeper also, but it would be good to
know your parameters before I begin.

Karl


On Tue, Sep 16, 2014 at 7:21 AM, lalit jangra <la...@gmail.com>
wrote:

> Hello,
>
> To restrain zookeeper from taking too much disk space, use below
> parameters. These will help to purge extra data one may not need.
>
> autopurge.snapRetainCount=3 : default value
> autopurge.purgeInterval=1: default value
>
> Feel free to update as per needs.
>
> Regards.
>
> On Tue, Sep 16, 2014 at 3:46 PM, Karl Wright <da...@gmail.com> wrote:
>
>> Hi Erlend,
>>
>> The zookeeper configuration supplied will likely fill up your disk with
>> zookeeper synch data, because the parameters that control the cleanup of
>> that data are not properly set up for long-term execution.
>>
>> Graeme Seaton would be the best resource for using Zookeeper properly;
>> he's on this list and I've cc'd him directly as well.
>>
>> Karl
>>
>>
>> On Tue, Sep 16, 2014 at 5:06 AM, Erlend Garåsen <e....@usit.uio.no>
>> wrote:
>>
>>> On 16.09.14 10:53, lalit jangra wrote:
>>>
>>>> Hi Erlend,
>>>>
>>>> Can you please elaborate on how you have configured zookeeper based
>>>> synchronization, is it in stand alone mode or clustered mode? How many
>>>> zookeeper nodes are you running for each of node and how many agents are
>>>> you running?
>>>>
>>>
>>> I'm not very familiar with Zookeeper, so I have just followed the
>>> examples inside the multiprocess-zk-example folder, i.e.:
>>> $MCF_HOME/../runzookeeper.sh > /dev/null 2>&1 &
>>>         # Reading global properties:
>>>         $MCF_HOME/../setglobalproperties.sh > /dev/null 2>&1 &
>>>         # Starting Agent process:
>>>         $MCF_HOME/processes/executecommand.sh
>>> org.apache.manifoldcf.agents.AgentRun \
>>>         1>>$LOGDIR/mcf_agent.stdout.log 2>>$LOGDIR/mcf_agent.stderr.log
>>> & pid=$!
>>>
>>> The above lines are from my startup script. I see now that I haven't
>>> specified "-Dorg.apache.manifoldcf.processid=A", I'm not sure this is
>>> important, but I can of course try to include that into my script and
>>> restart everything.
>>>
>>> So to the question about how many zookeeper nodes I'm using, the answer
>>> is one. The same applies to the number of running agents.
>>>
>>> Erlend
>>>
>>
>>
>
>
> --
> Regards,
> Lalit.
>

Re: Zookeeper configured MCF not working in production mode

Posted by lalit jangra <la...@gmail.com>.

Hello,

To restrain zookeeper from taking too much disk space, use below
parameters. These will help to purge extra data one may not need.

autopurge.snapRetainCount=3 : default value
autopurge.purgeInterval=1: default value

Feel free to update as per needs.

Regards.

On Tue, Sep 16, 2014 at 3:46 PM, Karl Wright <da...@gmail.com> wrote:

> Hi Erlend,
>
> The zookeeper configuration supplied will likely fill up your disk with
> zookeeper synch data, because the parameters that control the cleanup of
> that data are not properly set up for long-term execution.
>
> Graeme Seaton would be the best resource for using Zookeeper properly;
> he's on this list and I've cc'd him directly as well.
>
> Karl
>
>
> On Tue, Sep 16, 2014 at 5:06 AM, Erlend Garåsen <e....@usit.uio.no>
> wrote:
>
>> On 16.09.14 10:53, lalit jangra wrote:
>>
>>> Hi Erlend,
>>>
>>> Can you please elaborate on how you have configured zookeeper based
>>> synchronization, is it in stand alone mode or clustered mode? How many
>>> zookeeper nodes are you running for each of node and how many agents are
>>> you running?
>>>
>>
>> I'm not very familiar with Zookeeper, so I have just followed the
>> examples inside the multiprocess-zk-example folder, i.e.:
>> $MCF_HOME/../runzookeeper.sh > /dev/null 2>&1 &
>>         # Reading global properties:
>>         $MCF_HOME/../setglobalproperties.sh > /dev/null 2>&1 &
>>         # Starting Agent process:
>>         $MCF_HOME/processes/executecommand.sh
>> org.apache.manifoldcf.agents.AgentRun \
>>         1>>$LOGDIR/mcf_agent.stdout.log 2>>$LOGDIR/mcf_agent.stderr.log
>> & pid=$!
>>
>> The above lines are from my startup script. I see now that I haven't
>> specified "-Dorg.apache.manifoldcf.processid=A", I'm not sure this is
>> important, but I can of course try to include that into my script and
>> restart everything.
>>
>> So to the question about how many zookeeper nodes I'm using, the answer
>> is one. The same applies to the number of running agents.
>>
>> Erlend
>>
>
>


-- 
Regards,
Lalit.

Re: Zookeeper configured MCF not working in production mode

Posted by Karl Wright <da...@gmail.com>.

Hi Erlend,

The zookeeper configuration supplied will likely fill up your disk with
zookeeper synch data, because the parameters that control the cleanup of
that data are not properly set up for long-term execution.

Graeme Seaton would be the best resource for using Zookeeper properly; he's
on this list and I've cc'd him directly as well.

Karl


On Tue, Sep 16, 2014 at 5:06 AM, Erlend Garåsen <e....@usit.uio.no>
wrote:

> On 16.09.14 10:53, lalit jangra wrote:
>
>> Hi Erlend,
>>
>> Can you please elaborate on how you have configured zookeeper based
>> synchronization, is it in stand alone mode or clustered mode? How many
>> zookeeper nodes are you running for each of node and how many agents are
>> you running?
>>
>
> I'm not very familiar with Zookeeper, so I have just followed the examples
> inside the multiprocess-zk-example folder, i.e.:
> $MCF_HOME/../runzookeeper.sh > /dev/null 2>&1 &
>         # Reading global properties:
>         $MCF_HOME/../setglobalproperties.sh > /dev/null 2>&1 &
>         # Starting Agent process:
>         $MCF_HOME/processes/executecommand.sh
> org.apache.manifoldcf.agents.AgentRun \
>         1>>$LOGDIR/mcf_agent.stdout.log 2>>$LOGDIR/mcf_agent.stderr.log &
> pid=$!
>
> The above lines are from my startup script. I see now that I haven't
> specified "-Dorg.apache.manifoldcf.processid=A", I'm not sure this is
> important, but I can of course try to include that into my script and
> restart everything.
>
> So to the question about how many zookeeper nodes I'm using, the answer is
> one. The same applies to the number of running agents.
>
> Erlend
>

Re: Zookeeper configured MCF not working in production mode

Posted by Karl Wright <da...@gmail.com>.

Sorry, Erlend, I missed that.  The thread dump indicates that it is waiting
for the Zookeeper server to respond.  Do you have corresponding zookeeper
server logs?

Thanks,
Karl

On Tue, Sep 16, 2014 at 7:08 AM, Erlend Garåsen <e....@usit.uio.no>
wrote:

> On 16.09.14 12:27, Karl Wright wrote:
>
>  If you could obtain a thread dump from the agents process when MCF hangs
>> that would also be very helpful.
>>
>
> Hmm, I thought I did that:
> http://folk.uio.no/erlendfg/manifoldcf/
>
> Erlend
>

Re: Zookeeper configured MCF not working in production mode

Posted by Erlend Garåsen <e....@usit.uio.no>.

On 16.09.14 12:27, Karl Wright wrote:

> If you could obtain a thread dump from the agents process when MCF hangs
> that would also be very helpful.

Hmm, I thought I did that:
http://folk.uio.no/erlendfg/manifoldcf/

Erlend

Re: Zookeeper configured MCF not working in production mode

Posted by Karl Wright <da...@gmail.com>.

Hi Erlend,

If you could obtain a thread dump from the agents process when MCF hangs
that would also be very helpful.


IN GENERAL, when something hangs in Java, it's essential to get a thread
dump in order to diagnose the problem.

Thanks,
Karl


On Tue, Sep 16, 2014 at 5:06 AM, Erlend Garåsen <e....@usit.uio.no>
wrote:

> On 16.09.14 10:53, lalit jangra wrote:
>
>> Hi Erlend,
>>
>> Can you please elaborate on how you have configured zookeeper based
>> synchronization, is it in stand alone mode or clustered mode? How many
>> zookeeper nodes are you running for each of node and how many agents are
>> you running?
>>
>
> I'm not very familiar with Zookeeper, so I have just followed the examples
> inside the multiprocess-zk-example folder, i.e.:
> $MCF_HOME/../runzookeeper.sh > /dev/null 2>&1 &
>         # Reading global properties:
>         $MCF_HOME/../setglobalproperties.sh > /dev/null 2>&1 &
>         # Starting Agent process:
>         $MCF_HOME/processes/executecommand.sh
> org.apache.manifoldcf.agents.AgentRun \
>         1>>$LOGDIR/mcf_agent.stdout.log 2>>$LOGDIR/mcf_agent.stderr.log &
> pid=$!
>
> The above lines are from my startup script. I see now that I haven't
> specified "-Dorg.apache.manifoldcf.processid=A", I'm not sure this is
> important, but I can of course try to include that into my script and
> restart everything.
>
> So to the question about how many zookeeper nodes I'm using, the answer is
> one. The same applies to the number of running agents.
>
> Erlend
>

Re: Zookeeper configured MCF not working in production mode

Posted by Erlend Garåsen <e....@usit.uio.no>.

On 16.09.14 10:53, lalit jangra wrote:
> Hi Erlend,
>
> Can you please elaborate on how you have configured zookeeper based
> synchronization, is it in stand alone mode or clustered mode? How many
> zookeeper nodes are you running for each of node and how many agents are
> you running?

I'm not very familiar with Zookeeper, so I have just followed the 
examples inside the multiprocess-zk-example folder, i.e.:
$MCF_HOME/../runzookeeper.sh > /dev/null 2>&1 &
         # Reading global properties:
         $MCF_HOME/../setglobalproperties.sh > /dev/null 2>&1 &
         # Starting Agent process:
         $MCF_HOME/processes/executecommand.sh 
org.apache.manifoldcf.agents.AgentRun \
         1>>$LOGDIR/mcf_agent.stdout.log 2>>$LOGDIR/mcf_agent.stderr.log 
& pid=$!

The above lines are from my startup script. I see now that I haven't 
specified "-Dorg.apache.manifoldcf.processid=A", I'm not sure this is 
important, but I can of course try to include that into my script and 
restart everything.

So to the question about how many zookeeper nodes I'm using, the answer 
is one. The same applies to the number of running agents.

Erlend

Re: Zookeeper configured MCF not working in production mode

Posted by lalit jangra <la...@gmail.com>.

Hi Erlend,

Can you please elaborate on how you have configured zookeeper based
synchronization, is it in stand alone mode or clustered mode? How many
zookeeper nodes are you running for each of node and how many agents are
you running?

Regards.

On Tue, Sep 16, 2014 at 2:18 PM, Erlend Garåsen <e....@usit.uio.no>
wrote:

> I'm not able to run MCF 1.7 properly with Zookeeper-based synchronization.
> After some hours, it just stops fetching documents. Until now I have been
> using FileLockManager to get around this problem.
>
> A thread dump and my manifoldcf.log file can be found here:
> http://folk.uio.no/erlendfg/manifoldcf/
>
> Erlend
>

-- 
Regards,
Lalit.