You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by Cedric Ulmer <ce...@francelabs.com> on 2022/01/06 15:35:55 UTC

RE: Zookeeper locks issue

Hi Karl,


to bounce back on this issue, since you asked about how we initiate the connections, here is how we implement the connect() method. We are talking here about a sharepoint datasource :


  public void connect(final ConfigParams configParams) {
    super.connect(configParams);

    spVersion = configParams.getParameter(SharePointConfiguration.Server.VERSION);
    connectionType = configParams.getParameter(SharePointConfiguration.Server.CONNECTION_TYPE);
    sts = configParams.getParameter(SharePointConfiguration.Server.STS);
    protocol = configParams.getParameter(SharePointConfiguration.Server.PROTOCOL);
    host = configParams.getParameter(SharePointConfiguration.Server.HOST);
    port = configParams.getParameter(SharePointConfiguration.Server.PORT);
    path = configParams.getParameter(SharePointConfiguration.Server.PATH);
    username = configParams.getParameter(SharePointConfiguration.Server.USERNAME);
    password = configParams.getObfuscatedParameter(SharePointConfiguration.Server.PASSWORD);
    socketTimeoutString = configParams.getParameter(SharePointConfiguration.Server.SOCKET_TIMEOUT);
    connectionTimeoutString = configParams.getParameter(SharePointConfiguration.Server.CONNECTION_TIMEOUT);

    proxyHost = configParams.getParameter(SharePointConfiguration.Server.PROXY_HOST);
    proxyPortStr = configParams.getParameter(SharePointConfiguration.Server.PROXY_PORT);
    proxyUser = configParams.getParameter(SharePointConfiguration.Server.PROXY_USER);
    proxyPassword = configParams.getObfuscatedParameter(SharePointConfiguration.Server.PROXY_PASSWORD);
    proxyDomain = configParams.getParameter(SharePointConfiguration.Server.PROXY_DOMAIN);

    sharePointUrl = protocol + "://" + host;
    if (port != null && !port.isEmpty() && !port.equals("80") && !port.equals("443")) {
      sharePointUrl += ":" + port;
    }
    sharePointUrl += path;
    if (sharePointUrl.endsWith("/")) {
      sharePointUrl = sharePointUrl.substring(0, sharePointUrl.length() - 1);
    }

    final String fullDelta = configParams.getParameter(SharePointConfiguration.Delta.FULL_DELTA);
    if (fullDelta != null && !fullDelta.isEmpty()) {
      final boolean isFullDelta = Boolean.parseBoolean(fullDelta);
      if (isFullDelta) {
        connectorModel = MODEL_ADD_CHANGE;
      } else {
        connectorModel = MODEL_ADD_CHANGE_DELETE;
      }
    } else {
      connectorModel = MODEL_ADD_CHANGE_DELETE;
    }

  }

Regards,


Cédric


France Labs – Search Experts
Datafari – Discover our Version 5 released in 2021
www.datafari.com<http://www.datafari.com/>


________________________________
De : Karl Wright <da...@gmail.com>
Envoyé : vendredi 10 décembre 2021 15:32
À : dev
Objet : Re: Zookeeper locks issue

You haven't told me anything about the connectors involved in this job.
For every connector, there are connection pools of limited sizes,
partitioned by configuration parameters.  Throttling is done also on these
pools, and those require locks.

So if a connector has code for establishing a connection that can hang, it
can cause carnage.  Connection establishment in the connector contract
involves several methods, which I am sure you know.  If any one of those
can block, that is bad.  Instead they should just throw an exception.

Karl


On Fri, Dec 10, 2021 at 8:15 AM <ju...@francelabs.com> wrote:

> Hi,
>
> Not sure to understand what you mean by establishing a connection, what
> kind of connection ? If you can explain to me when the agent needs to set
> and release locks, I'll be able to better investigate.
>
> Julien
>
> -----Message d'origine-----
> De : Karl Wright <da...@gmail.com>
> Envoyé : jeudi 9 décembre 2021 15:37
> À : dev <de...@manifoldcf.apache.org>
> Objet : Re: Zookeeper locks issue
>
> The fact that you only see this on one job is pretty clearly evidence that
> we are seeing a hang of some kind due something a specific connector or
> connection is doing.
>
> I'm going to have to guess wildly here to focus us on a productive path.
> What I want to rule out is a case where the connector hangs while
> establishing a connection.  If this can happen then I could well believe
> there would be a train wreck.  Is this something you can confirm or
> disprove?
>
> Karl
>
>
> On Thu, Dec 9, 2021 at 9:07 AM Julien Massiera <
> julien.massiera@francelabs.com> wrote:
>
> > Actually, I have several jobs, but only one job is running at a time,
> > and currently the error always happens on the same one. The problem is
> > that I can't access the environment in debug mode, I also can't
> > activate debug log because I am limited in log size, so the only thing
> > I can do, is to add specific logs in specific places in the code to
> > try to understand what is happening. Where would you suggest me to add
> > log entries to optimise our chances to spot the issue ?
> >
> > Julien
> >
> > Le 09/12/2021 à 13:27, Karl Wright a écrit :
> > > The large number of connections can happen but usually that means
> > something
> > > is stuck somewhere and there is a "train wreck" of other locks
> > > getting backed up.
> > >
> > > If this is completely repeatable then I think we have an opportunity
> > > to figure out why this is happening.  One thing that is clear is
> > > that this doesn't happen in other situations or in our integration
> > > tests, so that makes it necessary to ask what you may be doing
> differently here?
> > >
> > > I was operating on the assumption that the session just expires from
> > > lack of use, but in this case it may well be the other way around:
> > > something hangs elsewhere and a lock is held open for a very long
> > > time, long enough to exceed the timeout.  If you have dozens of jobs
> > > running it might be a challenge to do this but if you can winnow it
> > > down to a small number the logs may give us a good picture of what is
> happening.
> > >
> > > Karl
> > >
> > >
> > >
> > >
> > > On Wed, Dec 8, 2021 at 3:55 PM Julien Massiera <
> > > julien.massiera@francelabs.com> wrote:
> > >
> > >> Hi,
> > >>
> > >> after having increased the session lifetime by 3, the lock error
> > >> still happens and the MCF agent hangs, so all my jobs also hang.
> > >>
> > >> Also, as I said in the other thread today, I notice a very large
> > >> amount of simultaneous connections from the agent to Zookeeper
> > >> (more than 1000) and I cannot tell if it is normal or not.
> > >>
> > >> Can we ignore that particular error and avoid to block an entire
> > >> MCF
> > node ?
> > >>
> > >> Julien
> > >>
> > >> Le 07/12/2021 à 22:15, Julien Massiera a écrit :
> > >>> Ok that makes sense. But still, I don't understand how the "Can't
> > >>> release lock we don't hold" exception can happen, knowing for sure
> > >>> that neither the Zookeeper process or the MCF agent process have
> > >>> been down and/or restarted. Not sure that increasing the session
> > >>> lifetime would solve that particular issue, and since I have no
> > >>> use case to easily reproduct it, it is very complicated to debug.
> > >>>
> > >>> Julien
> > >>>
> > >>> Le 07/12/2021 à 19:08, Karl Wright a écrit :
> > >>>> What this code is doing is interpreting exceptions back from
> > Zookeeper.
> > >>>> There are some kinds of exceptions it interprets as "session has
> > >>>> expired", so it rebuilds the session.
> > >>>>
> > >>>> The code is written in such a way that the locks are presumed to
> > persist
> > >>>> beyond the session.  In fact, if they do not persist beyond the
> > session,
> > >>>> there is a risk that proper locks won't be enforced.
> > >>>>
> > >>>> If I recall correctly, we have a number of integration tests that
> > >>>> exercise Zookeeper integration that are meant to allow sessions
> > >>>> to expire and
> > be
> > >>>> re-established.  If what you say is true and information is
> > >>>> attached solely to a session, Zookeeper cannot possibly work as
> > >>>> the cross-process lock mechanism we use it for.  And yet it is
> > >>>> used not just by us in this
> > way,
> > >>>> but by many other projects as well.
> > >>>>
> > >>>> So I think that the diagnosis that nodes in Zookeeper have
> > >>>> session affinity is not absolutely correct. It may be the case
> > >>>> that only one session
> > >>>> *owns*
> > >>>> a node, and if that session expires then the node goes away.  In
> > >>>> that case I think the right approach is the modify the zookeeper
> > >>>> parameters to increase the session lifetime; I don't see any
> > >>>> other way to prevent
> > bad
> > >>>> things from happening.  Presumably, if a session is created
> > >>>> within a process, and the process dies, the session does too.
> > >>>>
> > >>>> Kar
> > >>>>
> > >>>>
> > >>>> On Tue, Dec 7, 2021 at 11:54 AM Julien Massiera <
> > >>>> julien.massiera@francelabs.com> wrote:
> > >>>>
> > >>>>> Karl,
> > >>>>>
> > >>>>> I tried to understand the Zookeeper lock logic in the code, and
> > >>>>> the only thing I don't understand is the
> > >>>>> 'handleEphemeralNodeKeeperException'
> > >>>>> method that is called in the catch(KeeperException e) of every
> > >>>>> obtain/release lock method of the ZookeeperConnection class.
> > >>>>>
> > >>>>> This method sets the lockNode param to 'null', recreates a
> > >>>>> session
> > and
> > >>>>> recreates nodes but do not resets the lockNode param at the end.
> > >>>>> So,
> > as
> > >>>>> I understood it, if it happens it may result in the lock release
> > error
> > >>>>> that I mentioned because this error is triggered when the
> > >>>>> lockNode param is 'null'.
> > >>>>>
> > >>>>> The method is in the class
> > >>>>> org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection. If
> > >>>>> you
> > can
> > >>>>> take a look and tell me what you think about it, it would be great
> !
> > >>>>>
> > >>>>> Thanks,
> > >>>>>
> > >>>>> Julien
> > >>>>>
> > >>>>> Le 07/12/2021 à 14:40, Julien Massiera a écrit :
> > >>>>>> Yes, I will then try the patch and see if it is working
> > >>>>>>
> > >>>>>> Regards,
> > >>>>>>
> > >>>>>> Julien
> > >>>>>>
> > >>>>>> Le 07/12/2021 à 14:28, Karl Wright a écrit :
> > >>>>>>> Yes, this is plausible.  But I'm not sure what the solution is.
> > If a
> > >>>>>>> zookeeper session disappears, according to the documentation
> > >>>>>>> everything associated with that session should also disappear.
> > >>>>>>>
> > >>>>>>> So I guess we could catch this error and just ignore it,
> > >>>>>>> assuming that the session must be gone anyway?
> > >>>>>>>
> > >>>>>>> Karl
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Tue, Dec 7, 2021 at 8:21 AM Julien Massiera <
> > >>>>>>> julien.massiera@francelabs.com> wrote:
> > >>>>>>>
> > >>>>>>>> Hi,
> > >>>>>>>>
> > >>>>>>>> the Zookeeper lock error mentioned in the before last comment
> > >>>>>>>> of this issue
> > >>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-1447:
> > >>>>>>>>
> > >>>>>>>> FATAL 2017-08-04 09:28:25,855 (Agents idle cleanup thread) -
> > >>>>>>>> Error
> > >>>>>>>> tossed:
> > >>>>>>>> Can't release lock we don't hold
> > >>>>>>>> java.lang.IllegalStateException: Can't release lock we don't
> > >>>>>>>> hold at
> > >>>>>>>>
> > >>
> > org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection.releaseLock
> > (ZooKeeperConnection.java:815)
> > >>
> > >>>>>
> > >>>>>>>> at
> > >>>>>>>>
> > >>
> > org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearLock(Z
> > ooKeeperLockObject.java:218)
> > >>
> > >>>>>
> > >>>>>>>> at
> > >>>>>>>>
> > >>
> > org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearGlobal
> > WriteLockNoWait(ZooKeeperLockObject.java:100)
> > >>
> > >>>>>
> > >>>>>>>> at
> > >>>>>>>>
> > >>
> > org.apache.manifoldcf.core.lockmanager.LockObject.clearGlobalWriteLock
> > (LockObject.java:160)
> > >>
> > >>>>>
> > >>>>>>>> at
> > >>>>>>>>
> > >>
> > org.apache.manifoldcf.core.lockmanager.LockObject.leaveWriteLock(LockO
> > bject.java:141)
> > >>
> > >>>>>
> > >>>>>>>> at
> > >>>>>>>>
> > >>
> > org.apache.manifoldcf.core.lockmanager.LockGate.leaveWriteLock(LockGat
> > e.java:205)
> > >>
> > >>>>>
> > >>>>>>>> at
> > >>>>>>>>
> > >>
> > org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWrite(Base
> > LockManager.java:1224)
> > >>
> > >>>>>
> > >>>>>>>> at
> > >>>>>>>>
> > >>
> > org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWriteLock(
> > BaseLockManager.java:771)
> > >>
> > >>>>>
> > >>>>>>>> at
> > >>>>>>>>
> > >>
> > org.apache.manifoldcf.core.connectorpool.ConnectorPool$Pool.pollAll(Co
> > nnectorPool.java:670)
> > >>
> > >>>>>
> > >>>>>>>> at
> > >>>>>>>>
> > >>
> > org.apache.manifoldcf.core.connectorpool.ConnectorPool.pollAllConnecto
> > rs(ConnectorPool.java:338)
> > >>
> > >>>>>
> > >>>>>>>> at
> > >>>>>>>>
> > >>
> > org.apache.manifoldcf.agents.transformationconnectorpool.Transformatio
> > nConnectorPool.pollAllConnectors(TransformationConnectorPool.java:121)
> > >>
> > >>>>>
> > >>>>>>>> at
> > >>>>>>>>
> > >>
> > org.apache.manifoldcf.agents.system.IdleCleanupThread.run(IdleCleanupT
> > hread.java:91)
> > >>
> > >>>>>
> > >>>>>>>> is still happening in 2021 with the 2.20 version of MCF.
> > >>>>>>>>
> > >>>>>>>> Karl, you hypothesized that it could be related to Zookeeper
> > >>>>>>>> being restarted while the MCF agent is still running, but
> > >>>>>>>> after some investigations, my theory is that it is related to
> > >>>>>>>> re-established sessions. Locks are not associated to a
> > >>>>>>>> process but to a session, and it could happen that when a
> > >>>>>>>> session is closed accidentally (interrupted by exceptions
> > >>>>>>>> etc), it does not correctly release the locks it sets.
> > >>>>>>>> When a
> > >>>>>>>> new session is created by Zookeeper for the same client, the
> > >>>>>>>> locks cannot be released because they belong to an old
> > >>>>>>>> session and the exception is thrown !
> > >>>>>>>>
> > >>>>>>>> Is it something plausible for you ? I have no knowledge on
> > Zookeeper
> > >>>>>>>> but
> > >>>>>>>> if it is something plausible, then it is worth investigating
> > >>>>>>>> into the code to see if everything is correctly done to be
> > >>>>>>>> sure that all locks are released when a session is
> > >>>>>>>> closed/interrupted by a problem.
> > >>>>>>>>
> > >>>>>>>> Julien
> > >>>>>>>>
> > >>>>>>>> --
> > >>>>>>>> L'absence de virus dans ce courrier électronique a été
> > >>>>>>>> vérifiée par le logiciel antivirus Avast.
> > >>>>>>>> https://www.avast.com/antivirus
> > >>>>>>>>
> > >>>>> --
> > >>>>> L'absence de virus dans ce courrier électronique a été vérifiée
> > >>>>> par
> > le
> > >>>>> logiciel antivirus Avast.
> > >>>>> https://www.avast.com/antivirus
> > >>>>>
> > >>>>>
> >
>
>

RE: Zookeeper locks issue

Posted by Julien Massiera <ju...@francelabs.com>.
Hi, 

Here is the pipeline of connectors that we use for the job : 

1. Sharepoint (MCF customized to fit with client needs but the connect method code is the one provided by Cédric)
2. Tika rmeta (MCF original)
3. ForcedMetadataConnector (MCF original)
4. (Output) Solr (MCF original)

 Regards,
Julien

-----Message d'origine-----
De : Karl Wright <da...@gmail.com> 
Envoyé : mercredi 12 janvier 2022 02:35
À : dev <de...@manifoldcf.apache.org>
Objet : Re: Zookeeper locks issue

If you are using existing connectors we are shipping, then all you need to do is tell us which ones are involved in your entire job pipeline.

If you have any CUSTOM connectors involved, then I need to know what they are doing, especially if the connect() method may hang.

It sounds like your repository connector is Sharepoint.  That is fine, but it is only one connector.  What is your output connector?  Do you have any transformation connections? etc.

Karl


On Tue, Jan 11, 2022 at 4:57 PM Cedric Ulmer <ce...@francelabs.com>
wrote:

> Hi,
>
>
> would you also want to have a look at the disconnect() method ?
>
>
> Cedric
>
>
> ________________________________
> De : Cedric Ulmer <ce...@francelabs.com> Envoyé : jeudi 6 
> janvier 2022 16:35:55 À : dev Objet : RE: Zookeeper locks issue
>
> Hi Karl,
>
>
> to bounce back on this issue, since you asked about how we initiate 
> the connections, here is how we implement the connect() method. We are 
> talking here about a sharepoint datasource :
>
>
>   public void connect(final ConfigParams configParams) {
>     super.connect(configParams);
>
>     spVersion =
> configParams.getParameter(SharePointConfiguration.Server.VERSION);
>     connectionType =
> configParams.getParameter(SharePointConfiguration.Server.CONNECTION_TYPE);
>     sts = configParams.getParameter(SharePointConfiguration.Server.STS);
>     protocol =
> configParams.getParameter(SharePointConfiguration.Server.PROTOCOL);
>     host = configParams.getParameter(SharePointConfiguration.Server.HOST);
>     port = configParams.getParameter(SharePointConfiguration.Server.PORT);
>     path = configParams.getParameter(SharePointConfiguration.Server.PATH);
>     username =
> configParams.getParameter(SharePointConfiguration.Server.USERNAME);
>     password =
> configParams.getObfuscatedParameter(SharePointConfiguration.Server.PASSWORD);
>     socketTimeoutString =
> configParams.getParameter(SharePointConfiguration.Server.SOCKET_TIMEOUT);
>     connectionTimeoutString =
> configParams.getParameter(SharePointConfiguration.Server.CONNECTION_TI
> MEOUT);
>
>     proxyHost =
> configParams.getParameter(SharePointConfiguration.Server.PROXY_HOST);
>     proxyPortStr =
> configParams.getParameter(SharePointConfiguration.Server.PROXY_PORT);
>     proxyUser =
> configParams.getParameter(SharePointConfiguration.Server.PROXY_USER);
>     proxyPassword =
> configParams.getObfuscatedParameter(SharePointConfiguration.Server.PROXY_PASSWORD);
>     proxyDomain =
> configParams.getParameter(SharePointConfiguration.Server.PROXY_DOMAIN)
> ;
>
>     sharePointUrl = protocol + "://" + host;
>     if (port != null && !port.isEmpty() && !port.equals("80") &&
> !port.equals("443")) {
>       sharePointUrl += ":" + port;
>     }
>     sharePointUrl += path;
>     if (sharePointUrl.endsWith("/")) {
>       sharePointUrl = sharePointUrl.substring(0, 
> sharePointUrl.length() - 1);
>     }
>
>     final String fullDelta =
> configParams.getParameter(SharePointConfiguration.Delta.FULL_DELTA);
>     if (fullDelta != null && !fullDelta.isEmpty()) {
>       final boolean isFullDelta = Boolean.parseBoolean(fullDelta);
>       if (isFullDelta) {
>         connectorModel = MODEL_ADD_CHANGE;
>       } else {
>         connectorModel = MODEL_ADD_CHANGE_DELETE;
>       }
>     } else {
>       connectorModel = MODEL_ADD_CHANGE_DELETE;
>     }
>
>   }
>
> Regards,
>
>
> Cédric
>
>
> France Labs – Search Experts
> Datafari – Discover our Version 5 released in 2021 
> www.datafari.com<http://www.datafari.com/>
>
>
> ________________________________
> De : Karl Wright <da...@gmail.com>
> Envoyé : vendredi 10 décembre 2021 15:32 À : dev Objet : Re: Zookeeper 
> locks issue
>
> You haven't told me anything about the connectors involved in this job.
> For every connector, there are connection pools of limited sizes, 
> partitioned by configuration parameters.  Throttling is done also on 
> these pools, and those require locks.
>
> So if a connector has code for establishing a connection that can 
> hang, it can cause carnage.  Connection establishment in the connector 
> contract involves several methods, which I am sure you know.  If any 
> one of those can block, that is bad.  Instead they should just throw an exception.
>
> Karl
>
>
> On Fri, Dec 10, 2021 at 8:15 AM <ju...@francelabs.com> wrote:
>
> > Hi,
> >
> > Not sure to understand what you mean by establishing a connection, 
> > what kind of connection ? If you can explain to me when the agent 
> > needs to set and release locks, I'll be able to better investigate.
> >
> > Julien
> >
> > -----Message d'origine-----
> > De : Karl Wright <da...@gmail.com> Envoyé : jeudi 9 décembre 2021 
> > 15:37 À : dev <de...@manifoldcf.apache.org> Objet : Re: Zookeeper 
> > locks issue
> >
> > The fact that you only see this on one job is pretty clearly 
> > evidence
> that
> > we are seeing a hang of some kind due something a specific connector 
> > or connection is doing.
> >
> > I'm going to have to guess wildly here to focus us on a productive path.
> > What I want to rule out is a case where the connector hangs while 
> > establishing a connection.  If this can happen then I could well 
> > believe there would be a train wreck.  Is this something you can 
> > confirm or disprove?
> >
> > Karl
> >
> >
> > On Thu, Dec 9, 2021 at 9:07 AM Julien Massiera < 
> > julien.massiera@francelabs.com> wrote:
> >
> > > Actually, I have several jobs, but only one job is running at a 
> > > time, and currently the error always happens on the same one. The 
> > > problem is that I can't access the environment in debug mode, I 
> > > also can't activate debug log because I am limited in log size, so 
> > > the only thing I can do, is to add specific logs in specific 
> > > places in the code to try to understand what is happening. Where 
> > > would you suggest me to add log entries to optimise our chances to spot the issue ?
> > >
> > > Julien
> > >
> > > Le 09/12/2021 à 13:27, Karl Wright a écrit :
> > > > The large number of connections can happen but usually that 
> > > > means
> > > something
> > > > is stuck somewhere and there is a "train wreck" of other locks 
> > > > getting backed up.
> > > >
> > > > If this is completely repeatable then I think we have an 
> > > > opportunity to figure out why this is happening.  One thing that 
> > > > is clear is that this doesn't happen in other situations or in 
> > > > our integration tests, so that makes it necessary to ask what 
> > > > you may be doing
> > differently here?
> > > >
> > > > I was operating on the assumption that the session just expires 
> > > > from lack of use, but in this case it may well be the other way around:
> > > > something hangs elsewhere and a lock is held open for a very 
> > > > long time, long enough to exceed the timeout.  If you have 
> > > > dozens of jobs running it might be a challenge to do this but if 
> > > > you can winnow it down to a small number the logs may give us a 
> > > > good picture of what is
> > happening.
> > > >
> > > > Karl
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Dec 8, 2021 at 3:55 PM Julien Massiera < 
> > > > julien.massiera@francelabs.com> wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> after having increased the session lifetime by 3, the lock 
> > > >> error still happens and the MCF agent hangs, so all my jobs also hang.
> > > >>
> > > >> Also, as I said in the other thread today, I notice a very 
> > > >> large amount of simultaneous connections from the agent to 
> > > >> Zookeeper (more than 1000) and I cannot tell if it is normal or not.
> > > >>
> > > >> Can we ignore that particular error and avoid to block an 
> > > >> entire MCF
> > > node ?
> > > >>
> > > >> Julien
> > > >>
> > > >> Le 07/12/2021 à 22:15, Julien Massiera a écrit :
> > > >>> Ok that makes sense. But still, I don't understand how the 
> > > >>> "Can't release lock we don't hold" exception can happen, 
> > > >>> knowing for sure that neither the Zookeeper process or the MCF 
> > > >>> agent process have been down and/or restarted. Not sure that 
> > > >>> increasing the session lifetime would solve that particular 
> > > >>> issue, and since I have no use case to easily reproduct it, it is very complicated to debug.
> > > >>>
> > > >>> Julien
> > > >>>
> > > >>> Le 07/12/2021 à 19:08, Karl Wright a écrit :
> > > >>>> What this code is doing is interpreting exceptions back from
> > > Zookeeper.
> > > >>>> There are some kinds of exceptions it interprets as "session 
> > > >>>> has expired", so it rebuilds the session.
> > > >>>>
> > > >>>> The code is written in such a way that the locks are presumed 
> > > >>>> to
> > > persist
> > > >>>> beyond the session.  In fact, if they do not persist beyond 
> > > >>>> the
> > > session,
> > > >>>> there is a risk that proper locks won't be enforced.
> > > >>>>
> > > >>>> If I recall correctly, we have a number of integration tests 
> > > >>>> that exercise Zookeeper integration that are meant to allow 
> > > >>>> sessions to expire and
> > > be
> > > >>>> re-established.  If what you say is true and information is 
> > > >>>> attached solely to a session, Zookeeper cannot possibly work 
> > > >>>> as the cross-process lock mechanism we use it for.  And yet 
> > > >>>> it is used not just by us in this
> > > way,
> > > >>>> but by many other projects as well.
> > > >>>>
> > > >>>> So I think that the diagnosis that nodes in Zookeeper have 
> > > >>>> session affinity is not absolutely correct. It may be the 
> > > >>>> case that only one session
> > > >>>> *owns*
> > > >>>> a node, and if that session expires then the node goes away.  
> > > >>>> In that case I think the right approach is the modify the 
> > > >>>> zookeeper parameters to increase the session lifetime; I 
> > > >>>> don't see any other way to prevent
> > > bad
> > > >>>> things from happening.  Presumably, if a session is created 
> > > >>>> within a process, and the process dies, the session does too.
> > > >>>>
> > > >>>> Kar
> > > >>>>
> > > >>>>
> > > >>>> On Tue, Dec 7, 2021 at 11:54 AM Julien Massiera < 
> > > >>>> julien.massiera@francelabs.com> wrote:
> > > >>>>
> > > >>>>> Karl,
> > > >>>>>
> > > >>>>> I tried to understand the Zookeeper lock logic in the code, 
> > > >>>>> and the only thing I don't understand is the 
> > > >>>>> 'handleEphemeralNodeKeeperException'
> > > >>>>> method that is called in the catch(KeeperException e) of 
> > > >>>>> every obtain/release lock method of the ZookeeperConnection class.
> > > >>>>>
> > > >>>>> This method sets the lockNode param to 'null', recreates a 
> > > >>>>> session
> > > and
> > > >>>>> recreates nodes but do not resets the lockNode param at the end.
> > > >>>>> So,
> > > as
> > > >>>>> I understood it, if it happens it may result in the lock 
> > > >>>>> release
> > > error
> > > >>>>> that I mentioned because this error is triggered when the 
> > > >>>>> lockNode param is 'null'.
> > > >>>>>
> > > >>>>> The method is in the class
> > > >>>>> org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection. 
> > > >>>>> If you
> > > can
> > > >>>>> take a look and tell me what you think about it, it would be
> great
> > !
> > > >>>>>
> > > >>>>> Thanks,
> > > >>>>>
> > > >>>>> Julien
> > > >>>>>
> > > >>>>> Le 07/12/2021 à 14:40, Julien Massiera a écrit :
> > > >>>>>> Yes, I will then try the patch and see if it is working
> > > >>>>>>
> > > >>>>>> Regards,
> > > >>>>>>
> > > >>>>>> Julien
> > > >>>>>>
> > > >>>>>> Le 07/12/2021 à 14:28, Karl Wright a écrit :
> > > >>>>>>> Yes, this is plausible.  But I'm not sure what the solution is.
> > > If a
> > > >>>>>>> zookeeper session disappears, according to the 
> > > >>>>>>> documentation everything associated with that session should also disappear.
> > > >>>>>>>
> > > >>>>>>> So I guess we could catch this error and just ignore it, 
> > > >>>>>>> assuming that the session must be gone anyway?
> > > >>>>>>>
> > > >>>>>>> Karl
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> On Tue, Dec 7, 2021 at 8:21 AM Julien Massiera < 
> > > >>>>>>> julien.massiera@francelabs.com> wrote:
> > > >>>>>>>
> > > >>>>>>>> Hi,
> > > >>>>>>>>
> > > >>>>>>>> the Zookeeper lock error mentioned in the before last 
> > > >>>>>>>> comment of this issue
> > > >>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-1447:
> > > >>>>>>>>
> > > >>>>>>>> FATAL 2017-08-04 09:28:25,855 (Agents idle cleanup 
> > > >>>>>>>> thread) - Error
> > > >>>>>>>> tossed:
> > > >>>>>>>> Can't release lock we don't hold
> > > >>>>>>>> java.lang.IllegalStateException: Can't release lock we 
> > > >>>>>>>> don't hold at
> > > >>>>>>>>
> > > >>
> > > org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection.release
> > > Lock
> > > (ZooKeeperConnection.java:815)
> > > >>
> > > >>>>>
> > > >>>>>>>> at
> > > >>>>>>>>
> > > >>
> > > org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearLo
> > > ck(Z
> > > ooKeeperLockObject.java:218)
> > > >>
> > > >>>>>
> > > >>>>>>>> at
> > > >>>>>>>>
> > > >>
> > > org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearGl
> > > obal
> > > WriteLockNoWait(ZooKeeperLockObject.java:100)
> > > >>
> > > >>>>>
> > > >>>>>>>> at
> > > >>>>>>>>
> > > >>
> > > org.apache.manifoldcf.core.lockmanager.LockObject.clearGlobalWrite
> > > Lock
> > > (LockObject.java:160)
> > > >>
> > > >>>>>
> > > >>>>>>>> at
> > > >>>>>>>>
> > > >>
> > > org.apache.manifoldcf.core.lockmanager.LockObject.leaveWriteLock(L
> > > ockO
> > > bject.java:141)
> > > >>
> > > >>>>>
> > > >>>>>>>> at
> > > >>>>>>>>
> > > >>
> > > org.apache.manifoldcf.core.lockmanager.LockGate.leaveWriteLock(Loc
> > > kGat
> > > e.java:205)
> > > >>
> > > >>>>>
> > > >>>>>>>> at
> > > >>>>>>>>
> > > >>
> > > org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWrite(
> > > Base
> > > LockManager.java:1224)
> > > >>
> > > >>>>>
> > > >>>>>>>> at
> > > >>>>>>>>
> > > >>
> > > org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWriteL
> > > ock(
> > > BaseLockManager.java:771)
> > > >>
> > > >>>>>
> > > >>>>>>>> at
> > > >>>>>>>>
> > > >>
> > > org.apache.manifoldcf.core.connectorpool.ConnectorPool$Pool.pollAl
> > > l(Co
> > > nnectorPool.java:670)
> > > >>
> > > >>>>>
> > > >>>>>>>> at
> > > >>>>>>>>
> > > >>
> > > org.apache.manifoldcf.core.connectorpool.ConnectorPool.pollAllConn
> > > ecto
> > > rs(ConnectorPool.java:338)
> > > >>
> > > >>>>>
> > > >>>>>>>> at
> > > >>>>>>>>
> > > >>
> > > org.apache.manifoldcf.agents.transformationconnectorpool.Transform
> > > atio
> > > nConnectorPool.pollAllConnectors(TransformationConnectorPool.java:
> > > 121)
> > > >>
> > > >>>>>
> > > >>>>>>>> at
> > > >>>>>>>>
> > > >>
> > > org.apache.manifoldcf.agents.system.IdleCleanupThread.run(IdleClea
> > > nupT
> > > hread.java:91)
> > > >>
> > > >>>>>
> > > >>>>>>>> is still happening in 2021 with the 2.20 version of MCF.
> > > >>>>>>>>
> > > >>>>>>>> Karl, you hypothesized that it could be related to 
> > > >>>>>>>> Zookeeper being restarted while the MCF agent is still 
> > > >>>>>>>> running, but after some investigations, my theory is that 
> > > >>>>>>>> it is related to re-established sessions. Locks are not 
> > > >>>>>>>> associated to a process but to a session, and it could 
> > > >>>>>>>> happen that when a session is closed accidentally 
> > > >>>>>>>> (interrupted by exceptions etc), it does not correctly release the locks it sets.
> > > >>>>>>>> When a
> > > >>>>>>>> new session is created by Zookeeper for the same client, 
> > > >>>>>>>> the locks cannot be released because they belong to an 
> > > >>>>>>>> old session and the exception is thrown !
> > > >>>>>>>>
> > > >>>>>>>> Is it something plausible for you ? I have no knowledge 
> > > >>>>>>>> on
> > > Zookeeper
> > > >>>>>>>> but
> > > >>>>>>>> if it is something plausible, then it is worth 
> > > >>>>>>>> investigating into the code to see if everything is 
> > > >>>>>>>> correctly done to be sure that all locks are released 
> > > >>>>>>>> when a session is closed/interrupted by a problem.
> > > >>>>>>>>
> > > >>>>>>>> Julien
> > > >>>>>>>>
> > > >>>>>>>> --
> > > >>>>>>>> L'absence de virus dans ce courrier électronique a été 
> > > >>>>>>>> vérifiée par le logiciel antivirus Avast.
> > > >>>>>>>> https://www.avast.com/antivirus
> > > >>>>>>>>
> > > >>>>> --
> > > >>>>> L'absence de virus dans ce courrier électronique a été 
> > > >>>>> vérifiée par
> > > le
> > > >>>>> logiciel antivirus Avast.
> > > >>>>> https://www.avast.com/antivirus
> > > >>>>>
> > > >>>>>
> > >
> >
> >
>


Re: Zookeeper locks issue

Posted by Karl Wright <da...@gmail.com>.
If you are using existing connectors we are shipping, then all you need to
do is tell us which ones are involved in your entire job pipeline.

If you have any CUSTOM connectors involved, then I need to know what they
are doing, especially if the connect() method may hang.

It sounds like your repository connector is Sharepoint.  That is fine, but
it is only one connector.  What is your output connector?  Do you have any
transformation connections? etc.

Karl


On Tue, Jan 11, 2022 at 4:57 PM Cedric Ulmer <ce...@francelabs.com>
wrote:

> Hi,
>
>
> would you also want to have a look at the disconnect() method ?
>
>
> Cedric
>
>
> ________________________________
> De : Cedric Ulmer <ce...@francelabs.com>
> Envoyé : jeudi 6 janvier 2022 16:35:55
> À : dev
> Objet : RE: Zookeeper locks issue
>
> Hi Karl,
>
>
> to bounce back on this issue, since you asked about how we initiate the
> connections, here is how we implement the connect() method. We are talking
> here about a sharepoint datasource :
>
>
>   public void connect(final ConfigParams configParams) {
>     super.connect(configParams);
>
>     spVersion =
> configParams.getParameter(SharePointConfiguration.Server.VERSION);
>     connectionType =
> configParams.getParameter(SharePointConfiguration.Server.CONNECTION_TYPE);
>     sts = configParams.getParameter(SharePointConfiguration.Server.STS);
>     protocol =
> configParams.getParameter(SharePointConfiguration.Server.PROTOCOL);
>     host = configParams.getParameter(SharePointConfiguration.Server.HOST);
>     port = configParams.getParameter(SharePointConfiguration.Server.PORT);
>     path = configParams.getParameter(SharePointConfiguration.Server.PATH);
>     username =
> configParams.getParameter(SharePointConfiguration.Server.USERNAME);
>     password =
> configParams.getObfuscatedParameter(SharePointConfiguration.Server.PASSWORD);
>     socketTimeoutString =
> configParams.getParameter(SharePointConfiguration.Server.SOCKET_TIMEOUT);
>     connectionTimeoutString =
> configParams.getParameter(SharePointConfiguration.Server.CONNECTION_TIMEOUT);
>
>     proxyHost =
> configParams.getParameter(SharePointConfiguration.Server.PROXY_HOST);
>     proxyPortStr =
> configParams.getParameter(SharePointConfiguration.Server.PROXY_PORT);
>     proxyUser =
> configParams.getParameter(SharePointConfiguration.Server.PROXY_USER);
>     proxyPassword =
> configParams.getObfuscatedParameter(SharePointConfiguration.Server.PROXY_PASSWORD);
>     proxyDomain =
> configParams.getParameter(SharePointConfiguration.Server.PROXY_DOMAIN);
>
>     sharePointUrl = protocol + "://" + host;
>     if (port != null && !port.isEmpty() && !port.equals("80") &&
> !port.equals("443")) {
>       sharePointUrl += ":" + port;
>     }
>     sharePointUrl += path;
>     if (sharePointUrl.endsWith("/")) {
>       sharePointUrl = sharePointUrl.substring(0, sharePointUrl.length() -
> 1);
>     }
>
>     final String fullDelta =
> configParams.getParameter(SharePointConfiguration.Delta.FULL_DELTA);
>     if (fullDelta != null && !fullDelta.isEmpty()) {
>       final boolean isFullDelta = Boolean.parseBoolean(fullDelta);
>       if (isFullDelta) {
>         connectorModel = MODEL_ADD_CHANGE;
>       } else {
>         connectorModel = MODEL_ADD_CHANGE_DELETE;
>       }
>     } else {
>       connectorModel = MODEL_ADD_CHANGE_DELETE;
>     }
>
>   }
>
> Regards,
>
>
> Cédric
>
>
> France Labs – Search Experts
> Datafari – Discover our Version 5 released in 2021
> www.datafari.com<http://www.datafari.com/>
>
>
> ________________________________
> De : Karl Wright <da...@gmail.com>
> Envoyé : vendredi 10 décembre 2021 15:32
> À : dev
> Objet : Re: Zookeeper locks issue
>
> You haven't told me anything about the connectors involved in this job.
> For every connector, there are connection pools of limited sizes,
> partitioned by configuration parameters.  Throttling is done also on these
> pools, and those require locks.
>
> So if a connector has code for establishing a connection that can hang, it
> can cause carnage.  Connection establishment in the connector contract
> involves several methods, which I am sure you know.  If any one of those
> can block, that is bad.  Instead they should just throw an exception.
>
> Karl
>
>
> On Fri, Dec 10, 2021 at 8:15 AM <ju...@francelabs.com> wrote:
>
> > Hi,
> >
> > Not sure to understand what you mean by establishing a connection, what
> > kind of connection ? If you can explain to me when the agent needs to set
> > and release locks, I'll be able to better investigate.
> >
> > Julien
> >
> > -----Message d'origine-----
> > De : Karl Wright <da...@gmail.com>
> > Envoyé : jeudi 9 décembre 2021 15:37
> > À : dev <de...@manifoldcf.apache.org>
> > Objet : Re: Zookeeper locks issue
> >
> > The fact that you only see this on one job is pretty clearly evidence
> that
> > we are seeing a hang of some kind due something a specific connector or
> > connection is doing.
> >
> > I'm going to have to guess wildly here to focus us on a productive path.
> > What I want to rule out is a case where the connector hangs while
> > establishing a connection.  If this can happen then I could well believe
> > there would be a train wreck.  Is this something you can confirm or
> > disprove?
> >
> > Karl
> >
> >
> > On Thu, Dec 9, 2021 at 9:07 AM Julien Massiera <
> > julien.massiera@francelabs.com> wrote:
> >
> > > Actually, I have several jobs, but only one job is running at a time,
> > > and currently the error always happens on the same one. The problem is
> > > that I can't access the environment in debug mode, I also can't
> > > activate debug log because I am limited in log size, so the only thing
> > > I can do, is to add specific logs in specific places in the code to
> > > try to understand what is happening. Where would you suggest me to add
> > > log entries to optimise our chances to spot the issue ?
> > >
> > > Julien
> > >
> > > Le 09/12/2021 à 13:27, Karl Wright a écrit :
> > > > The large number of connections can happen but usually that means
> > > something
> > > > is stuck somewhere and there is a "train wreck" of other locks
> > > > getting backed up.
> > > >
> > > > If this is completely repeatable then I think we have an opportunity
> > > > to figure out why this is happening.  One thing that is clear is
> > > > that this doesn't happen in other situations or in our integration
> > > > tests, so that makes it necessary to ask what you may be doing
> > differently here?
> > > >
> > > > I was operating on the assumption that the session just expires from
> > > > lack of use, but in this case it may well be the other way around:
> > > > something hangs elsewhere and a lock is held open for a very long
> > > > time, long enough to exceed the timeout.  If you have dozens of jobs
> > > > running it might be a challenge to do this but if you can winnow it
> > > > down to a small number the logs may give us a good picture of what is
> > happening.
> > > >
> > > > Karl
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Dec 8, 2021 at 3:55 PM Julien Massiera <
> > > > julien.massiera@francelabs.com> wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> after having increased the session lifetime by 3, the lock error
> > > >> still happens and the MCF agent hangs, so all my jobs also hang.
> > > >>
> > > >> Also, as I said in the other thread today, I notice a very large
> > > >> amount of simultaneous connections from the agent to Zookeeper
> > > >> (more than 1000) and I cannot tell if it is normal or not.
> > > >>
> > > >> Can we ignore that particular error and avoid to block an entire
> > > >> MCF
> > > node ?
> > > >>
> > > >> Julien
> > > >>
> > > >> Le 07/12/2021 à 22:15, Julien Massiera a écrit :
> > > >>> Ok that makes sense. But still, I don't understand how the "Can't
> > > >>> release lock we don't hold" exception can happen, knowing for sure
> > > >>> that neither the Zookeeper process or the MCF agent process have
> > > >>> been down and/or restarted. Not sure that increasing the session
> > > >>> lifetime would solve that particular issue, and since I have no
> > > >>> use case to easily reproduct it, it is very complicated to debug.
> > > >>>
> > > >>> Julien
> > > >>>
> > > >>> Le 07/12/2021 à 19:08, Karl Wright a écrit :
> > > >>>> What this code is doing is interpreting exceptions back from
> > > Zookeeper.
> > > >>>> There are some kinds of exceptions it interprets as "session has
> > > >>>> expired", so it rebuilds the session.
> > > >>>>
> > > >>>> The code is written in such a way that the locks are presumed to
> > > persist
> > > >>>> beyond the session.  In fact, if they do not persist beyond the
> > > session,
> > > >>>> there is a risk that proper locks won't be enforced.
> > > >>>>
> > > >>>> If I recall correctly, we have a number of integration tests that
> > > >>>> exercise Zookeeper integration that are meant to allow sessions
> > > >>>> to expire and
> > > be
> > > >>>> re-established.  If what you say is true and information is
> > > >>>> attached solely to a session, Zookeeper cannot possibly work as
> > > >>>> the cross-process lock mechanism we use it for.  And yet it is
> > > >>>> used not just by us in this
> > > way,
> > > >>>> but by many other projects as well.
> > > >>>>
> > > >>>> So I think that the diagnosis that nodes in Zookeeper have
> > > >>>> session affinity is not absolutely correct. It may be the case
> > > >>>> that only one session
> > > >>>> *owns*
> > > >>>> a node, and if that session expires then the node goes away.  In
> > > >>>> that case I think the right approach is the modify the zookeeper
> > > >>>> parameters to increase the session lifetime; I don't see any
> > > >>>> other way to prevent
> > > bad
> > > >>>> things from happening.  Presumably, if a session is created
> > > >>>> within a process, and the process dies, the session does too.
> > > >>>>
> > > >>>> Kar
> > > >>>>
> > > >>>>
> > > >>>> On Tue, Dec 7, 2021 at 11:54 AM Julien Massiera <
> > > >>>> julien.massiera@francelabs.com> wrote:
> > > >>>>
> > > >>>>> Karl,
> > > >>>>>
> > > >>>>> I tried to understand the Zookeeper lock logic in the code, and
> > > >>>>> the only thing I don't understand is the
> > > >>>>> 'handleEphemeralNodeKeeperException'
> > > >>>>> method that is called in the catch(KeeperException e) of every
> > > >>>>> obtain/release lock method of the ZookeeperConnection class.
> > > >>>>>
> > > >>>>> This method sets the lockNode param to 'null', recreates a
> > > >>>>> session
> > > and
> > > >>>>> recreates nodes but do not resets the lockNode param at the end.
> > > >>>>> So,
> > > as
> > > >>>>> I understood it, if it happens it may result in the lock release
> > > error
> > > >>>>> that I mentioned because this error is triggered when the
> > > >>>>> lockNode param is 'null'.
> > > >>>>>
> > > >>>>> The method is in the class
> > > >>>>> org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection. If
> > > >>>>> you
> > > can
> > > >>>>> take a look and tell me what you think about it, it would be
> great
> > !
> > > >>>>>
> > > >>>>> Thanks,
> > > >>>>>
> > > >>>>> Julien
> > > >>>>>
> > > >>>>> Le 07/12/2021 à 14:40, Julien Massiera a écrit :
> > > >>>>>> Yes, I will then try the patch and see if it is working
> > > >>>>>>
> > > >>>>>> Regards,
> > > >>>>>>
> > > >>>>>> Julien
> > > >>>>>>
> > > >>>>>> Le 07/12/2021 à 14:28, Karl Wright a écrit :
> > > >>>>>>> Yes, this is plausible.  But I'm not sure what the solution is.
> > > If a
> > > >>>>>>> zookeeper session disappears, according to the documentation
> > > >>>>>>> everything associated with that session should also disappear.
> > > >>>>>>>
> > > >>>>>>> So I guess we could catch this error and just ignore it,
> > > >>>>>>> assuming that the session must be gone anyway?
> > > >>>>>>>
> > > >>>>>>> Karl
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> On Tue, Dec 7, 2021 at 8:21 AM Julien Massiera <
> > > >>>>>>> julien.massiera@francelabs.com> wrote:
> > > >>>>>>>
> > > >>>>>>>> Hi,
> > > >>>>>>>>
> > > >>>>>>>> the Zookeeper lock error mentioned in the before last comment
> > > >>>>>>>> of this issue
> > > >>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-1447:
> > > >>>>>>>>
> > > >>>>>>>> FATAL 2017-08-04 09:28:25,855 (Agents idle cleanup thread) -
> > > >>>>>>>> Error
> > > >>>>>>>> tossed:
> > > >>>>>>>> Can't release lock we don't hold
> > > >>>>>>>> java.lang.IllegalStateException: Can't release lock we don't
> > > >>>>>>>> hold at
> > > >>>>>>>>
> > > >>
> > > org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection.releaseLock
> > > (ZooKeeperConnection.java:815)
> > > >>
> > > >>>>>
> > > >>>>>>>> at
> > > >>>>>>>>
> > > >>
> > > org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearLock(Z
> > > ooKeeperLockObject.java:218)
> > > >>
> > > >>>>>
> > > >>>>>>>> at
> > > >>>>>>>>
> > > >>
> > > org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearGlobal
> > > WriteLockNoWait(ZooKeeperLockObject.java:100)
> > > >>
> > > >>>>>
> > > >>>>>>>> at
> > > >>>>>>>>
> > > >>
> > > org.apache.manifoldcf.core.lockmanager.LockObject.clearGlobalWriteLock
> > > (LockObject.java:160)
> > > >>
> > > >>>>>
> > > >>>>>>>> at
> > > >>>>>>>>
> > > >>
> > > org.apache.manifoldcf.core.lockmanager.LockObject.leaveWriteLock(LockO
> > > bject.java:141)
> > > >>
> > > >>>>>
> > > >>>>>>>> at
> > > >>>>>>>>
> > > >>
> > > org.apache.manifoldcf.core.lockmanager.LockGate.leaveWriteLock(LockGat
> > > e.java:205)
> > > >>
> > > >>>>>
> > > >>>>>>>> at
> > > >>>>>>>>
> > > >>
> > > org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWrite(Base
> > > LockManager.java:1224)
> > > >>
> > > >>>>>
> > > >>>>>>>> at
> > > >>>>>>>>
> > > >>
> > > org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWriteLock(
> > > BaseLockManager.java:771)
> > > >>
> > > >>>>>
> > > >>>>>>>> at
> > > >>>>>>>>
> > > >>
> > > org.apache.manifoldcf.core.connectorpool.ConnectorPool$Pool.pollAll(Co
> > > nnectorPool.java:670)
> > > >>
> > > >>>>>
> > > >>>>>>>> at
> > > >>>>>>>>
> > > >>
> > > org.apache.manifoldcf.core.connectorpool.ConnectorPool.pollAllConnecto
> > > rs(ConnectorPool.java:338)
> > > >>
> > > >>>>>
> > > >>>>>>>> at
> > > >>>>>>>>
> > > >>
> > > org.apache.manifoldcf.agents.transformationconnectorpool.Transformatio
> > > nConnectorPool.pollAllConnectors(TransformationConnectorPool.java:121)
> > > >>
> > > >>>>>
> > > >>>>>>>> at
> > > >>>>>>>>
> > > >>
> > > org.apache.manifoldcf.agents.system.IdleCleanupThread.run(IdleCleanupT
> > > hread.java:91)
> > > >>
> > > >>>>>
> > > >>>>>>>> is still happening in 2021 with the 2.20 version of MCF.
> > > >>>>>>>>
> > > >>>>>>>> Karl, you hypothesized that it could be related to Zookeeper
> > > >>>>>>>> being restarted while the MCF agent is still running, but
> > > >>>>>>>> after some investigations, my theory is that it is related to
> > > >>>>>>>> re-established sessions. Locks are not associated to a
> > > >>>>>>>> process but to a session, and it could happen that when a
> > > >>>>>>>> session is closed accidentally (interrupted by exceptions
> > > >>>>>>>> etc), it does not correctly release the locks it sets.
> > > >>>>>>>> When a
> > > >>>>>>>> new session is created by Zookeeper for the same client, the
> > > >>>>>>>> locks cannot be released because they belong to an old
> > > >>>>>>>> session and the exception is thrown !
> > > >>>>>>>>
> > > >>>>>>>> Is it something plausible for you ? I have no knowledge on
> > > Zookeeper
> > > >>>>>>>> but
> > > >>>>>>>> if it is something plausible, then it is worth investigating
> > > >>>>>>>> into the code to see if everything is correctly done to be
> > > >>>>>>>> sure that all locks are released when a session is
> > > >>>>>>>> closed/interrupted by a problem.
> > > >>>>>>>>
> > > >>>>>>>> Julien
> > > >>>>>>>>
> > > >>>>>>>> --
> > > >>>>>>>> L'absence de virus dans ce courrier électronique a été
> > > >>>>>>>> vérifiée par le logiciel antivirus Avast.
> > > >>>>>>>> https://www.avast.com/antivirus
> > > >>>>>>>>
> > > >>>>> --
> > > >>>>> L'absence de virus dans ce courrier électronique a été vérifiée
> > > >>>>> par
> > > le
> > > >>>>> logiciel antivirus Avast.
> > > >>>>> https://www.avast.com/antivirus
> > > >>>>>
> > > >>>>>
> > >
> >
> >
>

RE: Zookeeper locks issue

Posted by Cedric Ulmer <ce...@francelabs.com>.
Hi,


would you also want to have a look at the disconnect() method ?


Cedric


________________________________
De : Cedric Ulmer <ce...@francelabs.com>
Envoyé : jeudi 6 janvier 2022 16:35:55
À : dev
Objet : RE: Zookeeper locks issue

Hi Karl,


to bounce back on this issue, since you asked about how we initiate the connections, here is how we implement the connect() method. We are talking here about a sharepoint datasource :


  public void connect(final ConfigParams configParams) {
    super.connect(configParams);

    spVersion = configParams.getParameter(SharePointConfiguration.Server.VERSION);
    connectionType = configParams.getParameter(SharePointConfiguration.Server.CONNECTION_TYPE);
    sts = configParams.getParameter(SharePointConfiguration.Server.STS);
    protocol = configParams.getParameter(SharePointConfiguration.Server.PROTOCOL);
    host = configParams.getParameter(SharePointConfiguration.Server.HOST);
    port = configParams.getParameter(SharePointConfiguration.Server.PORT);
    path = configParams.getParameter(SharePointConfiguration.Server.PATH);
    username = configParams.getParameter(SharePointConfiguration.Server.USERNAME);
    password = configParams.getObfuscatedParameter(SharePointConfiguration.Server.PASSWORD);
    socketTimeoutString = configParams.getParameter(SharePointConfiguration.Server.SOCKET_TIMEOUT);
    connectionTimeoutString = configParams.getParameter(SharePointConfiguration.Server.CONNECTION_TIMEOUT);

    proxyHost = configParams.getParameter(SharePointConfiguration.Server.PROXY_HOST);
    proxyPortStr = configParams.getParameter(SharePointConfiguration.Server.PROXY_PORT);
    proxyUser = configParams.getParameter(SharePointConfiguration.Server.PROXY_USER);
    proxyPassword = configParams.getObfuscatedParameter(SharePointConfiguration.Server.PROXY_PASSWORD);
    proxyDomain = configParams.getParameter(SharePointConfiguration.Server.PROXY_DOMAIN);

    sharePointUrl = protocol + "://" + host;
    if (port != null && !port.isEmpty() && !port.equals("80") && !port.equals("443")) {
      sharePointUrl += ":" + port;
    }
    sharePointUrl += path;
    if (sharePointUrl.endsWith("/")) {
      sharePointUrl = sharePointUrl.substring(0, sharePointUrl.length() - 1);
    }

    final String fullDelta = configParams.getParameter(SharePointConfiguration.Delta.FULL_DELTA);
    if (fullDelta != null && !fullDelta.isEmpty()) {
      final boolean isFullDelta = Boolean.parseBoolean(fullDelta);
      if (isFullDelta) {
        connectorModel = MODEL_ADD_CHANGE;
      } else {
        connectorModel = MODEL_ADD_CHANGE_DELETE;
      }
    } else {
      connectorModel = MODEL_ADD_CHANGE_DELETE;
    }

  }

Regards,


Cédric


France Labs – Search Experts
Datafari – Discover our Version 5 released in 2021
www.datafari.com<http://www.datafari.com/>


________________________________
De : Karl Wright <da...@gmail.com>
Envoyé : vendredi 10 décembre 2021 15:32
À : dev
Objet : Re: Zookeeper locks issue

You haven't told me anything about the connectors involved in this job.
For every connector, there are connection pools of limited sizes,
partitioned by configuration parameters.  Throttling is done also on these
pools, and those require locks.

So if a connector has code for establishing a connection that can hang, it
can cause carnage.  Connection establishment in the connector contract
involves several methods, which I am sure you know.  If any one of those
can block, that is bad.  Instead they should just throw an exception.

Karl


On Fri, Dec 10, 2021 at 8:15 AM <ju...@francelabs.com> wrote:

> Hi,
>
> Not sure to understand what you mean by establishing a connection, what
> kind of connection ? If you can explain to me when the agent needs to set
> and release locks, I'll be able to better investigate.
>
> Julien
>
> -----Message d'origine-----
> De : Karl Wright <da...@gmail.com>
> Envoyé : jeudi 9 décembre 2021 15:37
> À : dev <de...@manifoldcf.apache.org>
> Objet : Re: Zookeeper locks issue
>
> The fact that you only see this on one job is pretty clearly evidence that
> we are seeing a hang of some kind due something a specific connector or
> connection is doing.
>
> I'm going to have to guess wildly here to focus us on a productive path.
> What I want to rule out is a case where the connector hangs while
> establishing a connection.  If this can happen then I could well believe
> there would be a train wreck.  Is this something you can confirm or
> disprove?
>
> Karl
>
>
> On Thu, Dec 9, 2021 at 9:07 AM Julien Massiera <
> julien.massiera@francelabs.com> wrote:
>
> > Actually, I have several jobs, but only one job is running at a time,
> > and currently the error always happens on the same one. The problem is
> > that I can't access the environment in debug mode, I also can't
> > activate debug log because I am limited in log size, so the only thing
> > I can do, is to add specific logs in specific places in the code to
> > try to understand what is happening. Where would you suggest me to add
> > log entries to optimise our chances to spot the issue ?
> >
> > Julien
> >
> > Le 09/12/2021 à 13:27, Karl Wright a écrit :
> > > The large number of connections can happen but usually that means
> > something
> > > is stuck somewhere and there is a "train wreck" of other locks
> > > getting backed up.
> > >
> > > If this is completely repeatable then I think we have an opportunity
> > > to figure out why this is happening.  One thing that is clear is
> > > that this doesn't happen in other situations or in our integration
> > > tests, so that makes it necessary to ask what you may be doing
> differently here?
> > >
> > > I was operating on the assumption that the session just expires from
> > > lack of use, but in this case it may well be the other way around:
> > > something hangs elsewhere and a lock is held open for a very long
> > > time, long enough to exceed the timeout.  If you have dozens of jobs
> > > running it might be a challenge to do this but if you can winnow it
> > > down to a small number the logs may give us a good picture of what is
> happening.
> > >
> > > Karl
> > >
> > >
> > >
> > >
> > > On Wed, Dec 8, 2021 at 3:55 PM Julien Massiera <
> > > julien.massiera@francelabs.com> wrote:
> > >
> > >> Hi,
> > >>
> > >> after having increased the session lifetime by 3, the lock error
> > >> still happens and the MCF agent hangs, so all my jobs also hang.
> > >>
> > >> Also, as I said in the other thread today, I notice a very large
> > >> amount of simultaneous connections from the agent to Zookeeper
> > >> (more than 1000) and I cannot tell if it is normal or not.
> > >>
> > >> Can we ignore that particular error and avoid to block an entire
> > >> MCF
> > node ?
> > >>
> > >> Julien
> > >>
> > >> Le 07/12/2021 à 22:15, Julien Massiera a écrit :
> > >>> Ok that makes sense. But still, I don't understand how the "Can't
> > >>> release lock we don't hold" exception can happen, knowing for sure
> > >>> that neither the Zookeeper process or the MCF agent process have
> > >>> been down and/or restarted. Not sure that increasing the session
> > >>> lifetime would solve that particular issue, and since I have no
> > >>> use case to easily reproduct it, it is very complicated to debug.
> > >>>
> > >>> Julien
> > >>>
> > >>> Le 07/12/2021 à 19:08, Karl Wright a écrit :
> > >>>> What this code is doing is interpreting exceptions back from
> > Zookeeper.
> > >>>> There are some kinds of exceptions it interprets as "session has
> > >>>> expired", so it rebuilds the session.
> > >>>>
> > >>>> The code is written in such a way that the locks are presumed to
> > persist
> > >>>> beyond the session.  In fact, if they do not persist beyond the
> > session,
> > >>>> there is a risk that proper locks won't be enforced.
> > >>>>
> > >>>> If I recall correctly, we have a number of integration tests that
> > >>>> exercise Zookeeper integration that are meant to allow sessions
> > >>>> to expire and
> > be
> > >>>> re-established.  If what you say is true and information is
> > >>>> attached solely to a session, Zookeeper cannot possibly work as
> > >>>> the cross-process lock mechanism we use it for.  And yet it is
> > >>>> used not just by us in this
> > way,
> > >>>> but by many other projects as well.
> > >>>>
> > >>>> So I think that the diagnosis that nodes in Zookeeper have
> > >>>> session affinity is not absolutely correct. It may be the case
> > >>>> that only one session
> > >>>> *owns*
> > >>>> a node, and if that session expires then the node goes away.  In
> > >>>> that case I think the right approach is the modify the zookeeper
> > >>>> parameters to increase the session lifetime; I don't see any
> > >>>> other way to prevent
> > bad
> > >>>> things from happening.  Presumably, if a session is created
> > >>>> within a process, and the process dies, the session does too.
> > >>>>
> > >>>> Kar
> > >>>>
> > >>>>
> > >>>> On Tue, Dec 7, 2021 at 11:54 AM Julien Massiera <
> > >>>> julien.massiera@francelabs.com> wrote:
> > >>>>
> > >>>>> Karl,
> > >>>>>
> > >>>>> I tried to understand the Zookeeper lock logic in the code, and
> > >>>>> the only thing I don't understand is the
> > >>>>> 'handleEphemeralNodeKeeperException'
> > >>>>> method that is called in the catch(KeeperException e) of every
> > >>>>> obtain/release lock method of the ZookeeperConnection class.
> > >>>>>
> > >>>>> This method sets the lockNode param to 'null', recreates a
> > >>>>> session
> > and
> > >>>>> recreates nodes but do not resets the lockNode param at the end.
> > >>>>> So,
> > as
> > >>>>> I understood it, if it happens it may result in the lock release
> > error
> > >>>>> that I mentioned because this error is triggered when the
> > >>>>> lockNode param is 'null'.
> > >>>>>
> > >>>>> The method is in the class
> > >>>>> org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection. If
> > >>>>> you
> > can
> > >>>>> take a look and tell me what you think about it, it would be great
> !
> > >>>>>
> > >>>>> Thanks,
> > >>>>>
> > >>>>> Julien
> > >>>>>
> > >>>>> Le 07/12/2021 à 14:40, Julien Massiera a écrit :
> > >>>>>> Yes, I will then try the patch and see if it is working
> > >>>>>>
> > >>>>>> Regards,
> > >>>>>>
> > >>>>>> Julien
> > >>>>>>
> > >>>>>> Le 07/12/2021 à 14:28, Karl Wright a écrit :
> > >>>>>>> Yes, this is plausible.  But I'm not sure what the solution is.
> > If a
> > >>>>>>> zookeeper session disappears, according to the documentation
> > >>>>>>> everything associated with that session should also disappear.
> > >>>>>>>
> > >>>>>>> So I guess we could catch this error and just ignore it,
> > >>>>>>> assuming that the session must be gone anyway?
> > >>>>>>>
> > >>>>>>> Karl
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Tue, Dec 7, 2021 at 8:21 AM Julien Massiera <
> > >>>>>>> julien.massiera@francelabs.com> wrote:
> > >>>>>>>
> > >>>>>>>> Hi,
> > >>>>>>>>
> > >>>>>>>> the Zookeeper lock error mentioned in the before last comment
> > >>>>>>>> of this issue
> > >>>>>>>> https://issues.apache.org/jira/browse/CONNECTORS-1447:
> > >>>>>>>>
> > >>>>>>>> FATAL 2017-08-04 09:28:25,855 (Agents idle cleanup thread) -
> > >>>>>>>> Error
> > >>>>>>>> tossed:
> > >>>>>>>> Can't release lock we don't hold
> > >>>>>>>> java.lang.IllegalStateException: Can't release lock we don't
> > >>>>>>>> hold at
> > >>>>>>>>
> > >>
> > org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection.releaseLock
> > (ZooKeeperConnection.java:815)
> > >>
> > >>>>>
> > >>>>>>>> at
> > >>>>>>>>
> > >>
> > org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearLock(Z
> > ooKeeperLockObject.java:218)
> > >>
> > >>>>>
> > >>>>>>>> at
> > >>>>>>>>
> > >>
> > org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearGlobal
> > WriteLockNoWait(ZooKeeperLockObject.java:100)
> > >>
> > >>>>>
> > >>>>>>>> at
> > >>>>>>>>
> > >>
> > org.apache.manifoldcf.core.lockmanager.LockObject.clearGlobalWriteLock
> > (LockObject.java:160)
> > >>
> > >>>>>
> > >>>>>>>> at
> > >>>>>>>>
> > >>
> > org.apache.manifoldcf.core.lockmanager.LockObject.leaveWriteLock(LockO
> > bject.java:141)
> > >>
> > >>>>>
> > >>>>>>>> at
> > >>>>>>>>
> > >>
> > org.apache.manifoldcf.core.lockmanager.LockGate.leaveWriteLock(LockGat
> > e.java:205)
> > >>
> > >>>>>
> > >>>>>>>> at
> > >>>>>>>>
> > >>
> > org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWrite(Base
> > LockManager.java:1224)
> > >>
> > >>>>>
> > >>>>>>>> at
> > >>>>>>>>
> > >>
> > org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWriteLock(
> > BaseLockManager.java:771)
> > >>
> > >>>>>
> > >>>>>>>> at
> > >>>>>>>>
> > >>
> > org.apache.manifoldcf.core.connectorpool.ConnectorPool$Pool.pollAll(Co
> > nnectorPool.java:670)
> > >>
> > >>>>>
> > >>>>>>>> at
> > >>>>>>>>
> > >>
> > org.apache.manifoldcf.core.connectorpool.ConnectorPool.pollAllConnecto
> > rs(ConnectorPool.java:338)
> > >>
> > >>>>>
> > >>>>>>>> at
> > >>>>>>>>
> > >>
> > org.apache.manifoldcf.agents.transformationconnectorpool.Transformatio
> > nConnectorPool.pollAllConnectors(TransformationConnectorPool.java:121)
> > >>
> > >>>>>
> > >>>>>>>> at
> > >>>>>>>>
> > >>
> > org.apache.manifoldcf.agents.system.IdleCleanupThread.run(IdleCleanupT
> > hread.java:91)
> > >>
> > >>>>>
> > >>>>>>>> is still happening in 2021 with the 2.20 version of MCF.
> > >>>>>>>>
> > >>>>>>>> Karl, you hypothesized that it could be related to Zookeeper
> > >>>>>>>> being restarted while the MCF agent is still running, but
> > >>>>>>>> after some investigations, my theory is that it is related to
> > >>>>>>>> re-established sessions. Locks are not associated to a
> > >>>>>>>> process but to a session, and it could happen that when a
> > >>>>>>>> session is closed accidentally (interrupted by exceptions
> > >>>>>>>> etc), it does not correctly release the locks it sets.
> > >>>>>>>> When a
> > >>>>>>>> new session is created by Zookeeper for the same client, the
> > >>>>>>>> locks cannot be released because they belong to an old
> > >>>>>>>> session and the exception is thrown !
> > >>>>>>>>
> > >>>>>>>> Is it something plausible for you ? I have no knowledge on
> > Zookeeper
> > >>>>>>>> but
> > >>>>>>>> if it is something plausible, then it is worth investigating
> > >>>>>>>> into the code to see if everything is correctly done to be
> > >>>>>>>> sure that all locks are released when a session is
> > >>>>>>>> closed/interrupted by a problem.
> > >>>>>>>>
> > >>>>>>>> Julien
> > >>>>>>>>
> > >>>>>>>> --
> > >>>>>>>> L'absence de virus dans ce courrier électronique a été
> > >>>>>>>> vérifiée par le logiciel antivirus Avast.
> > >>>>>>>> https://www.avast.com/antivirus
> > >>>>>>>>
> > >>>>> --
> > >>>>> L'absence de virus dans ce courrier électronique a été vérifiée
> > >>>>> par
> > le
> > >>>>> logiciel antivirus Avast.
> > >>>>> https://www.avast.com/antivirus
> > >>>>>
> > >>>>>
> >
>
>