You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@directory.apache.org by Emmanuel Lecharny <el...@gmail.com> on 2011/08/08 10:56:44 UTC

Replication heads up

Hi guys,

so we found the reason why the replication tests are failing randomly. 
Let me explain :

- the consumer is connected to the provider until it gets disconnected. 
It can last for days or weeks.
- the producer pushes modifications to the consumer directly if the 
consumer is connected
- if the consumer is disconnected, the modifications are stored in a 
queue, waiting for the client to reconnect to send it the content of 
this queue

That being said, we have one corner case when the provider 'thinks' that 
the consumer is connected when it's not anymore : the message is sent to 
the disconnected client, and we don't push it to the queue, losing it.

One better idea is to push *all* the modifications to the queue, not 
matter what. Then a thread will process this queue and send it contents 
to the client, unless the client isn't connected. In any case, we 
*don't* delete messages from the queue. Never.

That raises a question : what o we do in the long term ? The queue will 
grow and never shrink. In fact this is quite simple : we truncate the 
queue after a defined period of time (say once a day, or once a week). 
Ever modification older than the interval is simply deleted from the queue.

What if a consumer is not able to reconnect within this period of time ? 
Simple :
- the consumer sends the lastEntryCSN it received, and if it's older 
than what's in the queue, then we do a full replication.

It may seems costly, but it's unlikely that a consumer get disconnected 
for a long period of time. All in all, it's like if we just added a 
brand new consumer, with nothing in it.

One option would be to ask the consumer to send a periodic message to 
the producer informing it that it's up to date. It could be a daily 
unbind/bind for instance. The unbind will kill the pending persistent 
search we established between the producer and consumer, to establish a 
new one. As we will send a new request, with the lastEntryCSN, we will 
be able to truncate the provider queue, so it won't grow forever.

We will probably work around this idea with Kiran this week. I'm 
positive that it can work well by the end of this week, or even earlier.

Stay tuned !

-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com

Re: Replication heads up

Posted by Emmanuel Lecharny <el...@gmail.com>.

On 8/8/11 9:10 PM, Emmanuel Lecharny wrote:
> On 8/8/11 11:35 AM, Kiran Ayyagari wrote:
>> On Mon, Aug 8, 2011 at 2:26 PM, Emmanuel 
>> Lecharny<el...@gmail.com>  wrote:
>>> Hi guys,
>>>
>>> One option would be to ask the consumer to send a periodic message 
>>> to the
>>> producer informing it that it's up to date. It could be a daily 
>>> unbind/bind
>>> for instance. The unbind will kill the pending persistent search we
>>> established between the producer and consumer, to establish a new 
>>> one. As we
>>> will send a new request, with the lastEntryCSN, we will be able to 
>>> truncate
>>> the provider queue, so it won't grow forever.
>>>
>> this case is already handled(in my recent commit), i.e., when a
>> consumer reconnects we remove all the entries from log that are older
>> than the CSN value present in the cookie.
>>
>> Coming to restarting the consumer at periodic intervals is an
>> interesting idea, this perfectly solves many cases of 'how to
>> prune/truncate the log' except in cases of a consumer that never
>> reconnects, in which case we need to go for a time based policy
>
> I ran the tests 100 times in a row, no error. Seems like your fix was 
> the correct one.
Sadly, I tried a bit more today, running the test 400 times. I got 2 
failures :/

We definitively have an issue somewhere, still chasing the issue...

-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com

Re: Replication heads up

Posted by Emmanuel Lecharny <el...@gmail.com>.

On 8/8/11 11:35 AM, Kiran Ayyagari wrote:
> On Mon, Aug 8, 2011 at 2:26 PM, Emmanuel Lecharny<el...@gmail.com>  wrote:
>> Hi guys,
>>
>> One option would be to ask the consumer to send a periodic message to the
>> producer informing it that it's up to date. It could be a daily unbind/bind
>> for instance. The unbind will kill the pending persistent search we
>> established between the producer and consumer, to establish a new one. As we
>> will send a new request, with the lastEntryCSN, we will be able to truncate
>> the provider queue, so it won't grow forever.
>>
> this case is already handled(in my recent commit), i.e., when a
> consumer reconnects we remove all the entries from log that are older
> than the CSN value present in the cookie.
>
> Coming to restarting the consumer at periodic intervals is an
> interesting idea, this perfectly solves many cases of 'how to
> prune/truncate the log' except in cases of a consumer that never
> reconnects, in which case we need to go for a time based policy

I ran the tests 100 times in a row, no error. Seems like your fix was 
the correct one.



-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com

Re: Replication heads up

Posted by Pierre-Arnaud Marcelot <pa...@marcelot.net>.

Thanks for heads up guys!

Regards,
Pierre-Arnaud

On 8 août 2011, at 11:35, Kiran Ayyagari wrote:

> On Mon, Aug 8, 2011 at 2:26 PM, Emmanuel Lecharny <el...@gmail.com> wrote:
>> Hi guys,
>> 
>> so we found the reason why the replication tests are failing randomly. Let
>> me explain :
>> 
>> - the consumer is connected to the provider until it gets disconnected. It
>> can last for days or weeks.
>> - the producer pushes modifications to the consumer directly if the consumer
>> is connected
>> - if the consumer is disconnected, the modifications are stored in a queue,
>> waiting for the client to reconnect to send it the content of this queue
>> 
>> That being said, we have one corner case when the provider 'thinks' that the
>> consumer is connected when it's not anymore : the message is sent to the
>> disconnected client, and we don't push it to the queue, losing it.
>> 
>> One better idea is to push *all* the modifications to the queue, not matter
>> what. Then a thread will process this queue and send it contents to the
>> client, unless the client isn't connected. In any case, we *don't* delete
>> messages from the queue. Never.
>> 
>> That raises a question : what o we do in the long term ? The queue will grow
>> and never shrink. In fact this is quite simple : we truncate the queue after
>> a defined period of time (say once a day, or once a week). Ever modification
>> older than the interval is simply deleted from the queue.
>> 
>> What if a consumer is not able to reconnect within this period of time ?
>> Simple :
>> - the consumer sends the lastEntryCSN it received, and if it's older than
>> what's in the queue, then we do a full replication.
>> 
>> It may seems costly, but it's unlikely that a consumer get disconnected for
>> a long period of time. All in all, it's like if we just added a brand new
>> consumer, with nothing in it.
>> 
>> One option would be to ask the consumer to send a periodic message to the
>> producer informing it that it's up to date. It could be a daily unbind/bind
>> for instance. The unbind will kill the pending persistent search we
>> established between the producer and consumer, to establish a new one. As we
>> will send a new request, with the lastEntryCSN, we will be able to truncate
>> the provider queue, so it won't grow forever.
>> 
> this case is already handled(in my recent commit), i.e., when a
> consumer reconnects we remove all the entries from log that are older
> than the CSN value present in the cookie.
> 
> Coming to restarting the consumer at periodic intervals is an
> interesting idea, this perfectly solves many cases of 'how to
> prune/truncate the log' except in cases of a consumer that never
> reconnects, in which case we need to go for a time based policy
> 
>> We will probably work around this idea with Kiran this week. I'm positive
>> that it can work well by the end of this week, or even earlier.
>> 
>> Stay tuned !
>> 
> thanks for the putting these in ink, Emmanuel
>> --
>> Regards,
>> Cordialement,
>> Emmanuel Lécharny
>> www.iktek.com
>> 
>> 
> 
> 
> 
> -- 
> Kiran Ayyagari

Re: Replication heads up

Posted by Emmanuel Lecharny <el...@gmail.com>.

On 8/8/11 11:35 AM, Kiran Ayyagari wrote:
> On Mon, Aug 8, 2011 at 2:26 PM, Emmanuel Lecharny<el...@gmail.com>  wrote:
>> Hi guys,
>>
>>
>> One option would be to ask the consumer to send a periodic message to the
>> producer informing it that it's up to date. It could be a daily unbind/bind
>> for instance. The unbind will kill the pending persistent search we
>> established between the producer and consumer, to establish a new one. As we
>> will send a new request, with the lastEntryCSN, we will be able to truncate
>> the provider queue, so it won't grow forever.
>>
> this case is already handled(in my recent commit), i.e., when a
> consumer reconnects we remove all the entries from log that are older
> than the CSN value present in the cookie.

Hmmm. I saw your commit, and still, I have errors.

What I can guarantee is that the modifications never makes it to the 
consumer when I get an error. That means we have an issue in the way we 
manage the queue...

I will try to get some more verbose traces...


-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com

Re: Replication heads up

Posted by Kiran Ayyagari <ka...@apache.org>.

On Mon, Aug 8, 2011 at 2:26 PM, Emmanuel Lecharny <el...@gmail.com> wrote:
> Hi guys,
>
> so we found the reason why the replication tests are failing randomly. Let
> me explain :
>
> - the consumer is connected to the provider until it gets disconnected. It
> can last for days or weeks.
> - the producer pushes modifications to the consumer directly if the consumer
> is connected
> - if the consumer is disconnected, the modifications are stored in a queue,
> waiting for the client to reconnect to send it the content of this queue
>
> That being said, we have one corner case when the provider 'thinks' that the
> consumer is connected when it's not anymore : the message is sent to the
> disconnected client, and we don't push it to the queue, losing it.
>
> One better idea is to push *all* the modifications to the queue, not matter
> what. Then a thread will process this queue and send it contents to the
> client, unless the client isn't connected. In any case, we *don't* delete
> messages from the queue. Never.
>
> That raises a question : what o we do in the long term ? The queue will grow
> and never shrink. In fact this is quite simple : we truncate the queue after
> a defined period of time (say once a day, or once a week). Ever modification
> older than the interval is simply deleted from the queue.
>
> What if a consumer is not able to reconnect within this period of time ?
> Simple :
> - the consumer sends the lastEntryCSN it received, and if it's older than
> what's in the queue, then we do a full replication.
>
> It may seems costly, but it's unlikely that a consumer get disconnected for
> a long period of time. All in all, it's like if we just added a brand new
> consumer, with nothing in it.
>
> One option would be to ask the consumer to send a periodic message to the
> producer informing it that it's up to date. It could be a daily unbind/bind
> for instance. The unbind will kill the pending persistent search we
> established between the producer and consumer, to establish a new one. As we
> will send a new request, with the lastEntryCSN, we will be able to truncate
> the provider queue, so it won't grow forever.
>
this case is already handled(in my recent commit), i.e., when a
consumer reconnects we remove all the entries from log that are older
than the CSN value present in the cookie.

Coming to restarting the consumer at periodic intervals is an
interesting idea, this perfectly solves many cases of 'how to
prune/truncate the log' except in cases of a consumer that never
reconnects, in which case we need to go for a time based policy

> We will probably work around this idea with Kiran this week. I'm positive
> that it can work well by the end of this week, or even earlier.
>
> Stay tuned !
>
thanks for the putting these in ink, Emmanuel
> --
> Regards,
> Cordialement,
> Emmanuel Lécharny
> www.iktek.com
>
>



-- 
Kiran Ayyagari