You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Ramsey Haddad <ra...@gmail.com> on 2016/02/25 16:24:24 UTC

PeerSync.java: why "completeList" in handleVersions()?

Does "!completeList" do anything necessary in the line:

if (!completeList && Math.abs(otherVersion) < ourLowThreshold) break;

I think the line should simply be:

if (Math.abs(otherVersion) < ourLowThreshold) break;

-----
The inclusion of "!completeList" in this conditional would seem to
only cause some minor performance penalty: replaying a bunch of ADDs
that the syncing replica already has ADDed.

BUT: in our set-up this is causing a noticeable problem. In
particular, we use a large value of nUpdates and we have an hourly DBQ
for garbage collection. If we do rolling restarts of our replicas,
then the second restart can leave us leaderless for a long span of
time.

This happens as follows:
* Replica1 is leader. Replica1 goes down.
* Leadership goes to Replica2. It resyncs with all replicas except Replica1.
* Replica1 returns and resyncs.
* Replica2 is leader. Replica2 goes down.
* Leadership goes to Replica3. It resyncs with all replicas except Replica2.

At this point, Replica1 has a longer updatelog (less trimmed -- more
old updates) than the other replicas. We will refer to these as the
"ancient" updates.
Replica3 does a getVersion from Replica1 and Replica4 and receives
replies from them. The ancient updates will not be contained in
ourUpdateSet. While the ancient updates are older than
ourLowThreshold, the check is skipped because of the "completeList"
term that make no sense to me. So Replica3 replays the ancient ADDs.
Say that 1000 of these ADDs are older than a DBQ in Replica3's update
log? Then the DBQ gets replayed 1000 times ... once after each ADD is
replayed. Fixing the replay mechanism to only replay the DBQ once
looks hard because of the code structure. However, these ADDs (and
hence the DBQ) shouldn't have even been replayed at all!

After the leader Replica3 is synced. It asks Replica 1 and Replica4 to
sync to it. The ancient ADDs have now been merged back unto Replica3's
update log and so when Replica4 is syncing with Replica3, then
Replica4 also ends up replaying the ancient ADDs and replaying the DBQ
1000 times.

Only when all of this finally completes can Replica3 finally perform
its role as leader and accept new updates.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: PeerSync.java: why "completeList" in handleVersions()?

Posted by Ramsey Haddad <ra...@gmail.com>.
My co-worker, Christine Poerschke, pointed out that the "completeList"
term was added in a change described as "restore old deletes via tlog
so peersync won't reorder".

If the goal was only the replay of deletes older than ourLowThreshold,
then keeping that goal doesn't need to interfere with the performance
fix we want. The code could be changed to:

if (!completeList && Math.abs(otherVersion) < ourLowThreshold) break;
if (completeList && 0 < otherVersion && otherVersion <
ourLowThreshold) continue;

On Thu, Feb 25, 2016 at 3:24 PM, Ramsey Haddad <ra...@gmail.com> wrote:
> Does "!completeList" do anything necessary in the line:
>
> if (!completeList && Math.abs(otherVersion) < ourLowThreshold) break;
>
> I think the line should simply be:
>
> if (Math.abs(otherVersion) < ourLowThreshold) break;
>
> -----
> The inclusion of "!completeList" in this conditional would seem to
> only cause some minor performance penalty: replaying a bunch of ADDs
> that the syncing replica already has ADDed.
>
> BUT: in our set-up this is causing a noticeable problem. In
> particular, we use a large value of nUpdates and we have an hourly DBQ
> for garbage collection. If we do rolling restarts of our replicas,
> then the second restart can leave us leaderless for a long span of
> time.
>
> This happens as follows:
> * Replica1 is leader. Replica1 goes down.
> * Leadership goes to Replica2. It resyncs with all replicas except Replica1.
> * Replica1 returns and resyncs.
> * Replica2 is leader. Replica2 goes down.
> * Leadership goes to Replica3. It resyncs with all replicas except Replica2.
>
> At this point, Replica1 has a longer updatelog (less trimmed -- more
> old updates) than the other replicas. We will refer to these as the
> "ancient" updates.
> Replica3 does a getVersion from Replica1 and Replica4 and receives
> replies from them. The ancient updates will not be contained in
> ourUpdateSet. While the ancient updates are older than
> ourLowThreshold, the check is skipped because of the "completeList"
> term that make no sense to me. So Replica3 replays the ancient ADDs.
> Say that 1000 of these ADDs are older than a DBQ in Replica3's update
> log? Then the DBQ gets replayed 1000 times ... once after each ADD is
> replayed. Fixing the replay mechanism to only replay the DBQ once
> looks hard because of the code structure. However, these ADDs (and
> hence the DBQ) shouldn't have even been replayed at all!
>
> After the leader Replica3 is synced. It asks Replica 1 and Replica4 to
> sync to it. The ancient ADDs have now been merged back unto Replica3's
> update log and so when Replica4 is syncing with Replica3, then
> Replica4 also ends up replaying the ancient ADDs and replaying the DBQ
> 1000 times.
>
> Only when all of this finally completes can Replica3 finally perform
> its role as leader and accept new updates.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org