You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-dev@jackrabbit.apache.org by Julian Sedding <js...@gmail.com> on 2015/08/05 15:35:20 UTC

Checkpoints and copying NodeStore instances (aka RepositorySidegrade)

Hi all

I am working on a scenario, where I need to copy a SegmentNodeStore
(TarMK) to a DocumentNodeStore (MongoDB).

It is pretty straight forward to simply copy the NodeStore via the
API. No problems here.

In a recent experiment I successfully copied the NodeStore and got an
exception in the logs (stacktrace below the email).

My interpretation is that the AsyncIndexUpdate is trying to retrieve
the previous checkpoint as stored in /:async/async. Of course this
checkpoint is not present in the copied NodeStore and thus cannot be
retrieved.

IMHO it would be desirable to (optionally) copy the checkpoints as
well. In the case of AsyncIndexUpdate, having the checkpoint can save
a full re-index.

The question that remains is how the internal state of
AsyncIndexUpdate should be modified:
* implementing the logic in oak-upgrade would be pragmatic, but
distributes knowledge about AsyncIndexUpdate implementation details to
different modules
* having a CommitHook/Editor in oak-core that can be used in
oak-upgrade might be cleaner, but would only get used in oak-upgrade

Other ideas and opinions regarding this feature are more than welcome!

Regards
Julian


05.08.2015 00:03:19.133 *ERROR* [pool-6-thread-2]
org.apache.sling.commons.scheduler.impl.QuartzScheduler Exception
during job execution of
org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate@471e4b4b :
91f7e218-6cf5-4a44-a324-f094c29898e6
java.lang.IllegalArgumentException: 91f7e218-6cf5-4a44-a324-f094c29898e6
        at org.apache.jackrabbit.oak.plugins.document.Revision.fromString(Revision.java:236)
        at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.retrieve(DocumentNodeStore.java:1570)
        at org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate.run(AsyncIndexUpdate.java:279)
        at org.apache.sling.commons.scheduler.impl.QuartzJobExecutor.execute(QuartzJobExecutor.java:105)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Re: Checkpoints and copying NodeStore instances (aka RepositorySidegrade)

Posted by Julian Sedding <js...@gmail.com>.
Hi Davide

Of course we need to measure the speed and improvements. I am not sure
how much time I will have to implement benchmarks for this, but will
try.

So far I have not tried the cached Tika full-text extraction, yet. I'm
curious how much gain it can provide though. This may make re-indexing
so cheap that we need not worry about it any more.

I hadn't really paid attention to OAK-2749, but it sounds interesting.
Similarly but differently, I was pondering the idea to allow
multithreaded tar2mongo copies. Since DocumentMK supports clustering,
it should be possible to copy different sub-trees in different
threads?!

It would indeed be interesting to have a chat some time! But first
I'll be on holidays for two weeks :)

Regards
Julian



On Wed, Aug 5, 2015 at 8:57 PM, Davide Giannella <da...@apache.org> wrote:
> On 05/08/2015 17:45, Julian Sedding wrote:
>> ...
>>
>> My aim is to reduce the critical path for migrating one NodeStore
>> (incl JR2) to another. Indexing (especially async indexing) takes is a
>> big part of the time, so if I can move that out of the critical path,
>> it can save a lot of downtime.
>
> Interesting. I know async index can be lengthy but it would be very
> interesting if we could measure what we have now and the improvements
> we're making.
>
> The slowest part of the async index is normally the full-text extraction
> as they run in a single thread. With
> https://issues.apache.org/jira/browse/OAK-2749 we provided a mechanism
> (not used yet AFAIK) to run different indexers on different threads.
> Maybe it's something you would like to experiment with as well to speed
> up the indexing.
>
> If you want ping me on chat tomorrow morning (CEST) so we can quickly
> see what we can do here. But I think we should start measuring it first :)
>
> Cheers
> Davide
>
>
>

Re: Checkpoints and copying NodeStore instances (aka RepositorySidegrade)

Posted by Davide Giannella <da...@apache.org>.
On 05/08/2015 17:45, Julian Sedding wrote:
> ...
>
> My aim is to reduce the critical path for migrating one NodeStore
> (incl JR2) to another. Indexing (especially async indexing) takes is a
> big part of the time, so if I can move that out of the critical path,
> it can save a lot of downtime.

Interesting. I know async index can be lengthy but it would be very
interesting if we could measure what we have now and the improvements
we're making.

The slowest part of the async index is normally the full-text extraction
as they run in a single thread. With
https://issues.apache.org/jira/browse/OAK-2749 we provided a mechanism
(not used yet AFAIK) to run different indexers on different threads.
Maybe it's something you would like to experiment with as well to speed
up the indexing.

If you want ping me on chat tomorrow morning (CEST) so we can quickly
see what we can do here. But I think we should start measuring it first :)

Cheers
Davide




Re: Checkpoints and copying NodeStore instances (aka RepositorySidegrade)

Posted by Alex Parvulescu <al...@gmail.com>.
Hi Julian,

On Thu, Aug 6, 2015 at 3:14 PM, Julian Sedding <js...@gmail.com> wrote:

> Hi Alex
>
> See inline.
>
> On Wed, Aug 5, 2015 at 7:57 PM, Alex Parvulescu
> <al...@gmail.com> wrote:
> > Hi,
> >
> > see inline
> >
> > On Wed, Aug 5, 2015 at 5:45 PM, Julian Sedding <js...@gmail.com>
> wrote:
> >
> >> Hi Alex
> >>
> >> Thanks for your comments.
> >>
> >> On Wed, Aug 5, 2015 at 3:48 PM, Alex Parvulescu
> >> <al...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > Just a few clarifications on the error you see
> >> >
> >> >> My interpretation is that the AsyncIndexUpdate is trying to retrieve
> >> > the previous checkpoint as stored in /:async/async. Of course this
> >> > checkpoint is not present in the copied NodeStore and thus cannot be
> >> > retrieved.
> >> >
> >> > The error comes from DocumentMk trying to parse the reference
> checkpoint
> >> > value. Basically what fails here is 'Revision.fromString' receiving a
> >> > malformed checkpoint value because it comes from the SegmentMk. The
> quick
> >> > fix is to manually remove the properties on the "/:async" hidden node.
> >> This
> >> > will indeed trigger a full reindex, but will help you getting over
> this
> >> > issue.
> >>
> >> Agreed. In this case parsing the revision is the first thing that
> >> fails. When copying DNS to SNS a similar situation would arise,
> >> because no snapshot with the provided ID exists.
> >>
> >>
> > [alex] Not really, as the SegmentMk will not fail (no
> > IllegalArgumentException), but simply log a warning the checkpoint
> doesn't
> > exist and perform a full reindex. So in this regard it is a bit more
> > lenient :)
>
> Ok, I didn't know that SegmentMK is more lenient here. Should we make
> DocumentMK degrade gracefully as well? Currently the AsyncIndex does
> not recover by itself. It would be more robust if it did.
>

I think this falls under a 'nice to have' rather than a really needed
change. we are dealing with a specific case here and ideally the sidegrade
process would take care of removing the checkpoint reference (or set it to
a new value, depending on availability).


>
> >
> >
> >
> >> >
> >> >> IMHO it would be desirable to (optionally) copy the checkpoints as
> >> > well. In the case of AsyncIndexUpdate, having the checkpoint can save
> >> > a full re-index.
> >> >
> >> > This is very tricky, as the 2 representations of checkpoints between
> >> > SegmentMk and DocumentMk are quite different. I would strongly suggest
> >> > going for the reindex, after all you'd only migrate once, so you can
> >> > prepare for this lengthy process.
> >>
> >> I'm experimenting with the following approach:
> >> * retrieve the first checkpoint and copy the NodeState tree at that
> >> revision (available via CheckpointMBean impls)
> >> * after copying the tree, merge and create a checkpoint (expiration
> >> time can be calculated)
> >> * rinse and repeat until the head revision is reached
> >>
> >> My aim is to reduce the critical path for migrating one NodeStore
> >> (incl JR2) to another. Indexing (especially async indexing) takes is a
> >> big part of the time, so if I can move that out of the critical path,
> >> it can save a lot of downtime.
> >>
> >
> > [alex] interesting approach. I would only reduce this to the 'current'
> > indexed checkpoint (the async reference). So you'd migrate that over
> first
> > as the head state, create a checkpoint based on it (let' call it 'c0').
> > then diff&apply the SegmentMk head state on top of this. update the async
> > property to point to c0 and you might be good.
>
> Absolutely, only copying the checkpoints that are actually needed makes
> sense.
>
> Thinking out loud: it may be faster to run the async-index in the copy
> process, based on the diff from the source NodeStore between the
> checkpoint and the head. That should be feasible, right?
>

It should be doable, yes. but if I understand correctly, you'd like the
overall duration to reduce, not increase :) running the asyncs as sync
would only add more time to the migration.



>
> >
> >
> >
> >>
> >> My current approach for a migration from JR2 to MongoMK is to:
> >> * copy JR2 to TarMK (TarMK is a lot faster for creating indexes etc.
> >> than MongoMK)
> >> * repeat JR2 to TarMK copy every week or every 24h using incremental
> >> copy. this saves on CommitHook execution time - in theory this can
> >> reduce the time for one run to a single full repository traversal.
> >> * finally on the day when the systems should be switched over, run a
> >> last JR2 to TarMK and then a TarMK to MongoMK copy. this is the
> >> critical path.
> >>
> >
> > [alex] Always going through the SegmentMk seems a bit convoluted. Why not
> > do the migration once, then apply the diffs on top of MongoMk directly
> > (AFAIK we have support for incremental updates now)? Are the 24h diffs so
> > big that it makes it unusable/unacceptable to go to MongoMk directly?
> (I'd
> > like to see this backed by some numbers).
>
> We definitely need numbers. I aim to do some experiments after my
> holidays and provide some numbers at the beginning of September.
>
> Regarding incremental upgrades definitely yield a huge benefit on the
> critical path with SegmentMK, don't know about DocumentMK yet.
>
> Regarding incremental upgrades and have some numbers. The scenario is
> the migration from JR2 (TarPM) to Oak TarMK, copying 2.6mio regular
> nodes and 5.7mio versions (versions are copied via commit editor, see
> OAK-2776 "copy all referenced versions"). The first source repository
> is 23 days older than the second source repository, i.e. the delta is
> ~3 weeks of content editing of a live website.
>
> initial upgrade
> - copy time: ~6min (2.6mio regular nodes) + ~30min (5.7mio referenced
> versions)
> - index creation (synchronous indexes): ~2h 20min
> - total time: ~2h 57min
>
> incremental upgrade (with OAK-3163 applied)
> - copy/compare time: ~9min (2.6mio regular nodes) + ~6.5min (0.7mio
> new/modified referenced versions)
> - index creation (synchronous indexes): ~7min
> - total time: ~23min
>
> finally copy from TarMK to MongoMK
> - total: ~3h 7min
>


thanks for sharing the numbers!

Can you also add some info about the async indexes? How long does it take
for them to finish?

One quick item to consider is reevaluating the indexes that you are
building and maintaining during the main and incremental upgrades. there's
no greater waste of resources during migration than to build a few indexes
only to throw them away later when the repo starts up. depending on what
tools you use, I strongly suggest removing indexes as early as possible
(even if you mark an index as disabled, the TarMk -> MongoMK copy will
still move the unneeded content over) .


best,
alex


>
> >
> >
> > hope this helps,
> > alex
>
> Regards
> Julian
>
> >
> >
> >
> >
> >> Due to the above, copying at least the checkpoint of the async index
> >> will likely speed up the critical path. Of course measuring execution
> >> times will provide the definitive answer to this question.
> >>
> >> Regards
> >> Julian
> >>
> >> >
> >> > best,
> >> > alex
> >> >
> >> >
> >> > On Wed, Aug 5, 2015 at 3:35 PM, Julian Sedding <js...@gmail.com>
> >> wrote:
> >> >
> >> >> Hi all
> >> >>
> >> >> I am working on a scenario, where I need to copy a SegmentNodeStore
> >> >> (TarMK) to a DocumentNodeStore (MongoDB).
> >> >>
> >> >> It is pretty straight forward to simply copy the NodeStore via the
> >> >> API. No problems here.
> >> >>
> >> >> In a recent experiment I successfully copied the NodeStore and got an
> >> >> exception in the logs (stacktrace below the email).
> >> >>
> >> >> My interpretation is that the AsyncIndexUpdate is trying to retrieve
> >> >> the previous checkpoint as stored in /:async/async. Of course this
> >> >> checkpoint is not present in the copied NodeStore and thus cannot be
> >> >> retrieved.
> >> >>
> >> >> IMHO it would be desirable to (optionally) copy the checkpoints as
> >> >> well. In the case of AsyncIndexUpdate, having the checkpoint can save
> >> >> a full re-index.
> >> >>
> >> >> The question that remains is how the internal state of
> >> >> AsyncIndexUpdate should be modified:
> >> >> * implementing the logic in oak-upgrade would be pragmatic, but
> >> >> distributes knowledge about AsyncIndexUpdate implementation details
> to
> >> >> different modules
> >> >> * having a CommitHook/Editor in oak-core that can be used in
> >> >> oak-upgrade might be cleaner, but would only get used in oak-upgrade
> >> >>
> >> >> Other ideas and opinions regarding this feature are more than
> welcome!
> >> >>
> >> >> Regards
> >> >> Julian
> >> >>
> >> >>
> >> >> 05.08.2015 00:03:19.133 *ERROR* [pool-6-thread-2]
> >> >> org.apache.sling.commons.scheduler.impl.QuartzScheduler Exception
> >> >> during job execution of
> >> >> org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate@471e4b4b :
> >> >> 91f7e218-6cf5-4a44-a324-f094c29898e6
> >> >> java.lang.IllegalArgumentException:
> 91f7e218-6cf5-4a44-a324-f094c29898e6
> >> >>         at
> >> >>
> >>
> org.apache.jackrabbit.oak.plugins.document.Revision.fromString(Revision.java:236)
> >> >>         at
> >> >>
> >>
> org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.retrieve(DocumentNodeStore.java:1570)
> >> >>         at
> >> >>
> >>
> org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate.run(AsyncIndexUpdate.java:279)
> >> >>         at
> >> >>
> >>
> org.apache.sling.commons.scheduler.impl.QuartzJobExecutor.execute(QuartzJobExecutor.java:105)
> >> >>         at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
> >> >>         at
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >> >>         at
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >> >>         at java.lang.Thread.run(Thread.java:745)
> >> >>
> >>
>

Re: Checkpoints and copying NodeStore instances (aka RepositorySidegrade)

Posted by Julian Sedding <js...@gmail.com>.
Hi Alex

See inline.

On Wed, Aug 5, 2015 at 7:57 PM, Alex Parvulescu
<al...@gmail.com> wrote:
> Hi,
>
> see inline
>
> On Wed, Aug 5, 2015 at 5:45 PM, Julian Sedding <js...@gmail.com> wrote:
>
>> Hi Alex
>>
>> Thanks for your comments.
>>
>> On Wed, Aug 5, 2015 at 3:48 PM, Alex Parvulescu
>> <al...@gmail.com> wrote:
>> > Hi,
>> >
>> > Just a few clarifications on the error you see
>> >
>> >> My interpretation is that the AsyncIndexUpdate is trying to retrieve
>> > the previous checkpoint as stored in /:async/async. Of course this
>> > checkpoint is not present in the copied NodeStore and thus cannot be
>> > retrieved.
>> >
>> > The error comes from DocumentMk trying to parse the reference checkpoint
>> > value. Basically what fails here is 'Revision.fromString' receiving a
>> > malformed checkpoint value because it comes from the SegmentMk. The quick
>> > fix is to manually remove the properties on the "/:async" hidden node.
>> This
>> > will indeed trigger a full reindex, but will help you getting over this
>> > issue.
>>
>> Agreed. In this case parsing the revision is the first thing that
>> fails. When copying DNS to SNS a similar situation would arise,
>> because no snapshot with the provided ID exists.
>>
>>
> [alex] Not really, as the SegmentMk will not fail (no
> IllegalArgumentException), but simply log a warning the checkpoint doesn't
> exist and perform a full reindex. So in this regard it is a bit more
> lenient :)

Ok, I didn't know that SegmentMK is more lenient here. Should we make
DocumentMK degrade gracefully as well? Currently the AsyncIndex does
not recover by itself. It would be more robust if it did.

>
>
>
>> >
>> >> IMHO it would be desirable to (optionally) copy the checkpoints as
>> > well. In the case of AsyncIndexUpdate, having the checkpoint can save
>> > a full re-index.
>> >
>> > This is very tricky, as the 2 representations of checkpoints between
>> > SegmentMk and DocumentMk are quite different. I would strongly suggest
>> > going for the reindex, after all you'd only migrate once, so you can
>> > prepare for this lengthy process.
>>
>> I'm experimenting with the following approach:
>> * retrieve the first checkpoint and copy the NodeState tree at that
>> revision (available via CheckpointMBean impls)
>> * after copying the tree, merge and create a checkpoint (expiration
>> time can be calculated)
>> * rinse and repeat until the head revision is reached
>>
>> My aim is to reduce the critical path for migrating one NodeStore
>> (incl JR2) to another. Indexing (especially async indexing) takes is a
>> big part of the time, so if I can move that out of the critical path,
>> it can save a lot of downtime.
>>
>
> [alex] interesting approach. I would only reduce this to the 'current'
> indexed checkpoint (the async reference). So you'd migrate that over first
> as the head state, create a checkpoint based on it (let' call it 'c0').
> then diff&apply the SegmentMk head state on top of this. update the async
> property to point to c0 and you might be good.

Absolutely, only copying the checkpoints that are actually needed makes sense.

Thinking out loud: it may be faster to run the async-index in the copy
process, based on the diff from the source NodeStore between the
checkpoint and the head. That should be feasible, right?

>
>
>
>>
>> My current approach for a migration from JR2 to MongoMK is to:
>> * copy JR2 to TarMK (TarMK is a lot faster for creating indexes etc.
>> than MongoMK)
>> * repeat JR2 to TarMK copy every week or every 24h using incremental
>> copy. this saves on CommitHook execution time - in theory this can
>> reduce the time for one run to a single full repository traversal.
>> * finally on the day when the systems should be switched over, run a
>> last JR2 to TarMK and then a TarMK to MongoMK copy. this is the
>> critical path.
>>
>
> [alex] Always going through the SegmentMk seems a bit convoluted. Why not
> do the migration once, then apply the diffs on top of MongoMk directly
> (AFAIK we have support for incremental updates now)? Are the 24h diffs so
> big that it makes it unusable/unacceptable to go to MongoMk directly? (I'd
> like to see this backed by some numbers).

We definitely need numbers. I aim to do some experiments after my
holidays and provide some numbers at the beginning of September.

Regarding incremental upgrades definitely yield a huge benefit on the
critical path with SegmentMK, don't know about DocumentMK yet.

Regarding incremental upgrades and have some numbers. The scenario is
the migration from JR2 (TarPM) to Oak TarMK, copying 2.6mio regular
nodes and 5.7mio versions (versions are copied via commit editor, see
OAK-2776 "copy all referenced versions"). The first source repository
is 23 days older than the second source repository, i.e. the delta is
~3 weeks of content editing of a live website.

initial upgrade
- copy time: ~6min (2.6mio regular nodes) + ~30min (5.7mio referenced versions)
- index creation (synchronous indexes): ~2h 20min
- total time: ~2h 57min

incremental upgrade (with OAK-3163 applied)
- copy/compare time: ~9min (2.6mio regular nodes) + ~6.5min (0.7mio
new/modified referenced versions)
- index creation (synchronous indexes): ~7min
- total time: ~23min

finally copy from TarMK to MongoMK
- total: ~3h 7min

>
>
> hope this helps,
> alex

Regards
Julian

>
>
>
>
>> Due to the above, copying at least the checkpoint of the async index
>> will likely speed up the critical path. Of course measuring execution
>> times will provide the definitive answer to this question.
>>
>> Regards
>> Julian
>>
>> >
>> > best,
>> > alex
>> >
>> >
>> > On Wed, Aug 5, 2015 at 3:35 PM, Julian Sedding <js...@gmail.com>
>> wrote:
>> >
>> >> Hi all
>> >>
>> >> I am working on a scenario, where I need to copy a SegmentNodeStore
>> >> (TarMK) to a DocumentNodeStore (MongoDB).
>> >>
>> >> It is pretty straight forward to simply copy the NodeStore via the
>> >> API. No problems here.
>> >>
>> >> In a recent experiment I successfully copied the NodeStore and got an
>> >> exception in the logs (stacktrace below the email).
>> >>
>> >> My interpretation is that the AsyncIndexUpdate is trying to retrieve
>> >> the previous checkpoint as stored in /:async/async. Of course this
>> >> checkpoint is not present in the copied NodeStore and thus cannot be
>> >> retrieved.
>> >>
>> >> IMHO it would be desirable to (optionally) copy the checkpoints as
>> >> well. In the case of AsyncIndexUpdate, having the checkpoint can save
>> >> a full re-index.
>> >>
>> >> The question that remains is how the internal state of
>> >> AsyncIndexUpdate should be modified:
>> >> * implementing the logic in oak-upgrade would be pragmatic, but
>> >> distributes knowledge about AsyncIndexUpdate implementation details to
>> >> different modules
>> >> * having a CommitHook/Editor in oak-core that can be used in
>> >> oak-upgrade might be cleaner, but would only get used in oak-upgrade
>> >>
>> >> Other ideas and opinions regarding this feature are more than welcome!
>> >>
>> >> Regards
>> >> Julian
>> >>
>> >>
>> >> 05.08.2015 00:03:19.133 *ERROR* [pool-6-thread-2]
>> >> org.apache.sling.commons.scheduler.impl.QuartzScheduler Exception
>> >> during job execution of
>> >> org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate@471e4b4b :
>> >> 91f7e218-6cf5-4a44-a324-f094c29898e6
>> >> java.lang.IllegalArgumentException: 91f7e218-6cf5-4a44-a324-f094c29898e6
>> >>         at
>> >>
>> org.apache.jackrabbit.oak.plugins.document.Revision.fromString(Revision.java:236)
>> >>         at
>> >>
>> org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.retrieve(DocumentNodeStore.java:1570)
>> >>         at
>> >>
>> org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate.run(AsyncIndexUpdate.java:279)
>> >>         at
>> >>
>> org.apache.sling.commons.scheduler.impl.QuartzJobExecutor.execute(QuartzJobExecutor.java:105)
>> >>         at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
>> >>         at
>> >>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >>         at
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >>         at java.lang.Thread.run(Thread.java:745)
>> >>
>>

Re: Checkpoints and copying NodeStore instances (aka RepositorySidegrade)

Posted by Alex Parvulescu <al...@gmail.com>.
Hi,

see inline

On Wed, Aug 5, 2015 at 5:45 PM, Julian Sedding <js...@gmail.com> wrote:

> Hi Alex
>
> Thanks for your comments.
>
> On Wed, Aug 5, 2015 at 3:48 PM, Alex Parvulescu
> <al...@gmail.com> wrote:
> > Hi,
> >
> > Just a few clarifications on the error you see
> >
> >> My interpretation is that the AsyncIndexUpdate is trying to retrieve
> > the previous checkpoint as stored in /:async/async. Of course this
> > checkpoint is not present in the copied NodeStore and thus cannot be
> > retrieved.
> >
> > The error comes from DocumentMk trying to parse the reference checkpoint
> > value. Basically what fails here is 'Revision.fromString' receiving a
> > malformed checkpoint value because it comes from the SegmentMk. The quick
> > fix is to manually remove the properties on the "/:async" hidden node.
> This
> > will indeed trigger a full reindex, but will help you getting over this
> > issue.
>
> Agreed. In this case parsing the revision is the first thing that
> fails. When copying DNS to SNS a similar situation would arise,
> because no snapshot with the provided ID exists.
>
>
[alex] Not really, as the SegmentMk will not fail (no
IllegalArgumentException), but simply log a warning the checkpoint doesn't
exist and perform a full reindex. So in this regard it is a bit more
lenient :)



> >
> >> IMHO it would be desirable to (optionally) copy the checkpoints as
> > well. In the case of AsyncIndexUpdate, having the checkpoint can save
> > a full re-index.
> >
> > This is very tricky, as the 2 representations of checkpoints between
> > SegmentMk and DocumentMk are quite different. I would strongly suggest
> > going for the reindex, after all you'd only migrate once, so you can
> > prepare for this lengthy process.
>
> I'm experimenting with the following approach:
> * retrieve the first checkpoint and copy the NodeState tree at that
> revision (available via CheckpointMBean impls)
> * after copying the tree, merge and create a checkpoint (expiration
> time can be calculated)
> * rinse and repeat until the head revision is reached
>
> My aim is to reduce the critical path for migrating one NodeStore
> (incl JR2) to another. Indexing (especially async indexing) takes is a
> big part of the time, so if I can move that out of the critical path,
> it can save a lot of downtime.
>

[alex] interesting approach. I would only reduce this to the 'current'
indexed checkpoint (the async reference). So you'd migrate that over first
as the head state, create a checkpoint based on it (let' call it 'c0').
then diff&apply the SegmentMk head state on top of this. update the async
property to point to c0 and you might be good.



>
> My current approach for a migration from JR2 to MongoMK is to:
> * copy JR2 to TarMK (TarMK is a lot faster for creating indexes etc.
> than MongoMK)
> * repeat JR2 to TarMK copy every week or every 24h using incremental
> copy. this saves on CommitHook execution time - in theory this can
> reduce the time for one run to a single full repository traversal.
> * finally on the day when the systems should be switched over, run a
> last JR2 to TarMK and then a TarMK to MongoMK copy. this is the
> critical path.
>

[alex] Always going through the SegmentMk seems a bit convoluted. Why not
do the migration once, then apply the diffs on top of MongoMk directly
(AFAIK we have support for incremental updates now)? Are the 24h diffs so
big that it makes it unusable/unacceptable to go to MongoMk directly? (I'd
like to see this backed by some numbers).


hope this helps,
alex




> Due to the above, copying at least the checkpoint of the async index
> will likely speed up the critical path. Of course measuring execution
> times will provide the definitive answer to this question.
>
> Regards
> Julian
>
> >
> > best,
> > alex
> >
> >
> > On Wed, Aug 5, 2015 at 3:35 PM, Julian Sedding <js...@gmail.com>
> wrote:
> >
> >> Hi all
> >>
> >> I am working on a scenario, where I need to copy a SegmentNodeStore
> >> (TarMK) to a DocumentNodeStore (MongoDB).
> >>
> >> It is pretty straight forward to simply copy the NodeStore via the
> >> API. No problems here.
> >>
> >> In a recent experiment I successfully copied the NodeStore and got an
> >> exception in the logs (stacktrace below the email).
> >>
> >> My interpretation is that the AsyncIndexUpdate is trying to retrieve
> >> the previous checkpoint as stored in /:async/async. Of course this
> >> checkpoint is not present in the copied NodeStore and thus cannot be
> >> retrieved.
> >>
> >> IMHO it would be desirable to (optionally) copy the checkpoints as
> >> well. In the case of AsyncIndexUpdate, having the checkpoint can save
> >> a full re-index.
> >>
> >> The question that remains is how the internal state of
> >> AsyncIndexUpdate should be modified:
> >> * implementing the logic in oak-upgrade would be pragmatic, but
> >> distributes knowledge about AsyncIndexUpdate implementation details to
> >> different modules
> >> * having a CommitHook/Editor in oak-core that can be used in
> >> oak-upgrade might be cleaner, but would only get used in oak-upgrade
> >>
> >> Other ideas and opinions regarding this feature are more than welcome!
> >>
> >> Regards
> >> Julian
> >>
> >>
> >> 05.08.2015 00:03:19.133 *ERROR* [pool-6-thread-2]
> >> org.apache.sling.commons.scheduler.impl.QuartzScheduler Exception
> >> during job execution of
> >> org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate@471e4b4b :
> >> 91f7e218-6cf5-4a44-a324-f094c29898e6
> >> java.lang.IllegalArgumentException: 91f7e218-6cf5-4a44-a324-f094c29898e6
> >>         at
> >>
> org.apache.jackrabbit.oak.plugins.document.Revision.fromString(Revision.java:236)
> >>         at
> >>
> org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.retrieve(DocumentNodeStore.java:1570)
> >>         at
> >>
> org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate.run(AsyncIndexUpdate.java:279)
> >>         at
> >>
> org.apache.sling.commons.scheduler.impl.QuartzJobExecutor.execute(QuartzJobExecutor.java:105)
> >>         at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
> >>         at
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >>         at
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >>         at java.lang.Thread.run(Thread.java:745)
> >>
>

Re: Checkpoints and copying NodeStore instances (aka RepositorySidegrade)

Posted by Julian Sedding <js...@gmail.com>.
Hi Alex

Thanks for your comments.

On Wed, Aug 5, 2015 at 3:48 PM, Alex Parvulescu
<al...@gmail.com> wrote:
> Hi,
>
> Just a few clarifications on the error you see
>
>> My interpretation is that the AsyncIndexUpdate is trying to retrieve
> the previous checkpoint as stored in /:async/async. Of course this
> checkpoint is not present in the copied NodeStore and thus cannot be
> retrieved.
>
> The error comes from DocumentMk trying to parse the reference checkpoint
> value. Basically what fails here is 'Revision.fromString' receiving a
> malformed checkpoint value because it comes from the SegmentMk. The quick
> fix is to manually remove the properties on the "/:async" hidden node. This
> will indeed trigger a full reindex, but will help you getting over this
> issue.

Agreed. In this case parsing the revision is the first thing that
fails. When copying DNS to SNS a similar situation would arise,
because no snapshot with the provided ID exists.

>
>> IMHO it would be desirable to (optionally) copy the checkpoints as
> well. In the case of AsyncIndexUpdate, having the checkpoint can save
> a full re-index.
>
> This is very tricky, as the 2 representations of checkpoints between
> SegmentMk and DocumentMk are quite different. I would strongly suggest
> going for the reindex, after all you'd only migrate once, so you can
> prepare for this lengthy process.

I'm experimenting with the following approach:
* retrieve the first checkpoint and copy the NodeState tree at that
revision (available via CheckpointMBean impls)
* after copying the tree, merge and create a checkpoint (expiration
time can be calculated)
* rinse and repeat until the head revision is reached

My aim is to reduce the critical path for migrating one NodeStore
(incl JR2) to another. Indexing (especially async indexing) takes is a
big part of the time, so if I can move that out of the critical path,
it can save a lot of downtime.

My current approach for a migration from JR2 to MongoMK is to:
* copy JR2 to TarMK (TarMK is a lot faster for creating indexes etc.
than MongoMK)
* repeat JR2 to TarMK copy every week or every 24h using incremental
copy. this saves on CommitHook execution time - in theory this can
reduce the time for one run to a single full repository traversal.
* finally on the day when the systems should be switched over, run a
last JR2 to TarMK and then a TarMK to MongoMK copy. this is the
critical path.

Due to the above, copying at least the checkpoint of the async index
will likely speed up the critical path. Of course measuring execution
times will provide the definitive answer to this question.

Regards
Julian

>
> best,
> alex
>
>
> On Wed, Aug 5, 2015 at 3:35 PM, Julian Sedding <js...@gmail.com> wrote:
>
>> Hi all
>>
>> I am working on a scenario, where I need to copy a SegmentNodeStore
>> (TarMK) to a DocumentNodeStore (MongoDB).
>>
>> It is pretty straight forward to simply copy the NodeStore via the
>> API. No problems here.
>>
>> In a recent experiment I successfully copied the NodeStore and got an
>> exception in the logs (stacktrace below the email).
>>
>> My interpretation is that the AsyncIndexUpdate is trying to retrieve
>> the previous checkpoint as stored in /:async/async. Of course this
>> checkpoint is not present in the copied NodeStore and thus cannot be
>> retrieved.
>>
>> IMHO it would be desirable to (optionally) copy the checkpoints as
>> well. In the case of AsyncIndexUpdate, having the checkpoint can save
>> a full re-index.
>>
>> The question that remains is how the internal state of
>> AsyncIndexUpdate should be modified:
>> * implementing the logic in oak-upgrade would be pragmatic, but
>> distributes knowledge about AsyncIndexUpdate implementation details to
>> different modules
>> * having a CommitHook/Editor in oak-core that can be used in
>> oak-upgrade might be cleaner, but would only get used in oak-upgrade
>>
>> Other ideas and opinions regarding this feature are more than welcome!
>>
>> Regards
>> Julian
>>
>>
>> 05.08.2015 00:03:19.133 *ERROR* [pool-6-thread-2]
>> org.apache.sling.commons.scheduler.impl.QuartzScheduler Exception
>> during job execution of
>> org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate@471e4b4b :
>> 91f7e218-6cf5-4a44-a324-f094c29898e6
>> java.lang.IllegalArgumentException: 91f7e218-6cf5-4a44-a324-f094c29898e6
>>         at
>> org.apache.jackrabbit.oak.plugins.document.Revision.fromString(Revision.java:236)
>>         at
>> org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.retrieve(DocumentNodeStore.java:1570)
>>         at
>> org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate.run(AsyncIndexUpdate.java:279)
>>         at
>> org.apache.sling.commons.scheduler.impl.QuartzJobExecutor.execute(QuartzJobExecutor.java:105)
>>         at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
>>         at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>         at java.lang.Thread.run(Thread.java:745)
>>

Re: Checkpoints and copying NodeStore instances (aka RepositorySidegrade)

Posted by Alex Parvulescu <al...@gmail.com>.
Hi,

Just a few clarifications on the error you see

> My interpretation is that the AsyncIndexUpdate is trying to retrieve
the previous checkpoint as stored in /:async/async. Of course this
checkpoint is not present in the copied NodeStore and thus cannot be
retrieved.

The error comes from DocumentMk trying to parse the reference checkpoint
value. Basically what fails here is 'Revision.fromString' receiving a
malformed checkpoint value because it comes from the SegmentMk. The quick
fix is to manually remove the properties on the "/:async" hidden node. This
will indeed trigger a full reindex, but will help you getting over this
issue.

> IMHO it would be desirable to (optionally) copy the checkpoints as
well. In the case of AsyncIndexUpdate, having the checkpoint can save
a full re-index.

This is very tricky, as the 2 representations of checkpoints between
SegmentMk and DocumentMk are quite different. I would strongly suggest
going for the reindex, after all you'd only migrate once, so you can
prepare for this lengthy process.

best,
alex


On Wed, Aug 5, 2015 at 3:35 PM, Julian Sedding <js...@gmail.com> wrote:

> Hi all
>
> I am working on a scenario, where I need to copy a SegmentNodeStore
> (TarMK) to a DocumentNodeStore (MongoDB).
>
> It is pretty straight forward to simply copy the NodeStore via the
> API. No problems here.
>
> In a recent experiment I successfully copied the NodeStore and got an
> exception in the logs (stacktrace below the email).
>
> My interpretation is that the AsyncIndexUpdate is trying to retrieve
> the previous checkpoint as stored in /:async/async. Of course this
> checkpoint is not present in the copied NodeStore and thus cannot be
> retrieved.
>
> IMHO it would be desirable to (optionally) copy the checkpoints as
> well. In the case of AsyncIndexUpdate, having the checkpoint can save
> a full re-index.
>
> The question that remains is how the internal state of
> AsyncIndexUpdate should be modified:
> * implementing the logic in oak-upgrade would be pragmatic, but
> distributes knowledge about AsyncIndexUpdate implementation details to
> different modules
> * having a CommitHook/Editor in oak-core that can be used in
> oak-upgrade might be cleaner, but would only get used in oak-upgrade
>
> Other ideas and opinions regarding this feature are more than welcome!
>
> Regards
> Julian
>
>
> 05.08.2015 00:03:19.133 *ERROR* [pool-6-thread-2]
> org.apache.sling.commons.scheduler.impl.QuartzScheduler Exception
> during job execution of
> org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate@471e4b4b :
> 91f7e218-6cf5-4a44-a324-f094c29898e6
> java.lang.IllegalArgumentException: 91f7e218-6cf5-4a44-a324-f094c29898e6
>         at
> org.apache.jackrabbit.oak.plugins.document.Revision.fromString(Revision.java:236)
>         at
> org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.retrieve(DocumentNodeStore.java:1570)
>         at
> org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate.run(AsyncIndexUpdate.java:279)
>         at
> org.apache.sling.commons.scheduler.impl.QuartzJobExecutor.execute(QuartzJobExecutor.java:105)
>         at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>