You are viewing a plain text version of this content. The canonical link for it is here.

Posted to oak-dev@jackrabbit.apache.org by Davide Giannella <da...@apache.org> on 2014/08/26 12:04:18 UTC

reindex improvements

Hello team,

when we issue the reindex by changing the index definition with
`reindex=true` the algorithm scan all the repository and issue the "node
modified/added" to the specified index.

While this works with small repositories it doesn't really scale with
big ones.

So for taking an extreme example, we have 2 millions node repository
with only 1 node with the required property. The reindex will keep going
for as long the 2m node have not been scanned. And with very active
repositories where we changes a lot of nodes, manually or not, we could
virtually have an endless reindexing.

Based on my experience with content repositories normally clients are
interested in querying only parts of it. For example /content.

I was thinking that it could be a good added value if we could add an
additional property to the index definition: reindexPaths (multivalue,
String).

When this property is specified, the reindex will happens only on those
paths in the order as they are specified and it could potentially makes
the currently indexed content available to the query engine for
returning partial results when every path is completed.

A single path could be just path or a glob/regex. I'm for using a java
regex as it gives the end user a lot of power on fine tuning but on the
other hand regex evaluation is pretty slow...

thoughts?

Cheers
Davide

Re: reindex improvements

Posted by Davide Giannella <da...@apache.org>.

On 26/08/2014 11:27, Nicolas Peltier wrote:
> Hi Davide, 
>
> this would be nice indeed! wouldn’t that be “indexPath”, not “re-indexPath” ?
>
I'd rather keep a sort of "namespace" in the property naming. By stating
`reindexPath` it should be clear that is only related to reindexing and
that if the index is then global (under /oak:indexs) it will index all
repository.

Any other opinions? I'm not convinced yet in my idea. There's something
there that smells for me. :)

Cheers
Davide

Re: reindex improvements

Posted by Nicolas Peltier <np...@adobe.com>.

Hi Davide, 

this would be nice indeed! wouldn’t that be “indexPath”, not “re-indexPath” ?

Nicolas
On 26 Aug 2014, at 12:04, Davide Giannella <da...@apache.org> wrote:

> Hello team,
> 
> when we issue the reindex by changing the index definition with
> `reindex=true` the algorithm scan all the repository and issue the "node
> modified/added" to the specified index.
> 
> While this works with small repositories it doesn't really scale with
> big ones.
> 
> So for taking an extreme example, we have 2 millions node repository
> with only 1 node with the required property. The reindex will keep going
> for as long the 2m node have not been scanned. And with very active
> repositories where we changes a lot of nodes, manually or not, we could
> virtually have an endless reindexing.
> 
> Based on my experience with content repositories normally clients are
> interested in querying only parts of it. For example /content.
> 
> I was thinking that it could be a good added value if we could add an
> additional property to the index definition: reindexPaths (multivalue,
> String).
> 
> When this property is specified, the reindex will happens only on those
> paths in the order as they are specified and it could potentially makes
> the currently indexed content available to the query engine for
> returning partial results when every path is completed.
> 
> A single path could be just path or a glob/regex. I'm for using a java
> regex as it gives the end user a lot of power on fine tuning but on the
> other hand regex evaluation is pretty slow...
> 
> thoughts?
> 
> Cheers
> Davide
> 
> 
>

Re: reindex improvements

Posted by Davide Giannella <da...@apache.org>.

On 26/08/2014 15:10, Justin Edelson wrote:
>
> In this case, I think Thomas's suggestion makes much more sense. Let's
> just add a property to the QID which allows an index to be restricted
> to particular paths.
>
>
With OAK-1980 ordered indexes should be able to be delivered under
specific paths. I don't think thought that the reindex process takes
that into consideration. Should have to find it in the code :)

D.

Re: reindex improvements

Posted by Alexander Klimetschek <ak...@adobe.com>.

As a user of such an index, I do expect the index to properly update itself. Adding configuration to make that "faster" at the cost of index correctness doesn't help.

If the index works asynchronously and might take some time to be up to date, we need to clearly document this. And first check if that's ok.

AFAIR, with Jackrabbit 2 we guaranteed immediate index update upon session.save() for everything but fulltext which might take longer due to text extraction. This is ok since all "programmatic queries" that might have the requirement to work correctly immediately after content changes would be unlikely to include a fuzzy fulltext search (jcr:contains), as opposed to end user searches on a website for example.

Cheers,
Alex

On 27.08.2014, at 05:54, Davide Giannella <da...@apache.org> wrote:

> On 26/08/2014 15:10, Justin Edelson wrote:
>> ...
>> In this case, I think Thomas's suggestion makes much more sense. Let's
>> just add a property to the QID which allows an index to be restricted
>> to particular paths.
>> 
> As said previously there was something in my idea that was not
> convincing me, hence I started the discussion here :)
> 
> After this discussion I'm as well for keeping the reindex as it is. We
> could in case enhance it, if not already, to make sure that if an index
> definition is under a specific path, only that path is traversed rather
> than the whole repo. As said: if not already.
> 
> Nevertheless we have some use cases where we could need a workaround
> like it and I thought of a groovy script to be used in the oak-console.
> 
> This script will receive in input a list of paths to parse and will
> reindex only those. We could have cases where an index is defined as
> root level but the end user knows that at the current stage it's
> actually used only by part of the content tree.
> 
> In this way it won't be part of the core but a util that someone can use
> as not.
> 
> Thoughts?
> 
> Regards
> Davide
> 
>

Re: reindex improvements

Posted by Davide Giannella <da...@apache.org>.

On 26/08/2014 15:10, Justin Edelson wrote:
> ...
> In this case, I think Thomas's suggestion makes much more sense. Let's
> just add a property to the QID which allows an index to be restricted
> to particular paths.
>
As said previously there was something in my idea that was not
convincing me, hence I started the discussion here :)

After this discussion I'm as well for keeping the reindex as it is. We
could in case enhance it, if not already, to make sure that if an index
definition is under a specific path, only that path is traversed rather
than the whole repo. As said: if not already.

Nevertheless we have some use cases where we could need a workaround
like it and I thought of a groovy script to be used in the oak-console.

This script will receive in input a list of paths to parse and will
reindex only those. We could have cases where an index is defined as
root level but the end user knows that at the current stage it's
actually used only by part of the content tree.

In this way it won't be part of the core but a util that someone can use
as not.

Thoughts?

Regards
Davide

Re: reindex improvements

Posted by Justin Edelson <ju...@justinedelson.com>.

Hi,

On Tue, Aug 26, 2014 at 10:01 AM, Davide Giannella <da...@apache.org> wrote:
> On 26/08/2014 14:13, Justin Edelson wrote:
>> Hi Davide,
>> So what would happen to the already-indexed content which wasn't in
>> one of the reindexPaths?
>>
>> For example, let's say I'm building an index of a property called
>> "keywords". In the repo, I have:
>>
>> /content/foo@keywords=something
>> /content/bar/one@keywords=something
>> /content/bar/two@keywords=something
>>
>> And then I trigger a reindex with reindexPaths = /content/bar.
>>
>> Would //element(*)[@keywords='something'] still return /content/foo ?
>>
> In my idea no.
>
> Currently when reindexing the :index node, where the actual index is
> stored, is deleted and recreated.
>
> I would keep the same approach. I'm thinking of this as an advanced
> feature that someone has to know how to use it. So in the above example
> I would specify either: /content or /content/bar, /content/foo.
>
> It's a dangerous thing though. I can see it. :)

In this case, I think Thomas's suggestion makes much more sense. Let's
just add a property to the QID which allows an index to be restricted
to particular paths.

Regards,
Justin

>
> D.
>
>

Re: reindex improvements

Posted by Davide Giannella <da...@apache.org>.

On 26/08/2014 14:13, Justin Edelson wrote:
> Hi Davide,
> So what would happen to the already-indexed content which wasn't in
> one of the reindexPaths?
>
> For example, let's say I'm building an index of a property called
> "keywords". In the repo, I have:
>
> /content/foo@keywords=something
> /content/bar/one@keywords=something
> /content/bar/two@keywords=something
>
> And then I trigger a reindex with reindexPaths = /content/bar.
>
> Would //element(*)[@keywords='something'] still return /content/foo ?
>
In my idea no.

Currently when reindexing the :index node, where the actual index is
stored, is deleted and recreated.

I would keep the same approach. I'm thinking of this as an advanced
feature that someone has to know how to use it. So in the above example
I would specify either: /content or /content/bar, /content/foo.

It's a dangerous thing though. I can see it. :)

D.

Re: reindex improvements

Posted by Justin Edelson <ju...@justinedelson.com>.

Hi Davide,
So what would happen to the already-indexed content which wasn't in
one of the reindexPaths?

For example, let's say I'm building an index of a property called
"keywords". In the repo, I have:

/content/foo@keywords=something
/content/bar/one@keywords=something
/content/bar/two@keywords=something

And then I trigger a reindex with reindexPaths = /content/bar.

Would //element(*)[@keywords='something'] still return /content/foo ?

Regards,
Justin


On Tue, Aug 26, 2014 at 6:04 AM, Davide Giannella <da...@apache.org> wrote:
> Hello team,
>
> when we issue the reindex by changing the index definition with
> `reindex=true` the algorithm scan all the repository and issue the "node
> modified/added" to the specified index.
>
> While this works with small repositories it doesn't really scale with
> big ones.
>
> So for taking an extreme example, we have 2 millions node repository
> with only 1 node with the required property. The reindex will keep going
> for as long the 2m node have not been scanned. And with very active
> repositories where we changes a lot of nodes, manually or not, we could
> virtually have an endless reindexing.
>
> Based on my experience with content repositories normally clients are
> interested in querying only parts of it. For example /content.
>
> I was thinking that it could be a good added value if we could add an
> additional property to the index definition: reindexPaths (multivalue,
> String).
>
> When this property is specified, the reindex will happens only on those
> paths in the order as they are specified and it could potentially makes
> the currently indexed content available to the query engine for
> returning partial results when every path is completed.
>
> A single path could be just path or a glob/regex. I'm for using a java
> regex as it gives the end user a lot of power on fine tuning but on the
> other hand regex evaluation is pretty slow...
>
> thoughts?
>
> Cheers
> Davide
>
>
>

Re: reindex improvements

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

Did we already run into this problem in reality? How much of a pain point
is it? I think creating indexes is a "maintenance job", which doesn't need
to be done very often, comparable to creating a backup. If creating the
index is asynchronous, then it's OK if it's slow. Re-indexing (re-building
an existing index) should only be needed if there is a bug in the indexing
code.

If we really want to support it (not sure if it's worth it), I see two
main options:

* Defining a path filter in the index is an option, but I would probably
call it just "paths" and not "reindexPaths". Such an index would only be
used if the query is restricted to one of the paths.

* We could define indexes in a subtree. We discussed that a while back,
and indeed we already have some code for it. Right now, all indexes are
stored under "/oak:index/...". If you want to index only "/content/", then
the index could be stored under "/content/oak:index" (for example).
However, there are some problems: finding such an index requires that the
given subtree is read when running the query. Also, defining access rights
for those indexes is not trivial. Even thought it has some advantages, it
also has disadvantages.

Regards,
Thomas

On 26/08/14 12:04, "Davide Giannella" <da...@apache.org> wrote:

>Hello team,
>
>when we issue the reindex by changing the index definition with
>`reindex=true` the algorithm scan all the repository and issue the "node
>modified/added" to the specified index.
>
>While this works with small repositories it doesn't really scale with
>big ones.
>
>So for taking an extreme example, we have 2 millions node repository
>with only 1 node with the required property. The reindex will keep going
>for as long the 2m node have not been scanned. And with very active
>repositories where we changes a lot of nodes, manually or not, we could
>virtually have an endless reindexing.
>
>Based on my experience with content repositories normally clients are
>interested in querying only parts of it. For example /content.
>
>I was thinking that it could be a good added value if we could add an
>additional property to the index definition: reindexPaths (multivalue,
>String).
>
>When this property is specified, the reindex will happens only on those
>paths in the order as they are specified and it could potentially makes
>the currently indexed content available to the query engine for
>returning partial results when every path is completed.
>
>A single path could be just path or a glob/regex. I'm for using a java
>regex as it gives the end user a lot of power on fine tuning but on the
>other hand regex evaluation is pretty slow...
>
>thoughts?
>
>Cheers
>Davide
>
>
>