You are viewing a plain text version of this content. The canonical link for it is here.

Posted to oak-dev@jackrabbit.apache.org by Jukka Zitting <ju...@gmail.com> on 2012/09/18 17:14:01 UTC

On custom index configuration

Hi,

We now have a couple of initial index implementations for Oak and some
ideas on how index configuration could/should work. In order to start
unifying those approaches and to find some common consensus, I'd like
to throw out an idea of how I think index configuration should work in
Oak. Critiques, improvements or competing ideas welcome!

First of all I think there shouldn't be just one single place in the
repository where all index configuration should go. It would be nice
if users and applications could define custom indexes on areas they
have write access to, and having to grant them access to some shared
location for that might be troublesome.

Instead I'd allow a custom indexes to be defined by adding something
like an oak:indexed mixin type and an associated oak:indexes child
node to any node in the repository. Each child node of that
oak:indexes node would configure an index for the subtree rooted at
that oak:indexed node. Index configuration would be stored as normal
content, and the index content in a hidden :index subtree or elsewhere
depending on the type of the index.

For example, here's what a content tree that defines a unique jcr:uuid
index at the root of a workspace, a normal jcr:title property index
for content under /articles and a Lucene full text index for nt:file
nodes under /data:

    / [jcr:mixinTypes = oak:indexed]
        /oak:indexes
            /uuid [jcr:primaryType = oak:uniqueIndex, oak:propertyName
= jcr:uuid]
                /:index { invisible index content }
        /articles [jcr:mixinTypes = oak:indexed]
            /oak:indexes
                /title [jcr:primaryType = oak:propertyIndex,
oak:propertyName = jcr:title]
                    /:index { invisible index content }
        /data [jcr:mixinTypes = oak:indexed]
            /oak:indexes
                /fulltext [jcr:primaryType = oak:fulltextIndex,
oak:nodeType = nt:file]
                    /:index { invisible index content }

Creating a new custom index would be a matter of adding the
appropriate index configuration settings. For example, the following
code would define an additional jcr:created property index under
/articles:

    Session session  = ...;
    Node indexes = session.getNode("/articles/oak:indexes");
    Node created = indexes.addNode("created", "oak:propertyIndex");
    created.setProperty("oak:propertyName", "jcr:created");
    session.save();

Creating such an index node would trigger automatic indexing of the
subtree, either directly in a commit hook or as a delayed background
processing job. All future commits to that subtree would automatically
get indexed since the commit hook would notice the index
configurations in the oak:indexes node as it traverses down the commit
diffs.

When executing a query, the search engine in Oak would then detect all
indexes along the main path axis of a given query. For example, when
querying for content inside /data/foo, the search engine would use the
indexes at / and /data, but not the ones at /articles.

Removing a custom index would be a simple matter of removing the
respective index configuration node. For example, to remove the full
text index defined above, one would do:

    Session session  = ...;
    session.getNode("/data/oak:indexes/fulltext").remove();
    session.save();

BR,

Jukka Zitting

Re: On custom index configuration

Posted by Bertrand Delacretaz <bd...@apache.org>.

On Wed, Sep 19, 2012 at 3:50 PM, Lukas Kahwe Smith <ml...@pooteeweet.org> wrote:
> On Sep 19, 2012, at 3:48 PM, Bertrand Delacretaz <bd...@apache.org> wrote:
>> ...Currently, Jackrabbit mostly considers the repository as a monolithic
>> unit of content, yet typical applications have widely varying
>> requirements when it comes to indexing, performance, observation,
>> eventual consistency etc. for various parts of their content. Being
>> able to have different settings for those various subtrees can make a
>> big difference IMO.
>
> if we can mount workspaces into other workspaces, then maybe configuration of this could
> still be tied to a workspace and not nodes in the tree....

Agreed, using workspaces in this way might work if the mounts are
transparent at the JCR level.

-Bertrand

Re: On custom index configuration

Posted by Lukas Kahwe Smith <ml...@pooteeweet.org>.

On Sep 19, 2012, at 3:48 PM, Bertrand Delacretaz <bd...@apache.org> wrote:

> On Wed, Sep 19, 2012 at 2:26 PM, Thomas Mueller <mu...@adobe.com> wrote:
>> Jukka wrote:
>>>> When executing a query, the search engine in Oak would then detect all
>>>> indexes along the main path axis of a given query. For example, when
>>>> querying for content inside /data/foo, the search engine would use the
>>>> indexes at / and /data, but not the ones at /articles.
>> 
>> From a user perspective, that might be cool. From the perspective of the
>> query engine developer, not so cool. Patches are welcome :-)...
> 
> From the peanut gallery, I think this is an *extremely* useful feature.
> 
> Currently, Jackrabbit mostly considers the repository as a monolithic
> unit of content, yet typical applications have widely varying
> requirements when it comes to indexing, performance, observation,
> eventual consistency etc. for various parts of their content. Being
> able to have different settings for those various subtrees can make a
> big difference IMO.


if we can mount workspaces into other workspaces, then maybe configuration of this could still be tied to a workspace and not nodes in the tree.

regards,
Lukas Kahwe Smith
mls@pooteeweet.org

Re: On custom index configuration

Posted by Bertrand Delacretaz <bd...@apache.org>.

On Wed, Sep 19, 2012 at 2:26 PM, Thomas Mueller <mu...@adobe.com> wrote:
> Jukka wrote:
>>>When executing a query, the search engine in Oak would then detect all
>>> indexes along the main path axis of a given query. For example, when
>>> querying for content inside /data/foo, the search engine would use the
>>> indexes at / and /data, but not the ones at /articles.
>
> From a user perspective, that might be cool. From the perspective of the
> query engine developer, not so cool. Patches are welcome :-)...

>From the peanut gallery, I think this is an *extremely* useful feature.

Currently, Jackrabbit mostly considers the repository as a monolithic
unit of content, yet typical applications have widely varying
requirements when it comes to indexing, performance, observation,
eventual consistency etc. for various parts of their content. Being
able to have different settings for those various subtrees can make a
big difference IMO.

-Bertrand

Re: On custom index configuration

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

>+1
>
>From a content modeling perspective, forcing all indexes in a central
>location is very restricting and not modular.

Where the index configuration nodes are stored is normally internal to the
implementation, with the exceptions of export and import.

The index configuration is similar to the node type configuration: a
normal developer doesn't typically write those nodes directly, but uses a
tool to instead.

>Also, maybe you want less priviliged users/groups configure an index,
>which might not have access to that central config node.

I think index configuration should be similar to managing node types and
access rights.

>Is indexing done using the admin session?

Do you mean updating the index? Yes.

>If it's hidden, would it be possible to access and copy that :index node
>(in a jcr dump/packaging mechanism) over to another instance in order to
>avoid reindexing?

Now you talk about the index content, not the index configuration. That's
a different topic, but:

Yes that might be possible. Even thought, I don't consider this an
advantage. For databases, what you typically do is export just the data,
and then let the database re-create the index. I have never heard about
the use case "export / import index content" except if you restore a
complete database.

Actually it is a problem if you import into an existing tree (possibly
overwriting some of the current data). This would break the existing index
and lead to problems that are hard to detect.

>>When executing a query, the search engine in Oak would then detect all
>> indexes along the main path axis of a given query. For example, when
>> querying for content inside /data/foo, the search engine would use the
>> indexes at / and /data, but not the ones at /articles.
>
>Kewl :-)

>From a user perspective, that might be cool. From the perspective of the
query engine developer, not so cool. Patches are welcome :-)

Regards,
Thomas

Re: On custom index configuration

Posted by Alexander Klimetschek <ak...@adobe.com>.

On 18.09.2012, at 17:14, Jukka Zitting <ju...@gmail.com> wrote:

> First of all I think there shouldn't be just one single place in the
> repository where all index configuration should go. It would be nice
> if users and applications could define custom indexes on areas they
> have write access to, and having to grant them access to some shared
> location for that might be troublesome.

+1

>From a content modeling perspective, forcing all indexes in a central location is very restricting and not modular. Also, maybe you want less priviliged users/groups configure an index, which might not have access to that central config node.

BTW, how is security currently handled? Who can create an index config? Is indexing done using the admin session?

> Index configuration would be stored as normal
> content, and the index content in a hidden :index subtree or elsewhere
> depending on the type of the index.

If it's hidden, would it be possible to access and copy that :index node (in a jcr dump/packaging mechanism) over to another instance in order to avoid reindexing?

> When executing a query, the search engine in Oak would then detect all
> indexes along the main path axis of a given query. For example, when
> querying for content inside /data/foo, the search engine would use the
> indexes at / and /data, but not the ones at /articles.

Kewl :-)

Cheers,
Alex

Re: On custom index configuration

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Wed, Sep 19, 2012 at 11:30 PM, Ard Schrijvers
<a....@onehippo.com> wrote:
> Writing it to local FS instead of memory would then also be an option,
> right?

Definitely. The current approach of loading the index binaries to
memory is just a temporary solution (the easiest thing that could
possibly work :-) that certainly won't scale to larger indexes. We'll
need to extend the binary value mechanism a bit to match features
(mostly random access) that Lucene needs.

BR,

Jukka Zitting

Re: On custom index configuration

Posted by Ard Schrijvers <a....@onehippo.com>.

On Wed, Sep 19, 2012 at 10:39 PM, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> On Wed, Sep 19, 2012 at 9:30 PM, Ard Schrijvers
> <a....@onehippo.com> wrote:
>> I've read the entire thread, and below reply inline to the initial
>> proposal of Jukka as I have some doubts in that area:
>
> Great comments, thanks for joining the discussion!

Thanks

>
>> The only way I could imagine we already gain a lot compared to jr 2.x
>> and still have performance is if we have the backing storage contain
>> (and maintain like indexing new nodes) the indexes  (just like Jukka
>> suggests), but repository (jvm) instances load the entire index nodes
>> from the repository to local FS. If the repository index is an append
>> only binary (for example append only the binary segments as new
>> binaries to an index just like Lucene does) then perhaps it could
>> perform
>
> That's the idea.

Ah, good to hear :)

>All frequently accessed binaries can and should be
> kept locally, which should make the index perform pretty well. This
> isn't implemented yet (currently the LuceneIndex simply reads all
> index binaries to memory...), so there still is no way to benchmark

Writing it to local FS instead of memory would then also be an option,
right? Lucene indexes for current 2.x jr tend to get quite large, so
keeping them in memory might get quite big. Lucene also has a bit
better performance for FS indexes compared to memory indexes, but this
won't be too big an issue (it is due to GC overhead, certainly when
the in memory index becomes large)

> the idea in practice. But at least from a design perspective I don't
> see any major reasons why this solution couldn't perform at least
> reasonably close to what Lucene achieves when directly accessing a
> local file system.

Yes, as long as you have the Lucene indexes near the computation,
performance should be at least comparable to normal FS Lucene indexes.

>
>> And here I think I have my other doubts. For example, Lucene needs the
>> same analyzers query time as were used indexing time. Now, if I would
>> have an English spellchecker for the index at / and a French for the
>> index at /data, then, I cannot see how you could ever query both
>> indexes in one go. Similarly if the index at / indexes title property
>> as String (single token) and the index at /data indexes the title as
>> Text (tokenized). How can you now query the title at /
>
> The index at / indexes content from the entire tree, also from within
> /data. The fact that there's an extra index at /data wouldn't affect
> the index at / in any way. Therefore you can still easily query for
> title at / in English and get correct results also from within /data.
>
>> So, I do think it is nice to be able to configure multiple index
>> configuration for different parts of the jcr tree, but I doubt about
>> supporting nested indexes that are backed by different index
>> configuration. Without the nesting, I think it would work.
>
> As mentioned above, the idea is not for the indexes to be nested. (I
> previously toyed with the idea of a hierarchical map-reduce -like
> mechanism for building an index incrementally across the whole tree,
> but that's a different discussion and probably won't be implemented
> unless there's some particular use case for something like that.)
>
>> Thus, query for / uses the index for /. Query for /data uses just
>> the index for /data, not the one from /
>
> The index selection process is a bit more complicated than that.
>
> Basically for each query we'd look up all the potentially applicable
> indexes, and then each index is asked to estimate how efficiently it
> could execute a given query, for example
> /jcr:root/data//*[@title='foo']. The index at / would notice that it
> does keep track of the title property so it can do a property
> constraint pretty efficiently, but probably won't be that fast in
> evaluating the path constraint. The index at /data on the other hand
> could do both constraints efficiently, so the query engine will pick
> that one.
>
> On the other hand, if the query was about some other property, like
> /jcr:root/data//*[@author='bar'], and that property is only indexed at
> /, then that index would likely get selected by the query engine over
> the one at /data.

Thanks for your detailed explanation Jukka. It is now more clear to me
how you want to manage it. It does seem quite complex to me to
implement, but with enough transpiration it might work out :))

Regards Ard

>
> BR,
>
> Jukka Zitting



-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Re: On custom index configuration

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Thu, Sep 20, 2012 at 7:50 PM, Thomas Mueller <mu...@adobe.com> wrote:
> Your are right. To do this efficiently it would require an index on
> [oak:indexed], which would effectively be the same as having the list of
> indexes (and where they are configured) stored at a central place...

Right. The differences to an explicit list are that a node type
index can be used for a lot of other things too and that there's no
need for extra code to maintain such a list.

BR,

Jukka Zitting

Re: On custom index configuration

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

>>Yes, but the problem is if the "repository-administrator" has no way of
>> knowing what indexes are configured.
>
>That's easy to find out:
>
>    SELECT * FROM [oak:indexed]
>
>;-)

Your are right. To do this efficiently it would require an index on
[oak:indexed], which would effectively be the same as having the list of
indexes (and where they are configured) stored at a central place...

Regards,
Thomas

Re: On custom index configuration

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Thu, Sep 20, 2012 at 4:49 PM, Thomas Mueller <mu...@adobe.com> wrote:
> Yes, but the problem is if the "repository-administrator" has no way of
> knowing what indexes are configured.

That's easy to find out:

    SELECT * FROM [oak:indexed]

;-)

BR,

Jukka Zitting

Re: On custom index configuration

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

>It's a bit like in a relational database where you can configure an
>index to use one underlying data structure or another. The more
>indexes you have, the more you need to do when you want to reconfigure
>the indexes.

Yes, but the problem is if the "repository-administrator" has no way of
knowing what indexes are configured. To reconfigure the indexes, he would
have to read all the nodes in the repository, unless there is a central
place where all indexes are listed.

Regards,
Thomas

Re: On custom index configuration

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Thu, Sep 20, 2012 at 3:16 PM, Lukas Kahwe Smith <ml...@pooteeweet.org> wrote:
> that is my point .. if we have custom index configuration referencing specific indexing
> plugins all over the content .. then switching indexers might get needlessly hard ..

Ah, I see the point.

IMO this is basically up to a deployment and its administrator to
worry about. If you define just one or two big indexes that cover most
of the repository, then switching to an alternative indexing mechanism
will be fairly straightforward. And if you need lots of small,
purpose-built indexes then migrating them all will take more work.

It's a bit like in a relational database where you can configure an
index to use one underlying data structure or another. The more
indexes you have, the more you need to do when you want to reconfigure
the indexes.

BR,

Jukka Zitting

Re: On custom index configuration

Posted by Lukas Kahwe Smith <ml...@pooteeweet.org>.

On Sep 20, 2012, at 3:14 PM, Jukka Zitting <ju...@gmail.com> wrote:

>> The example you are showing above seems to go into the opposite direction.
> 
> Can you elaborate? The example doesn't affect any of the actual
> content, just the index configuration.

that is my point .. if we have custom index configuration referencing specific indexing plugins all over the content .. then switching indexers might get needlessly hard ..

regards,
Lukas Kahwe Smith
mls@pooteeweet.org

Re: On custom index configuration

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Thu, Sep 20, 2012 at 2:48 PM, Lukas Kahwe Smith <ml...@pooteeweet.org> wrote:
> i think it would be ideal if in most cases switching the internal lucene solution
> for Solr/ES should work without having to touch anything beyond a few configs
> (which can of course be stored inside the repo).

That should be the case.

> The example you are showing above seems to go into the opposite direction.

Can you elaborate? The example doesn't affect any of the actual
content, just the index configuration.

In the case of an external index, the :index subtree wouldn't even be
there as the index content would be stored outside the repository.

BR,

Jukka Zitting

Re: On custom index configuration

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

It is problematic to fix the configuration if you don't know where to
look, if the configuration can be basically anywhere in the repository.
Let's say there are special Lucene indexes configured at:

   /long/path/deep/in/repository

And if you want to switch to another query engine (let's say MongoDB), how
can you do that, given you don't know there is an index configured?

I guess it's simpler to administrate indexes if the configuration is
stored, or at least linked, at a central place.

Regards,
Thomas


On 9/20/12 2:48 PM, "Lukas Kahwe Smith" <ml...@pooteeweet.org> wrote:

>
>On Sep 20, 2012, at 2:45 PM, Tommaso Teofili <te...@adobe.com> wrote:
>
>> Hi all,
>> 
>> On 19/set/2012, at 22:47, Lukas Kahwe Smith wrote:
>> 
>>> Hi,
>>> 
>>> Just wanted to bring up how this all relates to custom index solutions
>>>(like Solr/ES). Isnt there a risk that by making it possible to attach
>>>such configuration to nodes, that it would encourage applications that
>>>make it close to impossible to switch to Solr/ES to benefit from their
>>>features (especially improved scalability in clustered setups)?
>> 
>> Why do you think so? Actually I'm working on such an integration (w/
>>Solr) and it doesn't sound that bad, on the contrary, as far as I
>>understand Jukka's proposal, it should be easier as you could add
>>something like:
>> 
>>     /path/to/somewhere [jcr:mixinTypes = oak:indexed]
>>           /oak:indexes
>>               /solr [jcr:primaryType = oak:solrIndex, oak:nodeType =
>>nt:file, url = ...]
>>                   /:index { invisible index content }
>> 
>> along with a CommitHook and a QueryIndex specific implementations.
>
>it just seemed to me like it would encourage a proliferation of very
>specialized handlers which bind the content to the specific indexer.
>i think it would be ideal if in most cases switching the internal lucene
>solution for Solr/ES should work without having to touch anything beyond
>a few configs (which can of course be stored inside the repo). The
>example you are showing above seems to go into the opposite direction.
>
>regards,
>Lukas Kahwe Smith
>mls@pooteeweet.org
>
>
>

Re: On custom index configuration

Posted by Lukas Kahwe Smith <ml...@pooteeweet.org>.

On Sep 20, 2012, at 2:45 PM, Tommaso Teofili <te...@adobe.com> wrote:

> Hi all,
> 
> On 19/set/2012, at 22:47, Lukas Kahwe Smith wrote:
> 
>> Hi,
>> 
>> Just wanted to bring up how this all relates to custom index solutions (like Solr/ES). Isnt there a risk that by making it possible to attach such configuration to nodes, that it would encourage applications that make it close to impossible to switch to Solr/ES to benefit from their features (especially improved scalability in clustered setups)?
> 
> Why do you think so? Actually I'm working on such an integration (w/ Solr) and it doesn't sound that bad, on the contrary, as far as I understand Jukka's proposal, it should be easier as you could add something like:
> 
>     /path/to/somewhere [jcr:mixinTypes = oak:indexed]
>           /oak:indexes
>               /solr [jcr:primaryType = oak:solrIndex, oak:nodeType = nt:file, url = ...]
>                   /:index { invisible index content }
> 
> along with a CommitHook and a QueryIndex specific implementations.

it just seemed to me like it would encourage a proliferation of very specialized handlers which bind the content to the specific indexer.
i think it would be ideal if in most cases switching the internal lucene solution for Solr/ES should work without having to touch anything beyond a few configs (which can of course be stored inside the repo). The example you are showing above seems to go into the opposite direction.

regards,
Lukas Kahwe Smith
mls@pooteeweet.org

Re: On custom index configuration

Posted by Tommaso Teofili <te...@adobe.com>.

Hi all,

On 19/set/2012, at 22:47, Lukas Kahwe Smith wrote:

> Hi,
> 
> Just wanted to bring up how this all relates to custom index solutions (like Solr/ES). Isnt there a risk that by making it possible to attach such configuration to nodes, that it would encourage applications that make it close to impossible to switch to Solr/ES to benefit from their features (especially improved scalability in clustered setups)?

Why do you think so? Actually I'm working on such an integration (w/ Solr) and it doesn't sound that bad, on the contrary, as far as I understand Jukka's proposal, it should be easier as you could add something like:

     /path/to/somewhere [jcr:mixinTypes = oak:indexed]
           /oak:indexes
               /solr [jcr:primaryType = oak:solrIndex, oak:nodeType = nt:file, url = ...]
                   /:index { invisible index content }

along with a CommitHook and a QueryIndex specific implementations.
Just my 0.2 cents.
Regards,
Tommaso

> 
> regards,
> Lukas Kahwe Smith
> mls@pooteeweet.org
> 
> 
>

Re: On custom index configuration

Posted by Lukas Kahwe Smith <ml...@pooteeweet.org>.

Hi,

Just wanted to bring up how this all relates to custom index solutions (like Solr/ES). Isnt there a risk that by making it possible to attach such configuration to nodes, that it would encourage applications that make it close to impossible to switch to Solr/ES to benefit from their features (especially improved scalability in clustered setups)?

regards,
Lukas Kahwe Smith
mls@pooteeweet.org

Re: On custom index configuration

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Wed, Sep 19, 2012 at 9:30 PM, Ard Schrijvers
<a....@onehippo.com> wrote:
> I've read the entire thread, and below reply inline to the initial
> proposal of Jukka as I have some doubts in that area:

Great comments, thanks for joining the discussion!

> The only way I could imagine we already gain a lot compared to jr 2.x
> and still have performance is if we have the backing storage contain
> (and maintain like indexing new nodes) the indexes  (just like Jukka
> suggests), but repository (jvm) instances load the entire index nodes
> from the repository to local FS. If the repository index is an append
> only binary (for example append only the binary segments as new
> binaries to an index just like Lucene does) then perhaps it could
> perform

That's the idea. All frequently accessed binaries can and should be
kept locally, which should make the index perform pretty well. This
isn't implemented yet (currently the LuceneIndex simply reads all
index binaries to memory...), so there still is no way to benchmark
the idea in practice. But at least from a design perspective I don't
see any major reasons why this solution couldn't perform at least
reasonably close to what Lucene achieves when directly accessing a
local file system.

> And here I think I have my other doubts. For example, Lucene needs the
> same analyzers query time as were used indexing time. Now, if I would
> have an English spellchecker for the index at / and a French for the
> index at /data, then, I cannot see how you could ever query both
> indexes in one go. Similarly if the index at / indexes title property
> as String (single token) and the index at /data indexes the title as
> Text (tokenized). How can you now query the title at /

The index at / indexes content from the entire tree, also from within
/data. The fact that there's an extra index at /data wouldn't affect
the index at / in any way. Therefore you can still easily query for
title at / in English and get correct results also from within /data.

> So, I do think it is nice to be able to configure multiple index
> configuration for different parts of the jcr tree, but I doubt about
> supporting nested indexes that are backed by different index
> configuration. Without the nesting, I think it would work.

As mentioned above, the idea is not for the indexes to be nested. (I
previously toyed with the idea of a hierarchical map-reduce -like
mechanism for building an index incrementally across the whole tree,
but that's a different discussion and probably won't be implemented
unless there's some particular use case for something like that.)

> Thus, query for / uses the index for /. Query for /data uses just
> the index for /data, not the one from /

The index selection process is a bit more complicated than that.

Basically for each query we'd look up all the potentially applicable
indexes, and then each index is asked to estimate how efficiently it
could execute a given query, for example
/jcr:root/data//*[@title='foo']. The index at / would notice that it
does keep track of the title property so it can do a property
constraint pretty efficiently, but probably won't be that fast in
evaluating the path constraint. The index at /data on the other hand
could do both constraints efficiently, so the query engine will pick
that one.

On the other hand, if the query was about some other property, like
/jcr:root/data//*[@author='bar'], and that property is only indexed at
/, then that index would likely get selected by the query engine over
the one at /data.

BR,

Jukka Zitting

Re: On custom index configuration

Posted by Ard Schrijvers <a....@onehippo.com>.

Hello Jukka et al,

I've read the entire thread, and below reply inline to the initial
proposal of Jukka as I have some doubts in that area:

On Tue, Sep 18, 2012 at 5:14 PM, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
<snip/>
>
> First of all I think there shouldn't be just one single place in the
> repository where all index configuration should go. It would be nice
> if users and applications could define custom indexes on areas they
> have write access to, and having to grant them access to some shared
> location for that might be troublesome.
>
> Instead I'd allow a custom indexes to be defined by adding something
> like an oak:indexed mixin type and an associated oak:indexes child
> node to any node in the repository. Each child node of that
> oak:indexes node would configure an index for the subtree rooted at
> that oak:indexed node. Index configuration would be stored as normal
> content, and the index content in a hidden :index subtree or elsewhere
> depending on the type of the index.

Having the Lucene indexes inside the repository is of course really
really nice, as currently (jr 2.x), bringing up a new cluster
repository node means you first have to index the entire repository to
create a *local* FS Lucene index (or actually indexes). That said, of
course it is really nice, but, I didn't yet hear of *any* successful
Lucene implementation that did not have the Lucene indexes near the
computation. Thus having the Lucene indexes in, say some noSQL store
or database, pretty much means it will never perform afaiu.

Also, I've talked to Simon Willnauer (Lucene chair) a couple of times
about these kind of attempts. He says Lucene will *never* perform if
the data (indexes) are not near the computation.

So, if we want to store the lucene indexes in the oak repository in
binary fields, how will they ever be 'near' the computation?

OTOH, I must be missing something because I expressed these concerns
before to Jukka so he must know something that I don't if he is still
confident this will work :)

The only way I could imagine we already gain a lot compared to jr 2.x
and still have performance is if we have the backing storage contain
(and maintain like indexing new nodes) the indexes  (just like Jukka
suggests), but repository (jvm) instances load the entire index nodes
from the repository to local FS. If the repository index is an append
only binary (for example append only the binary segments as new
binaries to an index just like Lucene does) then perhaps it could
perform

<snip/>

>
> When executing a query, the search engine in Oak would then detect all
> indexes along the main path axis of a given query. For example, when
> querying for content inside /data/foo, the search engine would use the
> indexes at / and /data, but not the ones at /articles.

And here I think I have my other doubts. For example, Lucene needs the
same analyzers query time as were used indexing time. Now, if I would
have an English spellchecker for the index at / and a French for the
index at /data, then, I cannot see how you could ever query both
indexes in one go. Similarly if the index at / indexes title property
as String (single token) and the index at /data indexes the title as
Text (tokenized). How can you now query the title at /

So, I do think it is nice to be able to configure multiple index
configuration for different parts of the jcr tree, but I doubt about
supporting nested indexes that are backed by different index
configuration. Without the nesting, I think it would work. Thus, query
for / uses the index for /. Query for /data uses just the index for
/data, not the one from /

These are my concerns...unfortunately I cannot join the upcoming oak
hackathon due to holiday, but otherwise I would have been very
interested in the details I don't understand

Regards Ard

>
> Removing a custom index would be a simple matter of removing the
> respective index configuration node. For example, to remove the full
> text index defined above, one would do:
>
>     Session session  = ...;
>     session.getNode("/data/oak:indexes/fulltext").remove();
>     session.save();
>
> BR,
>
> Jukka Zitting

Re: On custom index configuration

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Tue, Sep 18, 2012 at 5:16 PM, Lukas Kahwe Smith <ml...@pooteeweet.org> wrote:
> would this also allow to offer something similar to the virtual nodes based on facettes
> that HippoCMS currently offers?

Not directly, but more on that concept soon in a separate thread.

BR,

Jukka Zitting

Re: On custom index configuration

Posted by Lukas Kahwe Smith <ml...@pooteeweet.org>.

On Sep 18, 2012, at 5:14 PM, Jukka Zitting <ju...@gmail.com> wrote:

> Hi,
> 
> We now have a couple of initial index implementations for Oak and some
> ideas on how index configuration could/should work. In order to start
> unifying those approaches and to find some common consensus, I'd like
> to throw out an idea of how I think index configuration should work in
> Oak. Critiques, improvements or competing ideas welcome!
> 
> First of all I think there shouldn't be just one single place in the
> repository where all index configuration should go. It would be nice
> if users and applications could define custom indexes on areas they
> have write access to, and having to grant them access to some shared
> location for that might be troublesome.
> 
> Instead I'd allow a custom indexes to be defined by adding something
> like an oak:indexed mixin type and an associated oak:indexes child
> node to any node in the repository. Each child node of that
> oak:indexes node would configure an index for the subtree rooted at
> that oak:indexed node. Index configuration would be stored as normal
> content, and the index content in a hidden :index subtree or elsewhere
> depending on the type of the index.


would this also allow to offer something similar to the virtual nodes based on facettes that HippoCMS currently offers?

regards,
Lukas Kahwe Smith
mls@pooteeweet.org

Re: On custom index configuration

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

> I'd really like to avoid the need to rely on observation for keeping
> internal data structures in sync. It adds quite a bit of complexity

Well, having the index configuration distributed is what really adds
complexity :-) I think keeping a central list of all index configurations
simplifies things, and not having it is troublesome.

For me, storing the index configurations near the context is not a high
priority, but if you feel it's important I'm sure we find a solution.
Possibly we end up having a central list of paths were index
configurations are stored, that would be OK in my view.


Just now, for me the priority is adding support for global indexes. But I
will keep the 'distributed index config' in mind. Of course it would be a
nice thing to have.

Regards,
Thomas

Re: On custom index configuration

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Wed, Sep 19, 2012 at 8:56 AM, Thomas Mueller <mu...@adobe.com> wrote:
>>At query time, when it knows the main path constraint used in the
>>query, it can walk down that path to detect which indexes are
>>available and useful for resolving the query.
>
> I guess we could make it work. It would make the query engine a bit more
> complex, and some of the queries would get a little bit slower (because a
> few more nodes would need to be read as they might contain index configs),
> but it's possible as far as I see.

The performance difference should be minimal, as all the relevant
index configuration nodes will be frequently accessed and thus cached
in memory. If there is a significant performance difference to
accessing another in-memory data structure, then we have a bug in our
cache.

As for complexity, we also gain from not having to maintain a separate
up to date in-memory representation of the index configuration and
worry about keeping it in sync with changes in in-content
configuration.

> The configuration of 'global' indexes (that affect the whole repository,
> such as the jcr:uuid index, the fulltext index) would still need to be
> stored at a fixed location (for example at the root node).

Yes, the root node can (and should) be an oak:indexed node.

> One problem is if the index config is stored at the wrong place, or if the
> query doesn't include the path restriction. For example if a config of a
> global index is stored under "/content" instead of "/", and then if the
> query doesn't explicitly use "/content", the index wouldn't be picked up.

That would be as designed. If you want to speed up a query that
doesn't contain a path restriction, you'd need to put the index under
the root node.

> Storing the index configs at a fixed location is still what I would
> prefer, because it is a very simple solution, and I still don't see very
> big advantages to store the config near the content.

In addition to the access control issue I mentioned earlier this would
also allow us to migrate custom search indexes along with content. For
example, consider a web site or another content application stored in
a subtree of one repository. If you want to migrate it to another
repository (for example from development to production), it'll be
trivially easy to include also any custom indexes if they're
configured and stored in the same subtree.

On Wed, Sep 19, 2012 at 10:06 AM, Thomas Mueller <mu...@adobe.com> wrote:
> There is one more problem with storing the index config near the content.
> The index config doesn't just need to be read when running a query, but
> also when modifying data, in order to update the index data itself. If the
> index config isn't stored at a central place, then either the index isn't
> updated, or each time you store anything, all the parent nodes need to be
> read to pick up index configs.

The commit hook mechanism already provides a natural mechanism for
picking up information as you traverse down the tree to those areas
that are modified in a commit.

> A variation would be to store the index config at two places (at a central
> location and near the context). An internal observation handler could
> synchronize the two.

I'd really like to avoid the need to rely on observation for keeping
internal data structures in sync. It adds quite a bit of complexity
and risks hard-to-track inconsistency in internal state if there's a
bug in the relevant code.

> So I suggest we start with storing the index config at a central location,
> and then if we see a strong need we can still support a different solution.

We can start by only supporting the oak:indexed mechanism at the root
node, and extending it to support subtrees once there's a strong
enough need for that.

BR,

Jukka Zitting

Re: On custom index configuration

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

There is one more problem with storing the index config near the content.
The index config doesn't just need to be read when running a query, but
also when modifying data, in order to update the index data itself. If the
index config isn't stored at a central place, then either the index isn't
updated, or each time you store anything, all the parent nodes need to be
read to pick up index configs.

A variation would be to store the index config at two places (at a central
location and near the context). An internal observation handler could
synchronize the two.

But all that just seems like a lot of trouble to me, for a quite small
advantage (just my view).

So I suggest we start with storing the index config at a central location,
and then if we see a strong need we can still support a different solution.

Regards,
Thomas

On 9/18/12 6:04 PM, "Jukka Zitting" <ju...@gmail.com> wrote:

>Hi,
>
>On Tue, Sep 18, 2012 at 5:30 PM, Thomas Mueller <mu...@adobe.com> wrote:
>>>First of all I think there shouldn't be just one single place in the
>>>repository where all index configuration should go.
>>
>> Hm, how would the query engine detect what indexes are available?
>
>At query time, when it knows the main path constraint used in the
>query, it can walk down that path to detect which indexes are
>available and useful for resolving the query.
>
>At commit time, it can walk down the affected subtrees to detect which
>indexes need to be updated.
>
>BR,
>
>Jukka Zitting

Re: On custom index configuration

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

>At query time, when it knows the main path constraint used in the
>query, it can walk down that path to detect which indexes are
>available and useful for resolving the query.

I guess we could make it work. It would make the query engine a bit more
complex, and some of the queries would get a little bit slower (because a
few more nodes would need to be read as they might contain index configs),
but it's possible as far as I see.

The configuration of 'global' indexes (that affect the whole repository,
such as the jcr:uuid index, the fulltext index) would still need to be
stored at a fixed location (for example at the root node).


One problem is if the index config is stored at the wrong place, or if the
query doesn't include the path restriction. For example if a config of a
global index is stored under "/content" instead of "/", and then if the
query doesn't explicitly use "/content", the index wouldn't be picked up.
So there are a few things that could go wrong.

Storing the index configs at a fixed location is still what I would
prefer, because it is a very simple solution, and I still don't see very
big advantages to store the config near the content.

Regards,
Thomas

Re: On custom index configuration

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Tue, Sep 18, 2012 at 5:30 PM, Thomas Mueller <mu...@adobe.com> wrote:
>>First of all I think there shouldn't be just one single place in the
>>repository where all index configuration should go.
>
> Hm, how would the query engine detect what indexes are available?

At query time, when it knows the main path constraint used in the
query, it can walk down that path to detect which indexes are
available and useful for resolving the query.

At commit time, it can walk down the affected subtrees to detect which
indexes need to be updated.

BR,

Jukka Zitting

Re: On custom index configuration

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

>First of all I think there shouldn't be just one single place in the
>repository where all index configuration should go.

Hm, how would the query engine detect what indexes are available? I think
keeping the index configuration at one place is the most simple solution,
and I don't currently understand what problems that could cause... If it
does cause problems, maybe index config could be stored in two places
(near the data, and additionally at a central place) but I'm not sure.

>Instead I'd allow a custom indexes to be defined by adding something
>like an oak:indexed mixin type and an associated oak:indexes child
>node to any node in the repository.

In order to find out what indexes exist, the query engine would have to
run a query? I would say, that's a bit problematic.

>For example, the following
>code would define an additional jcr:created property index under
>/articles:

I think we should define a utility to manage indexes, on top of the JCR
API.

>Creating such an index node would trigger automatic indexing of the
>Subtree

I agree that index configuration should be normal nodes, and that adding
such nodes should create the index.

>either directly in a commit hook or as a delayed background
>processing job.

Yes. Sometimes it's also a good idea to wait creating the index, so that
if a second index is added, the content only has to be traversed once (and
not once for every index).

Regards,
Thomas