You are viewing a plain text version of this content. The canonical link for it is here.

Posted to oak-dev@jackrabbit.apache.org by Alex Parvulescu <al...@gmail.com> on 2012/03/22 00:13:39 UTC

Re (OAK-36) Implement a query parser - what about indexing?

Hi,

I've started to scratch the surface a little bit on the subject of queries.

OAK-36 covers the Query implementation effort, but I'm wondering if now
would be a good time to mention indexing as well.

We want to have dedicated indexes, I think that would be accomplished via
observation.
Any ideas about the availability of this feature?

The current index implementation just traverses the existing nodes (albeit
applying some path constraints first), but still it doesn't have (for a
lack of a better word) a local dedicated index.
This helps with testing the query parser & friends, but a lucene based
query engine needs events to update its data.

thoughts?

best,
alex

[0] http://markmail.org/message/vv7mohr22uqdugah

Re: Re (OAK-36) Implement a query parser - what about indexing?

Posted by Michael Dürig <md...@apache.org>.


On 22.3.12 11:39, Alex Parvulescu wrote:
> Hi,
>
> On Thu, Mar 22, 2012 at 10:43 AM, Michael Dürig<md...@apache.org>  wrote:
>
>>
>>
>> On 21.3.12 23:13, Alex Parvulescu wrote:
>>
>>> Hi,
>>>
>>> I've started to scratch the surface a little bit on the subject of
>>> queries.
>>>
>>> OAK-36 covers the Query implementation effort, but I'm wondering if now
>>> would be a good time to mention indexing as well.
>>>
>>
>> We should create a separate JIRA issue for the query execution engine. See
>> my last comment on OAK-28. I thought it would be too early for that
>> yesterday but since there seems to be some effort already in this area I
>> think we should go ahead.
>>
>>
> Sorry for not being clear enough. I'm talking about indexing.
> If you see the query result as a filtered output of the index, I'm looking
> for a way to build the input side.
> As far as I know you need some sort of observation events to know whenever
> a node has been added/changed so you can keep the index up to date.
>
> Am I making more sense now? :)

Right. I was mixing things up. Sorry for that.

Michael

>
> alex
>
>
>> Michael
>>
>>
>>> We want to have dedicated indexes, I think that would be accomplished via
>>> observation.
>>> Any ideas about the availability of this feature?
>>>
>>> The current index implementation just traverses the existing nodes (albeit
>>> applying some path constraints first), but still it doesn't have (for a
>>> lack of a better word) a local dedicated index.
>>> This helps with testing the query parser&   friends, but a lucene based
>>>
>>> query engine needs events to update its data.
>>>
>>> thoughts?
>>>
>>> best,
>>> alex
>>>
>>> [0] http://markmail.org/message/**vv7mohr22uqdugah<http://markmail.org/message/vv7mohr22uqdugah>
>>>
>>>
>

Re: Re (OAK-36) Implement a query parser - what about indexing?

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

>I'm looking
>for a way to build the input side.

Right, there are multiple ways to do that. I can think of three main cases:

(A) initial creation of the index: traverse the node tree
(B) offline update using the journal (MicroKernel.getJournal)
(C) real-time update using a MicroKernel wrapper

The index mechanism currently available in oak-core at
org.apache.jackrabbit.mk.index actually already supports all three cases:
(A) using Indexer.createPropertyIndex and Indexer.createPrefixIndex, (B)
if the index mechanism is used as a separate module, and (C) if the
IndexWrapper is used. Please note the index mechanism is not fully tested
yet. Also, there are still open questions with regard to clustering (but
to solve those, I believe we first need to get a better idea about how to
implement clustering - possibly we will not need this mechanism for
indexing in a cluster and use the 'native' index mechanism of the
underlying storage in this case). An index based on Lucene could work in a
similar way.

Regards,
Thomas

Re: Re (OAK-36) Implement a query parser - what about indexing?

Posted by Alex Parvulescu <al...@gmail.com>.

Hi,

On Thu, Mar 22, 2012 at 10:43 AM, Michael Dürig <md...@apache.org> wrote:

>
>
> On 21.3.12 23:13, Alex Parvulescu wrote:
>
>> Hi,
>>
>> I've started to scratch the surface a little bit on the subject of
>> queries.
>>
>> OAK-36 covers the Query implementation effort, but I'm wondering if now
>> would be a good time to mention indexing as well.
>>
>
> We should create a separate JIRA issue for the query execution engine. See
> my last comment on OAK-28. I thought it would be too early for that
> yesterday but since there seems to be some effort already in this area I
> think we should go ahead.
>
>
Sorry for not being clear enough. I'm talking about indexing.
If you see the query result as a filtered output of the index, I'm looking
for a way to build the input side.
As far as I know you need some sort of observation events to know whenever
a node has been added/changed so you can keep the index up to date.

Am I making more sense now? :)

alex


> Michael
>
>
>> We want to have dedicated indexes, I think that would be accomplished via
>> observation.
>> Any ideas about the availability of this feature?
>>
>> The current index implementation just traverses the existing nodes (albeit
>> applying some path constraints first), but still it doesn't have (for a
>> lack of a better word) a local dedicated index.
>> This helps with testing the query parser&  friends, but a lucene based
>>
>> query engine needs events to update its data.
>>
>> thoughts?
>>
>> best,
>> alex
>>
>> [0] http://markmail.org/message/**vv7mohr22uqdugah<http://markmail.org/message/vv7mohr22uqdugah>
>>
>>

Re: Re (OAK-36) Implement a query parser - what about indexing?

Posted by Michael Dürig <md...@apache.org>.


On 21.3.12 23:13, Alex Parvulescu wrote:
> Hi,
>
> I've started to scratch the surface a little bit on the subject of queries.
>
> OAK-36 covers the Query implementation effort, but I'm wondering if now
> would be a good time to mention indexing as well.

We should create a separate JIRA issue for the query execution engine. 
See my last comment on OAK-28. I thought it would be too early for that 
yesterday but since there seems to be some effort already in this area I 
think we should go ahead.

Michael

>
> We want to have dedicated indexes, I think that would be accomplished via
> observation.
> Any ideas about the availability of this feature?
>
> The current index implementation just traverses the existing nodes (albeit
> applying some path constraints first), but still it doesn't have (for a
> lack of a better word) a local dedicated index.
> This helps with testing the query parser&  friends, but a lucene based
> query engine needs events to update its data.
>
> thoughts?
>
> best,
> alex
>
> [0] http://markmail.org/message/vv7mohr22uqdugah
>

Re: Re (OAK-36) Implement a query parser - what about indexing?

Posted by Ard Schrijvers <a....@onehippo.com>.

On Mon, Mar 26, 2012 at 3:06 PM, Thomas Mueller <mu...@adobe.com> wrote:
> Hi,
>
>>>Would it make sense to implement this in Oak? Or do you prefer an
>>>external
>>> project?
>>
>>You mean the Solr integration? If so, I think in the first place it is
>>important that we try to make it simple to integrate with an external
>>index.
>
> I just think it would be simpler for everybody if we could discuss on the
> concrete source code, instead of communicating only by email about
> abstract ideas. It's just more efficient in my view.

The concrete source code is only available currently as a one day poc
in patch form in jira issue: It is also tightly coupled with
'HippoBean's' , some sort of read only ocm mapping (resembling
jr-ocm). The rest of the the 'non-seamless' efforts (useless for oak)
are closed source customer code


>
>>Being able to listen to a commit journal to index nodes like
>>Jukka describes would help enormously already
>
> The current mechanism used for the property index is a MicroKernel wrapper
> implementation. Other solutions are possible of course.
>
>>I am not sure if it can be made generic enough to be
>>part of oak (core). Perhaps an optional module?
>
> I think such discussions are more efficient if we talk about the actual
> "source code". As soon as we have the main pieces in place, I suggest we
> write the documentation and an example about how to attach a custom
> indexer module.

Agreed

>
>>For example as Jukka mentions there is faceted navigation which is not
>>easy to expose over jcr nodes/rows. Another non faceted example would
>>be for auto-suggestion / completion : For example give me all the
>>terms for property 'bar' that start with 'fo'  : In this case, you'd
>>just like to be able to return a list of terms.
>
> How would security work for such cases? Because I currently assumed we can
> reuse the security features that are available for reading normal nodes.

Very good point. Also, faceted navigation, described in the other
search thread, should take security into account when counting : A
node should not be part to the faceted result (for example the count
of some property value occurences) if the user is not allowed to read
the node.

> If security is not required in this case, it might make sense if the
> fulltext search would return "pseudo-nodes" (nodes that are not stored in
> the microkernel, and are not part of access rights checking). Each "term"
> could be such a pseudo-node with a property "term". The problem I see with
> custom Row implementations is that joins between your fulltext
> implementation and regular nodes would not be possible.

Yes, joins are a problem I think. Also ACLs are a problem. I saw Nuxeo
tries to solve it with a separate Solr core [1], but I am not sure
about the scalability of this effort. It also needs the non released
4.0 version of Solr it seems

Regards Ard

[1] https://github.com/nuxeo/nuxeo-solr/tree/master/architecture

>
> Regards,
> Thomas
>



-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Re: Re (OAK-36) Implement a query parser - what about indexing?

Posted by Ard Schrijvers <a....@onehippo.com>.

On Mon, Mar 26, 2012 at 2:24 PM, Ard Schrijvers
<a....@onehippo.com> wrote:
> On Mon, Mar 26, 2012 at 1:59 PM, Thomas Mueller <mu...@adobe.com> wrote:
>> Hi,
>>
>>>Currently, for Hippo, I am doing something
>>>similar for the query api, that can seamlessly delegate to Solr or
>
> Small correction : I am trying to get it on the agenda to really
> seamlessly integrate with it, as currently we have some projects with
> home-grown integration. I however did a simple 1 day poc some time ago
> which seemed pretty promising.
>
>>>jackrabbit, both returning a jcr node iterator (although the solr
>>>index through solrj can also return plain pojo's).
>>> I really like the
>>>first option (pre-commit example) and third (observation based), and
>>>still see many bears on the road for the second (full-text on
>>>post-commit)
>>
>> Would it make sense to implement this in Oak? Or do you prefer an external
>> project?
>
> You mean the Solr integration? If so, I think in the first place it is
> important that we try to make it simple to integrate with an external
> index. Being able to listen to a commit journal to index nodes like
> Jukka describes would help enormously already (and be able to do this
> in a clustered environment, such that only one single node picks up
> the journal). I am not sure if it can be made generic enough to be

Excuse me, the above might be confusing:

By ' I am not sure if it can be made generic enough' I mean the Solr /
Elastic Search integration, not the part about being able to listen to
the journal.

> part of oak (core). Perhaps an optional module? Also, although I am
> unfamiliar with the project, it might be that elastic search is a
> better match for oak than Solr : I just happen to mention Solr because
> we are planning to seamlessly integrate it into our site building
> stack (although I think they don't have a seamless integration, some
> DMS providers like Alfresco and Nuxeo, which used to be a fork off /
> based on JR in the past, also integrated with Solr lately )
>
>>
>>>I've one more question regarding the oak search/indexes : Will we be
>>>able to query that returns something else than jcr nodes/rows? I
>>>frequently want to be able to get a query result from the repository
>>>that cannot be returned as node iterators. For example query on stats,
>>>or a query for 'auto-completion' on some property (thus return some
>>>part of the TermEnum for example)
>>
>> Could you give a few concrete examples? What would such a query look like,
>> and what kind of data would it return?
>
> For example as Jukka mentions there is faceted navigation which is not
> easy to expose over jcr nodes/rows. Another non faceted example would
> be for auto-suggestion / completion : For example give me all the
> terms for property 'bar' that start with 'fo'  : In this case, you'd
> just like to be able to return a list of terms.
>

Re: Re (OAK-36) Implement a query parser - what about indexing?

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

>>Would it make sense to implement this in Oak? Or do you prefer an
>>external
>> project?
>
>You mean the Solr integration? If so, I think in the first place it is
>important that we try to make it simple to integrate with an external
>index.

I just think it would be simpler for everybody if we could discuss on the
concrete source code, instead of communicating only by email about
abstract ideas. It's just more efficient in my view.

>Being able to listen to a commit journal to index nodes like
>Jukka describes would help enormously already

The current mechanism used for the property index is a MicroKernel wrapper
implementation. Other solutions are possible of course.

>I am not sure if it can be made generic enough to be
>part of oak (core). Perhaps an optional module?

I think such discussions are more efficient if we talk about the actual
"source code". As soon as we have the main pieces in place, I suggest we
write the documentation and an example about how to attach a custom
indexer module.

>For example as Jukka mentions there is faceted navigation which is not
>easy to expose over jcr nodes/rows. Another non faceted example would
>be for auto-suggestion / completion : For example give me all the
>terms for property 'bar' that start with 'fo'  : In this case, you'd
>just like to be able to return a list of terms.

How would security work for such cases? Because I currently assumed we can
reuse the security features that are available for reading normal nodes.
If security is not required in this case, it might make sense if the
fulltext search would return "pseudo-nodes" (nodes that are not stored in
the microkernel, and are not part of access rights checking). Each "term"
could be such a pseudo-node with a property "term". The problem I see with
custom Row implementations is that joins between your fulltext
implementation and regular nodes would not be possible.

Regards,
Thomas

Re: Re (OAK-36) Implement a query parser - what about indexing?

Posted by Ard Schrijvers <a....@onehippo.com>.

On Mon, Mar 26, 2012 at 1:59 PM, Thomas Mueller <mu...@adobe.com> wrote:
> Hi,
>
>>Currently, for Hippo, I am doing something
>>similar for the query api, that can seamlessly delegate to Solr or

Small correction : I am trying to get it on the agenda to really
seamlessly integrate with it, as currently we have some projects with
home-grown integration. I however did a simple 1 day poc some time ago
which seemed pretty promising.

>>jackrabbit, both returning a jcr node iterator (although the solr
>>index through solrj can also return plain pojo's).
>> I really like the
>>first option (pre-commit example) and third (observation based), and
>>still see many bears on the road for the second (full-text on
>>post-commit)
>
> Would it make sense to implement this in Oak? Or do you prefer an external
> project?

You mean the Solr integration? If so, I think in the first place it is
important that we try to make it simple to integrate with an external
index. Being able to listen to a commit journal to index nodes like
Jukka describes would help enormously already (and be able to do this
in a clustered environment, such that only one single node picks up
the journal). I am not sure if it can be made generic enough to be
part of oak (core). Perhaps an optional module? Also, although I am
unfamiliar with the project, it might be that elastic search is a
better match for oak than Solr : I just happen to mention Solr because
we are planning to seamlessly integrate it into our site building
stack (although I think they don't have a seamless integration, some
DMS providers like Alfresco and Nuxeo, which used to be a fork off /
based on JR in the past, also integrated with Solr lately )

>
>>I've one more question regarding the oak search/indexes : Will we be
>>able to query that returns something else than jcr nodes/rows? I
>>frequently want to be able to get a query result from the repository
>>that cannot be returned as node iterators. For example query on stats,
>>or a query for 'auto-completion' on some property (thus return some
>>part of the TermEnum for example)
>
> Could you give a few concrete examples? What would such a query look like,
> and what kind of data would it return?

For example as Jukka mentions there is faceted navigation which is not
easy to expose over jcr nodes/rows. Another non faceted example would
be for auto-suggestion / completion : For example give me all the
terms for property 'bar' that start with 'fo'  : In this case, you'd
just like to be able to return a list of terms.

Regards Ard

>
> Regards,
> Thomas
>

-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Re: Re (OAK-36) Implement a query parser - what about indexing?

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Mon, Mar 26, 2012 at 1:14 PM, Ard Schrijvers
<a....@onehippo.com> wrote:
> I've one more question regarding the oak search/indexes : Will we be
> able to query that returns something else than jcr nodes/rows?

I think we need to, especially since we can't really do faceting if
all return values need to be tied to individual nodes.

I'm not yet sure what the exact query result abstraction in the Oak
API should look like. Ideas welcome.

The simple approach would be to just abstract the existing
.oak.query.Row class to a documented interface, but I'm not sure if
the current design covers all potential use cases.

BR,

Jukka Zitting

Re: Re (OAK-36) Implement a query parser - what about indexing?

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

>Currently, for Hippo, I am doing something
>similar for the query api, that can seamlessly delegate to Solr or
>jackrabbit, both returning a jcr node iterator (although the solr
>index through solrj can also return plain pojo's). I really like the
>first option (pre-commit example) and third (observation based), and
>still see many bears on the road for the second (full-text on
>post-commit)

Would it make sense to implement this in Oak? Or do you prefer an external
project?

>I've one more question regarding the oak search/indexes : Will we be
>able to query that returns something else than jcr nodes/rows? I
>frequently want to be able to get a query result from the repository
>that cannot be returned as node iterators. For example query on stats,
>or a query for 'auto-completion' on some property (thus return some
>part of the TermEnum for example)

Could you give a few concrete examples? What would such a query look like,
and what kind of data would it return?

Regards,
Thomas

Re: Re (OAK-36) Implement a query parser - what about indexing?

Posted by Ard Schrijvers <a....@onehippo.com>.

On Mon, Mar 26, 2012 at 12:56 PM, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> There's a number of points in this thread that I wanted to address, so
> instead of replying to them individually, let me try to summarize my
> thinking.
>
> One of the bigger pain points in the Jackrabbit 2.x architecture has
> been the query engine and the workspace-global query index that has
> been pretty difficult to customize for special needs and to handle in
> terms of backup/recovery and scaling to multiple cluster nodes. My
> wish for Oak is that we come up with a much more flexible search and
> indexing architecture that solves these issues and is easy to extend
> for any future use cases we may encounter.
>
> I think the biggest issue, as brought up by Alex and then elaborated
> by Ard, is the way we handle indexing. Instead of having a single,
> more or less fixed index for a repository like in Jackrabbit 2.x, Oak
> should provide generic extension points that various different kinds
> of indexing components could hook into. We should have at least three
> such extension points: pre- and post-commit hooks, and observation
> based on the commit journal.
>
> For example a low-level UUID-to-path index should preferably use the
> pre-commit hook for atomic index updates as a part of each commit. A
> post-commit hook could be used to trigger full-text extraction of
> nt:file binaries, a bit like we currently do in Jackrabbit 2.x. And an
> observation client could use the commit journal to feed an external
> Solr index for application-level index features. A given deployment
> can choose which ones of these and any other indexing components are
> needed based on relevant application needs and related
> performance/scalability overhead. A single solution does not fit all
> needs, so we need to make such customization as easy as possible.
>
> On the other hand there's a lot of value in having a single, unified
> query abstraction instead of having client applications reach out
> directly to Solr, Lucene, or custom indexes. Thus, in addition to the
> extensions points for indexing, we need a way for the indexing
> components to extend the Oak query engine with ways to evaluate given
> queries against the various configured indexes. This way all
> applications can use the same generic Oak query API (exposed through
> QueryManager in JCR, DASL in WebDAV, and/or something else in JSOP)
> while leveraging the custom indexes available in each deployment.

Thanks for this summary. I now really understand what the goals are
and how to achieve it. Especially the unified generic Oak query API is
something I really like. Currently, for Hippo, I am doing something
similar for the query api, that can seamlessly delegate to Solr or
jackrabbit, both returning a jcr node iterator (although the solr
index through solrj can also return plain pojo's). I really like the
first option (pre-commit example) and third (observation based), and
still see many bears on the road for the second (full-text on
post-commit)

I've one more question regarding the oak search/indexes : Will we be
able to query that returns something else than jcr nodes/rows? I
frequently want to be able to get a query result from the repository
that cannot be returned as node iterators. For example query on stats,
or a query for 'auto-completion' on some property (thus return some
part of the TermEnum for example)

Regards Ard

>
> BR,
>
> Jukka Zitting



-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Re: Re (OAK-36) Implement a query parser - what about indexing?

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

There's a number of points in this thread that I wanted to address, so
instead of replying to them individually, let me try to summarize my
thinking.

One of the bigger pain points in the Jackrabbit 2.x architecture has
been the query engine and the workspace-global query index that has
been pretty difficult to customize for special needs and to handle in
terms of backup/recovery and scaling to multiple cluster nodes. My
wish for Oak is that we come up with a much more flexible search and
indexing architecture that solves these issues and is easy to extend
for any future use cases we may encounter.

I think the biggest issue, as brought up by Alex and then elaborated
by Ard, is the way we handle indexing. Instead of having a single,
more or less fixed index for a repository like in Jackrabbit 2.x, Oak
should provide generic extension points that various different kinds
of indexing components could hook into. We should have at least three
such extension points: pre- and post-commit hooks, and observation
based on the commit journal.

For example a low-level UUID-to-path index should preferably use the
pre-commit hook for atomic index updates as a part of each commit. A
post-commit hook could be used to trigger full-text extraction of
nt:file binaries, a bit like we currently do in Jackrabbit 2.x. And an
observation client could use the commit journal to feed an external
Solr index for application-level index features. A given deployment
can choose which ones of these and any other indexing components are
needed based on relevant application needs and related
performance/scalability overhead. A single solution does not fit all
needs, so we need to make such customization as easy as possible.

On the other hand there's a lot of value in having a single, unified
query abstraction instead of having client applications reach out
directly to Solr, Lucene, or custom indexes. Thus, in addition to the
extensions points for indexing, we need a way for the indexing
components to extend the Oak query engine with ways to evaluate given
queries against the various configured indexes. This way all
applications can use the same generic Oak query API (exposed through
QueryManager in JCR, DASL in WebDAV, and/or something else in JSOP)
while leveraging the custom indexes available in each deployment.

BR,

Jukka Zitting

Re: Re (OAK-36) Implement a query parser - what about indexing?

Posted by Alex Parvulescu <al...@gmail.com>.

Hi,

>  It is not yet "wired" to

That touches on the original question. How can I wire in an indexing
strategy (lucene or not) using the current api?

alex


On Thu, Mar 22, 2012 at 9:36 AM, Thomas Mueller <mu...@adobe.com> wrote:

> Hi,
>
> >OAK-36 covers the Query implementation effort, but I'm wondering if now
> >would be a good time to mention indexing as well.
> >
> >We want to have dedicated indexes, I think that would be accomplished via
> >observation.
> >Any ideas about the availability of this feature?
>
> Sure. One such a mechanism is implemented, and currently lives under
> org.apache.jackrabbit.mk.index. It is not yet "wired" to
> org.apache.jackrabbit.oak.query.index. This mechanism stores the index
> data in nodes and properties, as a tree (using just the MicroKernel API).
> This mechanism is supposed to be as scalable as the MicroKernel
> implementation (support concurrent writes if the MicroKernel
> implementation supports it).
>
> >The current index implementation just traverses the existing nodes (albeit
> >applying some path constraints first),
>
> Yes, that's org.apache.jackrabbit.oak.query.index.TraversingReader
>
> >This helps with testing the query parser & friends, but a lucene based
> >query engine needs events to update its data.
>
> Given the scalability requirements defined at [1] (specially concurrent,
> scalable writes in multiple cluster nodes) we plan to support other
> (non-Lucene) index mechanisms as well. Personally, I believe we should use
> Lucene for fulltext indexing, because that's what Lucene is meant for. But
> I'm not sure how a fully scalable fulltext index using Lucene would look
> like. That's still an open question we need to resolve, or define the
> limitations in this area.
>
> [1]:
> http://wiki.apache.org/jackrabbit/Goals%20and%20non%20goals%20for%20Jackrab
> bit%203
>
> Regards,
> Thomas
>
>

Re: Re (OAK-36) Implement a query parser - what about indexing?

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

>Is it the
>idea to also store the Lucene fulltext index in the repository itself?

This is not decided yet. It might make sense, to simplify storage. But for
clustering it might be problematic.

Regards,
Thomas

Re: Re (OAK-36) Implement a query parser - what about indexing?

Posted by Ard Schrijvers <a....@onehippo.com>.

On Mon, Mar 26, 2012 at 12:27 PM, Thomas Mueller <mu...@adobe.com> wrote:
> Hi,
>
> I'm not sure if you already saw that, but one of the goals for Oak is to
> "Simple/Fast queries (i.e. through specialized indexes)" - see also
> http://wiki.apache.org/jackrabbit/Goals%20and%20non%20goals%20for%20Jackrab
> bit%203 - what I understand under "specialized indexes" is user defined
> indexes. At runtime, the query engine will pick the best (user defined)
> index(es) for the given query (cost based).
>
> Currently, there is only one index type: traversing the nodes. A second
> index type is the indexer mechanism that stores the index data in the
> repository itself (this is already implemented and needs to be wired to
> the query engine). A third index type is the Lucene fulltext index. The
> query engine supports additional index types.
>
> And there might be additional index types. Implement a new index type
> should be relatively easy. If you want to implement such an index type,
> then this could be either within Oak itself, or outside of Oak. In any
> case, the current plan (my plan) is to add an index service provider
> interface.
>
>>the 'more out-of-the-box functionality' does not imply a full-text
>>index is needed per se
>
> I believe we will need a out-of-the-box fulltext index as part of Oak. But
> whether or not it is actually used is up to you, because indexes are user
> defined.

Thanks for your feedback Thomas. One more quick question : Is it the
idea to also store the Lucene fulltext index in the repository itself?

Regards Ard

>
> Vocabulary:
>
> "index type": an index implementation, for example a property index
> implementation that allows quick lookup for a property. Another example is
> "traversing all nodes under a given path". Another implementation is a
> "Lucene fulltext index".
>
> "index": one particular index instance, for example a property index for
> the property "jcr:uuid" or the property "lastModified".
>
> Regards,
> Thomas
>
>
>



-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Re: Re (OAK-36) Implement a query parser - what about indexing?

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

I'm not sure if you already saw that, but one of the goals for Oak is to
"Simple/Fast queries (i.e. through specialized indexes)" - see also
http://wiki.apache.org/jackrabbit/Goals%20and%20non%20goals%20for%20Jackrab
bit%203 - what I understand under "specialized indexes" is user defined
indexes. At runtime, the query engine will pick the best (user defined)
index(es) for the given query (cost based).

Currently, there is only one index type: traversing the nodes. A second
index type is the indexer mechanism that stores the index data in the
repository itself (this is already implemented and needs to be wired to
the query engine). A third index type is the Lucene fulltext index. The
query engine supports additional index types.

And there might be additional index types. Implement a new index type
should be relatively easy. If you want to implement such an index type,
then this could be either within Oak itself, or outside of Oak. In any
case, the current plan (my plan) is to add an index service provider
interface.

>the 'more out-of-the-box functionality' does not imply a full-text
>index is needed per se

I believe we will need a out-of-the-box fulltext index as part of Oak. But
whether or not it is actually used is up to you, because indexes are user
defined.

Vocabulary:

"index type": an index implementation, for example a property index
implementation that allows quick lookup for a property. Another example is
"traversing all nodes under a given path". Another implementation is a
"Lucene fulltext index".

"index": one particular index instance, for example a property index for
the property "jcr:uuid" or the property "lastModified".

Regards,
Thomas

Re: Re (OAK-36) Implement a query parser - what about indexing?

Posted by Ard Schrijvers <a....@onehippo.com>.

On Fri, Mar 23, 2012 at 6:57 PM, Justin Edelson
<ju...@justinedelson.com> wrote:
> On Fri, Mar 23, 2012 at 5:40 AM, Ard Schrijvers
> <a....@onehippo.com>wrote:
>
>>
>> Although I am on thin ice here, I think there are hardly any noSQL
>> stores out there that actually include full text indexes.
>
> Yes, but the goal of Oak explicitly says "The implementation should provide
> more out-of-the-box functionality than typical NoSQL databases while
> achieving comparable levels of scalability and performance."

the 'more out-of-the-box functionality' does not imply a full-text
index is needed per se. For example hierarchy isn't part of most NoSQL
databases, that is already more functionality

I am just a bit skeptical about 'more x but the same y'. It seems to
me to be impossible to have a full text index and comparable levels of
scalability with NoSQL databases that do not concern about hierarchy
or full text indexes. Giving up some performance for a hierarchy makes
much sense, because it is true added value.

>
>
>> I think we
>> shouldn't try to address it in the repository, but rather provide some
>> tooling to easily setup a (external) full text index (like plain
>> Lucene, or use Solr/Elastic search) according someones exact needs
>> (like, which analyzer to use for which part of the content, which
>> properties should be stored, which properties should be analyzed in
>> which ways, which properties are meant for TrieRanges,  etc etc)
>>
> I agree that for many use cases a separate index is appropriate. That
> doesn't obviate the need/appropriateness for an internal full-text search
> index.

I would/do agree only if the full-text search index doesn't imply much
performance constraints, higher memory consumption, and scalability
issues. Imho, the price for full text indexes is way too high, while I
still doubt usability in the end.

Regards Ard

>
> Justin
>
>
>>
>> Regards Ard
>>
>> >
>> > [1]:
>> >
>> http://wiki.apache.org/jackrabbit/Goals%20and%20non%20goals%20for%20Jackrab
>> > bit%203
>> >
>> > Regards,
>> > Thomas
>> >
>>

-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Re: Re (OAK-36) Implement a query parser - what about indexing?

Posted by Justin Edelson <ju...@justinedelson.com>.

On Fri, Mar 23, 2012 at 5:40 AM, Ard Schrijvers
<a....@onehippo.com>wrote:

>
> Although I am on thin ice here, I think there are hardly any noSQL
> stores out there that actually include full text indexes.

Yes, but the goal of Oak explicitly says "The implementation should provide
more out-of-the-box functionality than typical NoSQL databases while
achieving comparable levels of scalability and performance."



> I think we
> shouldn't try to address it in the repository, but rather provide some
> tooling to easily setup a (external) full text index (like plain
> Lucene, or use Solr/Elastic search) according someones exact needs
> (like, which analyzer to use for which part of the content, which
> properties should be stored, which properties should be analyzed in
> which ways, which properties are meant for TrieRanges,  etc etc)
>
I agree that for many use cases a separate index is appropriate. That
doesn't obviate the need/appropriateness for an internal full-text search
index.

Justin


>
> Regards Ard
>
> >
> > [1]:
> >
> http://wiki.apache.org/jackrabbit/Goals%20and%20non%20goals%20for%20Jackrab
> > bit%203
> >
> > Regards,
> > Thomas
> >
>

Re: Re (OAK-36) Implement a query parser - what about indexing?

Posted by Ard Schrijvers <a....@onehippo.com>.

On Thu, Mar 22, 2012 at 9:36 AM, Thomas Mueller <mu...@adobe.com> wrote:
> Hi,
>
>>OAK-36 covers the Query implementation effort, but I'm wondering if now
>>would be a good time to mention indexing as well.
>>
>>We want to have dedicated indexes, I think that would be accomplished via
>>observation.
>>Any ideas about the availability of this feature?
>
> Sure. One such a mechanism is implemented, and currently lives under
> org.apache.jackrabbit.mk.index. It is not yet "wired" to
> org.apache.jackrabbit.oak.query.index. This mechanism stores the index
> data in nodes and properties, as a tree (using just the MicroKernel API).
> This mechanism is supposed to be as scalable as the MicroKernel
> implementation (support concurrent writes if the MicroKernel
> implementation supports it).
>
>>The current index implementation just traverses the existing nodes (albeit
>>applying some path constraints first),
>
> Yes, that's org.apache.jackrabbit.oak.query.index.TraversingReader
>
>>This helps with testing the query parser & friends, but a lucene based
>>query engine needs events to update its data.
>
> Given the scalability requirements defined at [1] (specially concurrent,
> scalable writes in multiple cluster nodes) we plan to support other
> (non-Lucene) index mechanisms as well. Personally, I believe we should use
> Lucene for fulltext indexing, because that's what Lucene is meant for. But
> I'm not sure how a fully scalable fulltext index using Lucene would look
> like. That's still an open question we need to resolve, or define the
> limitations in this area.

I'd opt for not implementing a fulltext search index at all in the
repository, but rather have some good places to hook in an 'external'
index. I should had written my/our (Hippo) use cases already in a mail
before but never got to it. I've come to believe, that free text
search / full text indexing is too domain specific to be caught in a
generic one fits all solution. Imo, full text indexing is very much
related to how your 'domain model' is mapped to jcr nodes. A generic
repository full text index will index jcr nodes, while, for example at
Hippo, we are interested in indexing 'documents' : A document can be
some small bonzai tree of nodes. I know there has been made attempts
for indexing_configuration kind of tuning, but, imho, it just does not
work that well.

Also, the jr indexes are quite inefficient in general : In our case,
for just a couple of hundreds of thousands of documents, the number of
jcr nodes easily exceeds many millions: The (Lucene / full text)
indexes are much bigger than needed. For the current jr 2 indexes, it
is also the case that pretty much every string property gets stored in
the index as well, to do a 'equals' : If for oak, the equality checks
are done against a different (node index) instead of Lucene, it will
be very hard to combine the results.

Although I am on thin ice here, I think there are hardly any noSQL
stores out there that actually include full text indexes. I think we
shouldn't try to address it in the repository, but rather provide some
tooling to easily setup a (external) full text index (like plain
Lucene, or use Solr/Elastic search) according someones exact needs
(like, which analyzer to use for which part of the content, which
properties should be stored, which properties should be analyzed in
which ways, which properties are meant for TrieRanges,  etc etc)

Regards Ard

>
> [1]:
> http://wiki.apache.org/jackrabbit/Goals%20and%20non%20goals%20for%20Jackrab
> bit%203
>
> Regards,
> Thomas
>

Re: Re (OAK-36) Implement a query parser - what about indexing?

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

>OAK-36 covers the Query implementation effort, but I'm wondering if now
>would be a good time to mention indexing as well.
>
>We want to have dedicated indexes, I think that would be accomplished via
>observation.
>Any ideas about the availability of this feature?

Sure. One such a mechanism is implemented, and currently lives under
org.apache.jackrabbit.mk.index. It is not yet "wired" to
org.apache.jackrabbit.oak.query.index. This mechanism stores the index
data in nodes and properties, as a tree (using just the MicroKernel API).
This mechanism is supposed to be as scalable as the MicroKernel
implementation (support concurrent writes if the MicroKernel
implementation supports it).

>The current index implementation just traverses the existing nodes (albeit
>applying some path constraints first),

Yes, that's org.apache.jackrabbit.oak.query.index.TraversingReader

>This helps with testing the query parser & friends, but a lucene based
>query engine needs events to update its data.

Given the scalability requirements defined at [1] (specially concurrent,
scalable writes in multiple cluster nodes) we plan to support other
(non-Lucene) index mechanisms as well. Personally, I believe we should use
Lucene for fulltext indexing, because that's what Lucene is meant for. But
I'm not sure how a fully scalable fulltext index using Lucene would look
like. That's still an open question we need to resolve, or define the
limitations in this area.

[1]: 
http://wiki.apache.org/jackrabbit/Goals%20and%20non%20goals%20for%20Jackrab
bit%203

Regards,
Thomas