You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-dev@jackrabbit.apache.org by Lukas Kahwe Smith <ml...@pooteeweet.org> on 2012/03/24 15:12:48 UTC

full text search improvements

Hi,

I am not a Jackrabbit developer but a very interested user and co-lead of the PHPCR [1] initiative.
I wanted to expand partially on what Ard said about potentially looking into hooking in Solr/ElasticSearch [2] but some other issues I see with full text search in Jackrabbit 2.x

1) scaling

Now first up I am overall quite happy with the scalability of Jackrabbit 2.x.
Obviously there are two places though where at some point we need to support sharding and that is the persistence manager (which seems to be covered in the current Oak plans) and the lucene index (which doesnt seem to covered). Now imho there are already two perfectly fine projects working on this with Solr (the more natural choice since its also an Apache project) and ElasticSearch (imho it provides a much better API).

More over (optionally) leveraging these has several other advantages:
- mature products (especially ElasticSearch is very mature when it comes to sharding), supporting them might also attract new users to Jackrabbit
- handle much larger data sets via sharding
- provide many more full text search specific features
- less pressure on Jackrabbit to support these features [3] [4]
- as these are both Lucene based the amount of code needed (for example to convert QOM to Solr/ElasticSearch) will be minimal

---

2) facetting

Now I mentioned facetting [4] above. Right now Jackrabbit does not even support COUNT() [5], which I find very painful and a major oversight. But really what people have come to expect from full text search is facetting. Imho its so important that it should even be part of JCR 2.1 [6] and as you can see in this link it seems like HippoCMS developers agree that its a very useful feature to have inside Jackrabbit.

---

3) "cleaner" data in results

This is actually a fairly trivial issue but with severe implications for scalability. As Ard explained in many cases "a document" will span many nodes. Now when dealing with such a "document" (especially when doing overview pages of a collection of documents) its not always necessary to get the entire tree of nodes. All that is needed are some fields. For this the full text search API could provide a much faster retrieval mechanism. However we have found that the data stored inside the Lucene index is not the original data. It probably makes sense to only store the tokenized version to limit the impact of the issue noted in 1), but the fact that the same separator is used for spaces and multi value fields [7] makes it needlessly hard in many cases to simply leverage the full text search API to fetch subsets of data from a tree of nodes.

---

4) cover more SQL2 functions

This is a comparatively minor topic and might just be beyond the scope of this mailinglist which seems to be more about designing the future architecture than "minor" feature requrts. But it would be great to also support PATH(), DEPTH() etc. [8].

---

Now one last comment, I hope that all of you see the potentially in pushing Jackrabbit's user base with the existence of PHPCR. Suddenly it becomes a high scalable database for the entire PHP CMS community. As a matter of fact at DrupalCon Denver this week Drupal tentatively agreed to migrate their storage API to PHPCR. Now this doesnt necessarily need to be limited to PHP even, PHPCR just proofed that JCR isnt as language specific as many proponents of CMIS make it out to be. Heck there is even someone that started to port JCR to Node.js [9] (well its not very active, but hey).

My point being here, when thinking about Oak, please also think about the performance of users talking to Jackrabbit via HTTP. The PHPCR team has done its best in trying to solve quite a few performance issues with the current HTTP API, but it would be great of this was really in everyones head.

regards,
Lukas Kahwe Smith
mls@pooteeweet.org

[1] http://phpcr.github.com
[2] http://www.mail-archive.com/oak-dev@jackrabbit.apache.org/msg00337.html
[3] https://issues.apache.org/jira/browse/JCR-3204
[4] https://issues.apache.org/jira/browse/JCR-3134
[5] https://issues.apache.org/jira/browse/JCR-2605
[6] http://java.net/projects/jsr-333/lists/dev/archive/2011-12/message/3
[7] https://issues.apache.org/jira/browse/JCR-3028
[8] https://issues.apache.org/jira/browse/JCR-3145
[9] https://github.com/NoCR/NoCR

Re: full text search improvements

Posted by Ard Schrijvers <a....@onehippo.com>.
On Mon, Mar 26, 2012 at 4:55 PM, Thomas Mueller <mu...@adobe.com> wrote:
> Hi,
>
>>I haven't looked at / tested JCR joins : I just can't imagine that is
>>scales enough, but perhaps this is more related to my 'Lucene 1.4
>>experience'  :)
>
> Lucene 1.4?

That's when I first used Lucene, don't worry :)  However note, *many*
of the current jackrabbit 2 search implementation designs still stem
from the short comings of the early Lucene 1.4 version! For example
that all properties are indexes in a single Lucene field, or that
there is a hierarchy of Lucene indexes (there was no 'reopen' of an
index reader back then)

>
> For Oak, joins should perform well (I guess with 'scale' you mean

I meant the joins in jackrabbit 2 : They are implemented in Lucene
afaik, and I cannot imagine those to perform very well for millions of
nodes. However, I did not test them so I might be wrong

For the current oak implementation, I cannot judge the performance of
joins at all. With scale I indeed mean performance, but then
specifically whether the performance scales.

> 'perform'). Currently only nested loop joins are implemented (this is what
> relational databases use most of the time). If this turns out to be a
> problem, we might want to implement other join algorithms (block-nested
> loop join, hash join, merge join). But first let's see if it really is a
> problem.
>
>>I am not sure if it would be an issue for oak, but for jr 1 and 2, we
>>build up jcr session keeping virtual node states in memory : This can
>>grow too large, and it not easy to limit.
>
> OK I see. With "virtual nodes" I was thinking about temporary nodes that
> only exist while iterating of the query result. But this is something I
> will keep in mind. I'm sure we will find a good solution.
>
>>but I think it is all much easier if we
>>expose faceting not over a node structure. Perhaps a row structure,
>>where some 'row' do not have a backing jcr node?
>
> It's hard to say right now, I think we should postpone talking about the
> implementation details until we have all the pieces and a good test case.

Yes, agreed

Regards Ard

>
> Regards,
> Thomas
>



-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Re: full text search improvements

Posted by Ard Schrijvers <a....@onehippo.com>.
On Mon, Mar 26, 2012 at 4:14 PM, Lukas Kahwe Smith <ml...@pooteeweet.org> wrote:
>
> On Mar 26, 2012, at 10:10 , Ard Schrijvers wrote:
>
>> I am not sure if it would be an issue for oak, but for jr 1 and 2, we
>> build up jcr session keeping virtual node states in memory : This can
>> grow too large, and it not easy to limit. Also, since we have many
>> millions in jcr nodes while only a couple of hundred of thousands of
>> documents in general, the build in faceted navigation is too cpu
>> demanding.
>
> so how do users currently "define" such virtual nodes?
> are they are simply a specialized node type in which one stores the facetting query as a property?
>
> or are they ad hoc like SQL2 queries?

A bit of both :)

We, developers, in general define some node type, containing some
'predefined' faceted navigation node below which the faceted
navigation virtual structure is populated (like which facets, which is
the node scope to compute faceted navigation from, which ranges, etc
etc) : See [1] for an elaborate overview

In a ad hoc version, you can inject xpath or sql queries in the
predefined faceted navigation, which works as an extra filter, see [1]
 @ 'XPath queries as in jsr-170' . Certainly for this, we had to hook
into some jcr parts which are not desirable.

See [2] for an example in action. the 'faceted' in the url is actually
some jcr node : Traversing into the facets is just traversing the
backing virtual hierarchical node structure

Again still note, that with hindsight, I doubt about the chosen
solution we have (not for small sites, but when the number of jcr
nodes become many millions there are performance drawbacks)

Regards Ard

[1] https://wiki.onehippo.com/display/CMS7/Faceted+Navigation+Configuration
[2] http://www.demo.onehippo.com/news/faceted

>
> i can see a use case for both, though the later is more important to me than the former
>
>> Another disadvantage imo of our current 'seamless' integration of
>> exposing faceted navigation over virtual layers, is that you cannot
>> write to these nodes : Some virtual nodes don't even have a canonical
>> equivalent. This makes the virtual structure also less obvious to use
>> fro third parties
>
> well thats to be expected from results of aggregation.
>
> regards,
> Lukas Kahwe Smith
> mls@pooteeweet.org
>
>
>

Re: full text search improvements

Posted by Lukas Kahwe Smith <ml...@pooteeweet.org>.
On Mar 26, 2012, at 10:10 , Ard Schrijvers wrote:

> I am not sure if it would be an issue for oak, but for jr 1 and 2, we
> build up jcr session keeping virtual node states in memory : This can
> grow too large, and it not easy to limit. Also, since we have many
> millions in jcr nodes while only a couple of hundred of thousands of
> documents in general, the build in faceted navigation is too cpu
> demanding.

so how do users currently "define" such virtual nodes?
are they are simply a specialized node type in which one stores the facetting query as a property?

or are they ad hoc like SQL2 queries?

i can see a use case for both, though the later is more important to me than the former

> Another disadvantage imo of our current 'seamless' integration of
> exposing faceted navigation over virtual layers, is that you cannot
> write to these nodes : Some virtual nodes don't even have a canonical
> equivalent. This makes the virtual structure also less obvious to use
> fro third parties

well thats to be expected from results of aggregation.

regards,
Lukas Kahwe Smith
mls@pooteeweet.org




Re: full text search improvements

Posted by Thomas Mueller <mu...@adobe.com>.
Hi,

>I haven't looked at / tested JCR joins : I just can't imagine that is
>scales enough, but perhaps this is more related to my 'Lucene 1.4
>experience'  :)

Lucene 1.4? 

For Oak, joins should perform well (I guess with 'scale' you mean
'perform'). Currently only nested loop joins are implemented (this is what
relational databases use most of the time). If this turns out to be a
problem, we might want to implement other join algorithms (block-nested
loop join, hash join, merge join). But first let's see if it really is a
problem.

>I am not sure if it would be an issue for oak, but for jr 1 and 2, we
>build up jcr session keeping virtual node states in memory : This can
>grow too large, and it not easy to limit.

OK I see. With "virtual nodes" I was thinking about temporary nodes that
only exist while iterating of the query result. But this is something I
will keep in mind. I'm sure we will find a good solution.

>but I think it is all much easier if we
>expose faceting not over a node structure. Perhaps a row structure,
>where some 'row' do not have a backing jcr node?

It's hard to say right now, I think we should postpone talking about the
implementation details until we have all the pieces and a good test case.

Regards,
Thomas


Re: full text search improvements

Posted by Ard Schrijvers <a....@onehippo.com>.
On Mon, Mar 26, 2012 at 3:54 PM, Thomas Mueller <mu...@adobe.com> wrote:
> Hi,
>
>>What our customers also want, is to be able to query on what a
>>document for the end-user (customer) is : Some customers have the
>>author of a document being some 'author node' referenced by the
>>'document node' : Now, by the author's name, you do not find the
>>document, because the authors name is stored somewhere else.
>
> This sounds like a join to me, like:
>
>    select * from document d inner join author a on a.id = d.authorId
>
> I would expect the JCR SQL-2 query to look similar.

I haven't looked at / tested JCR joins : I just can't imagine that is
scales enough, but perhaps this is more related to my 'Lucene 1.4
experience'  :)

>
>>Are there plans to also have some ocm mapping for jr3?
>
> Not directly, that is, not within oak-jcr, oak-core, and oak-mk.
>
>> It might make
>>sense, to be able to create external indexes by annotating ocm beans
>
> I don't think oak-core should depend on OCM.

No, agreed

> But your index implementation
> (should we call it "query index"?) could use OCM, and the query engine
> could be configured to use your index implementation.

Yes, that's pretty much how I'd hope it could work

>
>>Indexes can be a bit out of sync, when some reference node changes
>>(think about a changing author name), but imo acceptable for full text
>>indexes
>
> Yes, I think fulltext search doesn't need to be real-time.
>
>>We exposed it over virtual layers, but, during
>>the past years, performance and memory wise, I've switched my opinion
>>that I'd rather opt for not having faceted navigation exposed as
>>virtual nodes.
>
> Are virtual nodes a performance / memory problem? I don't see why this
> should be the case for Oak. But if it turns out that regular nodes are
> simpler, then maybe you should create regular nodes... Those could be
> maintained by your index implementation. For example, one node for each
> "fulltext search term".

I am not sure if it would be an issue for oak, but for jr 1 and 2, we
build up jcr session keeping virtual node states in memory : This can
grow too large, and it not easy to limit. Also, since we have many
millions in jcr nodes while only a couple of hundred of thousands of
documents in general, the build in faceted navigation is too cpu
demanding.

Another disadvantage imo of our current 'seamless' integration of
exposing faceted navigation over virtual layers, is that you cannot
write to these nodes : Some virtual nodes don't even have a canonical
equivalent. This makes the virtual structure also less obvious to use
fro third parties

Either way, it might be more a problem of our current technical
implementation than of oak. but I think it is all much easier if we
expose faceting not over a node structure. Perhaps a row structure,
where some 'row' do not have a backing jcr node?

Regards Ard

>
> Regards,
> Thomas
>



-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Re: full text search improvements

Posted by Lukas Kahwe Smith <ml...@pooteeweet.org>.
On Mar 26, 2012, at 09:40 , Ard Schrijvers wrote:

> On Sat, Mar 24, 2012 at 3:12 PM, Lukas Kahwe Smith <ml...@pooteeweet.org> wrote:
>> Hi,
>> 
>> I am not a Jackrabbit developer but a very interested user and co-lead of the PHPCR [1] initiative.
>> I wanted to expand partially on what Ard said about potentially looking into hooking in Solr/ElasticSearch [2] but some other issues I see with full text search in Jackrabbit 2.x
>> 
>> 1) scaling
>> 
>> Now first up I am overall quite happy with the scalability of Jackrabbit 2.x.
>> Obviously there are two places though where at some point we need to support sharding and that is the persistence manager (which seems to be covered in the current Oak plans) and the lucene index (which doesnt seem to covered). Now imho there are already two perfectly fine projects working on this with Solr (the more natural choice since its also an Apache project) and ElasticSearch (imho it provides a much better API).
>> 
>> More over (optionally) leveraging these has several other advantages:
>> - mature products (especially ElasticSearch is very mature when it comes to sharding), supporting them might also attract new users to Jackrabbit
>> - handle much larger data sets via sharding
>> - provide many more full text search specific features
> 
> What our customers also want, is to be able to query on what a
> document for the end-user (customer) is : Some customers have the
> author of a document being some 'author node' referenced by the
> 'document node' : Now, by the author's name, you do not find the
> document, because the authors name is stored somewhere else.

well you can already do this via a JOIN .. but I guess you are asking to be able to do some more denormalization during the indexing process for better performance.

(somewhat off topic, but we have this use case in our current application and we are concerned that some "meta authors" might lead to too many such references .. not sure if addressing this is part of Oak .. so right now we "partition" the referrers by date, which is ok but a bit annoying)

>> 2) facetting
>> 
>> Now I mentioned facetting [4] above. Right now Jackrabbit does not even support COUNT() [5], which I find very painful and a major oversight. But really what people have come to expect from full text search is facetting. Imho its so important that it should even be part of JCR 2.1 [6] and as you can see in this link it seems like HippoCMS developers agree that its a very useful feature to have inside Jackrabbit.
> 
> Yes, useful, but with hindsight, I wouldn't go for a seamless
> integration any more : We exposed it over virtual layers, but, during
> the past years, performance and memory wise, I've switched my opinion
> that I'd rather opt for not having faceted navigation exposed as
> virtual nodes. Still, being able to query the content over faceted
> navigation is desired by almost all customers


ok interesting.
does your current solution include support for ACLs?

regards,
Lukas Kahwe Smith
mls@pooteeweet.org




Re: full text search improvements

Posted by Thomas Mueller <mu...@adobe.com>.
Hi,

>What our customers also want, is to be able to query on what a
>document for the end-user (customer) is : Some customers have the
>author of a document being some 'author node' referenced by the
>'document node' : Now, by the author's name, you do not find the
>document, because the authors name is stored somewhere else.

This sounds like a join to me, like:

    select * from document d inner join author a on a.id = d.authorId

I would expect the JCR SQL-2 query to look similar.

>Are there plans to also have some ocm mapping for jr3?

Not directly, that is, not within oak-jcr, oak-core, and oak-mk.

> It might make
>sense, to be able to create external indexes by annotating ocm beans

I don't think oak-core should depend on OCM. But your index implementation
(should we call it "query index"?) could use OCM, and the query engine
could be configured to use your index implementation.

>Indexes can be a bit out of sync, when some reference node changes
>(think about a changing author name), but imo acceptable for full text
>indexes

Yes, I think fulltext search doesn't need to be real-time.

>We exposed it over virtual layers, but, during
>the past years, performance and memory wise, I've switched my opinion
>that I'd rather opt for not having faceted navigation exposed as
>virtual nodes. 

Are virtual nodes a performance / memory problem? I don't see why this
should be the case for Oak. But if it turns out that regular nodes are
simpler, then maybe you should create regular nodes... Those could be
maintained by your index implementation. For example, one node for each
"fulltext search term".

Regards,
Thomas


Re: full text search improvements

Posted by Ard Schrijvers <a....@onehippo.com>.
On Sat, Mar 24, 2012 at 3:12 PM, Lukas Kahwe Smith <ml...@pooteeweet.org> wrote:
> Hi,
>
> I am not a Jackrabbit developer but a very interested user and co-lead of the PHPCR [1] initiative.
> I wanted to expand partially on what Ard said about potentially looking into hooking in Solr/ElasticSearch [2] but some other issues I see with full text search in Jackrabbit 2.x
>
> 1) scaling
>
> Now first up I am overall quite happy with the scalability of Jackrabbit 2.x.
> Obviously there are two places though where at some point we need to support sharding and that is the persistence manager (which seems to be covered in the current Oak plans) and the lucene index (which doesnt seem to covered). Now imho there are already two perfectly fine projects working on this with Solr (the more natural choice since its also an Apache project) and ElasticSearch (imho it provides a much better API).
>
> More over (optionally) leveraging these has several other advantages:
> - mature products (especially ElasticSearch is very mature when it comes to sharding), supporting them might also attract new users to Jackrabbit
> - handle much larger data sets via sharding
> - provide many more full text search specific features

What our customers also want, is to be able to query on what a
document for the end-user (customer) is : Some customers have the
author of a document being some 'author node' referenced by the
'document node' : Now, by the author's name, you do not find the
document, because the authors name is stored somewhere else.

Are there plans to also have some ocm mapping for jr3? It might make
sense, to be able to create external indexes by annotating ocm beans :
This way, you also have the api for the search result, as it will just
return the ocm pojo's : This is actually the approach I want to take
for the content beans we have (where a developer can through
annotation specify how to index).

Indexes can be a bit out of sync, when some reference node changes
(think about a changing author name), but imo acceptable for full text
indexes

> - less pressure on Jackrabbit to support these features [3] [4]
> - as these are both Lucene based the amount of code needed (for example to convert QOM to Solr/ElasticSearch) will be minimal
>
> ---
>
> 2) facetting
>
> Now I mentioned facetting [4] above. Right now Jackrabbit does not even support COUNT() [5], which I find very painful and a major oversight. But really what people have come to expect from full text search is facetting. Imho its so important that it should even be part of JCR 2.1 [6] and as you can see in this link it seems like HippoCMS developers agree that its a very useful feature to have inside Jackrabbit.

Yes, useful, but with hindsight, I wouldn't go for a seamless
integration any more : We exposed it over virtual layers, but, during
the past years, performance and memory wise, I've switched my opinion
that I'd rather opt for not having faceted navigation exposed as
virtual nodes. Still, being able to query the content over faceted
navigation is desired by almost all customers

>

Regards Ard

Re: full text search improvements

Posted by Lukas Kahwe Smith <ml...@pooteeweet.org>.
On Mar 26, 2012, at 09:31 , Ard Schrijvers wrote:

> On Mon, Mar 26, 2012 at 2:12 PM, Jukka Zitting <ju...@gmail.com> wrote:
>> 
>>> 3) "cleaner" data in results
>> 
>> This goes into the discussion of what the query result abstraction in
>> the Oak API should look like. Breaking the requirement that all query
>> results be directly linked to nodes in the repository should go a long
>> way here, but it also opens up the issue of how such results relate
>> with access controls. We need to put some thought into this...
> 
> Access controls is indeed important, and difficult.


indeed.

in this way its also a nice option to allow users to hook in Solr/ElasticSearch because it means that for features where we have not yet figured out a solution (like ACLs on facettes) users can already get going by passing Jackrabbit for user cases where it makes sense without setting an unfortunate precedent with partially working features via the normal API that need to be broken later on.

regards,
Lukas Kahwe Smith
mls@pooteeweet.org




Re: full text search improvements

Posted by Ard Schrijvers <a....@onehippo.com>.
On Mon, Mar 26, 2012 at 2:12 PM, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> On Sat, Mar 24, 2012 at 3:12 PM, Lukas Kahwe Smith <ml...@pooteeweet.org> wrote:
>> More over (optionally) leveraging these has several other advantages:
>
> Agreed, I think it would be great to have first-class integration from
> Oak to *both* Solr and ElasticSearch. As soon as we have a first draft
> to the indexing extension points (as described in the other thread)
> I'd love to see some prototypes on how they'd work in terms of
> external search indexes. Volunteers for that?
>
>> Now I mentioned facetting [4] above. Right now Jackrabbit does not even
>> support COUNT() [5], which I find very painful and a major oversight. But
>> really what people have come to expect from full text search is facetting.
>
> Totally agreed. We need to make sure that the Oak query API supports
> faceting and other related query features. The actual implementation
> can (should?) be left to individual index components.
>
>> 3) "cleaner" data in results
>
> This goes into the discussion of what the query result abstraction in
> the Oak API should look like. Breaking the requirement that all query
> results be directly linked to nodes in the repository should go a long
> way here, but it also opens up the issue of how such results relate
> with access controls. We need to put some thought into this...

Access controls is indeed important, and difficult.

Ard

Re: full text search improvements

Posted by Bertrand Delacretaz <bd...@apache.org>.
On Mon, Mar 26, 2012 at 2:12 PM, Jukka Zitting <ju...@gmail.com> wrote:
> ...As soon as we have a first draft
> to the indexing extension points (as described in the other thread)
> I'd love to see some prototypes on how they'd work in terms of
> external search indexes. Volunteers for that?...

I'd like to create a prototype for semantic search using Stanbol -
might be bit outside of the core use case, but if we can do that it
would demonstrate that any type of indexing can be integrated.

-Bertrand

Re: full text search improvements

Posted by Ard Schrijvers <a....@onehippo.com>.
On Mon, Mar 26, 2012 at 5:16 PM, Thomas Mueller <mu...@adobe.com> wrote:
> Hi,
>
>>I'd like to prototype Solr integratation, however, I cannot commit to
>>this as I am dependent on the time I will be given by my manager:
>>April is completely booked already. I hope I can get a time slot in
>>May to work on a prototype
>
> That would be really nice! May is better than April anyway, because
> currently part of the "infrastructure" (and the documentation) just isn't
> ready yet. But we should also not wait too long, because the longer we
> wait the less room for bigger changes.

I'll do my best and will try to get it planned.

Regards Ard

>
> Regards,
> Thomas
>



-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Re: full text search improvements

Posted by Thomas Mueller <mu...@adobe.com>.
Hi,

>I'd like to prototype Solr integratation, however, I cannot commit to
>this as I am dependent on the time I will be given by my manager:
>April is completely booked already. I hope I can get a time slot in
>May to work on a prototype

That would be really nice! May is better than April anyway, because
currently part of the "infrastructure" (and the documentation) just isn't
ready yet. But we should also not wait too long, because the longer we
wait the less room for bigger changes.

Regards,
Thomas


Re: full text search improvements

Posted by Ard Schrijvers <a....@onehippo.com>.
On Mon, Mar 26, 2012 at 2:12 PM, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> On Sat, Mar 24, 2012 at 3:12 PM, Lukas Kahwe Smith <ml...@pooteeweet.org> wrote:
>> More over (optionally) leveraging these has several other advantages:
>
> Agreed, I think it would be great to have first-class integration from
> Oak to *both* Solr and ElasticSearch. As soon as we have a first draft
> to the indexing extension points (as described in the other thread)
> I'd love to see some prototypes on how they'd work in terms of
> external search indexes. Volunteers for that?

I'd like to prototype Solr integratation, however, I cannot commit to
this as I am dependent on the time I will be given by my manager:
April is completely booked already. I hope I can get a time slot in
May to work on a prototype

Regards Ard

>

Re: full text search improvements

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Sat, Mar 24, 2012 at 3:12 PM, Lukas Kahwe Smith <ml...@pooteeweet.org> wrote:
> More over (optionally) leveraging these has several other advantages:

Agreed, I think it would be great to have first-class integration from
Oak to *both* Solr and ElasticSearch. As soon as we have a first draft
to the indexing extension points (as described in the other thread)
I'd love to see some prototypes on how they'd work in terms of
external search indexes. Volunteers for that?

> Now I mentioned facetting [4] above. Right now Jackrabbit does not even
> support COUNT() [5], which I find very painful and a major oversight. But
> really what people have come to expect from full text search is facetting.

Totally agreed. We need to make sure that the Oak query API supports
faceting and other related query features. The actual implementation
can (should?) be left to individual index components.

> 3) "cleaner" data in results

This goes into the discussion of what the query result abstraction in
the Oak API should look like. Breaking the requirement that all query
results be directly linked to nodes in the repository should go a long
way here, but it also opens up the issue of how such results relate
with access controls. We need to put some thought into this...

> 4) cover more SQL2 functions
>
> This is a comparatively minor topic and might just be beyond the scope
> of this mailinglist which seems to be more about designing the future
> architecture than "minor" feature requrts. But it would be great to also
> support PATH(), DEPTH() etc. [8].

Agreed. That's one of the main reasons why I think we shouldn't just
reuse the JQOM from JCR 2.0 as the internal query model. Having an
easy way for custom functions to be added, ideally as pluggable
extensions, is IMHO a big part in future-proofing the architecture.
Examples of where this would come in handy are features like querying
by geographical location, image similarity, or graph distance (think
social networks).

> My point being here, when thinking about Oak, please also think about
> the performance of users talking to Jackrabbit via HTTP.

+1 I think we should start something like oak-jsop or oak-webdav (or
oak-atom) that provides a native mapping of the Oak API to a
HTTP-based access protocol. The current WebDAV(ex) mapping in
Jackrabbit 2.x is (as you've seen) a bit limited by all the JCR and
SPI layering in between.

> The PHPCR team has done its best in trying to solve quite a few
> performance issues with the current HTTP API, but it would be great
> of this was really in everyones head.

Agreed. It would be great also to get your feedback on the protocol
bits as soon as we have something runnable. The rough roadmap I came
up with earlier [1] suggests that we should have basic HTTP-based CRUD
operations working in the 0.2 release scheduled for April.

[1] http://markmail.org/message/7dhxklytr2xaoe24

BR,

Jukka Zitting