You are viewing a plain text version of this content. The canonical link for it is here.

Posted to oak-dev@jackrabbit.apache.org by Tommaso Teofili <to...@gmail.com> on 2014/08/25 16:08:50 UTC

[DISCUSS] supporting faceting in Oak query engine

Hi all,

since this has been asked every now and then [1] and since I think it's a
pretty useful and common feature for search engine nowadays I'd like to
discuss introduction of facets [2] for the Oak query engine.

Pros: having facets in search results usually helps filtering (drill down)
the results before browsing all of them, so the main usage would be for
client code.

Impact: probably change / addition in both the JCR and Oak APIs to support
returning other than "just nodes" (a NodeIterator and a Cursor
respectively).

Right now a couple of ideas on how we could do that come to my mind, both
based on the approach of having an Oak index for them:
1. a (multivalued) property index for facets, meaning we would store the
facets in the repository, so that we would run a query against it to have
the facets of an originating query.
2. a dedicated QueryIndex implementation, eventually leveraging Lucene
faceting capabilities, which could "use" the Lucene index we already have,
together with a "sidecar" index [3].

What do you think?
Regards,
Tommaso

[1] :
http://markmail.org/search/?q=oak%20faceting#query:oak%20faceting%20list%3Aorg.apache.jackrabbit.oak-dev+page:1+state:facets
[2] : http://en.wikipedia.org/wiki/Faceted_search
[3] :
http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/facet/doc-files/userguide.html

Re: [DISCUSS] supporting faceting in Oak query engine

Posted by Tommaso Teofili <to...@gmail.com>.

Hi Laurie,

2014-08-25 18:43 GMT+02:00 Laurie Byrum <lb...@adobe.com>:

> Hi Tommaso,
> I am happy to see this thread!
>

;-)


>
> Questions:
> Do you expect to want to support hierarchical or pivoted facets soonish?
>

I would say 'why not' if we have a valid use case.


> If so, does that influence this decision?
>

I think so, especially it would influence the way that may be implemented.


> Do you know how ACLs will come into play with your facet implementation?
>

not yet, I think that's one of the open points (e.g. Lukas mentioned that
HippoCMS did use 'virtual nodes' for them) we should take care of; each
'term' in the facet should be properly checked, but of course doing this
kind of check at that fine grain would be costly so we need to come up with
a solution which is both correct from the security point of view and
performant.


> If so, does that influence this decision? :-)
>

yes, I think so :)

Any suggestions and / or feedback would be highly welcome, especially from
potential users of this feature so that we properly tackle your
requirements (if any).

Thanks and regards,
Tommaso


>
> Thanks!
> Laurie
>
>
>
> On 8/25/14 7:08 AM, "Tommaso Teofili" <to...@gmail.com> wrote:
>
> >Hi all,
> >
> >since this has been asked every now and then [1] and since I think it's a
> >pretty useful and common feature for search engine nowadays I'd like to
> >discuss introduction of facets [2] for the Oak query engine.
> >
> >Pros: having facets in search results usually helps filtering (drill down)
> >the results before browsing all of them, so the main usage would be for
> >client code.
> >
> >Impact: probably change / addition in both the JCR and Oak APIs to support
> >returning other than "just nodes" (a NodeIterator and a Cursor
> >respectively).
> >
> >Right now a couple of ideas on how we could do that come to my mind, both
> >based on the approach of having an Oak index for them:
> >1. a (multivalued) property index for facets, meaning we would store the
> >facets in the repository, so that we would run a query against it to have
> >the facets of an originating query.
> >2. a dedicated QueryIndex implementation, eventually leveraging Lucene
> >faceting capabilities, which could "use" the Lucene index we already have,
> >together with a "sidecar" index [3].
> >
> >What do you think?
> >Regards,
> >Tommaso
> >
> >[1] :
> >
> http://markmail.org/search/?q=oak%20faceting#query:oak%20faceting%20list%3
> >Aorg.apache.jackrabbit.oak-dev+page:1+state:facets
> >[2] : http://en.wikipedia.org/wiki/Faceted_search
> >[3] :
> >
> http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/facet/doc-file
> >s/userguide.html
>
>

Re: [DISCUSS] supporting faceting in Oak query engine

Posted by Chetan Mehrotra <ch...@gmail.com>.

This looks useful Tommaso. With OAK-2005 we should be able to support
multiple LuceneIndexes and manage them easily.

If we can abstract all this out and just expose the facet information
as virtual node that would simplify the stuff for end users. Probably
we can have a read only NodeStore impl to expose the faceted data
bound to a system path. Otherwise we would need to expose the Lucene
API and OakDirectory
Chetan Mehrotra


On Tue, Aug 26, 2014 at 1:28 PM, Tommaso Teofili
<to...@gmail.com> wrote:
> 2014-08-25 19:02 GMT+02:00 Lukas Smith <sm...@pooteeweet.org>:
>
>> Aloha,
>>
>
> Aloha!
>
>
>>
>> you should definitely talk to the HippoCMS developers. They forked
>> Jackrabbit 2.x to add facetting as virtual nodes. They ran into some
>> performance issues but I am sure they still have value-able feedback on
>> this.
>>
>
> Cool, thanks for letting us know, if you or any other (from Hippo) would
> like to give some more insight on pros and cons of such an approach that'd
> be very good.
>
> Regards,
> Tommaso
>
>
>>
>> regards,
>> Lukas Kahwe Smith
>>
>> > On 25 Aug 2014, at 18:43, Laurie Byrum <lb...@adobe.com> wrote:
>> >
>> > Hi Tommaso,
>> > I am happy to see this thread!
>> >
>> > Questions:
>> > Do you expect to want to support hierarchical or pivoted facets soonish?
>> > If so, does that influence this decision?
>> > Do you know how ACLs will come into play with your facet implementation?
>> > If so, does that influence this decision? :-)
>> >
>> > Thanks!
>> > Laurie
>> >
>> >
>> >
>> >> On 8/25/14 7:08 AM, "Tommaso Teofili" <to...@gmail.com>
>> wrote:
>> >>
>> >> Hi all,
>> >>
>> >> since this has been asked every now and then [1] and since I think it's
>> a
>> >> pretty useful and common feature for search engine nowadays I'd like to
>> >> discuss introduction of facets [2] for the Oak query engine.
>> >>
>> >> Pros: having facets in search results usually helps filtering (drill
>> down)
>> >> the results before browsing all of them, so the main usage would be for
>> >> client code.
>> >>
>> >> Impact: probably change / addition in both the JCR and Oak APIs to
>> support
>> >> returning other than "just nodes" (a NodeIterator and a Cursor
>> >> respectively).
>> >>
>> >> Right now a couple of ideas on how we could do that come to my mind,
>> both
>> >> based on the approach of having an Oak index for them:
>> >> 1. a (multivalued) property index for facets, meaning we would store the
>> >> facets in the repository, so that we would run a query against it to
>> have
>> >> the facets of an originating query.
>> >> 2. a dedicated QueryIndex implementation, eventually leveraging Lucene
>> >> faceting capabilities, which could "use" the Lucene index we already
>> have,
>> >> together with a "sidecar" index [3].
>> >>
>> >> What do you think?
>> >> Regards,
>> >> Tommaso
>> >>
>> >> [1] :
>> >>
>> http://markmail.org/search/?q=oak%20faceting#query:oak%20faceting%20list%3
>> >> Aorg.apache.jackrabbit.oak-dev+page:1+state:facets
>> >> [2] : http://en.wikipedia.org/wiki/Faceted_search
>> >> [3] :
>> >>
>> http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/facet/doc-file
>> >> s/userguide.html
>> >
>>

Re: [DISCUSS] supporting faceting in Oak query engine

Posted by Tommaso Teofili <to...@gmail.com>.

2014-08-25 19:02 GMT+02:00 Lukas Smith <sm...@pooteeweet.org>:

> Aloha,
>

Aloha!


>
> you should definitely talk to the HippoCMS developers. They forked
> Jackrabbit 2.x to add facetting as virtual nodes. They ran into some
> performance issues but I am sure they still have value-able feedback on
> this.
>

Cool, thanks for letting us know, if you or any other (from Hippo) would
like to give some more insight on pros and cons of such an approach that'd
be very good.

Regards,
Tommaso


>
> regards,
> Lukas Kahwe Smith
>
> > On 25 Aug 2014, at 18:43, Laurie Byrum <lb...@adobe.com> wrote:
> >
> > Hi Tommaso,
> > I am happy to see this thread!
> >
> > Questions:
> > Do you expect to want to support hierarchical or pivoted facets soonish?
> > If so, does that influence this decision?
> > Do you know how ACLs will come into play with your facet implementation?
> > If so, does that influence this decision? :-)
> >
> > Thanks!
> > Laurie
> >
> >
> >
> >> On 8/25/14 7:08 AM, "Tommaso Teofili" <to...@gmail.com>
> wrote:
> >>
> >> Hi all,
> >>
> >> since this has been asked every now and then [1] and since I think it's
> a
> >> pretty useful and common feature for search engine nowadays I'd like to
> >> discuss introduction of facets [2] for the Oak query engine.
> >>
> >> Pros: having facets in search results usually helps filtering (drill
> down)
> >> the results before browsing all of them, so the main usage would be for
> >> client code.
> >>
> >> Impact: probably change / addition in both the JCR and Oak APIs to
> support
> >> returning other than "just nodes" (a NodeIterator and a Cursor
> >> respectively).
> >>
> >> Right now a couple of ideas on how we could do that come to my mind,
> both
> >> based on the approach of having an Oak index for them:
> >> 1. a (multivalued) property index for facets, meaning we would store the
> >> facets in the repository, so that we would run a query against it to
> have
> >> the facets of an originating query.
> >> 2. a dedicated QueryIndex implementation, eventually leveraging Lucene
> >> faceting capabilities, which could "use" the Lucene index we already
> have,
> >> together with a "sidecar" index [3].
> >>
> >> What do you think?
> >> Regards,
> >> Tommaso
> >>
> >> [1] :
> >>
> http://markmail.org/search/?q=oak%20faceting#query:oak%20faceting%20list%3
> >> Aorg.apache.jackrabbit.oak-dev+page:1+state:facets
> >> [2] : http://en.wikipedia.org/wiki/Faceted_search
> >> [3] :
> >>
> http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/facet/doc-file
> >> s/userguide.html
> >
>

Re: [DISCUSS] supporting faceting in Oak query engine

Posted by Tommaso Teofili <to...@gmail.com>.

2014-12-08 8:15 GMT+01:00 Thomas Mueller <mu...@adobe.com>:

> Hi,
>
> I think we should do:
>
>
> > 1. conservative approach, do not touch JCR API
>
>
> > select [jcr:path], [facet(jcr:primaryType)] from [nt:base]
> > where contains([text, 'oak']);
>
> The column "facet(jcr:primaryType)" would return the facet data. I think
> that's a good approach. The question is, which rows would return that
> data. I would prefer a solution where _each_ row returns the data (and not
> just the first row), because that's a bit easier to use, easier to
> document, and more closely matches the relational model. If just the first
> row returns the facet data, then we can't sort the result afterwards
> (otherwise the facet data ends up in another row, which would be weird).
>

sure, I see this point, while it didn't in the first impl me and Thomas
discussed offline, the current PoC does exactly that (can return the facets
via row.getColumnValue("facet(jcr:primaryType)") for each row).


>
> Another approach is to extend the API (create a new interface, for example
> OakQuery). The JDBC API (but not the JCR API) has a concept of multiple
> result sets per query (Statement.getMoreResults). We could build a
> solution that more closely matches this model. But I don't think it's
> worth the trouble right now (we could still do that later on if really
> needed).
>

I think that for an end user to leverage facets easily what you propose
would really make things nicer, of course there's no hurry in defining
that, at least until we have a satisfactory facets implementation.


>
> About security, I wonder what are the common configurations. I think we
> should avoid a complex (but slow, and hard to implement) solution that can
> solve 100% of all possible _theoretical_ cases, but instead go for a
> (faster, simpler) solution that covers 99% of all _pratical_ cases.
>

if I think to the simplest usecases I see:
- a publicly available website where users can search without logging in
- a website where logged in users can search on some content

both would require the results and facets to be filtered on the content a
logged in user or an "anonymous" user can see.

Perhaps we may also have a use case where the website expose content
crawled from the Web (e.g. Google) where there's no filtering on content,
maybe just a personalized ranking (but that's a different story that
doesn't belong here).

@Micheal, Laurie: for filtering out the counts, as I said I'd prefer not to
do that because it's an interesting piece of information we would loose,
what we may do is making that inclusion/exclusion configurable either in
the query index definition node or at runtime somehow within the query
depending on the client needs.

@Laurie for the option #5 that would mean we would have query indexes which
can index and query only data a configured user can see, e.g. we have an
'anonymous-lucene' index being a Lucene index that will only be able to
index nodes the user "anonymous" can see (has jcr:read privilege on), and
that will be used only for queries issued by the user "anonymous", however
as I said I am not sure that's a good idea, because that may not scale (if
you want to define 100 users, you would have 100 Lucene indexes dedicated
to 100 different users).

Regards,
Tommaso


>
>
> Regards,
> Thomas
>
>
>
>
> On 05/12/14 12:13, "Tommaso Teofili" <to...@gmail.com> wrote:
>
> >Hi all,
> >
> >I am resurrecting this thread as I've managed to find some time to start
> >having a look at how to support faceting in Oak query engine.
> >
> >One important thing is that I agree with Ard (and I've seen it like that
> >from the beginning) that since we have Lucene and Solr Oak index
> >implementations we should rely on them for such advanced features [1][2]
> >instead of reinventing the wheel.
> >
> >Within the above assumption the implementation seems quite
> >straightforward.
> >The not so obvious bits comes when getting to:
> >- exposing facets within the JCR API
> >- correctly filtering facets depending on authorization / privileges
> >
> >For the former here are a quick list of options that came to my mind
> >(originated also when talking to people f2f about this):
> >1. conservative approach, do not touch JCR API: facets are retrieved as
> >custom columns (String values) of a Row (from QueryResult.getRows()), e.g.
> >row.getValue("facet(jcr:primaryType)")).
> >2. Oak-only approach, do not touch JCR API but provide utilities which can
> >retrieve structured facets from the result, e.g. Iterable<Facet> facets =
> >OakQueryUtils.extractFacets(QueryResult.getRows());
> >3. not JCR compliant approach, we add methods to the API similarly to what
> >Ard and AlexK proposed
> >4. adapter pattern, similarly to what is done in Apache Sling's adaptTo,
> >where QueryResult can be adapted to different things and therefore it's
> >more extensible (but less controllable).
> >Of course other proposals are welcome on this.
> >
> >For the latter the things seem less simple as I foresee that we want the
> >facets to be consistent with the result nodes and therefore to be filtered
> >according to the privileges of the user having issued the query.
> >Here are the options I could think to so far, even though none looks
> >satisfactory to me yet:
> >
> >1. retrieve facets and then filter them afterwards seems to have an
> >inherent issue because the facets do not include information about the
> >documents (nodes) which generated them, therefore retrieving them
> >unfiltered (as the index doesn't have information about ACLs) as they are
> >,
> >e.g. facet on jcr:primaryType:
> >
> >"jcr:primaryType" : {
> >    "nt:unstructured" : 100,
> >    "nt:file" : 20,
> >    "oak:Unstructured" : 10
> >}
> >
> >would require to: iterate over the results and filter counts as you
> >iterate
> >or do N further queries to filter the counts but then it would be useless
> >to have the facets being returned from the index as we'd be retrieving
> >them
> >ourselves to do the ACL checks OR other such dummy methods.
> >
> >2. retrieve the facets unfiltered from the index and then return them in
> >the filtered results only if there's at least one item (node) in the
> >(filtered) results which falls under that facet. That would mean that we
> >would not return the counts of the facets, but a facet would be returned
> >if
> >there's at least one item in the results belonging to it. While it sounds
> >a
> >bit not too nice (and a pity as we're loosing some information we have
> >along the way) Amazon does exactly that (see "Show results for" column on
> >the left at [3]) :-)
> >
> >3. use a slightly different mechanism for returning facets, called result
> >grouping (or field collapsing) in Solr [5], in which results are returned
> >grouped (and counted) by a certain field. The example of point 1 would
> >look
> >like:
> >
> >"grouped":{
> >  "jcr:primaryType":{
> >    "matches": 130,
> >    "groups":[{
> >        "groupValue":"nt:unstructured",
> >        "doclist":{"numFound":100,"start":0,"docs":[
> >            {
> >              "path":"/content/a/b"
> >            }, ...
> >          ]
> >        }},
> >      {
> >        "groupValue":"nt:file",
> >        "doclist":{"numFound":20,"start":0,"docs":[
> >            {
> >              "path":"/content/d/e"
> >            }, ...
> >          ]
> >        }},
> >...
> >
> >there the facets would also contain (some or all of) the docs (nodes)
> >belonging to each group and therefore filtering the facets afterwards
> >could
> >be done without having to retrieve the paths of the nodes falling under
> >each facet.
> >
> >4. move towards the 'covering index' concept [5] Thomas mentioned in [6]
> >and incorporate the ACLs in the index so that no further filtering has to
> >be done once the underlying query index has returned its results. However
> >this comes with a non trivial impact with regards to a) load of the
> >indexing on the repo (each time some ACL changes a bunch of index updates
> >happen)  b) complexity in encoding ACLs in the indexed documents c)
> >complexity in encoding the ACL check in the index-specific queries. Still
> >this is probably something we may evaluate regardless of facets in the
> >future as the lazy ACL check approach we have has, IIUTC, the following
> >issue: userA searching for 'jcr:title=foo', the query engine selecting the
> >Lucene property index which returns 100 docs, userA being only able to see
> >2 of them because of its ACLs, in this case we have wasted (approximately)
> >80% of the Lucene effort to match and return the documents. However this
> >is
> >most probably overkill for now...
> >
> >5. another probably crazy idea is user filtered indexes, meaning that the
> >NodeStates passed to such IndexEditors would be filtered according to what
> >a configured user (list) can see / read. The obvious disadvantage is the
> >eventual pollution of such indexes and the consequent repository growth.
> >
> >6. at query time map the user ACLs to a list of (readable) paths, since
> >both Lucene and Solr index implementations index the exact path of each
> >node, such a list may be passed as a "filter query" to be used to find the
> >subset of nodes that such a user can see, and therefore the results
> >(facets
> >included) would come already filtered. The questions here are: a) is it
> >possible to do this mapping at all? b) how much slow would it be? Also the
> >implementation of that would probably require to encode the paths in a way
> >that they are shorter and eventually numbers so that search should be
> >faster.
> >
> >Again other proposals on authorization are welcome and I'll keep thinking
> >/
> >inspecting on other approaches too.
> >
> >Thanks to the brave who could read so far, after so many words a bit of
> >code of the first PoC of facets based on the Solr index [7]: facets there
> >are not filtered by ACLs and are returned as columns in
> >QueryResult.getRows() (JCR API conservative option seemed better to
> >start).
> >A sample query to retrieve facets would be:
> >
> >select [jcr:path], [facet(jcr:primaryType)] from [nt:base] where
> >contains([text, 'oak']);
> >
> >and the result would look like:
> >
> >RowIterator rows = query.execute().getResult().getRows();
> >while (rows.hasNext()) {
> >    Row row = rows.nextRow();
> >    String facetString = row.getValue("facet(jcr:primaryType)"); // -->
> >jcr:primaryType:[nt:unstructured (100), nt:file (20),
> >oak:Unstructured(10)]
> >    ...
> >}
> >
> >Looking forward to your comments on the mentioned approaches (and code,
> >eventually).
> >Regards,
> >Tommaso
> >
> >[1] :
> >
> http://lucene.apache.org/core/4_10_2/facet/org/apache/lucene/facet/package
> >-summary.html
> >[2] : https://cwiki.apache.org/confluence/display/solr/Faceting
> >[3] :
> >
> http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywor
> >ds=sony
> >[4] : https://cwiki.apache.org/confluence/display/solr/Result+Grouping
> >[5] : http://en.wikipedia.org/wiki/Database_index#Covering_index
> >[6] : http://markmail.org/message/4i5d55235oo26okl
> >[7] :
> >https://github.com/tteofili/jackrabbit-oak/compare/oak-1736a#files_bucket
> >
> >2014-09-01 13:11 GMT+02:00 Ard Schrijvers <a....@onehippo.com>:
> >
> >> Hey Alex,
> >>
> >> On Sat, Aug 30, 2014 at 4:25 AM, Alexander Klimetschek
> >> <ak...@adobe.com> wrote:
> >> > On 29.08.2014, at 03:10, Ard Schrijvers <a....@onehippo.com>
> >> wrote:
> >> >
> >> >> 1) When exposing faceting from Jackrabbit, we wouldn't use virtual
> >> >> layers any more to expose them over pure JCR spec API's. Instead, we
> >> >> would extend the jcr QueryResult to have next to getRows/getNodes/etc
> >> >> also expose for example methods on the QueryResult like
> >> >>
> >> >> public Map<String, Integer> getFacetValues(final String facet) {
> >> >>      return result.getFacetValues(facet);
> >> >> }
> >> >>
> >> >> public QueryResult drilldown(final FacetValue facetValue) {
> >> >>        // return current query result drilled down for facet value
> >> >>        return ...
> >> >> }
> >> >
> >> > We actually have a similar API in our CQ/AEM product:
> >> >
> >> > Query => represents a query [1]
> >> > SearchResult result = query.getResult();
> >> > Map<String, Facet> facets = result.getFacets();
> >> >
> >> > A facet is a list of "Buckets" [2] - same as FacetValue above, I
> >>assume
> >> - an abstraction over different values. You could have distinctive
> >>values
> >> (e.g. "red", "green", "blue"), but also ranges ("last year", "last
> >>month"
> >> etc.). Each bucket has a count, i.e. the number of times it occurs in
> >>the
> >> current result.
> >> >
> >> > Then on Query you have a method
> >> >
> >> > Query refine(Bucket bucket)
> >> >
> >> > which is the same as the drilldown above.
> >> >
> >> > So in the end it looks pretty much the same, and seems to be a good
> >>way
> >> to represent this as API. Doesn't say much about the implementation yet,
> >> though :)
> >>
> >> It looks very much the same, and I must admit that during typing my
> >> mail I didn't put too much attention to things like how to name
> >> something (I reckon that #refine is a much nicer name than the
> >> drillDown I wrote :-)
> >>
> >> >
> >> >> 2) Authorized counts....for faceting, it doesn't make sense to expose
> >> >> there are 314 results if you can only read 54 of them. Accounting for
> >> >> authorization through access manager can be way too slow.
> >> >> ...
> >> >> 3) If you support faceting through Oak, will that be competitive
> >> >> enough to what Solr and Elasticsearch offer? Customers these days
> >>have
> >> >> some expectations on search result quality and faceting capabilities,
> >> >> performance included.
> >> >> ...
> >> >> So, my take would be to invest time in easy integration with
> >> >> solr/elasticsearch and focus in Oak on the parts (hierarchy,
> >> >> authorization, merging, versioning) that aren't covered by already
> >> >> existing frameworks. Perhaps provide an extended JCR API as described
> >> >> in (1) which under the hood can delegate to a solr or es java client.
> >> >> In the end, you'll still end up having the authorized counts issue,
> >> >> but if you make the integration pluggable enough, it might be
> >>possible
> >> >> to leverage domain specific solutions to this (solr/es doesn't do
> >> >> anything with authorization either, it is a tough nut to crack)
> >> >
> >> > Good points. When facets are used, the worst case (showing facets for
> >> all your content) might actually be the very first thing you see, when
> >> something like a product search/browse page is shown, before any actual
> >> search by the user is done. Optimizing for performance right from the
> >>start
> >> is a must, I agree.
> >> >
> >> > What I can imagine though, is if you can leverage some kind of caching
> >> though. In practice, if you have a public site with content that does
> >>not
> >> change permanently, the facet values are pretty much stable, and
> >> authorization shouldn't cost much.
> >>
> >> Certainly there are many use cases where you can cache a lot, or for
> >> example have a public site user that has read access to an entire
> >> content tree. It becomes however much more difficult when you want to
> >> for example expose faceted structure of documents to an editor in a
> >> cms environment, where the editor has read access to only 1% of the
> >> documents. If at the same time, her initial query without
> >> authorization results in, say, 10 million hits, then you'll have to
> >> authorize all of them to get correct counts. The only way we could
> >> make this performing with Hippo CMS against jackrabbit was by
> >> translating our authorization authorization model directly to lucene
> >> queries and keep caching (authorized) bitsets (slightly different in
> >> newer lucene versions) in memory for a user, see [1]. The difficulty
> >> was that even executing the authorization query (to AND with normal
> >> query) became slow because of very large queries, but fortunately due
> >> to the jackrabbit 2 index implementation, we could keep a cached
> >> bitset per indexreader, see [2]. Unfortunately, this solution can only
> >> be done for specific authoriztion models (which can be mapped to
> >> lucene queries) and might not be generic enough for oak.
> >>
> >> Any way, apart from performance / authorization, I doubt whether oak
> >> will be able to keep up with what can be leveraged through ES or Solr.
> >>
> >> Regards Ard
> >>
> >> [1]
> >>
> >>
> http://svn.onehippo.org/repos/hippo/hippo-cms7/repository/trunk/engine/sr
> >>c/main/java/org/hippoecm/repository/query/lucene/AuthorizationQuery.java
> >> [2]
> >>
> >>
> http://www.onehippo.com/en/resources/blogs/2013/01/cms-7.8-nailed-down-au
> >>thorization-combined-with-searches.html
> >>
> >> >
> >> > [1]
> >>
> >>
> http://docs.adobe.com/docs/en/aem/6-0/develop/ref/javadoc/com/day/cq/sear
> >>ch/Query.html
> >> > [2]
> >>
> >>
> http://docs.adobe.com/docs/en/aem/6-0/develop/ref/javadoc/com/day/cq/sear
> >>ch/facets/Bucket.html
> >> >
> >> > Cheers,
> >> > Alex
> >>
> >>
> >>
> >> --
> >> Amsterdam - Oosteinde 11, 1017 WT Amsterdam
> >> Boston - 1 Broadway, Cambridge, MA 02142
> >>
> >> US +1 877 414 4776 (toll free)
> >> Europe +31(0)20 522 4466
> >> www.onehippo.com
> >>
>
>

Re: [DISCUSS] supporting faceting in Oak query engine

Posted by Laurie Byrum <lb...@adobe.com>.

Thanks, Michael. FWIW, with the use cases I have in mind, getting back a
count that is less than the actual number (and some indication that there
is an unknown amount more) would be perfectly fine if it makes us go from
potentially unacceptable performance to acceptable performance.

Laurie


On 12/12/14 12:41 AM, "Michael Marth" <mm...@adobe.com> wrote:

>Hi,
>
>Davide¹s proposal (let users specify maximum number of entries per facet)
>is basically a generalisation of my proposal to return a facet if there
>is more than 1 entry in the facet. I think we can try either, but we
>might want to test the performance on cases with large result sets where
>only few results are readable by the user.
>AFAIR Amit and Davide have been working on a ³micro scalability test
>framework² (measuring how queries scale with content). We could maybe add
>these tests there.
>
>On Ard¹s suggestion ³possibly incorrect, fast counts²: I think this is
>only feasible if ³incorrect² is guaranteed to always be lower than the
>exact amount. Otherwise facets would lead to information leakage as users
>could find information about nodes they otherwise cannot read.
>
>Cheers
>Michael
>
>
>On 10 Dec 2014, at 11:12, Tommaso Teofili <to...@gmail.com>
>wrote:
>
>> 2014-12-10 10:17 GMT+01:00 Ard Schrijvers <a....@onehippo.com>:
>> 
>>> On Wed, Dec 10, 2014 at 9:32 AM, Davide Giannella <da...@apache.org>
>>> wrote:
>>>> On 09/12/2014 17:10, Michael Marth wrote:
>>>>> ...
>>>>> 
>>>>> The use cases problematic case for counting the facets I have in mind
>>> are when a query returns millions of results. This is problematic when
>>>one
>>> wants to retrieve the exact size of the result set (taking ACLs into
>>> account, obviously). When facets are to be retrieved this will be an
>>>even
>>> harder problem (meaning when the exact number is to be calculated per
>>> facet).
>>>>> As an illustration consider a digital asset management application
>>>>>that
>>> displays mime type as facets. A query could return 1 million images
>>>and,
>>> say, 10 video.
>>>>> 
>>>>> Is there a way we could support such scenarios (while still counting
>>> results per facet) and have a performant implementation?
>>>>> 
>>>> We can opt for ACL-Checking/Parsing at most X (let's say 1000) nodes.
>>>>If
>>>> we're done within it, then we can output the actual number. In case
>>>> after 1000 nodes checked we still have some left we can leave the
>>>>number
>>>> either empty or with something like "many", "+", or any other fancy
>>>>way
>>>> if we want.
>>>> 
>>>> In the end is the same approach taken by Amazon (as Tommaso already
>>>> pointed) or for example google. If you run a search, their facets
>>>> (Searches related to...) are never with results.
>>> 
>>> I don't think Amazon and Google have customers that can demand them to
>>> show correct facet counts...our customers typically do :).
>> 
>> 
>> I see, however something along the lines of what Davide was proposing
>> doesn't sound too bad to me even for such use cases (but I may be
>>wrong).
>> 
>> 
>>> My take on
>>> on this would be to have a configurable option between
>>> 
>>> 1) exact and possibly slow counts
>>> 2) unauthorized, possibly incorrect, fast counts
>>> 
>>> Obviously, the second just uses the faceted navigation counts from the
>>> backing search implementation (with node by node access manager
>>> check), whether it is the internal lucene index, solr or Elastic
>>> Search. If you opt for the second option, then, depending on your
>>> authorization model you can get fast exact authorized counts as well :
>>> When the authorization model can be translated into a search query /
>>> filter that is AND-ed with every normal search. For ES this is briefly
>>> written at [1]. Most likely the filter is internally cached so even
>>> for very large authorization queries (like we have at Hippo because of
>>> fine grained ACL model) it will just perform. Obviously it depends
>>> quite heavily on your authorization model whether it can be translated
>>> to a query. If  it relies on an external authorization check or has
>>> many hierarchical constraints, it will be very hard. If you choose to
>>> have it based on, say, nodetype, nodename, node properties and
>>> jcr:path (fake pseudo property) it can be easily translated to a
>>> query. Note that for the jcr:path hierarchical ACL (eg read everything
>>> below /foo) it is not possible to write a lucene query easily unless
>>> you index path information as well....this results in that moves of
>>> large subtree's are slow because the entire subtree needs to be
>>> re-indexed. A different authorization model might be based on groups,
>>> where every node also gets the groups (the token of the group) indexed
>>> that can read that node. Although I never looked much into the code, I
>>> suspect [2] does something like this.
>>> 
>> 
>> that's what I had in mind in my proposal #4, the hurdles there relate to
>> the fact that each index implementation aiming at providing facets would
>> have to implement such an index and search with ACLs which is not
>>trivial.
>> One possibly good thing is that this is for sure not a new issue, as you
>> pointed out Apache ManifoldCF has something like that for Solr (and I
>>think
>> for ES too). One the other hand this would differ quite a bit from the
>> approach taken so far (indexes see just node and properties, the
>> QueryEngine post filters results on ACLs, node types, etc.), so that'd
>>be a
>> significant change.
>> 
>> 
>>> 
>>> So, instead of second guessing which might be acceptable (slow
>>> queries, wrong counts, etc) for which customers/users I'd try to keep
>>> the options open, have a default of correct (slow) counts, and make it
>>> easy to flip to 'counts from the indexes without accessmanager
>>> authorization', where depending on the authorization model, the latter
>>> can return correct results.
>>> 
>> 
>> I think the best way of addressing this is by try prototyping (some of)
>>the
>> mentioned options and see where we get, I'll see what I can do there.
>> 
>> 
>>> 
>>> For those who are interested, I will be listening to [3] this
>>> afternoon (5 pm GMT).
>>> 
>> 
>> cool, thanks for the pointer!
>> 
>> Regards,
>> Tommaso
>> 
>> 
>>> 
>>> Regards Ard
>>> 
>>> [1]
>>> 
>>>http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/in
>>>dices-aliases.html#filtered
>>> [2] http://manifoldcf.apache.org/en_US/index.html
>>> [3]
>>> 
>>>http://www.elasticsearch.com/webinars/shield-securing-your-data-in-elast
>>>icsearch/
>>> 
>>> 
>>>> 
>>>> D.
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Amsterdam - Oosteinde 11, 1017 WT Amsterdam
>>> Boston - 1 Broadway, Cambridge, MA 02142
>>> 
>>> US +1 877 414 4776 (toll free)
>>> Europe +31(0)20 522 4466
>>> www.onehippo.com
>>> 
>

Re: [DISCUSS] supporting faceting in Oak query engine

Posted by Michael Marth <mm...@adobe.com>.

Hi,

Davide’s proposal (let users specify maximum number of entries per facet) is basically a generalisation of my proposal to return a facet if there is more than 1 entry in the facet. I think we can try either, but we might want to test the performance on cases with large result sets where only few results are readable by the user.
AFAIR Amit and Davide have been working on a “micro scalability test framework” (measuring how queries scale with content). We could maybe add these tests there.

On Ard’s suggestion “possibly incorrect, fast counts”: I think this is only feasible if “incorrect” is guaranteed to always be lower than the exact amount. Otherwise facets would lead to information leakage as users could find information about nodes they otherwise cannot read.

Cheers
Michael


On 10 Dec 2014, at 11:12, Tommaso Teofili <to...@gmail.com> wrote:

> 2014-12-10 10:17 GMT+01:00 Ard Schrijvers <a....@onehippo.com>:
> 
>> On Wed, Dec 10, 2014 at 9:32 AM, Davide Giannella <da...@apache.org>
>> wrote:
>>> On 09/12/2014 17:10, Michael Marth wrote:
>>>> ...
>>>> 
>>>> The use cases problematic case for counting the facets I have in mind
>> are when a query returns millions of results. This is problematic when one
>> wants to retrieve the exact size of the result set (taking ACLs into
>> account, obviously). When facets are to be retrieved this will be an even
>> harder problem (meaning when the exact number is to be calculated per
>> facet).
>>>> As an illustration consider a digital asset management application that
>> displays mime type as facets. A query could return 1 million images and,
>> say, 10 video.
>>>> 
>>>> Is there a way we could support such scenarios (while still counting
>> results per facet) and have a performant implementation?
>>>> 
>>> We can opt for ACL-Checking/Parsing at most X (let's say 1000) nodes. If
>>> we're done within it, then we can output the actual number. In case
>>> after 1000 nodes checked we still have some left we can leave the number
>>> either empty or with something like "many", "+", or any other fancy way
>>> if we want.
>>> 
>>> In the end is the same approach taken by Amazon (as Tommaso already
>>> pointed) or for example google. If you run a search, their facets
>>> (Searches related to...) are never with results.
>> 
>> I don't think Amazon and Google have customers that can demand them to
>> show correct facet counts...our customers typically do :).
> 
> 
> I see, however something along the lines of what Davide was proposing
> doesn't sound too bad to me even for such use cases (but I may be wrong).
> 
> 
>> My take on
>> on this would be to have a configurable option between
>> 
>> 1) exact and possibly slow counts
>> 2) unauthorized, possibly incorrect, fast counts
>> 
>> Obviously, the second just uses the faceted navigation counts from the
>> backing search implementation (with node by node access manager
>> check), whether it is the internal lucene index, solr or Elastic
>> Search. If you opt for the second option, then, depending on your
>> authorization model you can get fast exact authorized counts as well :
>> When the authorization model can be translated into a search query /
>> filter that is AND-ed with every normal search. For ES this is briefly
>> written at [1]. Most likely the filter is internally cached so even
>> for very large authorization queries (like we have at Hippo because of
>> fine grained ACL model) it will just perform. Obviously it depends
>> quite heavily on your authorization model whether it can be translated
>> to a query. If  it relies on an external authorization check or has
>> many hierarchical constraints, it will be very hard. If you choose to
>> have it based on, say, nodetype, nodename, node properties and
>> jcr:path (fake pseudo property) it can be easily translated to a
>> query. Note that for the jcr:path hierarchical ACL (eg read everything
>> below /foo) it is not possible to write a lucene query easily unless
>> you index path information as well....this results in that moves of
>> large subtree's are slow because the entire subtree needs to be
>> re-indexed. A different authorization model might be based on groups,
>> where every node also gets the groups (the token of the group) indexed
>> that can read that node. Although I never looked much into the code, I
>> suspect [2] does something like this.
>> 
> 
> that's what I had in mind in my proposal #4, the hurdles there relate to
> the fact that each index implementation aiming at providing facets would
> have to implement such an index and search with ACLs which is not trivial.
> One possibly good thing is that this is for sure not a new issue, as you
> pointed out Apache ManifoldCF has something like that for Solr (and I think
> for ES too). One the other hand this would differ quite a bit from the
> approach taken so far (indexes see just node and properties, the
> QueryEngine post filters results on ACLs, node types, etc.), so that'd be a
> significant change.
> 
> 
>> 
>> So, instead of second guessing which might be acceptable (slow
>> queries, wrong counts, etc) for which customers/users I'd try to keep
>> the options open, have a default of correct (slow) counts, and make it
>> easy to flip to 'counts from the indexes without accessmanager
>> authorization', where depending on the authorization model, the latter
>> can return correct results.
>> 
> 
> I think the best way of addressing this is by try prototyping (some of) the
> mentioned options and see where we get, I'll see what I can do there.
> 
> 
>> 
>> For those who are interested, I will be listening to [3] this
>> afternoon (5 pm GMT).
>> 
> 
> cool, thanks for the pointer!
> 
> Regards,
> Tommaso
> 
> 
>> 
>> Regards Ard
>> 
>> [1]
>> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-aliases.html#filtered
>> [2] http://manifoldcf.apache.org/en_US/index.html
>> [3]
>> http://www.elasticsearch.com/webinars/shield-securing-your-data-in-elasticsearch/
>> 
>> 
>>> 
>>> D.
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
>> --
>> Amsterdam - Oosteinde 11, 1017 WT Amsterdam
>> Boston - 1 Broadway, Cambridge, MA 02142
>> 
>> US +1 877 414 4776 (toll free)
>> Europe +31(0)20 522 4466
>> www.onehippo.com
>>

Re: [DISCUSS] supporting faceting in Oak query engine

Posted by Tommaso Teofili <to...@gmail.com>.

2014-12-10 10:17 GMT+01:00 Ard Schrijvers <a....@onehippo.com>:

> On Wed, Dec 10, 2014 at 9:32 AM, Davide Giannella <da...@apache.org>
> wrote:
> > On 09/12/2014 17:10, Michael Marth wrote:
> >> ...
> >>
> >> The use cases problematic case for counting the facets I have in mind
> are when a query returns millions of results. This is problematic when one
> wants to retrieve the exact size of the result set (taking ACLs into
> account, obviously). When facets are to be retrieved this will be an even
> harder problem (meaning when the exact number is to be calculated per
> facet).
> >> As an illustration consider a digital asset management application that
> displays mime type as facets. A query could return 1 million images and,
> say, 10 video.
> >>
> >> Is there a way we could support such scenarios (while still counting
> results per facet) and have a performant implementation?
> >>
> > We can opt for ACL-Checking/Parsing at most X (let's say 1000) nodes. If
> > we're done within it, then we can output the actual number. In case
> > after 1000 nodes checked we still have some left we can leave the number
> > either empty or with something like "many", "+", or any other fancy way
> > if we want.
> >
> > In the end is the same approach taken by Amazon (as Tommaso already
> > pointed) or for example google. If you run a search, their facets
> > (Searches related to...) are never with results.
>
> I don't think Amazon and Google have customers that can demand them to
> show correct facet counts...our customers typically do :).


I see, however something along the lines of what Davide was proposing
doesn't sound too bad to me even for such use cases (but I may be wrong).


> My take on
> on this would be to have a configurable option between
>
> 1) exact and possibly slow counts
> 2) unauthorized, possibly incorrect, fast counts
>
> Obviously, the second just uses the faceted navigation counts from the
> backing search implementation (with node by node access manager
> check), whether it is the internal lucene index, solr or Elastic
> Search. If you opt for the second option, then, depending on your
> authorization model you can get fast exact authorized counts as well :
> When the authorization model can be translated into a search query /
> filter that is AND-ed with every normal search. For ES this is briefly
> written at [1]. Most likely the filter is internally cached so even
> for very large authorization queries (like we have at Hippo because of
> fine grained ACL model) it will just perform. Obviously it depends
> quite heavily on your authorization model whether it can be translated
> to a query. If  it relies on an external authorization check or has
> many hierarchical constraints, it will be very hard. If you choose to
> have it based on, say, nodetype, nodename, node properties and
> jcr:path (fake pseudo property) it can be easily translated to a
> query. Note that for the jcr:path hierarchical ACL (eg read everything
> below /foo) it is not possible to write a lucene query easily unless
> you index path information as well....this results in that moves of
> large subtree's are slow because the entire subtree needs to be
> re-indexed. A different authorization model might be based on groups,
> where every node also gets the groups (the token of the group) indexed
> that can read that node. Although I never looked much into the code, I
> suspect [2] does something like this.
>

that's what I had in mind in my proposal #4, the hurdles there relate to
the fact that each index implementation aiming at providing facets would
have to implement such an index and search with ACLs which is not trivial.
One possibly good thing is that this is for sure not a new issue, as you
pointed out Apache ManifoldCF has something like that for Solr (and I think
for ES too). One the other hand this would differ quite a bit from the
approach taken so far (indexes see just node and properties, the
QueryEngine post filters results on ACLs, node types, etc.), so that'd be a
significant change.


>
> So, instead of second guessing which might be acceptable (slow
> queries, wrong counts, etc) for which customers/users I'd try to keep
> the options open, have a default of correct (slow) counts, and make it
> easy to flip to 'counts from the indexes without accessmanager
> authorization', where depending on the authorization model, the latter
> can return correct results.
>

I think the best way of addressing this is by try prototyping (some of) the
mentioned options and see where we get, I'll see what I can do there.


>
> For those who are interested, I will be listening to [3] this
> afternoon (5 pm GMT).
>

cool, thanks for the pointer!

Regards,
Tommaso


>
> Regards Ard
>
> [1]
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-aliases.html#filtered
> [2] http://manifoldcf.apache.org/en_US/index.html
> [3]
> http://www.elasticsearch.com/webinars/shield-securing-your-data-in-elasticsearch/
>
>
> >
> > D.
> >
> >
> >
> >
>
>
>
> --
> Amsterdam - Oosteinde 11, 1017 WT Amsterdam
> Boston - 1 Broadway, Cambridge, MA 02142
>
> US +1 877 414 4776 (toll free)
> Europe +31(0)20 522 4466
> www.onehippo.com
>

Re: [DISCUSS] supporting faceting in Oak query engine

Posted by Ard Schrijvers <a....@onehippo.com>.

On Wed, Dec 10, 2014 at 10:17 AM, Ard Schrijvers
<a....@onehippo.com> wrote:
> On Wed, Dec 10, 2014 at 9:32 AM, Davide Giannella <da...@apache.org> wrote:
>> On 09/12/2014 17:10, Michael Marth wrote:
>>> ...
>>>
>>> The use cases problematic case for counting the facets I have in mind are when a query returns millions of results. This is problematic when one wants to retrieve the exact size of the result set (taking ACLs into account, obviously). When facets are to be retrieved this will be an even harder problem (meaning when the exact number is to be calculated per facet).
>>> As an illustration consider a digital asset management application that displays mime type as facets. A query could return 1 million images and, say, 10 video.
>>>
>>> Is there a way we could support such scenarios (while still counting results per facet) and have a performant implementation?
>>>
>> We can opt for ACL-Checking/Parsing at most X (let's say 1000) nodes. If
>> we're done within it, then we can output the actual number. In case
>> after 1000 nodes checked we still have some left we can leave the number
>> either empty or with something like "many", "+", or any other fancy way
>> if we want.
>>
>> In the end is the same approach taken by Amazon (as Tommaso already
>> pointed) or for example google. If you run a search, their facets
>> (Searches related to...) are never with results.
>
> I don't think Amazon and Google have customers that can demand them to
> show correct facet counts...our customers typically do :). My take on
> on this would be to have a configurable option between
>
> 1) exact and possibly slow counts
> 2) unauthorized, possibly incorrect, fast counts
>
> Obviously, the second just uses the faceted navigation counts from the
> backing search implementation (with node by node access manager

Here of course I meant to write: '**without** node by node access manager check'

> check), whether it is the internal lucene index, solr or Elastic
> Search. If you opt for the second option, then, depending on your
> authorization model you can get fast exact authorized counts as well :
> When the authorization model can be translated into a search query /
> filter that is AND-ed with every normal search. For ES this is briefly
> written at [1]. Most likely the filter is internally cached so even
> for very large authorization queries (like we have at Hippo because of
> fine grained ACL model) it will just perform. Obviously it depends
> quite heavily on your authorization model whether it can be translated
> to a query. If  it relies on an external authorization check or has
> many hierarchical constraints, it will be very hard. If you choose to
> have it based on, say, nodetype, nodename, node properties and
> jcr:path (fake pseudo property) it can be easily translated to a
> query. Note that for the jcr:path hierarchical ACL (eg read everything
> below /foo) it is not possible to write a lucene query easily unless
> you index path information as well....this results in that moves of
> large subtree's are slow because the entire subtree needs to be
> re-indexed. A different authorization model might be based on groups,
> where every node also gets the groups (the token of the group) indexed
> that can read that node. Although I never looked much into the code, I
> suspect [2] does something like this.
>
> So, instead of second guessing which might be acceptable (slow
> queries, wrong counts, etc) for which customers/users I'd try to keep
> the options open, have a default of correct (slow) counts, and make it
> easy to flip to 'counts from the indexes without accessmanager
> authorization', where depending on the authorization model, the latter
> can return correct results.
>
> For those who are interested, I will be listening to [3] this
> afternoon (5 pm GMT).
>
> Regards Ard
>
> [1] http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-aliases.html#filtered
> [2] http://manifoldcf.apache.org/en_US/index.html
> [3] http://www.elasticsearch.com/webinars/shield-securing-your-data-in-elasticsearch/
>
>
>>
>> D.
>>
>>
>>
>>
>
>
>
> --
> Amsterdam - Oosteinde 11, 1017 WT Amsterdam
> Boston - 1 Broadway, Cambridge, MA 02142
>
> US +1 877 414 4776 (toll free)
> Europe +31(0)20 522 4466
> www.onehippo.com



-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Re: [DISCUSS] supporting faceting in Oak query engine

Posted by Ard Schrijvers <a....@onehippo.com>.

On Wed, Dec 10, 2014 at 9:32 AM, Davide Giannella <da...@apache.org> wrote:
> On 09/12/2014 17:10, Michael Marth wrote:
>> ...
>>
>> The use cases problematic case for counting the facets I have in mind are when a query returns millions of results. This is problematic when one wants to retrieve the exact size of the result set (taking ACLs into account, obviously). When facets are to be retrieved this will be an even harder problem (meaning when the exact number is to be calculated per facet).
>> As an illustration consider a digital asset management application that displays mime type as facets. A query could return 1 million images and, say, 10 video.
>>
>> Is there a way we could support such scenarios (while still counting results per facet) and have a performant implementation?
>>
> We can opt for ACL-Checking/Parsing at most X (let's say 1000) nodes. If
> we're done within it, then we can output the actual number. In case
> after 1000 nodes checked we still have some left we can leave the number
> either empty or with something like "many", "+", or any other fancy way
> if we want.
>
> In the end is the same approach taken by Amazon (as Tommaso already
> pointed) or for example google. If you run a search, their facets
> (Searches related to...) are never with results.

I don't think Amazon and Google have customers that can demand them to
show correct facet counts...our customers typically do :). My take on
on this would be to have a configurable option between

1) exact and possibly slow counts
2) unauthorized, possibly incorrect, fast counts

Obviously, the second just uses the faceted navigation counts from the
backing search implementation (with node by node access manager
check), whether it is the internal lucene index, solr or Elastic
Search. If you opt for the second option, then, depending on your
authorization model you can get fast exact authorized counts as well :
When the authorization model can be translated into a search query /
filter that is AND-ed with every normal search. For ES this is briefly
written at [1]. Most likely the filter is internally cached so even
for very large authorization queries (like we have at Hippo because of
fine grained ACL model) it will just perform. Obviously it depends
quite heavily on your authorization model whether it can be translated
to a query. If  it relies on an external authorization check or has
many hierarchical constraints, it will be very hard. If you choose to
have it based on, say, nodetype, nodename, node properties and
jcr:path (fake pseudo property) it can be easily translated to a
query. Note that for the jcr:path hierarchical ACL (eg read everything
below /foo) it is not possible to write a lucene query easily unless
you index path information as well....this results in that moves of
large subtree's are slow because the entire subtree needs to be
re-indexed. A different authorization model might be based on groups,
where every node also gets the groups (the token of the group) indexed
that can read that node. Although I never looked much into the code, I
suspect [2] does something like this.

So, instead of second guessing which might be acceptable (slow
queries, wrong counts, etc) for which customers/users I'd try to keep
the options open, have a default of correct (slow) counts, and make it
easy to flip to 'counts from the indexes without accessmanager
authorization', where depending on the authorization model, the latter
can return correct results.

For those who are interested, I will be listening to [3] this
afternoon (5 pm GMT).

Regards Ard

[1] http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-aliases.html#filtered
[2] http://manifoldcf.apache.org/en_US/index.html
[3] http://www.elasticsearch.com/webinars/shield-securing-your-data-in-elasticsearch/

>
> D.
>
>
>
>

-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Re: [DISCUSS] supporting faceting in Oak query engine

Posted by Davide Giannella <da...@apache.org>.

On 09/12/2014 17:10, Michael Marth wrote:
> ...
>
> The use cases problematic case for counting the facets I have in mind are when a query returns millions of results. This is problematic when one wants to retrieve the exact size of the result set (taking ACLs into account, obviously). When facets are to be retrieved this will be an even harder problem (meaning when the exact number is to be calculated per facet).
> As an illustration consider a digital asset management application that displays mime type as facets. A query could return 1 million images and, say, 10 video.
>
> Is there a way we could support such scenarios (while still counting results per facet) and have a performant implementation?
>
We can opt for ACL-Checking/Parsing at most X (let's say 1000) nodes. If
we're done within it, then we can output the actual number. In case
after 1000 nodes checked we still have some left we can leave the number
either empty or with something like "many", "+", or any other fancy way
if we want.

In the end is the same approach taken by Amazon (as Tommaso already
pointed) or for example google. If you run a search, their facets
(Searches related to...) are never with results.

D.

Re: [DISCUSS] supporting faceting in Oak query engine

Posted by Lukas Kahwe Smith <sm...@pooteeweet.org>.

> On 09 Dec 2014, at 18:10, Michael Marth <mm...@adobe.com> wrote:
> 
> Hi,
> 
> I agree that facets *with* counts are better than without counts, but disagree that they are worthless without counts (see the Amazon link Tommaso posted earlier on this thread). There is value in providing the information that *some* results will appear when a user selects a facet .
> 
> The use cases problematic case for counting the facets I have in mind are when a query returns millions of results. This is problematic when one wants to retrieve the exact size of the result set (taking ACLs into account, obviously). When facets are to be retrieved this will be an even harder problem (meaning when the exact number is to be calculated per facet).
> As an illustration consider a digital asset management application that displays mime type as facets. A query could return 1 million images and, say, 10 video.
> 
> Is there a way we could support such scenarios (while still counting results per facet) and have a performant implementation?
> 
> (I should note that I have not tested how long it takes to retrieve and ACL-check 1 million nodes - maybe my concern is invalid)

yeah such stuff can easily cause severe slow downs. so count optional or count only up to some specified max value is nice but complicates the API.

regards,
Lukas Kahwe Smith
smith@pooteeweet.org

Re: [DISCUSS] supporting faceting in Oak query engine

Posted by Michael Marth <mm...@adobe.com>.

Hi,

I agree that facets *with* counts are better than without counts, but disagree that they are worthless without counts (see the Amazon link Tommaso posted earlier on this thread). There is value in providing the information that *some* results will appear when a user selects a facet .

The use cases problematic case for counting the facets I have in mind are when a query returns millions of results. This is problematic when one wants to retrieve the exact size of the result set (taking ACLs into account, obviously). When facets are to be retrieved this will be an even harder problem (meaning when the exact number is to be calculated per facet).
As an illustration consider a digital asset management application that displays mime type as facets. A query could return 1 million images and, say, 10 video.

Is there a way we could support such scenarios (while still counting results per facet) and have a performant implementation?

(I should note that I have not tested how long it takes to retrieve and ACL-check 1 million nodes - maybe my concern is invalid)

Best regards
Michael

On 09 Dec 2014, at 09:57, Thomas Mueller <mu...@adobe.com> wrote:

> Hi,
> 
>> I would like the counts.
> 
> I agree. I guess this feature doesn't make much sense without the counts.
> 
>> 1, 2, and 4 seem like
>> bad ideas
> 
>> 1 undercuts the idea that we'd use lucene/solr to get decent
>> performance. 
> 
> Sorry I don't understand... This is just about the API to retrieve the
> data. It still uses Lucene/Solr (the same as all other options). I'm not
> sure if you talk about the performance overhead of converting the facet
> data to a string and back? This performance overhead is very very small (I
> assume not measurable).
> 
> Regards,
> Thomas
>

Re: [DISCUSS] supporting faceting in Oak query engine

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

>I would like the counts.

I agree. I guess this feature doesn't make much sense without the counts.

>1, 2, and 4 seem like
>bad ideas

>1 undercuts the idea that we'd use lucene/solr to get decent
>performance. 

Sorry I don't understand... This is just about the API to retrieve the
data. It still uses Lucene/Solr (the same as all other options). I'm not
sure if you talk about the performance overhead of converting the facet
data to a string and back? This performance overhead is very very small (I
assume not measurable).

Regards,
Thomas

Re: [DISCUSS] supporting faceting in Oak query engine

Posted by Laurie Byrum <lb...@adobe.com>.

I guess that returning the facets without the counts really weakens the
story of facets. Yes, amazon does it for some searches, but usually it
does not. For the use case I have in mind, I would like the counts.

Options 3 or 6 seem like decent avenues to explore. 1, 2, and 4 seem like
bad ideas (1 undercuts the idea that we'd use lucene/solr to get decent
performance. 2 drops the counts. 4 feels like something we would regret,
because of the complexity). I'll admit it: I didn't understand option 5.

Thanks,
Laurie

On 12/8/14 2:19 AM, "Michael Marth" <mm...@adobe.com> wrote:

>Hi,
>
>About security, I wonder what are the common configurations. I think we
>should avoid a complex (but slow, and hard to implement) solution that can
>solve 100% of all possible _theoretical_ cases, but instead go for a
>(faster, simpler) solution that covers 99% of all _pratical_ cases.
>
>I am not sure if you are hinting towards one of the proposed approaches
>with that statement. IMO this simplification suggested by Tommaso makes
>sense:
>
>only if there's at least one item (node) in the
>(filtered) results which falls under that facet. That would mean that we
>would not return the counts of the facets, but a facet would be returned
>if
>there's at least one item in the results belonging to it
>
>Best regards
>Michael

Re: [DISCUSS] supporting faceting in Oak query engine

Posted by Michael Marth <mm...@adobe.com>.

Hi,

About security, I wonder what are the common configurations. I think we
should avoid a complex (but slow, and hard to implement) solution that can
solve 100% of all possible _theoretical_ cases, but instead go for a
(faster, simpler) solution that covers 99% of all _pratical_ cases.

I am not sure if you are hinting towards one of the proposed approaches with that statement. IMO this simplification suggested by Tommaso makes sense:

only if there's at least one item (node) in the
(filtered) results which falls under that facet. That would mean that we
would not return the counts of the facets, but a facet would be returned if
there's at least one item in the results belonging to it

Best regards
Michael

Re: [DISCUSS] supporting faceting in Oak query engine

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

I think we should do:


> 1. conservative approach, do not touch JCR API


> select [jcr:path], [facet(jcr:primaryType)] from [nt:base]
> where contains([text, 'oak']);

The column "facet(jcr:primaryType)" would return the facet data. I think
that's a good approach. The question is, which rows would return that
data. I would prefer a solution where _each_ row returns the data (and not
just the first row), because that's a bit easier to use, easier to
document, and more closely matches the relational model. If just the first
row returns the facet data, then we can't sort the result afterwards
(otherwise the facet data ends up in another row, which would be weird).

Another approach is to extend the API (create a new interface, for example
OakQuery). The JDBC API (but not the JCR API) has a concept of multiple
result sets per query (Statement.getMoreResults). We could build a
solution that more closely matches this model. But I don't think it's
worth the trouble right now (we could still do that later on if really
needed).

About security, I wonder what are the common configurations. I think we
should avoid a complex (but slow, and hard to implement) solution that can
solve 100% of all possible _theoretical_ cases, but instead go for a
(faster, simpler) solution that covers 99% of all _pratical_ cases.


Regards,
Thomas




On 05/12/14 12:13, "Tommaso Teofili" <to...@gmail.com> wrote:

>Hi all,
>
>I am resurrecting this thread as I've managed to find some time to start
>having a look at how to support faceting in Oak query engine.
>
>One important thing is that I agree with Ard (and I've seen it like that
>from the beginning) that since we have Lucene and Solr Oak index
>implementations we should rely on them for such advanced features [1][2]
>instead of reinventing the wheel.
>
>Within the above assumption the implementation seems quite
>straightforward.
>The not so obvious bits comes when getting to:
>- exposing facets within the JCR API
>- correctly filtering facets depending on authorization / privileges
>
>For the former here are a quick list of options that came to my mind
>(originated also when talking to people f2f about this):
>1. conservative approach, do not touch JCR API: facets are retrieved as
>custom columns (String values) of a Row (from QueryResult.getRows()), e.g.
>row.getValue("facet(jcr:primaryType)")).
>2. Oak-only approach, do not touch JCR API but provide utilities which can
>retrieve structured facets from the result, e.g. Iterable<Facet> facets =
>OakQueryUtils.extractFacets(QueryResult.getRows());
>3. not JCR compliant approach, we add methods to the API similarly to what
>Ard and AlexK proposed
>4. adapter pattern, similarly to what is done in Apache Sling's adaptTo,
>where QueryResult can be adapted to different things and therefore it's
>more extensible (but less controllable).
>Of course other proposals are welcome on this.
>
>For the latter the things seem less simple as I foresee that we want the
>facets to be consistent with the result nodes and therefore to be filtered
>according to the privileges of the user having issued the query.
>Here are the options I could think to so far, even though none looks
>satisfactory to me yet:
>
>1. retrieve facets and then filter them afterwards seems to have an
>inherent issue because the facets do not include information about the
>documents (nodes) which generated them, therefore retrieving them
>unfiltered (as the index doesn't have information about ACLs) as they are
>,
>e.g. facet on jcr:primaryType:
>
>"jcr:primaryType" : {
>    "nt:unstructured" : 100,
>    "nt:file" : 20,
>    "oak:Unstructured" : 10
>}
>
>would require to: iterate over the results and filter counts as you
>iterate
>or do N further queries to filter the counts but then it would be useless
>to have the facets being returned from the index as we'd be retrieving
>them
>ourselves to do the ACL checks OR other such dummy methods.
>
>2. retrieve the facets unfiltered from the index and then return them in
>the filtered results only if there's at least one item (node) in the
>(filtered) results which falls under that facet. That would mean that we
>would not return the counts of the facets, but a facet would be returned
>if
>there's at least one item in the results belonging to it. While it sounds
>a
>bit not too nice (and a pity as we're loosing some information we have
>along the way) Amazon does exactly that (see "Show results for" column on
>the left at [3]) :-)
>
>3. use a slightly different mechanism for returning facets, called result
>grouping (or field collapsing) in Solr [5], in which results are returned
>grouped (and counted) by a certain field. The example of point 1 would
>look
>like:
>
>"grouped":{
>  "jcr:primaryType":{
>    "matches": 130,
>    "groups":[{
>        "groupValue":"nt:unstructured",
>        "doclist":{"numFound":100,"start":0,"docs":[
>            {
>              "path":"/content/a/b"
>            }, ...
>          ]
>        }},
>      {
>        "groupValue":"nt:file",
>        "doclist":{"numFound":20,"start":0,"docs":[
>            {
>              "path":"/content/d/e"
>            }, ...
>          ]
>        }},
>...
>
>there the facets would also contain (some or all of) the docs (nodes)
>belonging to each group and therefore filtering the facets afterwards
>could
>be done without having to retrieve the paths of the nodes falling under
>each facet.
>
>4. move towards the 'covering index' concept [5] Thomas mentioned in [6]
>and incorporate the ACLs in the index so that no further filtering has to
>be done once the underlying query index has returned its results. However
>this comes with a non trivial impact with regards to a) load of the
>indexing on the repo (each time some ACL changes a bunch of index updates
>happen)  b) complexity in encoding ACLs in the indexed documents c)
>complexity in encoding the ACL check in the index-specific queries. Still
>this is probably something we may evaluate regardless of facets in the
>future as the lazy ACL check approach we have has, IIUTC, the following
>issue: userA searching for 'jcr:title=foo', the query engine selecting the
>Lucene property index which returns 100 docs, userA being only able to see
>2 of them because of its ACLs, in this case we have wasted (approximately)
>80% of the Lucene effort to match and return the documents. However this
>is
>most probably overkill for now...
>
>5. another probably crazy idea is user filtered indexes, meaning that the
>NodeStates passed to such IndexEditors would be filtered according to what
>a configured user (list) can see / read. The obvious disadvantage is the
>eventual pollution of such indexes and the consequent repository growth.
>
>6. at query time map the user ACLs to a list of (readable) paths, since
>both Lucene and Solr index implementations index the exact path of each
>node, such a list may be passed as a "filter query" to be used to find the
>subset of nodes that such a user can see, and therefore the results
>(facets
>included) would come already filtered. The questions here are: a) is it
>possible to do this mapping at all? b) how much slow would it be? Also the
>implementation of that would probably require to encode the paths in a way
>that they are shorter and eventually numbers so that search should be
>faster.
>
>Again other proposals on authorization are welcome and I'll keep thinking
>/
>inspecting on other approaches too.
>
>Thanks to the brave who could read so far, after so many words a bit of
>code of the first PoC of facets based on the Solr index [7]: facets there
>are not filtered by ACLs and are returned as columns in
>QueryResult.getRows() (JCR API conservative option seemed better to
>start).
>A sample query to retrieve facets would be:
>
>select [jcr:path], [facet(jcr:primaryType)] from [nt:base] where
>contains([text, 'oak']);
>
>and the result would look like:
>
>RowIterator rows = query.execute().getResult().getRows();
>while (rows.hasNext()) {
>    Row row = rows.nextRow();
>    String facetString = row.getValue("facet(jcr:primaryType)"); // -->
>jcr:primaryType:[nt:unstructured (100), nt:file (20),
>oak:Unstructured(10)]
>    ...
>}
>
>Looking forward to your comments on the mentioned approaches (and code,
>eventually).
>Regards,
>Tommaso
>
>[1] :
>http://lucene.apache.org/core/4_10_2/facet/org/apache/lucene/facet/package
>-summary.html
>[2] : https://cwiki.apache.org/confluence/display/solr/Faceting
>[3] :
>http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywor
>ds=sony
>[4] : https://cwiki.apache.org/confluence/display/solr/Result+Grouping
>[5] : http://en.wikipedia.org/wiki/Database_index#Covering_index
>[6] : http://markmail.org/message/4i5d55235oo26okl
>[7] :
>https://github.com/tteofili/jackrabbit-oak/compare/oak-1736a#files_bucket
>
>2014-09-01 13:11 GMT+02:00 Ard Schrijvers <a....@onehippo.com>:
>
>> Hey Alex,
>>
>> On Sat, Aug 30, 2014 at 4:25 AM, Alexander Klimetschek
>> <ak...@adobe.com> wrote:
>> > On 29.08.2014, at 03:10, Ard Schrijvers <a....@onehippo.com>
>> wrote:
>> >
>> >> 1) When exposing faceting from Jackrabbit, we wouldn't use virtual
>> >> layers any more to expose them over pure JCR spec API's. Instead, we
>> >> would extend the jcr QueryResult to have next to getRows/getNodes/etc
>> >> also expose for example methods on the QueryResult like
>> >>
>> >> public Map<String, Integer> getFacetValues(final String facet) {
>> >>      return result.getFacetValues(facet);
>> >> }
>> >>
>> >> public QueryResult drilldown(final FacetValue facetValue) {
>> >>        // return current query result drilled down for facet value
>> >>        return ...
>> >> }
>> >
>> > We actually have a similar API in our CQ/AEM product:
>> >
>> > Query => represents a query [1]
>> > SearchResult result = query.getResult();
>> > Map<String, Facet> facets = result.getFacets();
>> >
>> > A facet is a list of "Buckets" [2] - same as FacetValue above, I
>>assume
>> - an abstraction over different values. You could have distinctive
>>values
>> (e.g. "red", "green", "blue"), but also ranges ("last year", "last
>>month"
>> etc.). Each bucket has a count, i.e. the number of times it occurs in
>>the
>> current result.
>> >
>> > Then on Query you have a method
>> >
>> > Query refine(Bucket bucket)
>> >
>> > which is the same as the drilldown above.
>> >
>> > So in the end it looks pretty much the same, and seems to be a good
>>way
>> to represent this as API. Doesn't say much about the implementation yet,
>> though :)
>>
>> It looks very much the same, and I must admit that during typing my
>> mail I didn't put too much attention to things like how to name
>> something (I reckon that #refine is a much nicer name than the
>> drillDown I wrote :-)
>>
>> >
>> >> 2) Authorized counts....for faceting, it doesn't make sense to expose
>> >> there are 314 results if you can only read 54 of them. Accounting for
>> >> authorization through access manager can be way too slow.
>> >> ...
>> >> 3) If you support faceting through Oak, will that be competitive
>> >> enough to what Solr and Elasticsearch offer? Customers these days
>>have
>> >> some expectations on search result quality and faceting capabilities,
>> >> performance included.
>> >> ...
>> >> So, my take would be to invest time in easy integration with
>> >> solr/elasticsearch and focus in Oak on the parts (hierarchy,
>> >> authorization, merging, versioning) that aren't covered by already
>> >> existing frameworks. Perhaps provide an extended JCR API as described
>> >> in (1) which under the hood can delegate to a solr or es java client.
>> >> In the end, you'll still end up having the authorized counts issue,
>> >> but if you make the integration pluggable enough, it might be
>>possible
>> >> to leverage domain specific solutions to this (solr/es doesn't do
>> >> anything with authorization either, it is a tough nut to crack)
>> >
>> > Good points. When facets are used, the worst case (showing facets for
>> all your content) might actually be the very first thing you see, when
>> something like a product search/browse page is shown, before any actual
>> search by the user is done. Optimizing for performance right from the
>>start
>> is a must, I agree.
>> >
>> > What I can imagine though, is if you can leverage some kind of caching
>> though. In practice, if you have a public site with content that does
>>not
>> change permanently, the facet values are pretty much stable, and
>> authorization shouldn't cost much.
>>
>> Certainly there are many use cases where you can cache a lot, or for
>> example have a public site user that has read access to an entire
>> content tree. It becomes however much more difficult when you want to
>> for example expose faceted structure of documents to an editor in a
>> cms environment, where the editor has read access to only 1% of the
>> documents. If at the same time, her initial query without
>> authorization results in, say, 10 million hits, then you'll have to
>> authorize all of them to get correct counts. The only way we could
>> make this performing with Hippo CMS against jackrabbit was by
>> translating our authorization authorization model directly to lucene
>> queries and keep caching (authorized) bitsets (slightly different in
>> newer lucene versions) in memory for a user, see [1]. The difficulty
>> was that even executing the authorization query (to AND with normal
>> query) became slow because of very large queries, but fortunately due
>> to the jackrabbit 2 index implementation, we could keep a cached
>> bitset per indexreader, see [2]. Unfortunately, this solution can only
>> be done for specific authoriztion models (which can be mapped to
>> lucene queries) and might not be generic enough for oak.
>>
>> Any way, apart from performance / authorization, I doubt whether oak
>> will be able to keep up with what can be leveraged through ES or Solr.
>>
>> Regards Ard
>>
>> [1]
>> 
>>http://svn.onehippo.org/repos/hippo/hippo-cms7/repository/trunk/engine/sr
>>c/main/java/org/hippoecm/repository/query/lucene/AuthorizationQuery.java
>> [2]
>> 
>>http://www.onehippo.com/en/resources/blogs/2013/01/cms-7.8-nailed-down-au
>>thorization-combined-with-searches.html
>>
>> >
>> > [1]
>> 
>>http://docs.adobe.com/docs/en/aem/6-0/develop/ref/javadoc/com/day/cq/sear
>>ch/Query.html
>> > [2]
>> 
>>http://docs.adobe.com/docs/en/aem/6-0/develop/ref/javadoc/com/day/cq/sear
>>ch/facets/Bucket.html
>> >
>> > Cheers,
>> > Alex
>>
>>
>>
>> --
>> Amsterdam - Oosteinde 11, 1017 WT Amsterdam
>> Boston - 1 Broadway, Cambridge, MA 02142
>>
>> US +1 877 414 4776 (toll free)
>> Europe +31(0)20 522 4466
>> www.onehippo.com
>>

Re: [DISCUSS] supporting faceting in Oak query engine

Posted by Tommaso Teofili <to...@gmail.com>.

Hi all,

I am resurrecting this thread as I've managed to find some time to start
having a look at how to support faceting in Oak query engine.

One important thing is that I agree with Ard (and I've seen it like that
from the beginning) that since we have Lucene and Solr Oak index
implementations we should rely on them for such advanced features [1][2]
instead of reinventing the wheel.

Within the above assumption the implementation seems quite straightforward.
The not so obvious bits comes when getting to:
- exposing facets within the JCR API
- correctly filtering facets depending on authorization / privileges

For the former here are a quick list of options that came to my mind
(originated also when talking to people f2f about this):
1. conservative approach, do not touch JCR API: facets are retrieved as
custom columns (String values) of a Row (from QueryResult.getRows()), e.g.
row.getValue("facet(jcr:primaryType)")).
2. Oak-only approach, do not touch JCR API but provide utilities which can
retrieve structured facets from the result, e.g. Iterable<Facet> facets =
OakQueryUtils.extractFacets(QueryResult.getRows());
3. not JCR compliant approach, we add methods to the API similarly to what
Ard and AlexK proposed
4. adapter pattern, similarly to what is done in Apache Sling's adaptTo,
where QueryResult can be adapted to different things and therefore it's
more extensible (but less controllable).
Of course other proposals are welcome on this.

For the latter the things seem less simple as I foresee that we want the
facets to be consistent with the result nodes and therefore to be filtered
according to the privileges of the user having issued the query.
Here are the options I could think to so far, even though none looks
satisfactory to me yet:

1. retrieve facets and then filter them afterwards seems to have an
inherent issue because the facets do not include information about the
documents (nodes) which generated them, therefore retrieving them
unfiltered (as the index doesn't have information about ACLs) as they are ,
e.g. facet on jcr:primaryType:

"jcr:primaryType" : {
    "nt:unstructured" : 100,
    "nt:file" : 20,
    "oak:Unstructured" : 10
}

would require to: iterate over the results and filter counts as you iterate
or do N further queries to filter the counts but then it would be useless
to have the facets being returned from the index as we'd be retrieving them
ourselves to do the ACL checks OR other such dummy methods.

2. retrieve the facets unfiltered from the index and then return them in
the filtered results only if there's at least one item (node) in the
(filtered) results which falls under that facet. That would mean that we
would not return the counts of the facets, but a facet would be returned if
there's at least one item in the results belonging to it. While it sounds a
bit not too nice (and a pity as we're loosing some information we have
along the way) Amazon does exactly that (see "Show results for" column on
the left at [3]) :-)

3. use a slightly different mechanism for returning facets, called result
grouping (or field collapsing) in Solr [5], in which results are returned
grouped (and counted) by a certain field. The example of point 1 would look
like:

"grouped":{
  "jcr:primaryType":{
    "matches": 130,
    "groups":[{
        "groupValue":"nt:unstructured",
        "doclist":{"numFound":100,"start":0,"docs":[
            {
              "path":"/content/a/b"
            }, ...
          ]
        }},
      {
        "groupValue":"nt:file",
        "doclist":{"numFound":20,"start":0,"docs":[
            {
              "path":"/content/d/e"
            }, ...
          ]
        }},
...

there the facets would also contain (some or all of) the docs (nodes)
belonging to each group and therefore filtering the facets afterwards could
be done without having to retrieve the paths of the nodes falling under
each facet.

4. move towards the 'covering index' concept [5] Thomas mentioned in [6]
and incorporate the ACLs in the index so that no further filtering has to
be done once the underlying query index has returned its results. However
this comes with a non trivial impact with regards to a) load of the
indexing on the repo (each time some ACL changes a bunch of index updates
happen)  b) complexity in encoding ACLs in the indexed documents c)
complexity in encoding the ACL check in the index-specific queries. Still
this is probably something we may evaluate regardless of facets in the
future as the lazy ACL check approach we have has, IIUTC, the following
issue: userA searching for 'jcr:title=foo', the query engine selecting the
Lucene property index which returns 100 docs, userA being only able to see
2 of them because of its ACLs, in this case we have wasted (approximately)
80% of the Lucene effort to match and return the documents. However this is
most probably overkill for now...

5. another probably crazy idea is user filtered indexes, meaning that the
NodeStates passed to such IndexEditors would be filtered according to what
a configured user (list) can see / read. The obvious disadvantage is the
eventual pollution of such indexes and the consequent repository growth.

6. at query time map the user ACLs to a list of (readable) paths, since
both Lucene and Solr index implementations index the exact path of each
node, such a list may be passed as a "filter query" to be used to find the
subset of nodes that such a user can see, and therefore the results (facets
included) would come already filtered. The questions here are: a) is it
possible to do this mapping at all? b) how much slow would it be? Also the
implementation of that would probably require to encode the paths in a way
that they are shorter and eventually numbers so that search should be
faster.

Again other proposals on authorization are welcome and I'll keep thinking /
inspecting on other approaches too.

Thanks to the brave who could read so far, after so many words a bit of
code of the first PoC of facets based on the Solr index [7]: facets there
are not filtered by ACLs and are returned as columns in
QueryResult.getRows() (JCR API conservative option seemed better to start).
A sample query to retrieve facets would be:

select [jcr:path], [facet(jcr:primaryType)] from [nt:base] where
contains([text, 'oak']);

and the result would look like:

RowIterator rows = query.execute().getResult().getRows();
while (rows.hasNext()) {
    Row row = rows.nextRow();
    String facetString = row.getValue("facet(jcr:primaryType)"); // -->
jcr:primaryType:[nt:unstructured (100), nt:file (20), oak:Unstructured(10)]
    ...
}

Looking forward to your comments on the mentioned approaches (and code,
eventually).
Regards,
Tommaso

[1] :
http://lucene.apache.org/core/4_10_2/facet/org/apache/lucene/facet/package-summary.html
[2] : https://cwiki.apache.org/confluence/display/solr/Faceting
[3] :
http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=sony
[4] : https://cwiki.apache.org/confluence/display/solr/Result+Grouping
[5] : http://en.wikipedia.org/wiki/Database_index#Covering_index
[6] : http://markmail.org/message/4i5d55235oo26okl
[7] :
https://github.com/tteofili/jackrabbit-oak/compare/oak-1736a#files_bucket

2014-09-01 13:11 GMT+02:00 Ard Schrijvers <a....@onehippo.com>:

> Hey Alex,
>
> On Sat, Aug 30, 2014 at 4:25 AM, Alexander Klimetschek
> <ak...@adobe.com> wrote:
> > On 29.08.2014, at 03:10, Ard Schrijvers <a....@onehippo.com>
> wrote:
> >
> >> 1) When exposing faceting from Jackrabbit, we wouldn't use virtual
> >> layers any more to expose them over pure JCR spec API's. Instead, we
> >> would extend the jcr QueryResult to have next to getRows/getNodes/etc
> >> also expose for example methods on the QueryResult like
> >>
> >> public Map<String, Integer> getFacetValues(final String facet) {
> >>      return result.getFacetValues(facet);
> >> }
> >>
> >> public QueryResult drilldown(final FacetValue facetValue) {
> >>        // return current query result drilled down for facet value
> >>        return ...
> >> }
> >
> > We actually have a similar API in our CQ/AEM product:
> >
> > Query => represents a query [1]
> > SearchResult result = query.getResult();
> > Map<String, Facet> facets = result.getFacets();
> >
> > A facet is a list of "Buckets" [2] - same as FacetValue above, I assume
> - an abstraction over different values. You could have distinctive values
> (e.g. "red", "green", "blue"), but also ranges ("last year", "last month"
> etc.). Each bucket has a count, i.e. the number of times it occurs in the
> current result.
> >
> > Then on Query you have a method
> >
> > Query refine(Bucket bucket)
> >
> > which is the same as the drilldown above.
> >
> > So in the end it looks pretty much the same, and seems to be a good way
> to represent this as API. Doesn't say much about the implementation yet,
> though :)
>
> It looks very much the same, and I must admit that during typing my
> mail I didn't put too much attention to things like how to name
> something (I reckon that #refine is a much nicer name than the
> drillDown I wrote :-)
>
> >
> >> 2) Authorized counts....for faceting, it doesn't make sense to expose
> >> there are 314 results if you can only read 54 of them. Accounting for
> >> authorization through access manager can be way too slow.
> >> ...
> >> 3) If you support faceting through Oak, will that be competitive
> >> enough to what Solr and Elasticsearch offer? Customers these days have
> >> some expectations on search result quality and faceting capabilities,
> >> performance included.
> >> ...
> >> So, my take would be to invest time in easy integration with
> >> solr/elasticsearch and focus in Oak on the parts (hierarchy,
> >> authorization, merging, versioning) that aren't covered by already
> >> existing frameworks. Perhaps provide an extended JCR API as described
> >> in (1) which under the hood can delegate to a solr or es java client.
> >> In the end, you'll still end up having the authorized counts issue,
> >> but if you make the integration pluggable enough, it might be possible
> >> to leverage domain specific solutions to this (solr/es doesn't do
> >> anything with authorization either, it is a tough nut to crack)
> >
> > Good points. When facets are used, the worst case (showing facets for
> all your content) might actually be the very first thing you see, when
> something like a product search/browse page is shown, before any actual
> search by the user is done. Optimizing for performance right from the start
> is a must, I agree.
> >
> > What I can imagine though, is if you can leverage some kind of caching
> though. In practice, if you have a public site with content that does not
> change permanently, the facet values are pretty much stable, and
> authorization shouldn't cost much.
>
> Certainly there are many use cases where you can cache a lot, or for
> example have a public site user that has read access to an entire
> content tree. It becomes however much more difficult when you want to
> for example expose faceted structure of documents to an editor in a
> cms environment, where the editor has read access to only 1% of the
> documents. If at the same time, her initial query without
> authorization results in, say, 10 million hits, then you'll have to
> authorize all of them to get correct counts. The only way we could
> make this performing with Hippo CMS against jackrabbit was by
> translating our authorization authorization model directly to lucene
> queries and keep caching (authorized) bitsets (slightly different in
> newer lucene versions) in memory for a user, see [1]. The difficulty
> was that even executing the authorization query (to AND with normal
> query) became slow because of very large queries, but fortunately due
> to the jackrabbit 2 index implementation, we could keep a cached
> bitset per indexreader, see [2]. Unfortunately, this solution can only
> be done for specific authoriztion models (which can be mapped to
> lucene queries) and might not be generic enough for oak.
>
> Any way, apart from performance / authorization, I doubt whether oak
> will be able to keep up with what can be leveraged through ES or Solr.
>
> Regards Ard
>
> [1]
> http://svn.onehippo.org/repos/hippo/hippo-cms7/repository/trunk/engine/src/main/java/org/hippoecm/repository/query/lucene/AuthorizationQuery.java
> [2]
> http://www.onehippo.com/en/resources/blogs/2013/01/cms-7.8-nailed-down-authorization-combined-with-searches.html
>
> >
> > [1]
> http://docs.adobe.com/docs/en/aem/6-0/develop/ref/javadoc/com/day/cq/search/Query.html
> > [2]
> http://docs.adobe.com/docs/en/aem/6-0/develop/ref/javadoc/com/day/cq/search/facets/Bucket.html
> >
> > Cheers,
> > Alex
>
>
>
> --
> Amsterdam - Oosteinde 11, 1017 WT Amsterdam
> Boston - 1 Broadway, Cambridge, MA 02142
>
> US +1 877 414 4776 (toll free)
> Europe +31(0)20 522 4466
> www.onehippo.com
>

Re: [DISCUSS] supporting faceting in Oak query engine

Posted by Ard Schrijvers <a....@onehippo.com>.

Hey Alex,

On Sat, Aug 30, 2014 at 4:25 AM, Alexander Klimetschek
<ak...@adobe.com> wrote:
> On 29.08.2014, at 03:10, Ard Schrijvers <a....@onehippo.com> wrote:
>
>> 1) When exposing faceting from Jackrabbit, we wouldn't use virtual
>> layers any more to expose them over pure JCR spec API's. Instead, we
>> would extend the jcr QueryResult to have next to getRows/getNodes/etc
>> also expose for example methods on the QueryResult like
>>
>> public Map<String, Integer> getFacetValues(final String facet) {
>>      return result.getFacetValues(facet);
>> }
>>
>> public QueryResult drilldown(final FacetValue facetValue) {
>>        // return current query result drilled down for facet value
>>        return ...
>> }
>
> We actually have a similar API in our CQ/AEM product:
>
> Query => represents a query [1]
> SearchResult result = query.getResult();
> Map<String, Facet> facets = result.getFacets();
>
> A facet is a list of "Buckets" [2] - same as FacetValue above, I assume - an abstraction over different values. You could have distinctive values (e.g. "red", "green", "blue"), but also ranges ("last year", "last month" etc.). Each bucket has a count, i.e. the number of times it occurs in the current result.
>
> Then on Query you have a method
>
> Query refine(Bucket bucket)
>
> which is the same as the drilldown above.
>
> So in the end it looks pretty much the same, and seems to be a good way to represent this as API. Doesn't say much about the implementation yet, though :)

It looks very much the same, and I must admit that during typing my
mail I didn't put too much attention to things like how to name
something (I reckon that #refine is a much nicer name than the
drillDown I wrote :-)

>
>> 2) Authorized counts....for faceting, it doesn't make sense to expose
>> there are 314 results if you can only read 54 of them. Accounting for
>> authorization through access manager can be way too slow.
>> ...
>> 3) If you support faceting through Oak, will that be competitive
>> enough to what Solr and Elasticsearch offer? Customers these days have
>> some expectations on search result quality and faceting capabilities,
>> performance included.
>> ...
>> So, my take would be to invest time in easy integration with
>> solr/elasticsearch and focus in Oak on the parts (hierarchy,
>> authorization, merging, versioning) that aren't covered by already
>> existing frameworks. Perhaps provide an extended JCR API as described
>> in (1) which under the hood can delegate to a solr or es java client.
>> In the end, you'll still end up having the authorized counts issue,
>> but if you make the integration pluggable enough, it might be possible
>> to leverage domain specific solutions to this (solr/es doesn't do
>> anything with authorization either, it is a tough nut to crack)
>
> Good points. When facets are used, the worst case (showing facets for all your content) might actually be the very first thing you see, when something like a product search/browse page is shown, before any actual search by the user is done. Optimizing for performance right from the start is a must, I agree.
>
> What I can imagine though, is if you can leverage some kind of caching though. In practice, if you have a public site with content that does not change permanently, the facet values are pretty much stable, and authorization shouldn't cost much.

Certainly there are many use cases where you can cache a lot, or for
example have a public site user that has read access to an entire
content tree. It becomes however much more difficult when you want to
for example expose faceted structure of documents to an editor in a
cms environment, where the editor has read access to only 1% of the
documents. If at the same time, her initial query without
authorization results in, say, 10 million hits, then you'll have to
authorize all of them to get correct counts. The only way we could
make this performing with Hippo CMS against jackrabbit was by
translating our authorization authorization model directly to lucene
queries and keep caching (authorized) bitsets (slightly different in
newer lucene versions) in memory for a user, see [1]. The difficulty
was that even executing the authorization query (to AND with normal
query) became slow because of very large queries, but fortunately due
to the jackrabbit 2 index implementation, we could keep a cached
bitset per indexreader, see [2]. Unfortunately, this solution can only
be done for specific authoriztion models (which can be mapped to
lucene queries) and might not be generic enough for oak.

Any way, apart from performance / authorization, I doubt whether oak
will be able to keep up with what can be leveraged through ES or Solr.

Regards Ard

[1] http://svn.onehippo.org/repos/hippo/hippo-cms7/repository/trunk/engine/src/main/java/org/hippoecm/repository/query/lucene/AuthorizationQuery.java
[2] http://www.onehippo.com/en/resources/blogs/2013/01/cms-7.8-nailed-down-authorization-combined-with-searches.html

>
> [1] http://docs.adobe.com/docs/en/aem/6-0/develop/ref/javadoc/com/day/cq/search/Query.html
> [2] http://docs.adobe.com/docs/en/aem/6-0/develop/ref/javadoc/com/day/cq/search/facets/Bucket.html
>
> Cheers,
> Alex



-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Re: [DISCUSS] supporting faceting in Oak query engine

Posted by Bertrand Delacretaz <bd...@apache.org>.

On Sat, Aug 30, 2014 at 4:25 AM, Alexander Klimetschek
<ak...@adobe.com> wrote:
> ...you can leverage some kind of caching though. In practice, if you have a public site
> with content that does not change permanently, the facet values are pretty much
> stable, and authorization shouldn't cost much....

Yes, I think it's very rare to require facets to be immediately up to
date after content changes, updating them (or the related caches)
asynchronously with low priority should be good enough for the large
majority of cases.

So maybe the facet indexes and caches can be handled differently than
"primary" queries, with more lenient update latency requirements.

-Bertrand

Re: [DISCUSS] supporting faceting in Oak query engine

Posted by Alexander Klimetschek <ak...@adobe.com>.

On 29.08.2014, at 03:10, Ard Schrijvers <a....@onehippo.com> wrote:

> 1) When exposing faceting from Jackrabbit, we wouldn't use virtual
> layers any more to expose them over pure JCR spec API's. Instead, we
> would extend the jcr QueryResult to have next to getRows/getNodes/etc
> also expose for example methods on the QueryResult like
> 
> public Map<String, Integer> getFacetValues(final String facet) {
>      return result.getFacetValues(facet);
> }
> 
> public QueryResult drilldown(final FacetValue facetValue) {
>        // return current query result drilled down for facet value
>        return ...
> }

We actually have a similar API in our CQ/AEM product:

Query => represents a query [1]
SearchResult result = query.getResult();
Map<String, Facet> facets = result.getFacets();

A facet is a list of "Buckets" [2] - same as FacetValue above, I assume - an abstraction over different values. You could have distinctive values (e.g. "red", "green", "blue"), but also ranges ("last year", "last month" etc.). Each bucket has a count, i.e. the number of times it occurs in the current result.

Then on Query you have a method

Query refine(Bucket bucket)

which is the same as the drilldown above.

So in the end it looks pretty much the same, and seems to be a good way to represent this as API. Doesn't say much about the implementation yet, though :)

> 2) Authorized counts....for faceting, it doesn't make sense to expose
> there are 314 results if you can only read 54 of them. Accounting for
> authorization through access manager can be way too slow.
> ...
> 3) If you support faceting through Oak, will that be competitive
> enough to what Solr and Elasticsearch offer? Customers these days have
> some expectations on search result quality and faceting capabilities,
> performance included.
> ...
> So, my take would be to invest time in easy integration with
> solr/elasticsearch and focus in Oak on the parts (hierarchy,
> authorization, merging, versioning) that aren't covered by already
> existing frameworks. Perhaps provide an extended JCR API as described
> in (1) which under the hood can delegate to a solr or es java client.
> In the end, you'll still end up having the authorized counts issue,
> but if you make the integration pluggable enough, it might be possible
> to leverage domain specific solutions to this (solr/es doesn't do
> anything with authorization either, it is a tough nut to crack)

Good points. When facets are used, the worst case (showing facets for all your content) might actually be the very first thing you see, when something like a product search/browse page is shown, before any actual search by the user is done. Optimizing for performance right from the start is a must, I agree.

What I can imagine though, is if you can leverage some kind of caching though. In practice, if you have a public site with content that does not change permanently, the facet values are pretty much stable, and authorization shouldn't cost much.

[1] http://docs.adobe.com/docs/en/aem/6-0/develop/ref/javadoc/com/day/cq/search/Query.html
[2] http://docs.adobe.com/docs/en/aem/6-0/develop/ref/javadoc/com/day/cq/search/facets/Bucket.html

Cheers,
Alex

Re: [DISCUSS] supporting faceting in Oak query engine

Posted by Ard Schrijvers <a....@onehippo.com>.

Hello,

On Mon, Aug 25, 2014 at 7:02 PM, Lukas Smith <sm...@pooteeweet.org> wrote:
> Aloha,
>
> you should definitely talk to the HippoCMS developers. They forked Jackrabbit 2.x to add facetting as virtual nodes. They ran into some performance issues but I am sure they still have value-able feedback on this.

Well, performance actually wasn't the biggest hurdle : Exposing and
integrating virtual nodes was quite a bit tougher.

Indeed I think I might have quite some feedback, but honestly, I am
also these days full of doubts what the best approach will be. I'll
try to keep it short:

1) When exposing faceting from Jackrabbit, we wouldn't use virtual
layers any more to expose them over pure JCR spec API's. Instead, we
would extend the jcr QueryResult to have next to getRows/getNodes/etc
also expose for example methods on the QueryResult like

public Map<String, Integer> getFacetValues(final String facet) {
      return result.getFacetValues(facet);
}

public QueryResult drilldown(final FacetValue facetValue) {
        // return current query result drilled down for facet value
        return ...
}

2) Authorized counts....for faceting, it doesn't make sense to expose
there are 314 results if you can only read 54 of them. Accounting for
authorization through access manager can be way too slow. The
alternatives are to not show authorized counts, or try to translate
the authorization model to a lucene query which is in general not
possible unless you restrict your authorization model severely (which
results in a domain specific solution unusable for JR)

3) If you support faceting through Oak, will that be competitive
enough to what Solr and Elasticsearch offer? Customers these days have
some expectations on search result quality and faceting capabilities,
performance included. Oak's faceting support will be compared to
dedicated search servers and is quite unlikely to be nearly as good
and to keep up with what is being build: Aggregations is the new buzz
which is very cool super set of faceting. You really don't wanna have
to leverage that next from Oak.

So, my take would be to invest time in easy integration with
solr/elasticsearch and focus in Oak on the parts (hierarchy,
authorization, merging, versioning) that aren't covered by already
existing frameworks. Perhaps provide an extended JCR API as described
in (1) which under the hood can delegate to a solr or es java client.
In the end, you'll still end up having the authorized counts issue,
but if you make the integration pluggable enough, it might be possible
to leverage domain specific solutions to this (solr/es doesn't do
anything with authorization either, it is a tough nut to crack)

Regards Ard

>
> regards,
> Lukas Kahwe Smith
>
>> On 25 Aug 2014, at 18:43, Laurie Byrum <lb...@adobe.com> wrote:
>>
>> Hi Tommaso,
>> I am happy to see this thread!
>>
>> Questions:
>> Do you expect to want to support hierarchical or pivoted facets soonish?
>> If so, does that influence this decision?
>> Do you know how ACLs will come into play with your facet implementation?
>> If so, does that influence this decision? :-)
>>
>> Thanks!
>> Laurie
>>
>>
>>
>>> On 8/25/14 7:08 AM, "Tommaso Teofili" <to...@gmail.com> wrote:
>>>
>>> Hi all,
>>>
>>> since this has been asked every now and then [1] and since I think it's a
>>> pretty useful and common feature for search engine nowadays I'd like to
>>> discuss introduction of facets [2] for the Oak query engine.
>>>
>>> Pros: having facets in search results usually helps filtering (drill down)
>>> the results before browsing all of them, so the main usage would be for
>>> client code.
>>>
>>> Impact: probably change / addition in both the JCR and Oak APIs to support
>>> returning other than "just nodes" (a NodeIterator and a Cursor
>>> respectively).
>>>
>>> Right now a couple of ideas on how we could do that come to my mind, both
>>> based on the approach of having an Oak index for them:
>>> 1. a (multivalued) property index for facets, meaning we would store the
>>> facets in the repository, so that we would run a query against it to have
>>> the facets of an originating query.
>>> 2. a dedicated QueryIndex implementation, eventually leveraging Lucene
>>> faceting capabilities, which could "use" the Lucene index we already have,
>>> together with a "sidecar" index [3].
>>>
>>> What do you think?
>>> Regards,
>>> Tommaso
>>>
>>> [1] :
>>> http://markmail.org/search/?q=oak%20faceting#query:oak%20faceting%20list%3
>>> Aorg.apache.jackrabbit.oak-dev+page:1+state:facets
>>> [2] : http://en.wikipedia.org/wiki/Faceted_search
>>> [3] :
>>> http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/facet/doc-file
>>> s/userguide.html
>>

-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Re: [DISCUSS] supporting faceting in Oak query engine

Posted by Lukas Smith <sm...@pooteeweet.org>.

Aloha,

you should definitely talk to the HippoCMS developers. They forked Jackrabbit 2.x to add facetting as virtual nodes. They ran into some performance issues but I am sure they still have value-able feedback on this.

regards,
Lukas Kahwe Smith

> On 25 Aug 2014, at 18:43, Laurie Byrum <lb...@adobe.com> wrote:
> 
> Hi Tommaso,
> I am happy to see this thread!
> 
> Questions: 
> Do you expect to want to support hierarchical or pivoted facets soonish?
> If so, does that influence this decision?
> Do you know how ACLs will come into play with your facet implementation?
> If so, does that influence this decision? :-)
> 
> Thanks!
> Laurie
> 
> 
> 
>> On 8/25/14 7:08 AM, "Tommaso Teofili" <to...@gmail.com> wrote:
>> 
>> Hi all,
>> 
>> since this has been asked every now and then [1] and since I think it's a
>> pretty useful and common feature for search engine nowadays I'd like to
>> discuss introduction of facets [2] for the Oak query engine.
>> 
>> Pros: having facets in search results usually helps filtering (drill down)
>> the results before browsing all of them, so the main usage would be for
>> client code.
>> 
>> Impact: probably change / addition in both the JCR and Oak APIs to support
>> returning other than "just nodes" (a NodeIterator and a Cursor
>> respectively).
>> 
>> Right now a couple of ideas on how we could do that come to my mind, both
>> based on the approach of having an Oak index for them:
>> 1. a (multivalued) property index for facets, meaning we would store the
>> facets in the repository, so that we would run a query against it to have
>> the facets of an originating query.
>> 2. a dedicated QueryIndex implementation, eventually leveraging Lucene
>> faceting capabilities, which could "use" the Lucene index we already have,
>> together with a "sidecar" index [3].
>> 
>> What do you think?
>> Regards,
>> Tommaso
>> 
>> [1] :
>> http://markmail.org/search/?q=oak%20faceting#query:oak%20faceting%20list%3
>> Aorg.apache.jackrabbit.oak-dev+page:1+state:facets
>> [2] : http://en.wikipedia.org/wiki/Faceted_search
>> [3] :
>> http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/facet/doc-file
>> s/userguide.html
>

Re: [DISCUSS] supporting faceting in Oak query engine

Posted by Laurie Byrum <lb...@adobe.com>.

Hi Tommaso,
I am happy to see this thread!

Questions: 
Do you expect to want to support hierarchical or pivoted facets soonish?
If so, does that influence this decision?
Do you know how ACLs will come into play with your facet implementation?
If so, does that influence this decision? :-)

Thanks!
Laurie



On 8/25/14 7:08 AM, "Tommaso Teofili" <to...@gmail.com> wrote:

>Hi all,
>
>since this has been asked every now and then [1] and since I think it's a
>pretty useful and common feature for search engine nowadays I'd like to
>discuss introduction of facets [2] for the Oak query engine.
>
>Pros: having facets in search results usually helps filtering (drill down)
>the results before browsing all of them, so the main usage would be for
>client code.
>
>Impact: probably change / addition in both the JCR and Oak APIs to support
>returning other than "just nodes" (a NodeIterator and a Cursor
>respectively).
>
>Right now a couple of ideas on how we could do that come to my mind, both
>based on the approach of having an Oak index for them:
>1. a (multivalued) property index for facets, meaning we would store the
>facets in the repository, so that we would run a query against it to have
>the facets of an originating query.
>2. a dedicated QueryIndex implementation, eventually leveraging Lucene
>faceting capabilities, which could "use" the Lucene index we already have,
>together with a "sidecar" index [3].
>
>What do you think?
>Regards,
>Tommaso
>
>[1] :
>http://markmail.org/search/?q=oak%20faceting#query:oak%20faceting%20list%3
>Aorg.apache.jackrabbit.oak-dev+page:1+state:facets
>[2] : http://en.wikipedia.org/wiki/Faceted_search
>[3] :
>http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/facet/doc-file
>s/userguide.html