You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2011/05/07 13:03:03 UTC

[jira] [Created] (LUCENE-3079) Facetiing module

Facetiing module
----------------

                 Key: LUCENE-3079
                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
             Project: Lucene - Java
          Issue Type: Improvement
            Reporter: Michael McCandless


Faceting is a hugely important feature, available in Solr today but
not [easily] usable by Lucene-only apps.

We should fix this, by creating a shared faceting module.

Ideally, we factor out Solr's faceting impl, and maybe poach/merge
from other impls (eg Bobo browse).

Hoss describes some important challenges we'll face in doing this
(http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:

{noformat}
To look at "faceting" as a concrete example, there are big the reasons 
faceting works so well in Solr: Solr has total control over the 
index, knows exactly when the index has changed to rebuild caches, has a 
strict schema so it can make sense of field types and 
pick faceting algos accordingly, has multi-phase distributed search 
approach to get exact counts efficiently across multiple shards, etc...
(and there are still a lot of additional enhancements and improvements 
that can be made to take even more advantage of knowledge solr has because 
it "owns" the index that we no one has had time to tackle)
{noformat}

This is a great list of the things we face in refactoring.  It's also
important because, if Solr needed to be so deeply intertwined with
caching, schema, etc., other apps that want to facet will have the
same "needs" and so we really have to address them in creating the
shared module.

I think we should get a basic faceting module started, but should not
cut Solr over at first.  We should iterate on the module, fold in
improvements, etc., and then, once we can fully verify that cutting
over doesn't hurt Solr (ie lose functionality or performance) we can
later cutover.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Faceting module

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13057599#comment-13057599 ] 

Shai Erera commented on LUCENE-3079:
------------------------------------

I opened LUCENE-3261 and LUCENE-3260 to track userguide + benchmark issues. Will update more as we go along.

bq. I think we should close this issue

+1.

We can now say Lucene has a faceting module ! Perhaps we should advertise it on the user-list?

Great job at porting to trunk Robert !

> Faceting module
> ---------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Shai Erera
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079_4x.patch, LUCENE-3079_4x_broken.patch, TestPerformanceHack.java, facet-userguide.pdf
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Faceting module

Posted by "Toke Eskildsen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13057148#comment-13057148 ] 

Toke Eskildsen commented on LUCENE-3079:
----------------------------------------

I re-ran the edge case 5M documents, 6.8 paths/docs, max depth 6 (22M unique paths) to drill down to level 6 (d6) and to level 1 (d1). A test for the path 'root/a/b' as starting point for the facet request was added. I also checked that indexing does indeed run with 48MB for LUCENE-2369, but again this is just plain Lucene indexing.

||                             || LUCENE-3079 (d6) || LUCENE-3079 (d1) || LUCENE-3079 (d1+'deep/a/b') || LUCENE-2369 (d6) ||
| Index build time             |  752 s            |   771 s           |    -           |     238 s         |
| Memory required for indexing | 2500 MB           | 2500 MB           |   2500 MB           |      48 MB        |
| First facet request          | 3840 ms           | 1929 ms           |   1963 ms           | 147,000 ms        |
| Best of 5 requests           | 2688 ms           | 1172 ms           |   1246 ms           |    2673 ms        |
| Memory usage after faceting  |  435 MB           |  435 MB           |   435 MB           |     294 MB        |

Going to depth 1 helped a lot. I would have expected that requesting from 'deep/a/b' was even faster, but I guess I'll have to dig into the code to understand why it was not.

bq. It actually speeds up counting overall. If you think about it, when we encounter category ordinals, we just increment the count by 1 in the respective location in the count array. No need to ask whether this is an ordinal the user asked to count at all. Later when we compute the top-K, we know more efficiently while root ordinal the user requested to count, and its children, so it's just a matter of putting everything into a heap and returning the top-K.

LUCENE-2369 uses exactly the same counting strategy (brute counting of everything). Not having explicit duplicates speeds this phase up and saves memory, but the numbers for d6 and d1 very clearly shows that it is faster overall to skip the drill-down in the extraction phase. At least for this test (and then we're back to creating a proper test suite).

Just for kicks, I tried guesstimating the layout of a corpus with addresses by upping the docs and lowering the paths: 100M documents, 0.84 paths/doc, max depth 4 (2.5M unique paths)
||                             || LUCENE-3079 (d4) || LUCENE-3079 (d1) || LUCENE-2369 (d6) ||
| Index build time             |   28 min          |    -              |     17 min        |
| Memory required for indexing |    ? MB           |    ? MB           |      ? MB         |
| First facet request          |13933 ms           |13367 ms           | 46,000 ms         |
| Best of 5 requests           |11718 ms           |11036 ms           |   2989 ms         |
| Memory usage after faceting  |  240 MB           |  240 MB           |    475 MB         |

> Faceting module
> ---------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Shai Erera
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch, TestPerformanceHack.java
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3079) Faceting module

Posted by "Toke Eskildsen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Toke Eskildsen updated LUCENE-3079:
-----------------------------------

    Attachment: TestPerformanceHack.java

A quick hack to get an idea of performance characteristics for hierarchical faceting.

> Faceting module
> ---------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Shai Erera
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch, TestPerformanceHack.java
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Facetiing module

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055443#comment-13055443 ] 

Shai Erera commented on LUCENE-3079:
------------------------------------

Thanks Toke for the pointer. I think it's very interesting. We've actually explored in the past storing just the category/leaf, instead of the entire hierarchy, in the document. The search response time was much slower than what I reported above (nearly 2x slowdown). While storing the entire hierarchy indeed consumes more space, it is more performing at search time, and we figure that space today is cheap, and usually search apps are more interested in faster search response times and are willing to spend some more time at indexing and analysis stages.

Nevertheless, the link you provided proposes an interesting way to manage the hierarchy, and I think it's worth exploring at some point. Could be that it will perform better than how we managed it when we indexed just the leaf category for each document. We'd also need to see how to update the taxonomy on the go. For example, it describes that for A/B/C you know that its level is 3 (that's easy) and that the previous category/tag that matches (P) is A. But what if at some point A/B is added to a document? What happens to the data indexed for the doc w/ A/B/C, which now its previous matching category is A/B? It's not clear to me, but could be that I've missed the description in the proposal.

I am very close to uploading the patch. Hopefully I'll upload it by the end of my day.

> Facetiing module
> ----------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Facetiing module

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056427#comment-13056427 ] 

Simon Willnauer commented on LUCENE-3079:
-----------------------------------------

Shai, MASSIVE PATCH! phew I don't want to review this entire thing in a patch really but at a first glance this looks very good. Lots of tests, documentation etc. and obviously this has been used in production so its seen some cpu cycles :) I think roberts proposed way is good!

{quote}
+1

My suggestion:

commit to branch 3.x with @experimental.
next, do a "fast" port to trunk, this doesnt mean heavy refactoring to take advantage of things like docvalues, just get it working correctly on trunk's APIs.
finally, close this issue and do improvements as normal, backporting whichever ones are easy and make sense, like any other issue.
{quote}

here is my +1

> Facetiing module
> ----------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Faceting module

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056972#comment-13056972 ] 

Shai Erera commented on LUCENE-3079:
------------------------------------

bq. I fully support the idea of a facet benchmarking issue - perhaps with an associated wiki page?

Yes. And on the Wiki page you should clearly describe the scenario that is tested, along with results for 'default config' + 'optimized config'. That way, a user coming to the page can pick the scenario that best matches his app and config the facets package (whatever we end up with) accordingly.

bq. Seen from the other side, there were an average of 1M*3.5/1.4M ~= 2.5 documents/tag

That's indeed an extreme case. We've seen it when an analytics module extracted facets automatically from documents (such as places, people etc.), and in that case the taxonomy was very 'flat' and wide.

bq. I am a bit confused about your protest on depth=5

I did not protest :). Actually, the common scenario is to count the immediate children and fetch the top-K. And I thought that that's what you do in LUCENE-2369. But counting all the way down is a valid scenario - and shows another reason why we should have a benchmark page with clear description.

bq. Thinking about this, I now have a better understanding of the duplication of data by indexing all levels of the paths. This speeds up shallow counting tremendously.

It actually speeds up counting overall. If you think about it, when we encounter category ordinals, we just increment the count by 1 in the respective location in the count array. No need to ask whether this is an ordinal the user asked to count at all. Later when we compute the top-K, we know more efficiently while root ordinal the user requested to count, and its children, so it's just a matter of putting everything into a heap and returning the top-K.

bq. All this confusion supports the need for at coordinated effort to get some test cases with clear goals and realistic data.

Indeed. And we shouldn't pursue only 'realistic data', but edge cases too. As long as everything is clearly documented, it should be easy to interpret results.

I think that setting up facet benchmarking is more important than working on improving any implementation. Mostly because it will allow measuring the how much the improvements really improved. I'll open an issue for that.

> Faceting module
> ---------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Shai Erera
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch, TestPerformanceHack.java
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Facetiing module

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056390#comment-13056390 ] 

Shai Erera commented on LUCENE-3079:
------------------------------------

bq. Unfortunately I could not locate the million document test you mentioned above. Is it part of the patch?

It's not included. We wrote a proprietary (not extending Benchmark) test and would like to extend Lucene's benchmark for benchmarking facets. I think it will be interesting to define some facets on the Wikipedia collection, and benchmark it.

> Facetiing module
> ----------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Facetiing module

Posted by "Chris Male (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055526#comment-13055526 ] 

Chris Male commented on LUCENE-3079:
------------------------------------

Great contribution Shai.  What about putting it into a branch? I think it really does need a thorough review before we put it into trunk.

> Facetiing module
> ----------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3079.patch, LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Facetiing module

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055584#comment-13055584 ] 

Shai Erera commented on LUCENE-3079:
------------------------------------

{quote}
My suggestion:

1. commit to branch 3.x with @experimental.
2. next, do a "fast" port to trunk, this doesnt mean heavy refactoring to take advantage of things like docvalues, just get it working correctly on trunk's APIs.
3. finally, close this issue and do improvements as normal, backporting whichever ones are easy and make sense, like any other issue.
{quote}

I agree. I'll give it a day or two before I commit, unless everyone agree it can be committed today, in which case I'll happily press the button :).

> Facetiing module
> ----------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Issue Comment Edited] (LUCENE-3079) Faceting module

Posted by "Toke Eskildsen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13057148#comment-13057148 ] 

Toke Eskildsen edited comment on LUCENE-3079 at 6/29/11 11:02 AM:
------------------------------------------------------------------

I re-ran the edge case 5M documents, 6.8 paths/docs, max depth 6 (22M unique paths) to drill down to level 6 (d6) and to level 1 (d1). A test for the path 'root/a/b' as starting point for the facet request was added. I also checked that indexing does indeed run with 48MB for LUCENE-2369, but again this is just plain Lucene indexing.

||                             || LUCENE-3079 (d6) || LUCENE-3079 (d1) || LUCENE-3079 (d1+'deep/a/b') || LUCENE-2369 (d6) ||
| Index build time             |  752 s            |   771 s           |    -           |     347 s         |
| Memory required for indexing | 2500 MB           | 2500 MB           |   2500 MB           |      48 MB        |
| First facet request          | 3840 ms           | 1929 ms           |   1963 ms           | 147,000 ms        |
| Best of 5 requests           | 2688 ms           | 1172 ms           |   1246 ms           |    2673 ms        |
| Memory usage after faceting  |  435 MB           |  435 MB           |   435 MB           |     294 MB        |

Going to depth 1 helped a lot. I would have expected that requesting from 'deep/a/b' was even faster, but I guess I'll have to dig into the code to understand why it was not.

bq. It actually speeds up counting overall. If you think about it, when we encounter category ordinals, we just increment the count by 1 in the respective location in the count array. No need to ask whether this is an ordinal the user asked to count at all. Later when we compute the top-K, we know more efficiently while root ordinal the user requested to count, and its children, so it's just a matter of putting everything into a heap and returning the top-K.

LUCENE-2369 uses exactly the same counting strategy (brute counting of everything). Not having explicit duplicates speeds this phase up and saves memory, but the numbers for d6 and d1 very clearly shows that it is faster overall to skip the drill-down in the extraction phase. At least for this test (and then we're back to creating a proper test suite).

Just for kicks, I tried guesstimating the layout of a corpus with addresses by upping the docs and lowering the paths: 100M documents, 0.84 paths/doc, max depth 4 (2.5M unique paths)
||                             || LUCENE-3079 (d4) || LUCENE-3079 (d1) || LUCENE-2369 (d6) ||
| Index build time             |   28 min          |    -              |     17 min        |
| Memory required for indexing |    ? MB           |    ? MB           |      ? MB         |
| First facet request          |13933 ms           |13367 ms           | 46,000 ms         |
| Best of 5 requests           |11718 ms           |11036 ms           |   2989 ms         |
| Memory usage after faceting  |  240 MB           |  240 MB           |    475 MB         |

      was (Author: toke):
    I re-ran the edge case 5M documents, 6.8 paths/docs, max depth 6 (22M unique paths) to drill down to level 6 (d6) and to level 1 (d1). A test for the path 'root/a/b' as starting point for the facet request was added. I also checked that indexing does indeed run with 48MB for LUCENE-2369, but again this is just plain Lucene indexing.

||                             || LUCENE-3079 (d6) || LUCENE-3079 (d1) || LUCENE-3079 (d1+'deep/a/b') || LUCENE-2369 (d6) ||
| Index build time             |  752 s            |   771 s           |    -           |     238 s         |
| Memory required for indexing | 2500 MB           | 2500 MB           |   2500 MB           |      48 MB        |
| First facet request          | 3840 ms           | 1929 ms           |   1963 ms           | 147,000 ms        |
| Best of 5 requests           | 2688 ms           | 1172 ms           |   1246 ms           |    2673 ms        |
| Memory usage after faceting  |  435 MB           |  435 MB           |   435 MB           |     294 MB        |

Going to depth 1 helped a lot. I would have expected that requesting from 'deep/a/b' was even faster, but I guess I'll have to dig into the code to understand why it was not.

bq. It actually speeds up counting overall. If you think about it, when we encounter category ordinals, we just increment the count by 1 in the respective location in the count array. No need to ask whether this is an ordinal the user asked to count at all. Later when we compute the top-K, we know more efficiently while root ordinal the user requested to count, and its children, so it's just a matter of putting everything into a heap and returning the top-K.

LUCENE-2369 uses exactly the same counting strategy (brute counting of everything). Not having explicit duplicates speeds this phase up and saves memory, but the numbers for d6 and d1 very clearly shows that it is faster overall to skip the drill-down in the extraction phase. At least for this test (and then we're back to creating a proper test suite).

Just for kicks, I tried guesstimating the layout of a corpus with addresses by upping the docs and lowering the paths: 100M documents, 0.84 paths/doc, max depth 4 (2.5M unique paths)
||                             || LUCENE-3079 (d4) || LUCENE-3079 (d1) || LUCENE-2369 (d6) ||
| Index build time             |   28 min          |    -              |     17 min        |
| Memory required for indexing |    ? MB           |    ? MB           |      ? MB         |
| First facet request          |13933 ms           |13367 ms           | 46,000 ms         |
| Best of 5 requests           |11718 ms           |11036 ms           |   2989 ms         |
| Memory usage after faceting  |  240 MB           |  240 MB           |    475 MB         |
  
> Faceting module
> ---------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Shai Erera
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch, TestPerformanceHack.java
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3079) Faceting module

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera updated LUCENE-3079:
-------------------------------

    Lucene Fields: [New, Patch Available]  (was: [New])
    Fix Version/s: 4.0
                   3.4
         Assignee: Shai Erera
          Summary: Faceting module  (was: Facetiing module)

Edited the issue title (it had an extra 'i' in 'faceting'), as well as added fix versions etc. I intend to commit the patch to the 3x branch either later today or tomorrow.

> Faceting module
> ---------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Shai Erera
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Facetiing module

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055532#comment-13055532 ] 

Shai Erera commented on LUCENE-3079:
------------------------------------

We can put it in a branch for trunk, in case we plan to refactor the code right away (at first I just thought to get it to compile against trunk). I thought that at first people would like to get hands on experience with it, before we discuss changes and refactoring. I mean, this code can really be released with Lucene's next 3x release. And since everything is @lucene.experimental, and is in its own separate contrib/module, I don't think a branch will ease off the review or refactoring process?

> Facetiing module
> ----------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3079.patch, LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Facetiing module

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055570#comment-13055570 ] 

Robert Muir commented on LUCENE-3079:
-------------------------------------

{quote}
I guess what I'm aiming for is for our users to get this feature soon. And I'm afraid that putting it in a branch will only delay it.
{quote}

+1

My suggestion:
# commit to branch 3.x with @experimental.
# next, do a "fast" port to trunk, this doesnt mean heavy refactoring to take advantage of things like docvalues, just get it working correctly on trunk's APIs.
# finally, close this issue and do improvements as normal, backporting whichever ones are easy and make sense, like any other issue.


> Facetiing module
> ----------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Faceting module

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13057445#comment-13057445 ] 

Shai Erera commented on LUCENE-3079:
------------------------------------

Thanks guys for doing this port so quickly. The patch looks good. I suggest that we change the 'nocommit' to TODO (Facet): and commit it (under modules/). Then we can iterate on the TODOs and resolve them one by one, in followup issues. Makes sense?

Robert, would you like to do the honors? :)

> Faceting module
> ---------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Shai Erera
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079_4x.patch, LUCENE-3079_4x_broken.patch, TestPerformanceHack.java, facet-userguide.pdf
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Facetiing module

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056553#comment-13056553 ] 

Shai Erera commented on LUCENE-3079:
------------------------------------

You write LUCENE-3097 which is about "post group faceting", while this issue is LUCENE-3079. I assume you meant the latter, but want to confirm :). You also write LUCENE-2309 which is about decoupling IW from Analyzers. Are you perhaps referring to a Solr issue, or a different Lucene issue? If so can you please let me know which one?

This is a great test, and it matches more or less the test we've been running. Is it in 'benchmark' form? Can you post it on this issue so I can try the same?

What do you mean by "top 5 facets/tags"? If I were to speak of dimensions, where a dimensions is like "tags", "authors", "date", then do you mean you've requested to count 5 dimensions, or you indexed just one dimension (i.e. one "root") and requested to fetch the top-5 results for it? I assume it's the latter, but again, confirming my understanding.

So assuming I understood correctly the terminology and test setup, you execute one query which matches 50% of the documents and ask to count the top-5 facets under a single "root"/"dimension", and record the time as 'first facet request'. And then you execute it 4-5 additional times, and record 'best of 5 requests'. Do I understand it correctly?

One difference between the two approaches, assuming you're referring to a faceting approach that uses the FieldCache is that by default, the faceting approach here reads everything from disk. So it would be interesting to run w/ the facets-in-memory feature.

I don't know how to relate to the memory usage -- on the last test it consumed 50% less than the other approach, on the first it consumed nearly the same and on the second test it consumed 150% more. This is odd. Do you trust this measurement?

The 'first facet request' result is not surprising, because it takes time to warm up the FieldCache (assuming that's what you use).

I am interested in the memory observed for indexing because that too seems fluctuating? I.e., in the second test the difference is nearly x20 more, which is weird.

Also, the difference in indexing time is interesting too, as it too is not very consistent. And I find the x2 factor suspicious - would like to understand it better. Since trunk reports to improve indexing speed by a large factor (nearly 200%), I think it will be wise if we wait with this comparison until I bring the patch up w/ trunk.

I like it that you test the default behavior. I think it's very important that we have the greatest out-of-the-box experience. Since the two approaches read from disk/memory, I first would like to test the in-memory facets using this approach, so we can at least compare the same thing. I know that trunk plays some role here (definitely at indexing time), so we can focus on search time for now.

This is great stuff Toke !

> Facetiing module
> ----------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3079) Faceting module

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera updated LUCENE-3079:
-------------------------------

    Attachment: facet-userguide.pdf

Committed revision 1141060 (3x). Will start the port to trunk.

I'm also attaching a userguide we wrote which should help newcomers get up to speed w/ the package. It is not meant to be an end-to-end cover of all the functionality and API, but rather as a complementary asset to the Javadocs, example code and source code itself.

I think it will be good if we check it in with the source, e.g. under contrib/facet/docs or something, in ODT (+PDF?) format, and that it will be included in the release binaries (i.e. along with the .jar).

What do you think?

> Faceting module
> ---------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Shai Erera
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch, TestPerformanceHack.java, facet-userguide.pdf
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Faceting module

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056696#comment-13056696 ] 

Shai Erera commented on LUCENE-3079:
------------------------------------

Oh Toke ... I just reviewed your test and there is a problem in it:

{code}
    CountFacetRequest facetRequest = new CountFacetRequest(
        new CategoryPath(HIERARCHICAL), num);
    facetRequest.setDepth(5);
{code}

You create a CountFacetRequest, requesting to count HIERARCHICAL (which is the root), and fetch the top <num>, which is 5. BUT, you set the depth of the request to 5, which means it will compute the top-5 categories at each level !

This is a nice feature of the package, which lets you get not only the top-N child nodes of the immediate "root", but also the top-N of their child nodes and it's applied recursively until 'depth'. This is a nice feature, but not very performance friendly :).

Can you please rerun the test then, commenting out that line? I can run it on my laptop, but I don't have the env. setup w/ the patch from LUCENE-2369, so I cannot compare.

In general, this looks like a very useful test. I think we can commit it too, but rename it so that it doesn't run regularly w/ our tests, but rather selectively.

> Faceting module
> ---------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Shai Erera
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch, TestPerformanceHack.java
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Issue Comment Edited] (LUCENE-3079) Faceting module

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13057599#comment-13057599 ] 

Shai Erera edited comment on LUCENE-3079 at 6/30/11 5:55 AM:
-------------------------------------------------------------

I opened LUCENE-3261 and LUCENE-3262 to track userguide + benchmark issues. Will update more as we go along.

bq. I think we should close this issue

+1.

We can now say Lucene has a faceting module ! Perhaps we should advertise it on the user-list?

Great job at porting to trunk Robert !

      was (Author: shaie):
    I opened LUCENE-3261 and LUCENE-3260 to track userguide + benchmark issues. Will update more as we go along.

bq. I think we should close this issue

+1.

We can now say Lucene has a faceting module ! Perhaps we should advertise it on the user-list?

Great job at porting to trunk Robert !
  
> Faceting module
> ---------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Shai Erera
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079_4x.patch, LUCENE-3079_4x_broken.patch, TestPerformanceHack.java, facet-userguide.pdf
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Resolved] (LUCENE-3079) Faceting module

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera resolved LUCENE-3079.
--------------------------------

    Resolution: Fixed

Faceting module in 3.x and trunk, tests pass, opened follow up issues. I think we can close this.

Thanks for everyone for helping get this in so quickly !

> Faceting module
> ---------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Shai Erera
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079_4x.patch, LUCENE-3079_4x_broken.patch, TestPerformanceHack.java, facet-userguide.pdf
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Facetiing module

Posted by "Jan Høydahl (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053886#comment-13053886 ] 

Jan Høydahl commented on LUCENE-3079:
-------------------------------------

Bravo Shai & IBM!

> Facetiing module
> ----------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Faceting module

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13057399#comment-13057399 ] 

Robert Muir commented on LUCENE-3079:
-------------------------------------

the previous fail is somehow a bug in memorycodec (the seed randomly selected it):
ant test -Dtestcase=FacetsPayloadProcessorProviderTest -Dtests.codec=Memory


> Faceting module
> ---------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Shai Erera
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079_4x.patch, LUCENE-3079_4x_broken.patch, TestPerformanceHack.java, facet-userguide.pdf
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Faceting module

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056687#comment-13056687 ] 

Shai Erera commented on LUCENE-3079:
------------------------------------

About the TaxonomyWriterCache, looking at its API now, I think that an FST-based TWC might be a good fit here? FST is known for its performance and low RAM consumption, and TWC maps from a CategoryPath (or String) to an integer, which sounds like a typical usage for FST. So we can have both the keep-all-in-RAM-TWC and LRU use FST and consume less memory.

I'll open an issue for that.

One small correction - TWC is used only at indexing time, mapping from category->ordinal. For labeling ordinals you use TaxonomyReader which maintains its own int->String cache (LRU). Can FST aid in that case as well? I assume it will consume less space than an Integer->String hash map.

-------------

Back to performance -- Toke, did you verify that you get the same top-5 categories from both implementations? Also, can you try running the test asking for top-5 categories of a node below the root? I.e., if the paths are in the form /a/b/c/d and you request to count "/a", then I'm interested in how this performs if you ask to count "/a/b".

> Faceting module
> ---------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Shai Erera
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch, TestPerformanceHack.java
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3079) Facetiing module

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera updated LUCENE-3079:
-------------------------------

    Attachment: LUCENE-3079-dev-tools.patch

Thanks Robert for the fix. This indeed looks better than patching LTC !

Patch for dev-tools only, this time w/ Maven 
support too. I hope it works well :).

> Facetiing module
> ----------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3079) Facetiing module

Posted by "Stefan Trcek (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stefan Trcek updated LUCENE-3079:
---------------------------------

    Attachment: LUCENE-3079.patch

> Facetiing module
> ----------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Facetiing module

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053940#comment-13053940 ] 

Ryan McKinley commented on LUCENE-3079:
---------------------------------------

bq. Bravo Shai & IBM!

+1!  This sounds awesome, and I hope will prove how modules will help lucene *and* solr

> Facetiing module
> ----------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Faceting module

Posted by "Toke Eskildsen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056585#comment-13056585 ] 

Toke Eskildsen commented on LUCENE-3079:
----------------------------------------

Shai: I completely messed up the JIRA-numbers, clearly I need to go home and cool my brain. It is fixed now. Sorry for the inconvenience.

Yes, I only created a single root (one dimension) and requested the top-5 facets. You understood the timing measurements correctly. I am sorry that my table was confusing with regards to memory. The first numbers was the Xmx required for index build (binary search until I got bored), while the second if what the JVM reported after the faceting calls were finished. For LUCENE-2369 in the middle test, the faceted search required more memory (aka higher Xmx) than index build (which could probably have gotten by with even less).

I do not use field cance for LUCENE-2369. It holds a compressed list of ordinals for tags for the documents in memory, with a few levels of indirections to handle doublettes. The startup time is basically due to doublette elimination.

Regarding the memory difference, LUCENE-2369 does not operate at index-time. This means that is it plain Lucene indexing of terms like hierarchy:a/b/c/d. Actually I am surprised that it took 128MB for the larger test and I should probably re-run that with a lower allocation.

My guesstimage, based purely on observation, is that LUCENE-3079 requires heap relative to the taxonomy size at indexing time. At least with the (assumedly default) settings I used. Thus the 22M unique values in test #2 is the cause for the large memory requirement. Looking at the number of unique tags vs. index memory requirements for case #1 and #2, the factor seems nearly linear. It seems to fit your recommendation of splitting on large taxonomies?

I'll upload my test class for LUCENE-3079 now. I apologize for its hackish nature - this was just meant as explorative work.

> Faceting module
> ---------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Shai Erera
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Facetiing module

Posted by "Chris Male (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053587#comment-13053587 ] 

Chris Male commented on LUCENE-3079:
------------------------------------

I don't think any Facet module needs to be concerned with Schemas.  Instead the module can expose an API which asks for the information it needs to make the best choices.  Solr can then provide that information based on its Schema, pure Lucene users can do it however they want. 

> Facetiing module
> ----------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Issue Comment Edited] (LUCENE-3079) Facetiing module

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055532#comment-13055532 ] 

Shai Erera edited comment on LUCENE-3079 at 6/27/11 1:24 PM:
-------------------------------------------------------------

We can put it in a branch for trunk, in case we plan to refactor the code right away (at first I just thought to get it to compile against trunk). I thought that at first people would like to get hands on experience with it, before we discuss changes and refactoring. I mean, this code can really be released with Lucene's next 3x release. And since everything is @lucene.experimental, and is in its own separate contrib/module, I don't think a branch will ease off the review or refactoring process?

I guess what I'm aiming for is for our users to get this feature soon. And I'm afraid that putting it in a branch will only delay it.

      was (Author: shaie):
    We can put it in a branch for trunk, in case we plan to refactor the code right away (at first I just thought to get it to compile against trunk). I thought that at first people would like to get hands on experience with it, before we discuss changes and refactoring. I mean, this code can really be released with Lucene's next 3x release. And since everything is @lucene.experimental, and is in its own separate contrib/module, I don't think a branch will ease off the review or refactoring process?
  
> Facetiing module
> ----------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3079.patch, LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3079) Faceting module

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3079:
--------------------------------

    Attachment: LUCENE-3079.patch

updated patch: all tests pass.

I changed the nocommits to TODO (Facet):'s, and added verbage for the reason to each one.

we also have two TODOs for two bugs (The MTE.seekExact and SepCodec hasPayload bug) that we should fix, but currently we have workarounds in place (when we fix these bugs we can then remove the workarounds).

I'll svn move to modules, and doublecheck things like javadoc warnings, and commit later today.

> Faceting module
> ---------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Shai Erera
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079_4x.patch, LUCENE-3079_4x_broken.patch, TestPerformanceHack.java, facet-userguide.pdf
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Facetiing module

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053847#comment-13053847 ] 

Shai Erera commented on LUCENE-3079:
------------------------------------

I would like to contribute IBM's faceted search package (been wanting to do that for quite a while). The package supports the following features/capabilities (at a high-level):

* Taxonomy index -- manages trees of 'categories'. You can view example of taxonomies at e.g. the Open Directory Project.
** It's a Lucene index managed alongside the content index.
** Builds the taxonomy on-the-fly (i.e. as categories are discovered).
** In general it maps a category hierarchy to ordinals (integers). For example, the category /date/2011/06/24 will create the following entry in the taxonomy index:
*** /date, ordinal=1
*** /date/2011, ordinal=2
*** /date/2011/06, ordinal=3
*** /date/2011/06/24, ordinal=4

* FacetsDocumentBuilder which receives a list of categories that are associated w/ the document (can be of several dimensions) and:
** Fetches the ordinals of the category components from the taxonomy index (adding them to it on-the-fly).
** Indexes them in a (compressed) payload for the document (so for the above category example, 4 payloads will be indexed for the document).
** FDB can be used to augment a Document with other fields for indexing (it adds its own Field objects).

* FacetsCollector receives a handle to the taxonomy, a list of facet 'roots' to count and returns the top-K categories for each requested facet:
** The root can denote any node in the category tree (e.g., 'count all facets under /date/2011')
** top-K can be returned for the top most K immediate children of root, or any top-K in the sub-tree of root.

* Counting algorithm (at a high-level):
** Fetch the payload for every matching document.
** Increment by 1 the count of every ordinal that was encountered (even for facets that were not requested by the user)
** After all ordinals are counted, compute the top-K on the ones the user requested
** Label the result facets

* Miscellaneous features:
** *Sampling* algorithm allows for more efficient facets counting/accumulation, while still returning exact counts for the top-K facets.
** *Complements* algorithm allows for more efficient facets counting/accumulation, when the number of results is > 50% of the docs in the index (we keep a total count of facets, count facets on the docs that did not match the query and subtract).
*** Complements can be used to count facets that do not appear in any of the matching documents (of this result set). This does not exist in the package though ... yet.
** *Facets partitioning* -- if the taxonomy is huge (i.e. millions of categories), it is better to partition them at indexing time, so that search time is faster and consumes less memory. Note that this is required because of the approach of counting all (allocating a count array) and then keeping only the results of interest.
** *Category enhancements* allow storing 'metadata' with categories in the index, so that more than simple counting can be implemented:
*** *weighted facets* (built on top of enhancements) allows associating a weight w/ each category, and use smarter counting techniques at runtime. For example, if facets are generated by an analytics component, the confidence level can be set as the category's weight. If tags are indexed as facets (for e.g. generating a tag cloud for the result set), the number of times the document was tagged by the tag can be set as the tag's weight.
** That that facets are indexed in the payloads of documents allows managing very large taxonomies and indexes, without blowing up the RAM at runtime (but incur some search performance overhead). However, the payloads can be loaded up to RAM (like in FieldCache) in which case runtime becomes much faster.
*** However facets are stored is abstracted though by a CountingList API, so we can definitely explore other means of storing the facet ordinals. Actually, the CountingList API allows us to read the ordinals from disk or RAM w/o affecting the rest of the algorithm at all.
** I did not want to dive too deep on the API here, but the runtime API is very extensible and allows one to use FacetsCollector for the simple cases, or lower-level API to get more control on the process. You can look at: FacetRequest, FacetSearchParams, FacetResult, FacetResultNode, FacetsCollector, FacetsAccumulator, FacetsAggregator for a more extensive set of API to use.

* The package comes with example code which shows how to use the different features I've mentioned. There are also unit tests for ensuring the example code works :).

* The package comes with a very extensive tests suite and is in use by many of our products for a long time, so I can state that it's very stable.

* Some rough performance numbers:
** Collection of 1M documents, few hierarchical facets per document (we call it the 'Amazon' case)
** Query matches 49.8% of the docs (~500K)
** No sampling / complements were used.
** Facets read from disk
** Takes 0.4 seconds to execute the query and count facets.

It will take me a few days to prepare a patch (lots of code) - will upload it as soon as it's ready.

> Facetiing module
> ----------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Faceting module

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13057468#comment-13057468 ] 

Robert Muir commented on LUCENE-3079:
-------------------------------------

Committed revision 1141246.

I think we should close this issue soon, and open followup issues?
Maybe just start with a separate issue for the documentation guide?

> Faceting module
> ---------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Shai Erera
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079_4x.patch, LUCENE-3079_4x_broken.patch, TestPerformanceHack.java, facet-userguide.pdf
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Facetiing module

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053591#comment-13053591 ] 

Jason Rutherglen commented on LUCENE-3079:
------------------------------------------

bq. I don't think any Facet module needs to be concerned with Schemas

Right, they should be field type aware.

> Facetiing module
> ----------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Faceting module

Posted by "Toke Eskildsen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056785#comment-13056785 ] 

Toke Eskildsen commented on LUCENE-3079:
----------------------------------------

I must admit that my choices of tests were more aimed at probing edge cases than simulating real taxonomies. Nevertheless, it is good to hear that LUCENE-3079 can be tweaked to handle it. I fully support the idea of a facet benchmarking issue -  perhaps with an associated wiki page?

As for the 1M test case, the number of tags/documents were random with the average being the stated 3.5 tags/document. Seen from the other side, there were an average of 1M*3.5/1.4M ~= 2.5 documents/tag.

I did verify the facet-results and they did fit my expectations. I will try requesting from further down the tree later - hopefully tomorrow. I am a bit confused about your protest on depth=5, but I suspect that we have different ideas of what is relevant when issuing a hierarchical request. The API states that specifying a depth of 5 will count all sub-tags until depth 5. I used the number 5 to effectively count all the way to the bottom (whoops! It should be 6 for the second case. That might explain why LUCENE-3079 was faster than LUCENE-2369 in that one as LUCENE-2369 counted to the bottom). The reason for the complete counting was that I implicitly found this to be the "correct" behavior, internally visioning a taxonomy of species or something similar, with the wish to get the number of unique elements at the finest level.

Thinking about this, I now have a better understanding of the duplication of data by indexing all levels of the paths. This speeds up shallow counting tremendously.

All this confusion supports the need for at coordinated effort to get some test cases with clear goals and realistic data.

> Faceting module
> ---------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Shai Erera
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch, TestPerformanceHack.java
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Facetiing module

Posted by "Toke Eskildsen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055402#comment-13055402 ] 

Toke Eskildsen commented on LUCENE-3079:
----------------------------------------

This is quite another design than the quarter-baked one I've proposed with SOLR-2412 (which is really just a thin wrapper around LUCENE-2369). While maintaining a sidecar index makes the workflow more complicated, I would expect that it is beneficial for re-open speed and scalability.

Technical note: For hierarchical faceting, I find that it is possible to avoid storing all levels in the hierarchy. By maintaining two numbers for each tag, denoting the tag-level and the level for the previous tag that matches, only the relevant tags needs to be indexed (full explanation at https://sbdevel.wordpress.com/2010/10/05/fast-hierarchical-faceting/).

Kudos for contributing solid code. I am looking forward to seeing the patch.

> Facetiing module
> ----------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Facetiing module

Posted by "Toke Eskildsen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056517#comment-13056517 ] 

Toke Eskildsen commented on LUCENE-3079:
----------------------------------------

Some preliminary performance testing: I hacked together a test where 1M documents were created with an average of 3.5 paths, down to a depth of 4 (resulting in 1.4M unique paths). A search that hit every other document was issued and the top 5 facets/tags was requested. Hopefully this is somewhat similar to your test.

||            || LUCENE-3097 || LUCENE-2309 ||
| Index build time |  52 s |  23 s |
| Memory required for indexing | 192 MB | 48 MB |
| First facet request | 432 ms | 12,000 ms |
| Best of 5 requests | 228 ms | 159 ms |
| Memory usage after faceting (after gc()) | 21 MB | 22 MB |

Upping the ante to 5M documents, 6.8 paths/docs, max depth 6 (22M unique paths) resulted in

||            || LUCENE-3097 || LUCENE-2309 ||
| Index build time |  752 s |  238 s |
| Memory required for indexing | 2500 MB | 128 MB |
| First facet request | 3370 ms | 147,000 ms |
| Best of 5 requests | 2400 ms | 2673 ms |
| Memory usage after faceting | 435 MB | 294 MB |

Scaling down to 100K documents, 1.6 paths/doc, max depth 4 (63K unique paths) resulted in
||            || LUCENE-3097 || LUCENE-2309 ||
| Index build time |  5317 ms |  2563 ms |
| Memory required for indexing | 48 MB | 32 MB |
| First facet request | 245 ms | 1425 ms |
| Best of 5 requests | 15 ms | 8 ms |
| Memory usage after faceting | 1 MB | 2 MB |

Some observations: It seems clear that some trade-offs are very different for the two solutions. LUCENE-3097 has brilliant startup time and slows analyzing time a bit through the whole indexing process. LUCENE-2309 is dog slow at startup but does not impact indexing. They seem similar with regards to search-time speed and memory usage.

Now, LUCENE-2309 patches some semi-random Lucene-4, so this is not a fair test. Likewise, the tests were just quick hacks; the disk cache was not flushed, the laptop was used for browsing etc. When LUCENE-3097 patches trunk, a proper comparison can be made.

I am a bit worried about the observed memory usage for index build. It seems that LUCENE-3097 uses a lot of heap there? I created the documents one at a time just before adding them to the index, so the memory usage is for the writers and a quick profile told me that it was mainly used for int-arrays. Does that sound right?

> Facetiing module
> ----------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Faceting module

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056673#comment-13056673 ] 

Shai Erera commented on LUCENE-3079:
------------------------------------

bq. resulting in 1.4M unique paths

AND

bq. LUCENE-3079 requires heap relative to the taxonomy size at indexing time

Ok now I see what's happening. The taxonomy index maintains a category cache, which maps a category to its ordinal. It's maintained at both indexing and search time. At indexing time, it's used to quickly determine if an incoming category already exists, and if so returns its ordinal. At search time its used to quickly label ordinals, as well as determining hierarchy information.

The default settings of LuceneTaxonomyWriter uses Cl2oTaxonomyWriterCache, which maintains a mapping of *all* categories to their ordinal, although the mapping is maintained compact. There is another cache LruTaxonomyWriterCache which as its javadocs states "good choice for huge taxonomies".

The taxonomy you create is HUGE by all standards (and I'm not speaking on the 22M case :)). The largest 'normal' taxonomy we've seen was in the order of few 100K nodes (when people names were maintained, and social tags), while the largest 'abnormal' taxonomy we've seen contained ~5M nodes, and that is considered a very extreme case for taxonomies.

Just to be clear, I'm not trying to make excuses. Your perf test is still great in that it tests an extreme case. But I'd expect an extreme case to require some mods to the defaults, and using LruTWC is one of them (w/ a cacheSize of let's say 100/200K). Another thing I've mentioned this package can do is the partitions -- because at runtime we allocate an array the size of the taxonomy, in your case (1.4M nodes), we'll create an array that is ~6MB size for every query. While if you use partitions, and partition the categories into 10/100K-categories buckets, you'd allocate less RAM, but might incur some search performance overhead.

Out of curiosity, w/ 1.4M unique paths, and 1M docs, how many categories are assigned to each document and how many documents are associated w/ each category? When we ran our test, we used a Zipf distribution for categories. If in this test we end up associating only a couple of documents per category, then this is not a too realistic scenario. And while the package can handle it, by not using defaults, perhaps we should define a scenario that makes sense (a common one that is) and run with it?

I don't think there can be one "right" faceted search solution, but rather a collection of tools that can match different scenarios. And if it turns out that for one case one implementation is better than another, then our job will be to create a faceted search layer which allows the user to choose what's best for him, and leave the rest of his app code unmodified.

What do you think? Perhaps we should open a separate issue, let's call it "facet benchmarking", where we define some scenarios, work on extending the benchmark package (we have done some preliminary work there) and then compare few approaches?

> Faceting module
> ---------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Shai Erera
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch, TestPerformanceHack.java
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Facetiing module

Posted by "Toke Eskildsen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055480#comment-13055480 ] 

Toke Eskildsen commented on LUCENE-3079:
----------------------------------------

SOLR-2412/LUCENE-2369 were created with the trade-offs (relatively) long startup, low memory, high performance: When the index is (re)opened, the hierarchy is analyzed by iterating the terms (it could be offloaded to index-time, but it is still iterate-the-entire-term-list after each change). This does not play well with real-time, but should be a nice fit for large indexes with low update rate.

As for speed, my theory is that the sparser hierarchy (only the concrete paths) wins due to less counting, but without another solution to compare against it has so far remained a theory. There are some measurements at https://sbdevel.wordpress.com/2010/10/11/hierarchical-faceting/ but I find that for hierarchical faceting, small changes to test-setups can easily have vast implications on performance, so they are not comparable to your million-document test.

> Facetiing module
> ----------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3079) Facetiing module

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3079:
--------------------------------

    Attachment: LUCENE-3079.patch

just some trivial test modifications so the tests work with an unmodified LuceneTestCase:

* in some cases, if an assertion failed it would print the seed... but LTC does this.
* in other tests, the test wanted to repeat a random sequence, but instead of exposing LTC internals, the test just grabs random.nextLong, makes a new Random from this, and then resets it with .setSeed.


> Facetiing module
> ----------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Facetiing module

Posted by "Stefan Trcek (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053361#comment-13053361 ] 

Stefan Trcek commented on LUCENE-3079:
--------------------------------------

This patch was generated by git and tested to apply with
patch -p0 -i LUCENE-3079.patch --dry-run
Be patient if anything went wrong.

Review starting points may be
- FacetSearcherTest.testSimpleFacetWithIndexSearcher() or
- FacetSearcher.facetCollectSearch()

Functions.java may be  dismissed in favor of Guava.
If you are willing to keep it I'll strip it down to the required parts.

----------------------------------------------------------------------

The implementation relies on field cache only, no index scheme, no 
cached filters etc. It supports
- single valued facets (Facet.java)
- multi valued facets (Facet.MultiValued.java)
- facet filters (see FacetSearcher.java)
- evaluation of facet values that would dismiss due to other facet 
filters (Yonik says Solr calls this "multi-select faceting").
(realized by FacetSearcher.fillFacetsForGuiMode())

Let me explain the last point: For the user a facet query
  (color==green) AND (shape==circle OR shape==square)
may look like

Facet color
[ ] (3) red
[x] (5) green
[ ] (7) blue

Facet shape
[x] (9) circle
[ ] (4) line
[x] (2) square

The red/blue/line facet values will display even though the 
corresponding documents are not in the result set. Also there is 
support for filtered facet values with zero results, so users 
understand why they do not get results.


> Facetiing module
> ----------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3079) Faceting module

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3079:
--------------------------------

    Attachment: LUCENE-3079_4x.patch

updated patch, one of the fails was caused by my use of seekExact, i think at least for MultiTermsEnum, if you seekExact, and even if it finds your term, its not positioned correctly, so if you then call next() its unsafe.

another one of the fails, is calling getPayload() before hasPayload() will not work with SepCodec.

thanks to mike for helping track some of these down. 

i also added to build.xml the logic to depend on the analyzers module, now all tests pass (some of the time).

but i have at least one random fail:NOTE: reproduce with: ant test -Dtestcase=FacetsPayloadProcessorProviderTest -Dtestmethod=testTaxonomyMergeUtils -Dtests.seed=3732021887561370529:1102439953879128238


> Faceting module
> ---------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Shai Erera
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079_4x.patch, LUCENE-3079_4x_broken.patch, TestPerformanceHack.java, facet-userguide.pdf
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Facetiing module

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053580#comment-13053580 ] 

Yonik Seeley commented on LUCENE-3079:
--------------------------------------

bq. if Solr needed to be so deeply intertwined with caching, schema, etc., other apps that want to facet will have the same "needs"

Sort of an aside, but not really.... specific applications are much easier.  A lot more indirection is required in Solr and a schema is needed for pretty much everything.  Without the schema, a client would specify "sort=foo desc" and Solr would have no idea how to do that.  A specific application just does it because they have the knowledge of what all the fields are.  It's why people have gotten along just fine without a schema in Lucene thus far.  If you're building another Solr... yes, you need something like a schema.



> Facetiing module
> ----------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3079) Faceting module

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3079:
--------------------------------

    Attachment: LUCENE-3079_4x_broken.patch

here's a hack patch, tried to quickly move this thing to trunk apis... 3 out of the 250 tests fail though, I'm 99% positive i jacked something up in the taxonomywriter (this is the most complicated one to convert), so maybe that one should just be started over.

but maybe some of the patch (except taxonomywriter, again i think its broken) would be useful in getting it ported to 4.x

> Faceting module
> ---------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Shai Erera
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079_4x_broken.patch, TestPerformanceHack.java, facet-userguide.pdf
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Faceting module

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13057428#comment-13057428 ] 

Michael McCandless commented on LUCENE-3079:
--------------------------------------------

bq. the previous fail is somehow a bug in memorycodec (the seed randomly selected it)

I just committed a fix for this; it was because .getPayload() in MemoryCodec was (incorrectly) assuming caller did not change the .bytes of the returned BytesRef between calls.

> Faceting module
> ---------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Shai Erera
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079_4x.patch, LUCENE-3079_4x_broken.patch, TestPerformanceHack.java, facet-userguide.pdf
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Issue Comment Edited] (LUCENE-3079) Faceting module

Posted by "Toke Eskildsen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056517#comment-13056517 ] 

Toke Eskildsen edited comment on LUCENE-3079 at 6/28/11 3:17 PM:
-----------------------------------------------------------------

Some preliminary performance testing: I hacked together a test where 1M documents were created with an average of 3.5 paths, down to a depth of 4 (resulting in 1.4M unique paths). A search that hit every other document was issued and the top 5 facets/tags was requested. Hopefully this is somewhat similar to your test.

||            || LUCENE-3079 || LUCENE-2369 ||
| Index build time |  52 s |  23 s |
| Memory required for indexing | 192 MB | 48 MB |
| First facet request | 432 ms | 12,000 ms |
| Best of 5 requests | 228 ms | 159 ms |
| Memory usage after faceting (after gc()) | 21 MB | 22 MB |

Upping the ante to 5M documents, 6.8 paths/docs, max depth 6 (22M unique paths) resulted in

||            || LUCENE-3079 || LUCENE-2369 ||
| Index build time |  752 s |  238 s |
| Memory required for indexing | 2500 MB | 128 MB |
| First facet request | 3370 ms | 147,000 ms |
| Best of 5 requests | 2400 ms | 2673 ms |
| Memory usage after faceting | 435 MB | 294 MB |

Scaling down to 100K documents, 1.6 paths/doc, max depth 4 (63K unique paths) resulted in
||            || LUCENE-3079 || LUCENE-2369 ||
| Index build time |  5317 ms |  2563 ms |
| Memory required for indexing | 48 MB | 32 MB |
| First facet request | 245 ms | 1425 ms |
| Best of 5 requests | 15 ms | 8 ms |
| Memory usage after faceting | 1 MB | 2 MB |

Some observations: It seems clear that some trade-offs are very different for the two solutions. LUCENE-3079 has brilliant startup time and slows analyzing time a bit through the whole indexing process. LUCENE-2369 is dog slow at startup but does not impact indexing. They seem similar with regards to search-time speed and memory usage.

Now, LUCENE-2369 patches some semi-random Lucene-4, so this is not a fair test. Likewise, the tests were just quick hacks; the disk cache was not flushed, the laptop was used for browsing etc. When LUCENE-3079 patches trunk, a proper comparison can be made.

I am a bit worried about the observed memory usage for index build. It seems that LUCENE-3079 uses a lot of heap there? I created the documents one at a time just before adding them to the index, so the memory usage is for the writers and a quick profile told me that it was mainly used for int-arrays. Does that sound right?

      was (Author: toke):
    Some preliminary performance testing: I hacked together a test where 1M documents were created with an average of 3.5 paths, down to a depth of 4 (resulting in 1.4M unique paths). A search that hit every other document was issued and the top 5 facets/tags was requested. Hopefully this is somewhat similar to your test.

||            || LUCENE-3097 || LUCENE-2309 ||
| Index build time |  52 s |  23 s |
| Memory required for indexing | 192 MB | 48 MB |
| First facet request | 432 ms | 12,000 ms |
| Best of 5 requests | 228 ms | 159 ms |
| Memory usage after faceting (after gc()) | 21 MB | 22 MB |

Upping the ante to 5M documents, 6.8 paths/docs, max depth 6 (22M unique paths) resulted in

||            || LUCENE-3097 || LUCENE-2309 ||
| Index build time |  752 s |  238 s |
| Memory required for indexing | 2500 MB | 128 MB |
| First facet request | 3370 ms | 147,000 ms |
| Best of 5 requests | 2400 ms | 2673 ms |
| Memory usage after faceting | 435 MB | 294 MB |

Scaling down to 100K documents, 1.6 paths/doc, max depth 4 (63K unique paths) resulted in
||            || LUCENE-3097 || LUCENE-2309 ||
| Index build time |  5317 ms |  2563 ms |
| Memory required for indexing | 48 MB | 32 MB |
| First facet request | 245 ms | 1425 ms |
| Best of 5 requests | 15 ms | 8 ms |
| Memory usage after faceting | 1 MB | 2 MB |

Some observations: It seems clear that some trade-offs are very different for the two solutions. LUCENE-3097 has brilliant startup time and slows analyzing time a bit through the whole indexing process. LUCENE-2309 is dog slow at startup but does not impact indexing. They seem similar with regards to search-time speed and memory usage.

Now, LUCENE-2309 patches some semi-random Lucene-4, so this is not a fair test. Likewise, the tests were just quick hacks; the disk cache was not flushed, the laptop was used for browsing etc. When LUCENE-3097 patches trunk, a proper comparison can be made.

I am a bit worried about the observed memory usage for index build. It seems that LUCENE-3097 uses a lot of heap there? I created the documents one at a time just before adding them to the index, so the memory usage is for the writers and a quick profile told me that it was mainly used for int-arrays. Does that sound right?
  
> Faceting module
> ---------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Shai Erera
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Facetiing module

Posted by "Jason Rutherglen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053581#comment-13053581 ] 

Jason Rutherglen commented on LUCENE-3079:
------------------------------------------

Schemas should probably be a module that makes use of embedding the field types per-segment, this is something the faceting module could/should use.  I think is what LUCENE-2308 is aiming for?  Though I thought there was another Jira issue created by Simon for this as well.

> Facetiing module
> ----------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3079) Faceting module

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera updated LUCENE-3079:
-------------------------------

    Component/s: modules/facet

Created component modules/facet

> Faceting module
> ---------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Shai Erera
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3079) Facetiing module

Posted by "Shai Erera (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera updated LUCENE-3079:
-------------------------------

    Attachment: LUCENE-3079.patch

Attached patch includes the faceted search module. It currently compiles 
against 3x, so I've put it under lucene/contrib, but after we port it to 
trunk, it should be under modules/.

There isn't a short way to describe the content of the patch (as you can 
see, it's huge), so instead I'll give a brief overview of some key 
packages: 

* src/examples: contains code of different capabilities of the facets module. I'd start w/ examples/simple.
* src/test: contains many tests, great place to start too
* o.a.l.facet.taxonomy contains the taxonomy index management code. There are two interfaces TaxonomyWriter/Reader with a LuceneTW/TR impl
* o.a.l.facet.index contains the indexing code of the different capabilities (e.g. simpl, enhancements etc.)
* o.a.l.facet.search contains the respective search code

Few points:
* I've put the ASL on all files.
* Marked all code as @lucene.experimental.
* After you apply the patch you can run 'ant eclipse' and it will build the facets code in eclipse (no maven integration yet - will need to look into it)
* Under o.a.l and o.a.l.util there are several utility classes that are not specific to facetted search, however the facets code uses them. I've kept them there so that we can review and decide whether we want to move them to lucene-core at some point.

TODOs:
* After it's on trunk, I think we should explore replacing the payloads w/ DocValues.
* Leverage Lucene's superb random testing framework !
* There are few TODOs in the code which I think can be addressed following this issue.

I will open follow-on issues for those.

Given the amount of code, I am wondering if perhaps we should commit it 
as-is, and do more thorough reviews afterwards. The code does not modify 
any existing code (aside from a tiny change to LuceneTestCase), so I 
think there's no risk in doing so. I am also not sure that it's sane to 
review that amount of code (nearly 40K lines) in patch form. What do you 
think?

> Facetiing module
> ----------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3079.patch, LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3079) Facetiing module

Posted by "Toke Eskildsen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056377#comment-13056377 ] 

Toke Eskildsen commented on LUCENE-3079:
----------------------------------------

The patch compiles neatly against a 3x stable checkout and the examples worked fine. Unfortunately I could not locate the million document test you mentioned above. Is it part of the patch?

> Facetiing module
> ----------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, LUCENE-3079.patch, LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org