You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Shai Erera (JIRA)" <ji...@apache.org> on 2011/06/30 05:23:28 UTC

[jira] [Created] (LUCENE-3262) Facet benchmarking

Facet benchmarking
------------------

                 Key: LUCENE-3262
                 URL: https://issues.apache.org/jira/browse/LUCENE-3262
             Project: Lucene - Java
          Issue Type: New Feature
          Components: modules/benchmark, modules/facet
            Reporter: Shai Erera


A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.).

Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point.

We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well.

We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3262) Facet benchmarking

Posted by "Doron Cohen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13123714#comment-13123714 ] 

Doron Cohen commented on LUCENE-3262:
-------------------------------------

bq. I reduced those to 1-20 per document with depth of 1-3 and got results I could live with.

I agree, tried this too now and the comparison is more reasonable. 
Perhaps what are reasonable numbers (for #facets/doc and their depth) is debatable, but I agree that 200 facets per document is too many.

Changing the defaults to 20/3 and preparing to commit. 
                
> Facet benchmarking
> ------------------
>
>                 Key: LUCENE-3262
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3262
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/benchmark, modules/facet
>            Reporter: Shai Erera
>            Assignee: Doron Cohen
>         Attachments: CorpusGenerator.java, LUCENE-3262.patch, LUCENE-3262.patch, LUCENE-3262.patch, TestPerformanceHack.java
>
>
> A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.).
> Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point.
> We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well.
> We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3262) Facet benchmarking

Posted by "Doron Cohen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13122598#comment-13122598 ] 

Doron Cohen commented on LUCENE-3262:
-------------------------------------

Actually, since the doc is created at setup() it is sufficient to make the doc protected (was private). Also that with.facets property is useful for comparisons, so I kept it (now used only in AddFacetedDocTask) but modified its default to true.
                
> Facet benchmarking
> ------------------
>
>                 Key: LUCENE-3262
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3262
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/benchmark, modules/facet
>            Reporter: Shai Erera
>            Assignee: Doron Cohen
>         Attachments: CorpusGenerator.java, LUCENE-3262.patch, LUCENE-3262.patch, TestPerformanceHack.java
>
>
> A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.).
> Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point.
> We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well.
> We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3262) Facet benchmarking

Posted by "Doron Cohen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13123742#comment-13123742 ] 

Doron Cohen commented on LUCENE-3262:
-------------------------------------

After manually removing benchmark/{work,temp} reuters collection was correctly extracted and in trunk the alg runs same as in 3x.
Committed to trunk: r1180674.
Keeping issue open to handle faceted search benchmarking.
                
> Facet benchmarking
> ------------------
>
>                 Key: LUCENE-3262
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3262
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/benchmark, modules/facet
>            Reporter: Shai Erera
>            Assignee: Doron Cohen
>         Attachments: CorpusGenerator.java, LUCENE-3262.patch, LUCENE-3262.patch, LUCENE-3262.patch, TestPerformanceHack.java
>
>
> A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.).
> Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point.
> We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well.
> We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3262) Facet benchmarking

Posted by "Doron Cohen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13123134#comment-13123134 ] 

Doron Cohen commented on LUCENE-3262:
-------------------------------------

bq. Someone can use AddFacetedDocTask w/ and w/o facets? What for? 

It is useful for specifying the property like this:

{code}
with.facets=facets:true:false
...
{ "MAddDocs" AddFacetedDoc > : 400
{code}

and then getting in the report something like this:

{noformat}
------------> Report sum by Prefix (MAddDocs) and Round (4 about 4 out of 42)
Operation    round facets   runCnt   recsPerRun        rec/s  elapsedSec
MAddDocs_400     0   true        1          400       246.61        1.62
MAddDocs_400 -   1  false -  -   1 -  -  -  400 -   1,801.80 -  -   0.22
MAddDocs_400     2   true        1          400       412.80        0.97
MAddDocs_400 -   3  false -  -   1 -  -  -  400 -   2,139.04 -  -   0.19
{noformat}
                
> Facet benchmarking
> ------------------
>
>                 Key: LUCENE-3262
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3262
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/benchmark, modules/facet
>            Reporter: Shai Erera
>            Assignee: Doron Cohen
>         Attachments: CorpusGenerator.java, LUCENE-3262.patch, LUCENE-3262.patch, LUCENE-3262.patch, TestPerformanceHack.java
>
>
> A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.).
> Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point.
> We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well.
> We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3262) Facet benchmarking

Posted by "Doron Cohen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13120003#comment-13120003 ] 

Doron Cohen commented on LUCENE-3262:
-------------------------------------

I am working on a patch for this, much in the lines of the Solr benchmark patch in SOLR-2646.
Currently the direction is:

- Add to PerfRunData:
-- Taxonomy Directory
-- Taxonomy Writer
-- Taxonomy Reader

- Add tasks for manipulating facets and taxonomies:
-- create/open/commit/close Taxonomy Index
-- open/close Taxonomy Reader
-- AddDocWith facets

- FacetDocMaker will also build the categories into the document
- FacetSource will bring back categories to be added to current doc

- ReadTask will be extended to also support faceted search.
  This is different from the Solr benchmark approach, where a SolrSearchTask is not extending ReadTask but rather extending PerfTask.
  Not sure yet if this is the way to go - still work to be done here.

Should have a start patch in a day or two.
                
> Facet benchmarking
> ------------------
>
>                 Key: LUCENE-3262
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3262
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/benchmark, modules/facet
>            Reporter: Shai Erera
>            Assignee: Doron Cohen
>         Attachments: CorpusGenerator.java, TestPerformanceHack.java
>
>
> A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.).
> Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point.
> We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well.
> We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Assigned] (LUCENE-3262) Facet benchmarking

Posted by "Doron Cohen (Assigned) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doron Cohen reassigned LUCENE-3262:
-----------------------------------

    Assignee: Doron Cohen
    
> Facet benchmarking
> ------------------
>
>                 Key: LUCENE-3262
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3262
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/benchmark, modules/facet
>            Reporter: Shai Erera
>            Assignee: Doron Cohen
>         Attachments: CorpusGenerator.java, TestPerformanceHack.java
>
>
> A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.).
> Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point.
> We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well.
> We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3262) Facet benchmarking

Posted by "Doron Cohen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13123735#comment-13123735 ] 

Doron Cohen commented on LUCENE-3262:
-------------------------------------

Committed to 3x in r1180637, thanks Gilad!
Now porting to trunk, it is more involved than anticipated, because of contrib/modules differences.
Managed to make the tests pass, and the benchmark alg of choice to run.
However I noticed that in 3x that alg - when indexing reuters - added the entire collection, that is 21578 docs, while in trunk it only added about 400 docs.  Might be something in my set-up, digging...
                
> Facet benchmarking
> ------------------
>
>                 Key: LUCENE-3262
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3262
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/benchmark, modules/facet
>            Reporter: Shai Erera
>            Assignee: Doron Cohen
>         Attachments: CorpusGenerator.java, LUCENE-3262.patch, LUCENE-3262.patch, LUCENE-3262.patch, TestPerformanceHack.java
>
>
> A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.).
> Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point.
> We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well.
> We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3262) Facet benchmarking

Posted by "Doron Cohen (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doron Cohen updated LUCENE-3262:
--------------------------------

    Attachment: LUCENE-3262.patch

Updated patch according to Shai's comments and with AddFacetedDoc task.

                
> Facet benchmarking
> ------------------
>
>                 Key: LUCENE-3262
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3262
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/benchmark, modules/facet
>            Reporter: Shai Erera
>            Assignee: Doron Cohen
>         Attachments: CorpusGenerator.java, LUCENE-3262.patch, LUCENE-3262.patch, LUCENE-3262.patch, TestPerformanceHack.java
>
>
> A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.).
> Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point.
> We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well.
> We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3262) Facet benchmarking

Posted by "Gilad Barkai (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13123650#comment-13123650 ] 

Gilad Barkai commented on LUCENE-3262:
--------------------------------------

Doron, great patch!

I ran it and was somewhat surprised at the large overhead of the facet indexing. Digging deeper, I found the number of random facets to be 1-120 per document, with depth of 1-8. I believe those are overkill requirements. I reduced those to 1-*20* per document with depth of 1-*3* and got results I could live with.
Those number are scenario dependent but I think most cases I encountered are closer to my proposed numbers. What do you think?

Also, I changed the alg to consume the entire content source.

I would suggest renaming max.facet.length (in the alg) & maxFacetLengh (in the code) to max.facet.*depth* and maxFacetDepth. Depth seems more appropriate. 

Other than that - I'm thrilled to have a working benchmark with facets - thanks!
                
> Facet benchmarking
> ------------------
>
>                 Key: LUCENE-3262
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3262
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/benchmark, modules/facet
>            Reporter: Shai Erera
>            Assignee: Doron Cohen
>         Attachments: CorpusGenerator.java, LUCENE-3262.patch, LUCENE-3262.patch, LUCENE-3262.patch, TestPerformanceHack.java
>
>
> A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.).
> Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point.
> We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well.
> We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3262) Facet benchmarking

Posted by "Doron Cohen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13122580#comment-13122580 ] 

Doron Cohen commented on LUCENE-3262:
-------------------------------------

bq. changes entry
Right, I always forget to include it in the patch, and add it only afterwords, should change that...

Also, I am not comfortable with the use of a config property in AddDocTask to tell that facets should be added. Seems too implicit to me, all of the sudden... So I think it would be better to refactor the doc creation in AddDoc into a method, and add AddFacetedDocTask that extends AddDoc and overrides the creation of the doc to be added, calling super, and then add the facets into it.
                
> Facet benchmarking
> ------------------
>
>                 Key: LUCENE-3262
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3262
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/benchmark, modules/facet
>            Reporter: Shai Erera
>            Assignee: Doron Cohen
>         Attachments: CorpusGenerator.java, LUCENE-3262.patch, LUCENE-3262.patch, TestPerformanceHack.java
>
>
> A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.).
> Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point.
> We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well.
> We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3262) Facet benchmarking

Posted by "Shai Erera (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13123654#comment-13123654 ] 

Shai Erera commented on LUCENE-3262:
------------------------------------

bq. Those number are scenario dependent but I think most cases I encountered are closer to my proposed numbers

I think we should maybe have a Wiki page or something with several .alg files that test different scenarios. I don't think that 1-120 is an example we shouldn't test. Rather, we should describe, either in a Wiki or a JIRA issue, the performance results for each scenario. And if the results are suspicious for a particular scenario, dig deeper and understand why.

So given that you know the numbers from above were run with that many facets per document, do the numbers make sense? Or you still think they're high?

bq. I would suggest renaming max.facet.length (in the alg) & maxFacetLengh (in the code) to max.facet.depth and maxFacetDepth

+1.
                
> Facet benchmarking
> ------------------
>
>                 Key: LUCENE-3262
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3262
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/benchmark, modules/facet
>            Reporter: Shai Erera
>            Assignee: Doron Cohen
>         Attachments: CorpusGenerator.java, LUCENE-3262.patch, LUCENE-3262.patch, LUCENE-3262.patch, TestPerformanceHack.java
>
>
> A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.).
> Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point.
> We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well.
> We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3262) Facet benchmarking

Posted by "Doron Cohen (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doron Cohen updated LUCENE-3262:
--------------------------------

    Attachment: LUCENE-3262.patch

Updated patch with a test, more javadocs, and a comment as Shai suggested.

I think this is ready to commit.

More tests are needed, and also Search with facets is missing, but that can go in a separate issue.

                
> Facet benchmarking
> ------------------
>
>                 Key: LUCENE-3262
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3262
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/benchmark, modules/facet
>            Reporter: Shai Erera
>            Assignee: Doron Cohen
>         Attachments: CorpusGenerator.java, LUCENE-3262.patch, LUCENE-3262.patch, TestPerformanceHack.java
>
>
> A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.).
> Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point.
> We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well.
> We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3262) Facet benchmarking

Posted by "Toke Eskildsen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Toke Eskildsen updated LUCENE-3262:
-----------------------------------

    Attachment: TestPerformanceHack.java
                CorpusGenerator.java

I've attached a second shot at faceting performance testing. It separates the taxonomy generation into a CorpusGenerator (maybe similar to the RandomTaxonomyWriter that Robert calls for in LUCENE-3264?).

Proper setup of faceting tweaks for the new faceting module is not done at all and not something I find myself qualified for.

> Facet benchmarking
> ------------------
>
>                 Key: LUCENE-3262
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3262
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/benchmark, modules/facet
>            Reporter: Shai Erera
>         Attachments: CorpusGenerator.java, TestPerformanceHack.java
>
>
> A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.).
> Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point.
> We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well.
> We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3262) Facet benchmarking

Posted by "Shai Erera (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13122770#comment-13122770 ] 

Shai Erera commented on LUCENE-3262:
------------------------------------

bq. Also that with.facets property is useful for comparisons

What do you mean? Someone can use AddFacetedDocTask w/ and w/o facets? What for? (sorry, but you didn't post a patch, so I cannot see what this is about)
                
> Facet benchmarking
> ------------------
>
>                 Key: LUCENE-3262
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3262
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/benchmark, modules/facet
>            Reporter: Shai Erera
>            Assignee: Doron Cohen
>         Attachments: CorpusGenerator.java, LUCENE-3262.patch, LUCENE-3262.patch, TestPerformanceHack.java
>
>
> A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.).
> Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point.
> We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well.
> We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3262) Facet benchmarking

Posted by "Doron Cohen (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doron Cohen updated LUCENE-3262:
--------------------------------

    Attachment: LUCENE-3262.patch

Patch (3x)  with working facets indexing benchmark.
It follows the outline above, except that: 
- there is no FacetDocMaker - only FacetSource
- there is no AddDocWithFacet - instead, AddDoc takes an additional config param: with.facet

'ant run-task -Dtask.alg=conf/facets.alg' will run an algorithm that indexes facets.

Not ready to commit yet - need some testing and docs. Also, only covers indexing for now, though perhaps search with facets can go in a separate issue.
                
> Facet benchmarking
> ------------------
>
>                 Key: LUCENE-3262
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3262
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/benchmark, modules/facet
>            Reporter: Shai Erera
>            Assignee: Doron Cohen
>         Attachments: CorpusGenerator.java, LUCENE-3262.patch, TestPerformanceHack.java
>
>
> A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.).
> Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point.
> We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well.
> We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3262) Facet benchmarking

Posted by "Shai Erera (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121728#comment-13121728 ] 

Shai Erera commented on LUCENE-3262:
------------------------------------

Patch looks good ! I have a couple of initial comments:

* facets.alg: as I often find these .alg files as examples, I think it would be good if it declares facet.source (to random) explicitly.

* OpenTaxonomyReaderTask: I see that since PerfRunData incRef() the incoming taxonomy, you decRef(). I also see that setIndexReader behaves the same way. But I find it confusing. Personally, since this is not an application, I don't think we should 'hold a reference to IR/LTR just in case the one who set it closes it'. But if we do that, can we at least document on setIR/LTR that this is the case? I can certainly see myself opening IR/LTR, setting on PerfRunData without decRef()/close(). It would not occur to me that I should ...

* The abstraction of ItemSource is nice. But it's jdocs still contain content.source.*. Since we're not committed to backwards compatibility in benchmark, and in the interest of clarity, perhaps we should rename them to item.source.*?

* ItemSource.resetInputs has a @SuppressWarnings("unused") -- is it a leftover from when it was private?

* In PerfRunData ctor you do a Class.forName using the String name of RandomFacetSource. Why not use RandomFacetSource.class.getName()?

Looks very good. Now with FacetSource we can generate facets per the case we want to test (dense hierarchies, Zipf'ian ...)
                
> Facet benchmarking
> ------------------
>
>                 Key: LUCENE-3262
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3262
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/benchmark, modules/facet
>            Reporter: Shai Erera
>            Assignee: Doron Cohen
>         Attachments: CorpusGenerator.java, LUCENE-3262.patch, TestPerformanceHack.java
>
>
> A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.).
> Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point.
> We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well.
> We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3262) Facet benchmarking

Posted by "Shai Erera (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121852#comment-13121852 ] 

Shai Erera commented on LUCENE-3262:
------------------------------------

bq. ItemSource.resetInputs

I don't have that warning turned on in Eclipse. I disabled it for exactly this reason :).

bq. ItemSource rename

The new name is ok, and the properties better fit it. BTW, if you wanted to have the .algs out there to not silently fail, you could add some code to setConfig that checks for these outdated properties, and throw a proper exception. But I'm ok with the solution you chose.

bq. PFD.readers.incRef()

The javadocs are good. I'd also add "<b>NOTE:</b> if you no longer need that IndexReader/TaxoReader, you should decRef()/close() after calling this method". Otherwise, the IR/TR will just stay open ...
                
> Facet benchmarking
> ------------------
>
>                 Key: LUCENE-3262
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3262
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/benchmark, modules/facet
>            Reporter: Shai Erera
>            Assignee: Doron Cohen
>         Attachments: CorpusGenerator.java, LUCENE-3262.patch, TestPerformanceHack.java
>
>
> A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.).
> Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point.
> We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well.
> We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3262) Facet benchmarking

Posted by "Shai Erera (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13123205#comment-13123205 ] 

Shai Erera commented on LUCENE-3262:
------------------------------------

Ahh, forgot about iterations. It does indeed look useful then.

Perhaps mention facet.source in AddFacetedDocTask?

I'm +1 for committing the current progress, but keep this issue open for the search side (to complete the framework).
                
> Facet benchmarking
> ------------------
>
>                 Key: LUCENE-3262
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3262
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/benchmark, modules/facet
>            Reporter: Shai Erera
>            Assignee: Doron Cohen
>         Attachments: CorpusGenerator.java, LUCENE-3262.patch, LUCENE-3262.patch, LUCENE-3262.patch, TestPerformanceHack.java
>
>
> A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.).
> Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point.
> We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well.
> We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3262) Facet benchmarking

Posted by "Shai Erera (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13122134#comment-13122134 ] 

Shai Erera commented on LUCENE-3262:
------------------------------------

bq. I think this is ready to commit.

+1. Perhaps just add a CHANGES entry?

bq. but that can go in a separate issue

I think it's better if we resolve it in that issue, and maybe rename the issue to "Facet benchmarking framework". You can still commit the current progress because it is 'whole' - covering the indexing side. I've worked on issues before that had several commits, so this will not be the first one.

We should also run some benchmark tests, describing clearly the data sets, but this can be done under a separate issue.
                
> Facet benchmarking
> ------------------
>
>                 Key: LUCENE-3262
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3262
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/benchmark, modules/facet
>            Reporter: Shai Erera
>            Assignee: Doron Cohen
>         Attachments: CorpusGenerator.java, LUCENE-3262.patch, LUCENE-3262.patch, TestPerformanceHack.java
>
>
> A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.).
> Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point.
> We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well.
> We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3262) Facet benchmarking

Posted by "Doron Cohen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121844#comment-13121844 ] 

Doron Cohen commented on LUCENE-3262:
-------------------------------------

Thanks for reviewing Shai!

* facets.alg: I agree, and added that specific facet source setting. With also modified the algorithm to have several rounds, some with facets, some without, so the effect of adding facets on performance is measured.

* PFD.readers.incRef() - once the task (e.g. OpenReaderTask) opened the reader that task is gone and the reader really is maintained by the PFD, therefore IMO it is important that the PFD maintains its ref count as it does today. I agree that more documentation for this can help and added something in this spirit to both setReader()'s:

{code}
  /**
   * Set the taxonomy reader. Takes ownership of that taxonomy reader, that is,
   * internally performs taxoReader.incRef(). 
   * @param indexReader The indexReader to set.
   */
  public synchronized void setTaxonomyReader(TaxonomyReader taxoReader) throws IOException {

  }

  /**
   * Set the index reader. Takes ownership of that index reader, that is,
   * internally performs indexReader.incRef(). 
   * @param indexReader The indexReader to set.
   */
  public synchronized void setIndexReader(IndexReader indexReader) throws IOException {
  }
{code}

* ItemSource and config props "source.xyz" - I am afraid there might be algorithms out there that will silently stop working, as the props names are not safe types or something like that. So I would rather to not alter these property names. Instead, following your comment, I renamed *ItemSource* to *ContentItemsSource* which is still extended by both ContentSource and FacetSource. So it is not that bad now that the property names start with "content." (these properties apply to all content-item-sources - that is both doc-content and facets). 

* ItemSource.resetInputs @SuppressWarnings("unused") - this is for the IOException which is not really thrown by that code (though it might be thrown by subclasses). Perhaps just a misconfiguration of my IDE? (I assume you have that warning disabled or stg?)

* PerfRunData Class.forName on String rather than class.getName() - this is in fact the code pattern for all reflection actions in PerfRunData, so I just repeated for the new code whatever was there for the existing code. However, in fact, I thought about the same thing when adding this code. I think the difference is that if config contains a specification of another class, the default class would not even be loaded when using strings, but it will be loaded (though not used) when using Class.getName(). It is not a big performance issue, but it does seem "cleaner" to not load classes that are not going to be used. In any case, if we would like to change that, should revisit all the .forName(String) of default impls in the benchmark (and I think there are a few) and it seems to not belong to this issue.

Thanks for the review, working on a new patch - there are several copy/paste errors in the code where a CloseTaxonomyReader by mistake sets the PFD IndexReader to null...
                
> Facet benchmarking
> ------------------
>
>                 Key: LUCENE-3262
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3262
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/benchmark, modules/facet
>            Reporter: Shai Erera
>            Assignee: Doron Cohen
>         Attachments: CorpusGenerator.java, LUCENE-3262.patch, TestPerformanceHack.java
>
>
> A spin off from LUCENE-3079. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.).
> Toke attached a preliminary test case to LUCENE-3079, so I'll attach it here as a starting point.
> We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well.
> We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org