You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Terrance A. Snyder (JIRA)" <ji...@apache.org> on 2013/07/02 05:15:23 UTC
[jira] [Comment Edited] (SOLR-2242) Get distinct count of names for a facet field

    [ https://issues.apache.org/jira/browse/SOLR-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13697454#comment-13697454 ] 

Terrance A. Snyder edited comment on SOLR-2242 at 7/2/13 3:13 AM:
------------------------------------------------------------------

[~otis] I got the email - I'll give some background as we've enhanced and combined but I should be able to put together a patch in the following week. There is an old version on github I need to update to trunk I'll spend time doing this, most of this work was enhancing two existing JIRA items which are wonderful.

Core Work:
https://issues.apache.org/jira/browse/SOLR-2894
https://issues.apache.org/jira/browse/SOLR-3583

Newer features:

+ Some of the issues that have been discussed around distributed counting has already been done in larger installations (counting billions of items). I work in the advertising space and counting/slicing dicing things and sending between shards 90+ billion documents on highly unique facet counts such as session id, or cookie ID is hugely wasteful and doesn't scale.

+ The Ad industry is great at counting stuff "at scale" - sessions, web events, etc. We take the stance that counting stuff can be "roughly" right when we get to billions + or - 0-1.5% error rate is OK when the response goes from minutes to milliseconds. As such, optional parameters for "estimated count" is added which will leverage a HyperLogLog implementation to do a 98.5% correct response. By default this is turned on for us - on a large installation (multiple billions of POS transactions)

* HyperLogLog *

http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html
http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
http://metamarkets.com/2012/fast-cheap-and-98-right-cardinality-estimation-for-big-data/

*Questions as I'd like to actually do this right*

+ Rather than re-invent the wheel I use stream-lib (https://github.com/clearspring/stream-lib). It is apache licensed and includes HyperLogLog, HyperLogLogPlus, BloomFilters, TopK, QDigest, etc. Is this an issue?

+ Test cases - I've got 82% code coverage - is this good enough?

+ Documentation - I've got markdown documents that cover the commands and syntax - is this the right format?

+ SOLR-2894, SOLR-3583 - It makes logical sense that these start to be joined together. When using all these I sometimes start smelling solr as an analytic engine (and it's a very nice one when combining probabilistic data structures).

If someone can answer the above questions while I sync to /trunk please let me know.

Old Version for posterity until I get around to updating to latest trunk and including the HyperLogLog implementation - doesn't include HyperLogLog sketching - minor updates.
https://github.com/terrancesnyder/solr-analytics/blob/master/solr/core/src/java/org/apache/solr/handler/component/PivotFacetHelper.java
                
      was (Author: terrance.snyder):
    [~otis] I got the email - I'll give some background as we've enhanced and combined but I should be able to put together a patch in the following week. There is an old version on github I need to update to trunk I'll spend time doing this, most of this work was enhancing two existing JIRA items which are wonderful.

Core Work:
https://issues.apache.org/jira/browse/SOLR-2894
https://issues.apache.org/jira/browse/SOLR-3583

Newer features:

+ Some of the issues that have been discussed around distributed counting has already been done in larger installations (counting billions of items). I work in the advertising space and counting/slicing dicing things and sending between shards 90+ billion documents on highly unique facet counts such as session id, or cookie ID is hugely wasteful and doesn't scale.

+ The Ad industry is great at counting stuff "at scale" - sessions, web events, etc. We take the stance that counting stuff can be "roughly" right when we get to billions + or - 0-1.5% error rate is OK when the response goes from minutes to milliseconds. As such, optional parameters for "estimated count" is added which will leverage a HyperLogLog implementation to do a 98.5% correct response. By default this is turned on for us - on a large installation (multiple billions of POS transactions)

*Questions as I'd like to actually do this right*

+ Rather than re-invent the wheel I use stream-lib (https://github.com/clearspring/stream-lib). It is apache licensed and includes HyperLogLog, HyperLogLogPlus, BloomFilters, TopK, QDigest, etc. Is this an issue?

+ Test cases - I've got 82% code coverage - is this good enough?

+ Documentation - I've got markdown documents that cover the commands and syntax - is this the right format?

+ SOLR-2894, SOLR-3583 - It makes logical sense that these start to be joined together. When using all these I sometimes start smelling solr as an analytic engine (and it's a very nice one when combining probabilistic data structures).

If someone can answer the above questions while I sync to /trunk please let me know.

Old Version for posterity until I get around to updating to latest trunk and including the HyperLogLog implementation - doesn't include HyperLogLog sketching - minor updates.
https://github.com/terrancesnyder/solr-analytics/blob/master/solr/core/src/java/org/apache/solr/handler/component/PivotFacetHelper.java
                  
> Get distinct count of names for a facet field
> ---------------------------------------------
>
>                 Key: SOLR-2242
>                 URL: https://issues.apache.org/jira/browse/SOLR-2242
>             Project: Solr
>          Issue Type: New Feature
>          Components: Response Writers
>    Affects Versions: 4.0-ALPHA
>            Reporter: Bill Bell
>            Priority: Minor
>             Fix For: 4.4
>
>         Attachments: SOLR-2242-3x_5_tests.patch, SOLR-2242-3x.patch, SOLR-2242.patch, SOLR-2242.patch, SOLR-2242.patch, SOLR-2242.shard.withtests.patch, SOLR-2242.solr3.1-fix.patch, SOLR-2242.solr3.1.patch, SOLR.2242.solr3.1.patch, SOLR-2242-solr40-3.patch
>
>
> When returning facet.field=<name of field> you will get a list of matches for distinct values. This is normal behavior. This patch tells you how many distinct values you have (# of rows). Use with limit=-1 and mincount=1.
> The feature is called "namedistinct". Here is an example:
> Parameters:
> facet.numTerms or f.<field>.facet.numTerms = true (default is false) - turn on distinct counting of terms
> facet.field - the field to count the terms
> It creates a new section in the facet section...
> http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=*:*&facet=true&facet.mincount=1&facet.numTerms=true&facet.limit=-1&facet.field=price
> http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=*:*&facet=true&facet.mincount=1&facet.numTerms=false&facet.limit=-1&facet.field=price
> http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=*:*&facet=true&facet.mincount=1&facet.numTerms=true&facet.limit=-1&facet.field=price
> This currently only works on facet.field.
> {code}
> <lst name="facet_counts">
> <lst name="facet_queries"/>
> <lst name="facet_fields">...</lst>
> <lst name="facet_numTerms">
> <lst name="localhost:8983/solr/">
> <int name="price">14</int>
> </lst>
> <lst name="localhost:8080/solr/">
> <int name="price">14</int>
> </lst>
> </lst>
> <lst name="facet_dates"/>
> <lst name="facet_ranges"/>
> </lst>
> OR with no sharding-
> <lst name="facet_numTerms">
> <int name="price">14</int>
> </lst>
> {code} 
> Several people use this to get the group.field count (the # of groups).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org