You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Hoss Man (JIRA)" <ji...@apache.org> on 2015/05/05 19:16:00 UTC

[jira] [Commented] (SOLR-7461) StatsComponent, calcdistinct, ability to disable distinctValues while keeping countDistinct

    [ https://issues.apache.org/jira/browse/SOLR-7461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14528822#comment-14528822 ] 

Hoss Man commented on SOLR-7461:
--------------------------------

James: sorry, i overlooked your comments until now...

bq. Regarding the hyperloglog approach:

please raise HLL related discussions in the comments of SOLR-6968 - been making lots of progress there, would love your opinions & help testing, but i don't want to have the conversation fractured in two diff places for other people watching that issue.

----

bq. Unfortunately, for me, being able to get exact (or very nearly exact) distinct counts is important for the application I'm building (it's an analytics / stats tool so correct numbers are a must have feature)

I understand ... the reason i popped back up in this issue today was to comment that: 

1) as I work more with HLL and really think about the internals, i realized that for people who care a lot about absolutely accurate precision, especially in low cardinality situations, i think you are correct -- fixing this issue (countdistinct w/o distinctValues) would be a big win -- even once HLL support is added.  The simple reason being that HLL is fundamentally based on counting unique *hashed* values, so even if someone tries to tune it to be as accurate as possible, there's still the risk that two distinct values in a users's set will have the same hash value and prevent 100% accurate cardinality results.
2) as i worked on refactoring the tests in SOLR-6968 it occurred to me that what you're requesting here would actually be fairly easy to implement and support in a backcompat way -- in particular because of the changes already made in SOLR-6349 (distinguishing per-shard dependent stats) and the fact that the current param is "calcdistinct" but the stats currently returned are "countDistinct" and "distinctValues"

In a nutshell, what i think we should do is...

* replace {{calcdistinct(true)}} value in the {{Stat}} enum with {{distinctvals(true)}} and {{countdistinct(false,distinctvals)}} (so countdistinct indicates that in a distributed setup it requires distinctvals to be returned by each shard)
* remove all the calcDistinct logic from {{StatsValuesFactory}} and instead treat {{countdistinct}} and {{distinctvals}} as independently _requested_ stats that have a dependent distributed computation relationship (just like "mean" vs "sum / count")
* update the existing special syntactic sugar parsing for the top level "calcdistinct" param to also include syntatic sugar parsing for the calcdistinct local param - in both cases they become sugar for {{countdistinct=true distinctvals=true}}

the end result being that users who want the {{countDistinct}} response info, w/o the list of all {{distinctVals}} would just specify {{countdistinct=true}} as a localparam (no other new top level params needed)

How does that sound?

these changes should already be fairly trivial to make if you want to take a stab at a patch -- but there may be some patch collision in the code with the work i'm doing in SOLR-6968. There will definitely be patch collisions in the existing test cases, but once SOLR-6968 is commited the test changes should actually be _much_ easier ... and i'm happy to try and tackle this as soon as i'm done with that one (if nothing else, it will help me do a better apples to apples benchmark of HLL vs countDistinct w/o the added distincatVals in the final result)


> StatsComponent, calcdistinct, ability to disable distinctValues while keeping countDistinct
> -------------------------------------------------------------------------------------------
>
>                 Key: SOLR-7461
>                 URL: https://issues.apache.org/jira/browse/SOLR-7461
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: James Andres
>              Labels: statscomponent
>
> When using calcdistinct with large amounts of data the distinctValues field can be extremely large. In cases where the countDistinct is only required it would be helpful if the server did not return distinctValues in the response.
> I'm no expert, but here are some ideas for how the syntax could look.
> {code}
> # Both countDistinct and distinctValues are returned, along with all other stats
> stats.calcdistinct=true&stats.field=myfield
> # Only countDistinct and distinctValues are returned
> stats.calcdistinct=true&stats.field={!countDistinct=true distinctValues=true}myfield
> # Only countDistinct is returned
> stats.calcdistinct=true&stats.field={!countDistinct=true}myfield
> # Only distinctValues is returned
> stats.calcdistinct=true&stats.field={!distinctValues=true}myfield
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org