You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jonathan Rochkind <ro...@jhu.edu> on 2010/07/27 22:44:32 UTC
min/max, StatsComponent, performance
I thought I asked a variation of this before, but I don't see it on the
list, apologies if this is a duplicate, but I have new questions.
So I need to find the min and max value of a result set. Which can be
several million documents. One way to do this is the StatsComponent.
One problem is that I'm having performance problems with StatsComponent
across so many documents, adding the stats component on the field I'm
interested in is adding 10s to my query response time.
So one question is if there's any way to increase StatsComponent
performance. Does it use any caches, or does it operate without caches?
My Solr is running near the top of it's heap size, although I'm not
currently getting any OOM errors, perhaps not enough free memory is
somehow hurting StatsComponent performance. Or any other ideas for
increasing StatsComponent performance?
But it also occurs to me that the StatsComponent is doing a lot more
than I need. I just need min/max. And the cardinality of this field is a
couple orders of magnitude lower than the total number of documents. But
StatsComponent is also doing a bunch of other things, like sum, median,
etc. Perhaps if there were a way to _just_ get min/max, it would be
faster. Is there any way to get min/max values in a result set other
than StatsComponent?
Jonathan
Re: min/max, StatsComponent, performance
Posted by Jonathan Rochkind <ro...@jhu.edu>.
Chris Hostetter wrote:
> Honestly: if you have a really small cardinality for these numeric
> values (ie: small enough to return every value on every request) perhaps
> you should use faceting to find the min/max values (with facet.mincount=1)
> instead of starts?
>
Thanks for the tips and info.
I can't figure out any way to use faceting to find min/max values. If I
do a facet.sort=index, and facet.limit=1, then the facet value returned
would be the min value... but how could I get the max value? There is
no facet.sort=rindex or what have you. Ah, you say small enough to
return every value on every request. Nope, it's not THAT small. I've
got about 3 million documents, and 2-10k unique integers in a field, and
I want to find the min/max.
I guess, if I both index and store the field (which I guess i have to do
anyway), I can find min and max via two separate queries. Sort by
my_field asc, sort by my_field desc, with rows=1 both times, get out the
stored field, that's my min/max.
That might be what I resort to. But it's a shame, StatsComponent can
give me the info "included" in the query I'm already making, as opposed
to requiring two additional querries on top of that -- which you'd think
would be _slower_, but doesn't in fact seem to be.
> I don't think so .. i belive Ryan considered this when he firsted added
> StatsComponent, but he decided it wasn't really worth the trouble -- all
> of the stats are computed in a single pass, and the majority of the time
> is spent getting the value of every doc in the set -- adding each value to
> a running total (for the sum and ultimatley computing the median) is a
> really cheap operation compared to the actaul iteration over the set.
>
Yeah, it's really kind of a mystery to me why StatsComponent is being so
slow. StatsComponent is slower than faceting on the field, and is even
slower than the total time of: 1) First making the initial query,
filling all caches, 2) Then making two additional querries with the same
q/fq, but with different sorts to get min and max from the result set in
#1.
From what you say, there's no good reason for StatsComponent to be
slower than these alternatives, but it is, by an order of magnitude (1-2
seconds vs 10-15 seconds).
I guess I'd have to get into Java profiling/debugging to figure it out,
maybe a weird bug or mis-design somewhere I'm tripping.
Konathan
Re: min/max, StatsComponent, performance
Posted by Chris Hostetter <ho...@fucit.org>.
: So one question is if there's any way to increase StatsComponent performance.
: Does it use any caches, or does it operate without caches? My Solr is running
I believe it uses the field cache to allow fast lookup of numeric values
for documents as it iterates through teh document set -- there's not
really any sort of caching it can use that it isn't already.
: But it also occurs to me that the StatsComponent is doing a lot more than I
: need. I just need min/max. And the cardinality of this field is a couple
: orders of magnitude lower than the total number of documents. But
the cardnaliy of the values isn't really relevant -- it still has to check
the value for every doc in your set to see what value it has.
In things like faceting, term frequency can come into play becuase we can
make optimizations to see if a given terms index wide frequency is less
the our cut off, and if it is we can skip it completely w/o checking how
many docs in our set contain that value -- that type of optimization isn't
possible for min/max (although i suppose there is room for a possible
imporvement of checking if the min we've found so far is the "global" min
for that field, and if so don't bother checking nay docs ... that seems
like a really niche special case optimization, but if you want to submit a
patch it might be useful.
Honestly: if you have a really small cardinality for these numeric
values (ie: small enough to return every value on every request) perhaps
you should use faceting to find the min/max values (with facet.mincount=1)
instead of starts?
: StatsComponent is also doing a bunch of other things, like sum, median, etc.
: Perhaps if there were a way to _just_ get min/max, it would be faster. Is
: there any way to get min/max values in a result set other than StatsComponent?
I don't think so .. i belive Ryan considered this when he firsted added
StatsComponent, but he decided it wasn't really worth the trouble -- all
of the stats are computed in a single pass, and the majority of the time
is spent getting the value of every doc in the set -- adding each value to
a running total (for the sum and ultimatley computing the median) is a
really cheap operation compared to the actaul iteration over the set.
That said: if you wanna work on a patch and can demonstrate that making
these things configurable has performance improvements in the special
case w/o hurting performance in the default case, i don't think anyone
will argue against it.
-Hoss