You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Martin Grotzke <ma...@javakaffee.de> on 2007/07/21 21:29:56 UTC

How to read values of a field efficiently

Hi,

I have a custom Facet implementation that extends SimpleFacets
and overrides getTermCounts( String field ).

For the price field I calculate available ranges, for this I
have to read the values for this field. Right this looks like
this:

    public NamedList getTermCounts( final String field ) throws IOException {
        SchemaField sf = searcher.getSchema().getField( field );
        FieldType ft = sf.getType();
        final DocValues docValues = ft.getValueSource( sf ).getValues( searcher.getReader() );
        final DocIterator iter = docs.iterator();
        final TIntArrayList prices = new TIntArrayList( docs.size() );
        while (iter.hasNext()) {
           float value = docValues.floatVal(iter.next());
           prices.add( (int)value );
        }
        // calculate ranges and return the result
    }

This part (reading field values) takes fairly long compared
to the other fields (that use getFacetTermEnumCounts or
getFieldCacheCounts as implemented in SimpleFacets), so that
I asume that there is potential for optimization.

Fairly long: getFieldCacheCounts for the cat field takes ~70 ms
for the second request, while reading prices takes ~600 ms.

Is there a better way (in terms of performance) to determine
the values for the found docs?

Thanx in advance,
cheers,
Martin

Re: How to read values of a field efficiently

Posted by Martin Grotzke <ma...@javakaffee.de>.

On Mon, 2007-07-30 at 00:30 -0700, Chris Hostetter wrote:
> : Is it possible to get the values from the ValueSource (or from
> : getFieldCacheCounts) sorted by its natural order (from lowest to
> : highest values)?
> 
> well, an inverted term index is already a data structure listing terms
> from lowest to highest and the associated documents -- so if you want to
> iterate from low to high between a range and find matching docs you should
> just use hte TermEnum
Ok. Unfortunately I don't see how I can get a TermEnum for a specific
field (e.g. "price")... I tried

TermEnum te = searcher.getReader().terms(new Term(field, ""));

but this returns also terms for several other fields.
Is it possible at all to get a TermEnum for a specific field?

Then if I had this TermEnum, how can I check if a Term is in my
DocSet? In other words, I would like to read Terms for a specific
field from my DocSet - so that I could determine all price terms
for my DocSet.

Is there a way to achieve this?

Thanx in advance,
cheers,
Martin


>  -- the whole point of the FieldCache (and
> FieldCacheSource) is to have a "reverse inverted index" so you can quickly
> fetch the indexed value if you know the docId.
> 
> perhaps you should elaborate a little more on what it is you are trying to
> do so we can help you figure out how to do it more efficinelty ... i know
> you mentioend computing price ranges in your first message ... but you
> also didn't post any clear code about that part of your problem, just that
> the *other* part of your code that iterated over every doc was too slow
> ... perhaps you shouldn't be iterating over every doc to figure out your
> ranges .. perhaps you can iterate over the terms themselves?
> 
> 
> hang on ... rereading your first message i just noticed something i
> definitely didn't spot before...
> 
> >> Fairly long: getFieldCacheCounts for the cat field takes ~70 ms
> >> for the second request, while reading prices takes ~600 ms.
> 
> ...i clearly missed this, and fixated on your assertion that your reading
> of field values took longer then the stock methods -- but you're not just
> comparing the time needed byu different methods, you're also timing
> different fields.
> 
> this actually makes a lot of sense since there are probably a lot fewer
> unique values for the cat field, so there are a lot fewer discrete values
> to deal with when computing counts.
> 
> 
> 
> 
> -Hoss
> 
-- 
Martin Grotzke
http://www.javakaffee.de/blog/

Re: How to read values of a field efficiently

Posted by Martin Grotzke <ma...@javakaffee.de>.

On Mon, 2007-07-30 at 00:30 -0700, Chris Hostetter wrote:
> : Is it possible to get the values from the ValueSource (or from
> : getFieldCacheCounts) sorted by its natural order (from lowest to
> : highest values)?
> 
> well, an inverted term index is already a data structure listing terms
> from lowest to highest and the associated documents -- so if you want to
> iterate from low to high between a range and find matching docs you should
> just use hte TermEnum -- the whole point of the FieldCache (and
> FieldCacheSource) is to have a "reverse inverted index" so you can quickly
> fetch the indexed value if you know the docId.
Ok, I will have a look at the TermEnum and try this.

> 
> perhaps you should elaborate a little more on what it is you are trying to
> do so we can help you figure out how to do it more efficinelty ...
I want to read all values of the price field of the found docs,
and calculate the mean value and the standard deviation.
Based on the min value (mean - deviation, the max value (mean +
deviation) and the number of prices I calculate price ranges.

Then I iterate over the sorted array of prices and count how many
prices go into the current range.

This sorting (Arrays.sort) takes much time, that's why I asked if
it's possible to read values in sorted order.

But reading this, I think it would also be possible to skip sorting and
check for each price into which bucket it would go and increment the
counter for this bucket - this should also be a possibility for
optimization.

> ... perhaps you shouldn't be iterating over every doc to figure out your
> ranges .. perhaps you can iterate over the terms themselves?
Are you referring to TermEnum with this?

Thanx && cheers,
Martin


> 
> 
> hang on ... rereading your first message i just noticed something i
> definitely didn't spot before...
> 
> >> Fairly long: getFieldCacheCounts for the cat field takes ~70 ms
> >> for the second request, while reading prices takes ~600 ms.
> 
> ...i clearly missed this, and fixated on your assertion that your reading
> of field values took longer then the stock methods -- but you're not just
> comparing the time needed byu different methods, you're also timing
> different fields.
> 
> this actually makes a lot of sense since there are probably a lot fewer
> unique values for the cat field, so there are a lot fewer discrete values
> to deal with when computing counts.
> 
> 
> 
> 
> -Hoss
> 
-- 
Martin Grotzke
http://www.javakaffee.de/blog/

Re: How to read values of a field efficiently

Posted by Chris Hostetter <ho...@fucit.org>.

: Is it possible to get the values from the ValueSource (or from
: getFieldCacheCounts) sorted by its natural order (from lowest to
: highest values)?

well, an inverted term index is already a data structure listing terms
from lowest to highest and the associated documents -- so if you want to
iterate from low to high between a range and find matching docs you should
just use hte TermEnum -- the whole point of the FieldCache (and
FieldCacheSource) is to have a "reverse inverted index" so you can quickly
fetch the indexed value if you know the docId.

perhaps you should elaborate a little more on what it is you are trying to
do so we can help you figure out how to do it more efficinelty ... i know
you mentioend computing price ranges in your first message ... but you
also didn't post any clear code about that part of your problem, just that
the *other* part of your code that iterated over every doc was too slow
... perhaps you shouldn't be iterating over every doc to figure out your
ranges .. perhaps you can iterate over the terms themselves?


hang on ... rereading your first message i just noticed something i
definitely didn't spot before...

>> Fairly long: getFieldCacheCounts for the cat field takes ~70 ms
>> for the second request, while reading prices takes ~600 ms.

...i clearly missed this, and fixated on your assertion that your reading
of field values took longer then the stock methods -- but you're not just
comparing the time needed byu different methods, you're also timing
different fields.

this actually makes a lot of sense since there are probably a lot fewer
unique values for the cat field, so there are a lot fewer discrete values
to deal with when computing counts.




-Hoss

Re: How to read values of a field efficiently

Posted by Martin Grotzke <ma...@javakaffee.de>.

On Mon, 2007-07-23 at 23:32 -0700, Chris Hostetter wrote:
> : This part (reading field values) takes fairly long compared
> : to the other fields (that use getFacetTermEnumCounts or
> : getFieldCacheCounts as implemented in SimpleFacets), so that
> : I asume that there is potential for optimization.
> :
> : Fairly long: getFieldCacheCounts for the cat field takes ~70 ms
> : for the second request, while reading prices takes ~600 ms.
> 
> using the ValueSource from the field should be roughly as fast as using
> the FieldCache since it's backed by the fieldcache ... 
Ok, I didn't know that.

> allthough a few
> things jump out at me as kind of odd...
> 
> 1) why ask for hte float value and then cast it to int?
You're right, I just changed this, unfortunately it doesn't save time...

> 2) what is a TIntArrayList and what kind of overhead does calling add have?
A list implementation for primitive int values, but I tested this before
and there's no measurable overhead.
I already changed this to a primitive int[] array, also no time savings.

> 3) if getFieldCacheCounts is fast enough, why not based your code on that
>    instead of getting the ValueSource?
I just tried this, and mostly (but not always) it's faster than going
through the ValueSource. For this I have to adjust my range calculation
and check if this is faster/slower then.

Is it possible to get the values from the ValueSource (or from
getFieldCacheCounts) sorted by its natural order (from lowest to
highest values)?

If this would not take much more time, I wouldn't need to sort
the values by myself (what I do for range calculation) which takes
also a fairly amount of time (mostly as much as getting values from
ValueSource and sometimes even more).

Thanx for your help,
cheers,
Martin


> 
> 
> 
> 
> 
> -Hoss
>

Re: How to read values of a field efficiently

Posted by Chris Hostetter <ho...@fucit.org>.

: This part (reading field values) takes fairly long compared
: to the other fields (that use getFacetTermEnumCounts or
: getFieldCacheCounts as implemented in SimpleFacets), so that
: I asume that there is potential for optimization.
:
: Fairly long: getFieldCacheCounts for the cat field takes ~70 ms
: for the second request, while reading prices takes ~600 ms.

using the ValueSource from the field should be roughly as fast as using
the FieldCache since it's backed by the fieldcache ... allthough a few
things jump out at me as kind of odd...

1) why ask for hte float value and then cast it to int?
2) what is a TIntArrayList and what kind of overhead does calling add have?
3) if getFieldCacheCounts is fast enough, why not based your code on that
   instead of getting the ValueSource?





-Hoss