You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Chet Vora <ch...@gmail.com> on 2013/08/06 17:57:47 UTC

TermRangeTermsEnum usage and performance

Hi

I have an index consisting of a double value that can range between certain
values and an associated tag. I am trying to find all the docs which match
a certain tag (or combination of tags) and a certain range.

I'm trying to use the TermRangeTermsEnum from the Flex API as part of a
custom parser. This is how I'm using it (in the getDocIdSet() method).


        Terms myField = fields.terms("Count"); //this is the field I'm
interested in for range enum
        termsEnum = myField.iterator(termsEnum);
        BytesRef lowerBound = new BytesRef();

NumericUtils.longToPrefixCodedBytes(NumericUtils.doubleToSortableLong(lower),
0, lowerBound);
        BytesRef upperBound = new BytesRef();

NumericUtils.longToPrefixCodedBytes(NumericUtils.doubleToSortableLong(upper),
0, upperBound);
        TermRangeTermsEnum termRangeTermsEnum=  new
TermRangeTermsEnum(termsEnum, lowerBound, upperBound, true, true);

DocsEnum docs = null;
        FixedBitSet rangeFilter = new FixedBitSet(reader.maxDoc());
        // Create a bitset of all docs that pass range filter
        while (termRangeTermsEnum.next() != null) {
            docs = termRangeTermsEnum.docs(startResults, docs,
DocsEnum.FLAG_NONE); // no freq since we don't need them
            while (docs.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
                rangeFilter.set(docs.docID());
        }
        }

Terms tagField = fields.terms("Tag");//the other field I want to filter by
        termsEnum = tagField.iterator(termsEnum);
        // filter by docs who match the tag
private String[] tags;
        Set<Integer> myIds = new HashSet<Integer>();

        for (String s : tags) {
            ref = new BytesRef(s);
            if (termsEnum.seekExact(ref, false)) { // don't use cache since
we could pollute the cache here easily
                docs = termsEnum.docs(rangeFilter, docs,
DocsEnum.FLAG_NONE); // no freq since we don't need them
                while (docs.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
                    myIds.add(docs.docID());
                }
            }
        }
This does return me the results I want but doesn't perform very well. By
comparison, using TermsEnum and doing a check by hand of the range performs
much better -its is an order of a magnitude better for small number of
(<1000) records and about 3-4 times faster for more.

Terms tagField = fields.terms("Tag");
        termsEnum = tagField.iterator(termsEnum);
        Set<Integer> myIds = new HashSet<Integer>();
        double value;
        for (String s : tags) {
            ref = new BytesRef(s);
            if (termsEnum.seekExact(ref, false)) { // don't use cache since
we could pollute the cache here easily
                docs = termsEnum.docs(initialSet, docs,
DocsEnum.FLAG_NONE); // no freq since we don't need them
                while (docs.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
                    value = cache.get(docs.docID());
                    if (value >= lowerBound && value <= upperBound) //check
for the range
                        myIds.add(docs.docID());
                }
            }
        }


Is this the expected usage of TermRangeTermsEnum? Is this the expected
performance also? Any pointers or helpful references are welcome.

Thanks,
CV