You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Chet Vora <ch...@gmail.com> on 2013/08/07 15:17:36 UTC
TermRangeTermsEnum
Hi
Posting this to dev as well as this is related to Lucene internals.
I have an index consisting of a double value that can range between certain
values and an associated tag. I am trying to find all the docs which match
a certain tag (or combination of tags) and a certain range. I'm trying to
use the TermRangeTermsEnum from the Flex API as part of a custom parser.
This is how I'm using it (in the getDocIdSet() method).
Terms myField = fields.terms("Count"); //this is the field I'm
interested in for range enum
termsEnum = myField.iterator(termsEnum);
BytesRef lowerBound = new BytesRef();
NumericUtils.longToPrefixCodedBytes(NumericUtils.doubleToSortableLong(lower),
0, lowerBound);
BytesRef upperBound = new BytesRef();
NumericUtils.longToPrefixCodedBytes(NumericUtils.doubleToSortableLong(upper),
0, upperBound);
TermRangeTermsEnum termRangeTermsEnum= new
TermRangeTermsEnum(termsEnum, lowerBound, upperBound, true, true);
DocsEnum docs = null;
FixedBitSet rangeFilter = new FixedBitSet(reader.maxDoc());
// Create a bitset of all docs that pass range filter
while (termRangeTermsEnum.next() != null) {
docs = termRangeTermsEnum.docs(startResults, docs,
DocsEnum.FLAG_NONE); // no freq since we don't need them
while (docs.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
rangeFilter.set(docs.docID());
}
}
Terms tagField = fields.terms("Tag");//the other field I want to filter by
termsEnum = tagField.iterator(termsEnum);
// filter by docs who match the tag
private String[] tags;
Set<Integer> myIds = new HashSet<Integer>();
for (String s : tags) {
ref = new BytesRef(s);
if (termsEnum.seekExact(ref, false)) { // don't use cache since
we could pollute the cache here easily
docs = termsEnum.docs(rangeFilter, docs,
DocsEnum.FLAG_NONE); // no freq since we don't need them
while (docs.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
myIds.add(docs.docID());
}
}
}
This does return me the results I want but doesn't perform very well. By
comparison, using TermsEnum and doing a check by hand of the range performs
much better -its is an order of a magnitude better for <1000 records and
about 3-4 times faster for more.
Terms tagField = fields.terms("Tag");
termsEnum = tagField.iterator(termsEnum);
Set<Integer> myIds = new HashSet<Integer>();
double value;
for (String s : tags) {
ref = new BytesRef(s);
if (termsEnum.seekExact(ref, false)) { // don't use cache since
we could pollute the cache here easily
docs = termsEnum.docs(initialSet, docs,
DocsEnum.FLAG_NONE); // no freq since we don't need them
while (docs.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
value = cache.get(docs.docID());
if (value >= lowerBound && value <= upperBound) //check
for the range
myIds.add(docs.docID());
}
}
}
Is this the expected usage of TermRangeTermsEnum? Is this the expected
performance also? Any pointers or helpful references to doing this in a
more permormant way are welcome.
Regards,
CV
RE: TermRangeTermsEnum
Posted by Uwe Schindler <uw...@thetaphi.de>.
Why don’t you use NumericRangeQuery’s enum? If the field is indexed as NumericField this should work.
Uwe
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de <http://www.thetaphi.de/>
eMail: uwe@thetaphi.de
From: chetanvora@gmail.com [mailto:chetanvora@gmail.com] On Behalf Of Chet Vora
Sent: Wednesday, August 07, 2013 3:18 PM
To: dev@lucene.apache.org
Subject: TermRangeTermsEnum
Hi
Posting this to dev as well as this is related to Lucene internals.
I have an index consisting of a double value that can range between certain values and an associated tag. I am trying to find all the docs which match a certain tag (or combination of tags) and a certain range. I'm trying to use the TermRangeTermsEnum from the Flex API as part of a custom parser. This is how I'm using it (in the getDocIdSet() method).
Terms myField = fields.terms("Count"); //this is the field I'm interested in for range enum
termsEnum = myField.iterator(termsEnum);
BytesRef lowerBound = new BytesRef();
NumericUtils.longToPrefixCodedBytes(NumericUtils.doubleToSortableLong(lower), 0, lowerBound);
BytesRef upperBound = new BytesRef();
NumericUtils.longToPrefixCodedBytes(NumericUtils.doubleToSortableLong(upper), 0, upperBound);
TermRangeTermsEnum termRangeTermsEnum= new TermRangeTermsEnum(termsEnum, lowerBound, upperBound, true, true);
DocsEnum docs = null;
FixedBitSet rangeFilter = new FixedBitSet(reader.maxDoc());
// Create a bitset of all docs that pass range filter
while (termRangeTermsEnum.next() != null) {
docs = termRangeTermsEnum.docs(startResults, docs, DocsEnum.FLAG_NONE); // no freq since we don't need them
while (docs.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
rangeFilter.set(docs.docID());
}
}
Terms tagField = fields.terms("Tag");//the other field I want to filter by
termsEnum = tagField.iterator(termsEnum);
// filter by docs who match the tag
private String[] tags;
Set<Integer> myIds = new HashSet<Integer>();
for (String s : tags) {
ref = new BytesRef(s);
if (termsEnum.seekExact(ref, false)) { // don't use cache since we could pollute the cache here easily
docs = termsEnum.docs(rangeFilter, docs, DocsEnum.FLAG_NONE); // no freq since we don't need them
while (docs.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
myIds.add(docs.docID());
}
}
}
This does return me the results I want but doesn't perform very well. By comparison, using TermsEnum and doing a check by hand of the range performs much better -its is an order of a magnitude better for <1000 records and about 3-4 times faster for more.
Terms tagField = fields.terms("Tag");
termsEnum = tagField.iterator(termsEnum);
Set<Integer> myIds = new HashSet<Integer>();
double value;
for (String s : tags) {
ref = new BytesRef(s);
if (termsEnum.seekExact(ref, false)) { // don't use cache since we could pollute the cache here easily
docs = termsEnum.docs(initialSet, docs, DocsEnum.FLAG_NONE); // no freq since we don't need them
while (docs.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
value = cache.get(docs.docID());
if (value >= lowerBound && value <= upperBound) //check for the range
myIds.add(docs.docID());
}
}
}
Is this the expected usage of TermRangeTermsEnum? Is this the expected performance also? Any pointers or helpful references to doing this in a more permormant way are welcome.
Regards,
CV