You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Hasenberger, Josef" <Jo...@zetcom.com> on 2016/03/23 10:24:51 UTC

What is the propper replacement for Filters working in DocValue fields?

Hello,

I am migrating a rather large application from Lucene 4.10 to Lucene 5.5.0.
Since Filters are deprecated in Lucene 5, I am looking for an efficient replacement in our code.

We use many Filters that calculate the DocIdSet by doing a lookup of numeric DocValues in some collection.
Everything is based on "long" types and results could be large.
Pseudo code in Filter class looks like this:

    @Override
    public DocIdSet getDocIdSet(AtomicReaderContext context, Bits acceptDocs) throws IOException {
        AtomicReader reader = context.reader();
        OpenBitSet docSet = new OpenBitSet();
        NumericDocValues docValues = reader.getNumericDocValues(filterKeyName);

        for (int doc = 0; doc < reader.maxDoc(); doc++) {
            long value = docValues.get(doc); // getting DocValues for current doc
            if (isMatch(value)) { // check value against some condition
                docSet.set(doc); // set bit for doc
            }
        }
        return docSet;
    }


I wonder what the proper and efficient replacement for such filtering is?

Should I convert my matching value set into a TermsQuery and wrap with ConstantScoreQuery?
I could do this, but then I am concerned about:

*         Efficiency:
The matching document in the isMatch() method above could be very large. I would need to create large collection of Terms rather than the memory efficient DocIdSet.


*         More efficiency:
>From my current understanding, I would need to create a Term from the String representation of my long value. Isn't this inefficient again?

I would really appreciate any recommendations on this.

Thanks a lot and best regards,
Josef

Re: What is the propper replacement for Filters working in DocValue fields?

Posted by Sheng <sh...@gmail.com>.

One possible workaround I can think of is to make use of the
CustomScoreQuery to do a posteri scoring, let documents not matching your
criteria have score 0, and use a PostiveScoreOnlyCollector to harvest the
search result. Now problem using CustomScoreQuery is FieldCache is
deprecated too, but you should be able to use UnivertingReader instead.

On Wednesday, March 23, 2016, Hasenberger, Josef <
Josef.Hasenberger@zetcom.com> wrote:

> Hello,
>
> I am migrating a rather large application from Lucene 4.10 to Lucene 5.5.0.
> Since Filters are deprecated in Lucene 5, I am looking for an efficient
> replacement in our code.
>
> We use many Filters that calculate the DocIdSet by doing a lookup of
> numeric DocValues in some collection.
> Everything is based on "long" types and results could be large.
> Pseudo code in Filter class looks like this:
>
>     @Override
>     public DocIdSet getDocIdSet(AtomicReaderContext context, Bits
> acceptDocs) throws IOException {
>         AtomicReader reader = context.reader();
>         OpenBitSet docSet = new OpenBitSet();
>         NumericDocValues docValues =
> reader.getNumericDocValues(filterKeyName);
>
>         for (int doc = 0; doc < reader.maxDoc(); doc++) {
>             long value = docValues.get(doc); // getting DocValues for
> current doc
>             if (isMatch(value)) { // check value against some condition
>                 docSet.set(doc); // set bit for doc
>             }
>         }
>         return docSet;
>     }
>
>
> I wonder what the proper and efficient replacement for such filtering is?
>
> Should I convert my matching value set into a TermsQuery and wrap with
> ConstantScoreQuery?
> I could do this, but then I am concerned about:
>
> *         Efficiency:
> The matching document in the isMatch() method above could be very large. I
> would need to create large collection of Terms rather than the memory
> efficient DocIdSet.
>
>
> *         More efficiency:
> From my current understanding, I would need to create a Term from the
> String representation of my long value. Isn't this inefficient again?
>
> I would really appreciate any recommendations on this.
>
> Thanks a lot and best regards,
> Josef
>
>