You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Grdan Eenc <er...@googlemail.com> on 2018/03/13 08:58:11 UTC

Payload TFIDF Similarity in Lucene 7.1.0

Hej there,

I want to extend the TFIDF Similarity class such that the term frequency is
neglected and the value in the payload used instead. Therefore I basically
do this:

    @Override
    public float tf(float freq) {
        return 1f;
    }

    public float scorePayload(int doc, int start, int end, BytesRef
payload) {
        if (payload != null) {
            return PayloadHelper.decodeFloat(payload.bytes, payload.offset);
        } else {
            return 1f;
        }
    }

Complete class can be found here:

https://gist.github.com/nadre/66be2a2a32214f2c5ec1ec1f6edcef08

Unfortunately the scorePayload never gets called and I end up with the
wrong scoring. I know that scorePayload is deprecated in Lucene 7.2.1 but
it should work in 7.1.0 or am I missing something?

I implemented the same thing by directly extending the basic Similarity
class and iterating through doc terms using the LeafReaderContext, based on
the code in this repo:

https://github.com/sdauletau/elasticsearch-position-similarity

This works but is horribly slow which is why I would prefer the first idea.

Any idea why scorePayload doesn't get called? I really couldn't find any
resources on the net.

Best, Erdan.

Re: Payload TFIDF Similarity in Lucene 7.1.0

Posted by Michael Sokolov <ms...@gmail.com>.
Yes that (LUCENE-7854) was what I was referring to, and you are right that
it stores values as integers. This doesn't necessarily have to be a
blocker; you could scale your values by some factor, I guess.

On Mar 13, 2018 9:36 AM, "Erdan Genc" <er...@googlemail.com> wrote:

> @Erik: I didn't know that, how can I figure out which query types support
> payload scoring? The class I described is wrapped into an elasticsearch
> plugin so I don't have full control over this. Currently I'm using the
> SpanTermQuery, maybe another available query type will do, so I don't need
> to implement a custom query parser as well. Thank you!
>
> @Michael: This was my first thought as well but I couldn't find any
> resources when I first searched for it. I just discovered LUCENE-7854
> <https://issues.apache.org/jira/browse/LUCENE-7854>, the
> DelimitedTermFrequencyTokenFilter, but it can't handle floating values
> right? Thanks!
>
> 2018-03-13 12:14 GMT+01:00 Michael Sokolov <ms...@gmail.com>:
>
> > Also, if you are no longer using the term frequency at all, you might
> > consider wiring your score (the one you are currently wiring into
> payloads)
> > in there, in place of the term frequency.
> >
> > On Mar 13, 2018 6:57 AM, "Erik Hatcher" <er...@gmail.com> wrote:
> >
> > > Payloads are only scored from certain query types.   What query are you
> > > executing?
> > >
> > > > On Mar 13, 2018, at 04:58, Grdan Eenc <er...@googlemail.com>
> > wrote:
> > > >
> > > > Hej there,
> > > >
> > > > I want to extend the TFIDF Similarity class such that the term
> > frequency
> > > is
> > > > neglected and the value in the payload used instead. Therefore I
> > > basically
> > > > do this:
> > > >
> > > >    @Override
> > > >    public float tf(float freq) {
> > > >        return 1f;
> > > >    }
> > > >
> > > >    public float scorePayload(int doc, int start, int end, BytesRef
> > > > payload) {
> > > >        if (payload != null) {
> > > >            return PayloadHelper.decodeFloat(payload.bytes,
> > > payload.offset);
> > > >        } else {
> > > >            return 1f;
> > > >        }
> > > >    }
> > > >
> > > > Complete class can be found here:
> > > >
> > > > https://gist.github.com/nadre/66be2a2a32214f2c5ec1ec1f6edcef08
> > > >
> > > > Unfortunately the scorePayload never gets called and I end up with
> the
> > > > wrong scoring. I know that scorePayload is deprecated in Lucene 7.2.1
> > but
> > > > it should work in 7.1.0 or am I missing something?
> > > >
> > > > I implemented the same thing by directly extending the basic
> Similarity
> > > > class and iterating through doc terms using the LeafReaderContext,
> > based
> > > on
> > > > the code in this repo:
> > > >
> > > > https://github.com/sdauletau/elasticsearch-position-similarity
> > > >
> > > > This works but is horribly slow which is why I would prefer the first
> > > idea.
> > > >
> > > > Any idea why scorePayload doesn't get called? I really couldn't find
> > any
> > > > resources on the net.
> > > >
> > > > Best, Erdan.
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>

Re: Payload TFIDF Similarity in Lucene 7.1.0

Posted by Erdan Genc <er...@googlemail.com>.
@Erik: I didn't know that, how can I figure out which query types support
payload scoring? The class I described is wrapped into an elasticsearch
plugin so I don't have full control over this. Currently I'm using the
SpanTermQuery, maybe another available query type will do, so I don't need
to implement a custom query parser as well. Thank you!

@Michael: This was my first thought as well but I couldn't find any
resources when I first searched for it. I just discovered LUCENE-7854
<https://issues.apache.org/jira/browse/LUCENE-7854>, the
DelimitedTermFrequencyTokenFilter, but it can't handle floating values
right? Thanks!

2018-03-13 12:14 GMT+01:00 Michael Sokolov <ms...@gmail.com>:

> Also, if you are no longer using the term frequency at all, you might
> consider wiring your score (the one you are currently wiring into payloads)
> in there, in place of the term frequency.
>
> On Mar 13, 2018 6:57 AM, "Erik Hatcher" <er...@gmail.com> wrote:
>
> > Payloads are only scored from certain query types.   What query are you
> > executing?
> >
> > > On Mar 13, 2018, at 04:58, Grdan Eenc <er...@googlemail.com>
> wrote:
> > >
> > > Hej there,
> > >
> > > I want to extend the TFIDF Similarity class such that the term
> frequency
> > is
> > > neglected and the value in the payload used instead. Therefore I
> > basically
> > > do this:
> > >
> > >    @Override
> > >    public float tf(float freq) {
> > >        return 1f;
> > >    }
> > >
> > >    public float scorePayload(int doc, int start, int end, BytesRef
> > > payload) {
> > >        if (payload != null) {
> > >            return PayloadHelper.decodeFloat(payload.bytes,
> > payload.offset);
> > >        } else {
> > >            return 1f;
> > >        }
> > >    }
> > >
> > > Complete class can be found here:
> > >
> > > https://gist.github.com/nadre/66be2a2a32214f2c5ec1ec1f6edcef08
> > >
> > > Unfortunately the scorePayload never gets called and I end up with the
> > > wrong scoring. I know that scorePayload is deprecated in Lucene 7.2.1
> but
> > > it should work in 7.1.0 or am I missing something?
> > >
> > > I implemented the same thing by directly extending the basic Similarity
> > > class and iterating through doc terms using the LeafReaderContext,
> based
> > on
> > > the code in this repo:
> > >
> > > https://github.com/sdauletau/elasticsearch-position-similarity
> > >
> > > This works but is horribly slow which is why I would prefer the first
> > idea.
> > >
> > > Any idea why scorePayload doesn't get called? I really couldn't find
> any
> > > resources on the net.
> > >
> > > Best, Erdan.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Re: Payload TFIDF Similarity in Lucene 7.1.0

Posted by Michael Sokolov <ms...@gmail.com>.
Also, if you are no longer using the term frequency at all, you might
consider wiring your score (the one you are currently wiring into payloads)
in there, in place of the term frequency.

On Mar 13, 2018 6:57 AM, "Erik Hatcher" <er...@gmail.com> wrote:

> Payloads are only scored from certain query types.   What query are you
> executing?
>
> > On Mar 13, 2018, at 04:58, Grdan Eenc <er...@googlemail.com> wrote:
> >
> > Hej there,
> >
> > I want to extend the TFIDF Similarity class such that the term frequency
> is
> > neglected and the value in the payload used instead. Therefore I
> basically
> > do this:
> >
> >    @Override
> >    public float tf(float freq) {
> >        return 1f;
> >    }
> >
> >    public float scorePayload(int doc, int start, int end, BytesRef
> > payload) {
> >        if (payload != null) {
> >            return PayloadHelper.decodeFloat(payload.bytes,
> payload.offset);
> >        } else {
> >            return 1f;
> >        }
> >    }
> >
> > Complete class can be found here:
> >
> > https://gist.github.com/nadre/66be2a2a32214f2c5ec1ec1f6edcef08
> >
> > Unfortunately the scorePayload never gets called and I end up with the
> > wrong scoring. I know that scorePayload is deprecated in Lucene 7.2.1 but
> > it should work in 7.1.0 or am I missing something?
> >
> > I implemented the same thing by directly extending the basic Similarity
> > class and iterating through doc terms using the LeafReaderContext, based
> on
> > the code in this repo:
> >
> > https://github.com/sdauletau/elasticsearch-position-similarity
> >
> > This works but is horribly slow which is why I would prefer the first
> idea.
> >
> > Any idea why scorePayload doesn't get called? I really couldn't find any
> > resources on the net.
> >
> > Best, Erdan.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Payload TFIDF Similarity in Lucene 7.1.0

Posted by Erik Hatcher <er...@gmail.com>.
Payloads are only scored from certain query types.   What query are you executing?

> On Mar 13, 2018, at 04:58, Grdan Eenc <er...@googlemail.com> wrote:
> 
> Hej there,
> 
> I want to extend the TFIDF Similarity class such that the term frequency is
> neglected and the value in the payload used instead. Therefore I basically
> do this:
> 
>    @Override
>    public float tf(float freq) {
>        return 1f;
>    }
> 
>    public float scorePayload(int doc, int start, int end, BytesRef
> payload) {
>        if (payload != null) {
>            return PayloadHelper.decodeFloat(payload.bytes, payload.offset);
>        } else {
>            return 1f;
>        }
>    }
> 
> Complete class can be found here:
> 
> https://gist.github.com/nadre/66be2a2a32214f2c5ec1ec1f6edcef08
> 
> Unfortunately the scorePayload never gets called and I end up with the
> wrong scoring. I know that scorePayload is deprecated in Lucene 7.2.1 but
> it should work in 7.1.0 or am I missing something?
> 
> I implemented the same thing by directly extending the basic Similarity
> class and iterating through doc terms using the LeafReaderContext, based on
> the code in this repo:
> 
> https://github.com/sdauletau/elasticsearch-position-similarity
> 
> This works but is horribly slow which is why I would prefer the first idea.
> 
> Any idea why scorePayload doesn't get called? I really couldn't find any
> resources on the net.
> 
> Best, Erdan.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org