You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Romaric Pighetti <ro...@francelabs.com> on 2019/05/20 10:04:55 UTC

Distinct terms within a document for new Similarity class

Hi,

I am currently implementing a new similarity class into lucene which is 
based on a language model with absolute discount.
I am basing my work on the work already done in the 
LMDirichletSimilarity and LMJelinekMercerSimilarity which are really close.
However to end my implementation I need to get the number of unique 
terms present in the document, and this information seems to be 
unavailable natively from within the score function.

The computeNorm function which is in the Similarity class seems to be 
the right place to compute (or read) and store this statistic but I am 
not sure.
So I am reaching you to know if I am on the right track and if you have 
any advice on how I could access this statistic from the computeNorm 
function if possible ?

I would like the implementation to be as clean as possible with regards 
to Lucene's code expectation to be able to submit it for integration 
once it is done.

Thanks for your help,
Regards.

-- 
Romaric Pighetti
R&D - FranceLabs


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Distinct terms within a document for new Similarity class

Posted by Romaric Pighetti <ro...@francelabs.com>.

Yes i meant the frequency of the term inside the document sorry.

That is what I was afraid of.
Thank you for your help and advices, I will look into implementing it as 
a query because i need to extract that value for other processes and 
doing it into solr / lucene is more convenient for me.

Regards,
Romaric

Le 22/05/2019 à 09:07, Adrien Grand a écrit :
> Did you mean termFreq rather than docFreq?
>
> I'm afraid that this scoring function can't be implemented as a
> Similarity given Lucene's new requirement that scores must
>   - be non-negative
>   - not increase when the norm increases.
>
> We had to remove a number of DFR similarities that we used to support
> for this reason, and hacked a couple other similarities to make sure
> their score could never be negative. This one seems to be another one
> that we cannot support unfortunately.
>
> If you really need this, there are workarounds, such as implementing
> it via a query rather than a similarity, but it's a bit more tedious
> and you'd need to populate the unique term count manually in the index
> via a docvalue field.
>
> On Wed, May 22, 2019 at 8:28 AM Romaric Pighetti
> <ro...@francelabs.com> wrote:
>> Hi Adrien,
>>
>> I thought about merging the two into one value that I could use in the scoring function but failed to find a way to do so.
>> The scoring function is:
>>
>> log(1+(max(docrFreq-delta,0)) / (delta * d_u* p(w\c) ) ) + log( delta * d_u / |d|)
>>
>> (also included as image bellow)
>>
>> Where:
>> d_u is the unique term count
>> |d| is the document length.
>> p(w\c) is a language model computed on the corpus that gives the probability that the corpus generates the word w
>> delta is a parameter (which value is between 0 and 1).
>>
>> My problem here is that I need the unique term count once alone and once divided by the document length, so I can't really craft a single value combining the two that i can use in the scoring function.
>>
>> I am trying to work around the problem by thinking about how i can change the function while keeping the idea of what the different terms express but couldn't find a satisfying solution yet.
>>
>> At least I know i should not search how to create a doc value anymore ! :)
>>
>> Thanks a lot for your help !
>> Romaric
>>
>> Le 21/05/2019 à 22:23, Adrien Grand a écrit :
>>
>> Hi Romaric,
>>
>> Indeed similarities are not expected to create doc value fields, they
>> should only populate norms. The similarity API has been changed in 8.0
>> and similarities no longer have access to the reader context, they are
>> now expected to work with only term frequency and a length
>> normalization factor as per-document contributions to the score.
>>
>> One challenge is that Lucene now mandates that scores do not increase
>> when the norm value increases, which makes it hard to record both the
>> unique term count and the total term count in the norm. What does the
>> scoring function look like, maybe there are ways that we could record
>> both the unique term count and total term count in a single long
>> depending on how the scoring formula merges them (maybe not!).
>>
>> On Tue, May 21, 2019 at 3:28 PM Romaric Pighetti
>> <ro...@francelabs.com> wrote:
>>
>> Hi,
>>
>> Thanks Adrien for the quick and accurate answer.
>> Digging into the implementation I saw that the document length is
>> already stored there and as I need both the unique term count and the
>> length, I can't just replace one with the other.
>> The Similarity class documentation states that it is possible to store
>> additional values using NumericDocValuesField that could be accessed at
>> query time using a LeafReader.
>>
>>   From my understanding, using the LeafReaderContext when building the
>> SimScorer should allow me to get access to the NumericDocValuesField.
>>
>> The problem is I don't get how to create and store a new
>> NumericDocValuesField from the Similarity. My guess is that it should
>> happen within the computeNorm function again as it is the only function
>> called at indexing time. However I am unable to understand how to create
>> and store this information from that function.
>>
>> If you have any advice that would be really helpful.
>>
>> Thanks.
>> Romaric
>>
>> Le 20/05/2019 à 12:16, Adrien Grand a écrit :
>>
>> Hi Romaric,
>>
>> You are right, computeNorm is the right place to compute and record
>> the number of unique terms of a document. Your computeNorm function
>> would look something like this:
>>
>> @Override
>> public final long computeNorm(FieldInvertState state) {
>>     return SmallFloat.intToByte4(state.getUniqueTermCount());
>> }
>>
>> And then in your scorer you could convert the norm back to the unique
>> term count by doing SmallFloat.byte4ToInt on it. The SmallFloat
>> methods are useful to encode this count on one byte, which trades some
>> accuracy but is usually the right trade-off.
>>
>> On Mon, May 20, 2019 at 12:05 PM Romaric Pighetti
>> <ro...@francelabs.com> wrote:
>>
>> Hi,
>>
>> I am currently implementing a new similarity class into lucene which is
>> based on a language model with absolute discount.
>> I am basing my work on the work already done in the
>> LMDirichletSimilarity and LMJelinekMercerSimilarity which are really close.
>> However to end my implementation I need to get the number of unique
>> terms present in the document, and this information seems to be
>> unavailable natively from within the score function.
>>
>> The computeNorm function which is in the Similarity class seems to be
>> the right place to compute (or read) and store this statistic but I am
>> not sure.
>> So I am reaching you to know if I am on the right track and if you have
>> any advice on how I could access this statistic from the computeNorm
>> function if possible ?
>>
>> I would like the implementation to be as clean as possible with regards
>> to Lucene's code expectation to be able to submit it for integration
>> once it is done.
>>
>> Thanks for your help,
>> Regards.
>>
>> --
>> Romaric Pighetti
>> R&D - FranceLabs
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>> --
>> Romaric Pighetti
>> R&D - FranceLabs
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>> --
>> Romaric Pighetti
>> R&D - FranceLabs
>
>
-- 
Romaric Pighetti
R&D - FranceLabs


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Distinct terms within a document for new Similarity class

Posted by Adrien Grand <jp...@gmail.com>.

Did you mean termFreq rather than docFreq?

I'm afraid that this scoring function can't be implemented as a
Similarity given Lucene's new requirement that scores must
 - be non-negative
 - not increase when the norm increases.

We had to remove a number of DFR similarities that we used to support
for this reason, and hacked a couple other similarities to make sure
their score could never be negative. This one seems to be another one
that we cannot support unfortunately.

If you really need this, there are workarounds, such as implementing
it via a query rather than a similarity, but it's a bit more tedious
and you'd need to populate the unique term count manually in the index
via a docvalue field.

On Wed, May 22, 2019 at 8:28 AM Romaric Pighetti
<ro...@francelabs.com> wrote:
>
> Hi Adrien,
>
> I thought about merging the two into one value that I could use in the scoring function but failed to find a way to do so.
> The scoring function is:
>
> log(1+(max(docrFreq-delta,0)) / (delta * d_u* p(w\c) ) ) + log( delta * d_u / |d|)
>
> (also included as image bellow)
>
> Where:
> d_u is the unique term count
> |d| is the document length.
> p(w\c) is a language model computed on the corpus that gives the probability that the corpus generates the word w
> delta is a parameter (which value is between 0 and 1).
>
> My problem here is that I need the unique term count once alone and once divided by the document length, so I can't really craft a single value combining the two that i can use in the scoring function.
>
> I am trying to work around the problem by thinking about how i can change the function while keeping the idea of what the different terms express but couldn't find a satisfying solution yet.
>
> At least I know i should not search how to create a doc value anymore ! :)
>
> Thanks a lot for your help !
> Romaric
>
> Le 21/05/2019 à 22:23, Adrien Grand a écrit :
>
> Hi Romaric,
>
> Indeed similarities are not expected to create doc value fields, they
> should only populate norms. The similarity API has been changed in 8.0
> and similarities no longer have access to the reader context, they are
> now expected to work with only term frequency and a length
> normalization factor as per-document contributions to the score.
>
> One challenge is that Lucene now mandates that scores do not increase
> when the norm value increases, which makes it hard to record both the
> unique term count and the total term count in the norm. What does the
> scoring function look like, maybe there are ways that we could record
> both the unique term count and total term count in a single long
> depending on how the scoring formula merges them (maybe not!).
>
> On Tue, May 21, 2019 at 3:28 PM Romaric Pighetti
> <ro...@francelabs.com> wrote:
>
> Hi,
>
> Thanks Adrien for the quick and accurate answer.
> Digging into the implementation I saw that the document length is
> already stored there and as I need both the unique term count and the
> length, I can't just replace one with the other.
> The Similarity class documentation states that it is possible to store
> additional values using NumericDocValuesField that could be accessed at
> query time using a LeafReader.
>
>  From my understanding, using the LeafReaderContext when building the
> SimScorer should allow me to get access to the NumericDocValuesField.
>
> The problem is I don't get how to create and store a new
> NumericDocValuesField from the Similarity. My guess is that it should
> happen within the computeNorm function again as it is the only function
> called at indexing time. However I am unable to understand how to create
> and store this information from that function.
>
> If you have any advice that would be really helpful.
>
> Thanks.
> Romaric
>
> Le 20/05/2019 à 12:16, Adrien Grand a écrit :
>
> Hi Romaric,
>
> You are right, computeNorm is the right place to compute and record
> the number of unique terms of a document. Your computeNorm function
> would look something like this:
>
> @Override
> public final long computeNorm(FieldInvertState state) {
>    return SmallFloat.intToByte4(state.getUniqueTermCount());
> }
>
> And then in your scorer you could convert the norm back to the unique
> term count by doing SmallFloat.byte4ToInt on it. The SmallFloat
> methods are useful to encode this count on one byte, which trades some
> accuracy but is usually the right trade-off.
>
> On Mon, May 20, 2019 at 12:05 PM Romaric Pighetti
> <ro...@francelabs.com> wrote:
>
> Hi,
>
> I am currently implementing a new similarity class into lucene which is
> based on a language model with absolute discount.
> I am basing my work on the work already done in the
> LMDirichletSimilarity and LMJelinekMercerSimilarity which are really close.
> However to end my implementation I need to get the number of unique
> terms present in the document, and this information seems to be
> unavailable natively from within the score function.
>
> The computeNorm function which is in the Similarity class seems to be
> the right place to compute (or read) and store this statistic but I am
> not sure.
> So I am reaching you to know if I am on the right track and if you have
> any advice on how I could access this statistic from the computeNorm
> function if possible ?
>
> I would like the implementation to be as clean as possible with regards
> to Lucene's code expectation to be able to submit it for integration
> once it is done.
>
> Thanks for your help,
> Regards.
>
> --
> Romaric Pighetti
> R&D - FranceLabs
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
> --
> Romaric Pighetti
> R&D - FranceLabs
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
> --
> Romaric Pighetti
> R&D - FranceLabs



-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Distinct terms within a document for new Similarity class

Posted by Romaric Pighetti <ro...@francelabs.com>.

Hi Adrien,

I thought about merging the two into one value that I could use in the 
scoring function but failed to find a way to do so.
The scoring function is:

log(1+(max(docrFreq-delta,0)) / (delta * d_u* p(w\c) ) ) + log( delta * 
d_u / |d|)

(also included as image bellow)

{\log{{\left({1}+\frac{{\max{\left({d}{o}{c}{F}{r}{e}{q}-\delta,{0}\right)}}}{{\delta\cdot{d}_{u}\cdot{p}{\left({w}{c}\right)}}}\right)}}}+{\log{{\left(\delta\cdot\frac{{d}_{u}}{{{\left|{d}\right|}}}\right)}}}

Where:
d_u is the unique term count
|d| is the document length.
p(w\c) is a language model computed on the corpus that gives the 
probability that the corpus generates the word w
delta is a parameter (which value is between 0 and 1).

My problem here is that I need the unique term count once alone and once 
divided by the document length, so I can't really craft a single value 
combining the two that i can use in the scoring function.

I am trying to work around the problem by thinking about how i can 
change the function while keeping the idea of what the different terms 
express but couldn't find a satisfying solution yet.

At least I know i should not search how to create a doc value anymore ! :)

Thanks a lot for your help !
Romaric

Le 21/05/2019 à 22:23, Adrien Grand a écrit :
> Hi Romaric,
>
> Indeed similarities are not expected to create doc value fields, they
> should only populate norms. The similarity API has been changed in 8.0
> and similarities no longer have access to the reader context, they are
> now expected to work with only term frequency and a length
> normalization factor as per-document contributions to the score.
>
> One challenge is that Lucene now mandates that scores do not increase
> when the norm value increases, which makes it hard to record both the
> unique term count and the total term count in the norm. What does the
> scoring function look like, maybe there are ways that we could record
> both the unique term count and total term count in a single long
> depending on how the scoring formula merges them (maybe not!).
>
> On Tue, May 21, 2019 at 3:28 PM Romaric Pighetti
> <ro...@francelabs.com> wrote:
>> Hi,
>>
>> Thanks Adrien for the quick and accurate answer.
>> Digging into the implementation I saw that the document length is
>> already stored there and as I need both the unique term count and the
>> length, I can't just replace one with the other.
>> The Similarity class documentation states that it is possible to store
>> additional values using NumericDocValuesField that could be accessed at
>> query time using a LeafReader.
>>
>>   From my understanding, using the LeafReaderContext when building the
>> SimScorer should allow me to get access to the NumericDocValuesField.
>>
>> The problem is I don't get how to create and store a new
>> NumericDocValuesField from the Similarity. My guess is that it should
>> happen within the computeNorm function again as it is the only function
>> called at indexing time. However I am unable to understand how to create
>> and store this information from that function.
>>
>> If you have any advice that would be really helpful.
>>
>> Thanks.
>> Romaric
>>
>> Le 20/05/2019 à 12:16, Adrien Grand a écrit :
>>> Hi Romaric,
>>>
>>> You are right, computeNorm is the right place to compute and record
>>> the number of unique terms of a document. Your computeNorm function
>>> would look something like this:
>>>
>>> @Override
>>> public final long computeNorm(FieldInvertState state) {
>>>     return SmallFloat.intToByte4(state.getUniqueTermCount());
>>> }
>>>
>>> And then in your scorer you could convert the norm back to the unique
>>> term count by doing SmallFloat.byte4ToInt on it. The SmallFloat
>>> methods are useful to encode this count on one byte, which trades some
>>> accuracy but is usually the right trade-off.
>>>
>>> On Mon, May 20, 2019 at 12:05 PM Romaric Pighetti
>>> <ro...@francelabs.com> wrote:
>>>> Hi,
>>>>
>>>> I am currently implementing a new similarity class into lucene which is
>>>> based on a language model with absolute discount.
>>>> I am basing my work on the work already done in the
>>>> LMDirichletSimilarity and LMJelinekMercerSimilarity which are really close.
>>>> However to end my implementation I need to get the number of unique
>>>> terms present in the document, and this information seems to be
>>>> unavailable natively from within the score function.
>>>>
>>>> The computeNorm function which is in the Similarity class seems to be
>>>> the right place to compute (or read) and store this statistic but I am
>>>> not sure.
>>>> So I am reaching you to know if I am on the right track and if you have
>>>> any advice on how I could access this statistic from the computeNorm
>>>> function if possible ?
>>>>
>>>> I would like the implementation to be as clean as possible with regards
>>>> to Lucene's code expectation to be able to submit it for integration
>>>> once it is done.
>>>>
>>>> Thanks for your help,
>>>> Regards.
>>>>
>>>> --
>>>> Romaric Pighetti
>>>> R&D - FranceLabs
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>
>> --
>> Romaric Pighetti
>> R&D - FranceLabs
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
-- 
Romaric Pighetti
R&D - FranceLabs

Re: Distinct terms within a document for new Similarity class

Posted by Adrien Grand <jp...@gmail.com>.

Hi Romaric,

Indeed similarities are not expected to create doc value fields, they
should only populate norms. The similarity API has been changed in 8.0
and similarities no longer have access to the reader context, they are
now expected to work with only term frequency and a length
normalization factor as per-document contributions to the score.

One challenge is that Lucene now mandates that scores do not increase
when the norm value increases, which makes it hard to record both the
unique term count and the total term count in the norm. What does the
scoring function look like, maybe there are ways that we could record
both the unique term count and total term count in a single long
depending on how the scoring formula merges them (maybe not!).

On Tue, May 21, 2019 at 3:28 PM Romaric Pighetti
<ro...@francelabs.com> wrote:
>
> Hi,
>
> Thanks Adrien for the quick and accurate answer.
> Digging into the implementation I saw that the document length is
> already stored there and as I need both the unique term count and the
> length, I can't just replace one with the other.
> The Similarity class documentation states that it is possible to store
> additional values using NumericDocValuesField that could be accessed at
> query time using a LeafReader.
>
>  From my understanding, using the LeafReaderContext when building the
> SimScorer should allow me to get access to the NumericDocValuesField.
>
> The problem is I don't get how to create and store a new
> NumericDocValuesField from the Similarity. My guess is that it should
> happen within the computeNorm function again as it is the only function
> called at indexing time. However I am unable to understand how to create
> and store this information from that function.
>
> If you have any advice that would be really helpful.
>
> Thanks.
> Romaric
>
> Le 20/05/2019 à 12:16, Adrien Grand a écrit :
> > Hi Romaric,
> >
> > You are right, computeNorm is the right place to compute and record
> > the number of unique terms of a document. Your computeNorm function
> > would look something like this:
> >
> > @Override
> > public final long computeNorm(FieldInvertState state) {
> >    return SmallFloat.intToByte4(state.getUniqueTermCount());
> > }
> >
> > And then in your scorer you could convert the norm back to the unique
> > term count by doing SmallFloat.byte4ToInt on it. The SmallFloat
> > methods are useful to encode this count on one byte, which trades some
> > accuracy but is usually the right trade-off.
> >
> > On Mon, May 20, 2019 at 12:05 PM Romaric Pighetti
> > <ro...@francelabs.com> wrote:
> >> Hi,
> >>
> >> I am currently implementing a new similarity class into lucene which is
> >> based on a language model with absolute discount.
> >> I am basing my work on the work already done in the
> >> LMDirichletSimilarity and LMJelinekMercerSimilarity which are really close.
> >> However to end my implementation I need to get the number of unique
> >> terms present in the document, and this information seems to be
> >> unavailable natively from within the score function.
> >>
> >> The computeNorm function which is in the Similarity class seems to be
> >> the right place to compute (or read) and store this statistic but I am
> >> not sure.
> >> So I am reaching you to know if I am on the right track and if you have
> >> any advice on how I could access this statistic from the computeNorm
> >> function if possible ?
> >>
> >> I would like the implementation to be as clean as possible with regards
> >> to Lucene's code expectation to be able to submit it for integration
> >> once it is done.
> >>
> >> Thanks for your help,
> >> Regards.
> >>
> >> --
> >> Romaric Pighetti
> >> R&D - FranceLabs
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> --
> Romaric Pighetti
> R&D - FranceLabs
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>


-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Distinct terms within a document for new Similarity class

Posted by Romaric Pighetti <ro...@francelabs.com>.

Hi,

Thanks Adrien for the quick and accurate answer.
Digging into the implementation I saw that the document length is 
already stored there and as I need both the unique term count and the 
length, I can't just replace one with the other.
The Similarity class documentation states that it is possible to store 
additional values using NumericDocValuesField that could be accessed at 
query time using a LeafReader.

 From my understanding, using the LeafReaderContext when building the 
SimScorer should allow me to get access to the NumericDocValuesField.

The problem is I don't get how to create and store a new 
NumericDocValuesField from the Similarity. My guess is that it should 
happen within the computeNorm function again as it is the only function 
called at indexing time. However I am unable to understand how to create 
and store this information from that function.

If you have any advice that would be really helpful.

Thanks.
Romaric

Le 20/05/2019 à 12:16, Adrien Grand a écrit :
> Hi Romaric,
>
> You are right, computeNorm is the right place to compute and record
> the number of unique terms of a document. Your computeNorm function
> would look something like this:
>
> @Override
> public final long computeNorm(FieldInvertState state) {
>    return SmallFloat.intToByte4(state.getUniqueTermCount());
> }
>
> And then in your scorer you could convert the norm back to the unique
> term count by doing SmallFloat.byte4ToInt on it. The SmallFloat
> methods are useful to encode this count on one byte, which trades some
> accuracy but is usually the right trade-off.
>
> On Mon, May 20, 2019 at 12:05 PM Romaric Pighetti
> <ro...@francelabs.com> wrote:
>> Hi,
>>
>> I am currently implementing a new similarity class into lucene which is
>> based on a language model with absolute discount.
>> I am basing my work on the work already done in the
>> LMDirichletSimilarity and LMJelinekMercerSimilarity which are really close.
>> However to end my implementation I need to get the number of unique
>> terms present in the document, and this information seems to be
>> unavailable natively from within the score function.
>>
>> The computeNorm function which is in the Similarity class seems to be
>> the right place to compute (or read) and store this statistic but I am
>> not sure.
>> So I am reaching you to know if I am on the right track and if you have
>> any advice on how I could access this statistic from the computeNorm
>> function if possible ?
>>
>> I would like the implementation to be as clean as possible with regards
>> to Lucene's code expectation to be able to submit it for integration
>> once it is done.
>>
>> Thanks for your help,
>> Regards.
>>
>> --
>> Romaric Pighetti
>> R&D - FranceLabs
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
-- 
Romaric Pighetti
R&D - FranceLabs


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Distinct terms within a document for new Similarity class

Posted by Adrien Grand <jp...@gmail.com>.

Hi Romaric,

You are right, computeNorm is the right place to compute and record
the number of unique terms of a document. Your computeNorm function
would look something like this:

@Override
public final long computeNorm(FieldInvertState state) {
  return SmallFloat.intToByte4(state.getUniqueTermCount());
}

And then in your scorer you could convert the norm back to the unique
term count by doing SmallFloat.byte4ToInt on it. The SmallFloat
methods are useful to encode this count on one byte, which trades some
accuracy but is usually the right trade-off.

On Mon, May 20, 2019 at 12:05 PM Romaric Pighetti
<ro...@francelabs.com> wrote:
>
> Hi,
>
> I am currently implementing a new similarity class into lucene which is
> based on a language model with absolute discount.
> I am basing my work on the work already done in the
> LMDirichletSimilarity and LMJelinekMercerSimilarity which are really close.
> However to end my implementation I need to get the number of unique
> terms present in the document, and this information seems to be
> unavailable natively from within the score function.
>
> The computeNorm function which is in the Similarity class seems to be
> the right place to compute (or read) and store this statistic but I am
> not sure.
> So I am reaching you to know if I am on the right track and if you have
> any advice on how I could access this statistic from the computeNorm
> function if possible ?
>
> I would like the implementation to be as clean as possible with regards
> to Lucene's code expectation to be able to submit it for integration
> once it is done.
>
> Thanks for your help,
> Regards.
>
> --
> Romaric Pighetti
> R&D - FranceLabs
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>


-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org