You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Koji Sekiguchi <ko...@rondhuit.com> on 2017/08/22 08:30:14 UTC

is omitNorms still valid?

Hi,

After LUCENE-6819 committed, I think omitNorms was removed but it seems it is still alive.

Deprecate index-time boosts?
https://issues.apache.org/jira/browse/LUCENE-6819

Is it still valid or is there a ticket to delete it?

Thanks,

Koji

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: is omitNorms still valid?

Posted by Koji Sekiguchi <ko...@rondhuit.com>.

Hi Adrien,

Thank you for the great explanation!

Koji


On 2017/08/22 19:36, Adrien Grand wrote:
> Yes, LUCENE-7730 is the issue.
> 
> Le mar. 22 août 2017 à 12:00, Koji Sekiguchi <koji.sekiguchi@rondhuit.com 
> <ma...@rondhuit.com>> a écrit :
> 
>     I thought LUCENE-6819 removed the single byte float as well because to describe the background of
>     the ticket, you mentioned it was poor precision. So I thought the ticket solved it (from the
>     context).
> 
>     So the field length is still stored in the single byte and the precision of the float still not
>     good? And the point of the LUCENE-6819 is that we can set more precise boost value if we want
>     because it no longer depends on the poor precision single byte for field length?
> 
> 
> We still use a single byte in order to store the norm. The difference is that before we used to 
> store ${index-boost} * ${length-norm}. Because index-boosts could take any positive value, we could 
> not make any assumptions about this quantity that could have helped make storage more efficient. 
> More concretely, length-norm was always between 0 and 1, so if you did not use index boosts like 
> most Lucene users, then the final normalization factor would be in 0-1 as well. Yet only 125 out of 
> the 256 bytes that the SmallFloat encoding that we used represent values between 0 and 1. So this 
> feature was trading accuracy of the length normalization factor in favor of a feature that was only 
> used by a minority and could be easily replaced by a doc-value field.
> 
> We actually went a bit further and started storing the document length rather than the precomputed 
> length-normalization factor in the norms field. It is easier to reason about since we know all 
> values are integers, positive, and that we want to have better accuracy for lower values. This 
> allowed to encode lengths accurately up to 40, while the previous encoding that we used considered 3 
> and 4 to be the same lengths for instance. Then accuracy degrades progressively as you can notice on 
> the LUCENE-7730 ticket.
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: is omitNorms still valid?

Posted by Adrien Grand <jp...@gmail.com>.

Yes, LUCENE-7730 is the issue.

Le mar. 22 août 2017 à 12:00, Koji Sekiguchi <ko...@rondhuit.com>
a écrit :

> I thought LUCENE-6819 removed the single byte float as well because to
> describe the background of
> the ticket, you mentioned it was poor precision. So I thought the ticket
> solved it (from the context).
>
> So the field length is still stored in the single byte and the precision
> of the float still not
> good? And the point of the LUCENE-6819 is that we can set more precise
> boost value if we want
> because it no longer depends on the poor precision single byte for field
> length?
>

We still use a single byte in order to store the norm. The difference is
that before we used to store ${index-boost} * ${length-norm}. Because
index-boosts could take any positive value, we could not make any
assumptions about this quantity that could have helped make storage more
efficient. More concretely, length-norm was always between 0 and 1, so if
you did not use index boosts like most Lucene users, then the final
normalization factor would be in 0-1 as well. Yet only 125 out of the 256
bytes that the SmallFloat encoding that we used represent values between 0
and 1. So this feature was trading accuracy of the length normalization
factor in favor of a feature that was only used by a minority and could be
easily replaced by a doc-value field.

We actually went a bit further and started storing the document length
rather than the precomputed length-normalization factor in the norms field.
It is easier to reason about since we know all values are integers,
positive, and that we want to have better accuracy for lower values. This
allowed to encode lengths accurately up to 40, while the previous encoding
that we used considered 3 and 4 to be the same lengths for instance. Then
accuracy degrades progressively as you can notice on the LUCENE-7730 ticket.

Re: is omitNorms still valid?

Posted by Koji Sekiguchi <ko...@rondhuit.com>.

LUCENE-7730 solved this?


On 2017/08/22 18:59, Koji Sekiguchi wrote:
> Hi Adrien,
> 
> Thank you for your explanation!
> 
> I thought LUCENE-6819 removed the single byte float as well because to describe the background of 
> the ticket, you mentioned it was poor precision. So I thought the ticket solved it (from the context).
> 
> So the field length is still stored in the single byte and the precision of the float still not 
> good? And the point of the LUCENE-6819 is that we can set more precise boost value if we want 
> because it no longer depends on the poor precision single byte for field length?
> 
> Thanks,
> 
> Koji
> 
> 
> On 2017/08/22 18:10, Adrien Grand wrote:
>> Hi Koji,
>>
>> OmitNorms is still valid. It used to store a scoring factor that depended on both the field length 
>> and an index-time boost. The only difference that LUCENE-6819 made is that norms now only store a 
>> number that depends on the field length since index-time boosts have been removed.
>>
>> Le mar. 22 août 2017 à 10:30, Koji Sekiguchi <koji.sekiguchi@rondhuit.com 
>> <ma...@rondhuit.com>> a écrit :
>>
>>     Hi,
>>
>>     After LUCENE-6819 committed, I think omitNorms was removed but it seems it is still alive.
>>
>>     Deprecate index-time boosts?
>>     https://issues.apache.org/jira/browse/LUCENE-6819
>>
>>     Is it still valid or is there a ticket to delete it?
>>
>>     Thanks,
>>
>>     Koji
>>
>>     ---------------------------------------------------------------------
>>     To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org 
>> <ma...@lucene.apache.org>
>>     For additional commands, e-mail: dev-help@lucene.apache.org <ma...@lucene.apache.org>
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 
> 


-- 
最新ブログ記事〜Solr でランキング学習を体験する
http://lucene.jugem.jp/?eid=484
==========================================
株式会社 ロンウイット
関口宏司
105-0003 東京都港区西新橋1-18-6
クロスオフィス内幸町 11階
TEL 03-5288-5927
FAX 03-5288-5928
http://www.rondhuit.com/
ブログ http://lucene.jugem.jp/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: is omitNorms still valid?

Posted by Koji Sekiguchi <ko...@rondhuit.com>.

Hi Adrien,

Thank you for your explanation!

I thought LUCENE-6819 removed the single byte float as well because to describe the background of 
the ticket, you mentioned it was poor precision. So I thought the ticket solved it (from the context).

So the field length is still stored in the single byte and the precision of the float still not 
good? And the point of the LUCENE-6819 is that we can set more precise boost value if we want 
because it no longer depends on the poor precision single byte for field length?

Thanks,

Koji


On 2017/08/22 18:10, Adrien Grand wrote:
> Hi Koji,
> 
> OmitNorms is still valid. It used to store a scoring factor that depended on both the field length 
> and an index-time boost. The only difference that LUCENE-6819 made is that norms now only store a 
> number that depends on the field length since index-time boosts have been removed.
> 
> Le mar. 22 août 2017 à 10:30, Koji Sekiguchi <koji.sekiguchi@rondhuit.com 
> <ma...@rondhuit.com>> a écrit :
> 
>     Hi,
> 
>     After LUCENE-6819 committed, I think omitNorms was removed but it seems it is still alive.
> 
>     Deprecate index-time boosts?
>     https://issues.apache.org/jira/browse/LUCENE-6819
> 
>     Is it still valid or is there a ticket to delete it?
> 
>     Thanks,
> 
>     Koji
> 
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org <ma...@lucene.apache.org>
>     For additional commands, e-mail: dev-help@lucene.apache.org <ma...@lucene.apache.org>
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: is omitNorms still valid?

Posted by Adrien Grand <jp...@gmail.com>.

Hi Koji,

OmitNorms is still valid. It used to store a scoring factor that depended
on both the field length and an index-time boost. The only difference that
LUCENE-6819 made is that norms now only store a number that depends on the
field length since index-time boosts have been removed.

Le mar. 22 août 2017 à 10:30, Koji Sekiguchi <ko...@rondhuit.com>
a écrit :

> Hi,
>
> After LUCENE-6819 committed, I think omitNorms was removed but it seems it
> is still alive.
>
> Deprecate index-time boosts?
> https://issues.apache.org/jira/browse/LUCENE-6819
>
> Is it still valid or is there a ticket to delete it?
>
> Thanks,
>
> Koji
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>