You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by wangdong <hr...@gmail.com> on 2015/03/04 04:48:54 UTC

understanding the norm encode and decode

I read the article about the scoring section in lucene as follows:

Encoding and decoding of the resulted float norm in a single byte are 
done by the static methods of the class Similarity:encodeNorm() 
<http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html#encodeNorm%28float%29>anddecodeNorm() 
<http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html#decodeNorm%28byte%29>. 
Due to loss of precision, it is not guaranteed that decode(encode(x)) = 
x, e.g. decode(encode(0.89)) = 0.75. At scoring (search) time, this norm 
is brought into the score of document as*norm(t, d)*, as shown by the 
formula inSimilarity 
<http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html>.

I can not understand the formula decode(encode(0.89)) = 0.75
how can i get the 0.75 from the left.

Is anyone can help me ?
thanks ahead!

andrew

Re: understanding the norm encode and decode

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.


Hi András,

Thats a good catch! Do you want to correct that javadoc mistake and create a patch?
https://wiki.apache.org/lucene-java/HowToContribute

If you don't have a jira account, anyone can create it.
https://issues.apache.org/jira/browse/lucene

Ahmet


On Thursday, March 5, 2015 11:15 AM, András Péteri <ap...@b2international.com> wrote:
Sorry, I also got it wrong in the previous message. :) It goes 0.89f
-> 123 -> 0.875f.

On Thu, Mar 5, 2015 at 10:08 AM, András Péteri
<ap...@b2international.com> wrote:
> Hi Andrew,
>
> If you are using Lucene 3.6.1, you can take a look at the method which
> creates a single byte value out of the received float using bit
> manipulation at [1]. There is also a 256-element decoder table in
> Similarity, where each byte corresponds to a decoded float value
> computed by [2].
>
> The first method encodes 0.89f to byte 123. 123 is decoded to 0.85f
> via the second method, so it seems that the documentation is incorrect
> in this regard.
>
> [1] https://github.com/apache/lucene-solr/blob/lucene_solr_3_6_1/lucene/core/src/java/org/apache/lucene/util/SmallFloat.java#L75
> [2] https://github.com/apache/lucene-solr/blob/lucene_solr_3_6_1/lucene/core/src/java/org/apache/lucene/util/SmallFloat.java#L88
>
> On Thu, Mar 5, 2015 at 3:45 AM, wangdong <hr...@gmail.com> wrote:
>> thank you for your disscussion.
>>
>> I am a junior user of lucene, so i am not**familiar with some deep concept
>> you mentioned.
>> my question is simple. I just want to know how to get 0.75 from
>> decode(encode(0.89)) in offical document.
>>
>> why not 0.875?   (0.875=0.5+0.25+0.125)
>>
>> thanks
>> andrew
>>
>> 在 2015/3/4 22:54, Adrien Grand 写道:
>>>
>>> Norms and doc values are indeed using the same API. However
>>> implementations differ a bit (eg. norms are stored in memory and use
>>> different compression schemes).
>>>
>>> The precision loss is up to the similarity. You could write a
>>> similarity impl which keeps full float precision, but scoring being
>>> fuzzy anyway this would multiply your memory needs for norms by 4
>>> while not really improving the quality of the scores of your
>>> documents. This precision loss is the right trade-off for most
>>> use-cases.
>>>
>>> On Wed, Mar 4, 2015 at 3:04 PM, Ahmet Arslan <io...@yahoo.com.invalid>
>>> wrote:
>>>>
>>>> Hi Adrien,
>>>>
>>>> I read somewhere that norms are stored using docValues.
>>>> In my understanding, docvalues can store lossless float values.
>>>> So the question is, why are still several decode/encode methods exist in
>>>> similarity implementations?
>>>> Intuitively switching to docvalues for norms should prevent precision
>>>> loss thing.
>>>>
>>>> Ahmet
>>>>
>>>>
>>>> On Wednesday, March 4, 2015 3:22 PM, Adrien Grand <jp...@gmail.com>
>>>> wrote:
>>>> Hi,
>>>>
>>>> Floats require 32 bits but norms are encoded on a single byte. So
>>>> there is a precision loss when encoding float values into a single
>>>> byte. In your example, 0.75 and 0.89 are sufficiently close to each
>>>> other so that they are encoded to the same byte.
>>>>
>>>>
>>>> On Wed, Mar 4, 2015 at 4:48 AM, wangdong <hr...@gmail.com> wrote:
>>>>>
>>>>> I read the article about the scoring section in lucene as follows:
>>>>>
>>>>> Encoding and decoding of the resulted float norm in a single byte are
>>>>> done
>>>>> by the static methods of the class Similarity:encodeNorm()
>>>>>
>>>>> <http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html#encodeNorm%28float%29>anddecodeNorm()
>>>>>
>>>>> <http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html#decodeNorm%28byte%29>.
>>>>> Due to loss of precision, it is not guaranteed that decode(encode(x)) =
>>>>> x,
>>>>> e.g. decode(encode(0.89)) = 0.75. At scoring (search) time, this norm is
>>>>> brought into the score of document as*norm(t, d)*, as shown by the
>>>>> formula
>>>>> inSimilarity
>>>>>
>>>>> <http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html>.
>>>>>
>>>>> I can not understand the formula decode(encode(0.89)) = 0.75
>>>>> how can i get the 0.75 from the left.
>>>>>
>>>>> Is anyone can help me ?
>>>>> thanks ahead!
>>>>>
>>>>> andrew
>>>>
>>>>
>>>>
>>>> --
>>>> Adrien
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>>
>>
>
> --
> András



-- 
Péteri András


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: understanding the norm encode and decode

Posted by wangdong <hr...@gmail.com>.

thank you for your detail answer.I get it
As the document i have read is offical materials,I doubt it is correct. 
so i start  a question.

thank you again!

andrew

在 2015/3/5 17:14, András Péteri 写道:
> Sorry, I also got it wrong in the previous message. :) It goes 0.89f
> -> 123 -> 0.875f.
>
> On Thu, Mar 5, 2015 at 10:08 AM, András Péteri
> <ap...@b2international.com> wrote:
>> Hi Andrew,
>>
>> If you are using Lucene 3.6.1, you can take a look at the method which
>> creates a single byte value out of the received float using bit
>> manipulation at [1]. There is also a 256-element decoder table in
>> Similarity, where each byte corresponds to a decoded float value
>> computed by [2].
>>
>> The first method encodes 0.89f to byte 123. 123 is decoded to 0.85f
>> via the second method, so it seems that the documentation is incorrect
>> in this regard.
>>
>> [1] https://github.com/apache/lucene-solr/blob/lucene_solr_3_6_1/lucene/core/src/java/org/apache/lucene/util/SmallFloat.java#L75
>> [2] https://github.com/apache/lucene-solr/blob/lucene_solr_3_6_1/lucene/core/src/java/org/apache/lucene/util/SmallFloat.java#L88
>>
>> On Thu, Mar 5, 2015 at 3:45 AM, wangdong <hr...@gmail.com> wrote:
>>> thank you for your disscussion.
>>>
>>> I am a junior user of lucene, so i am not**familiar with some deep concept
>>> you mentioned.
>>> my question is simple. I just want to know how to get 0.75 from
>>> decode(encode(0.89)) in offical document.
>>>
>>> why not 0.875?   (0.875=0.5+0.25+0.125)
>>>
>>> thanks
>>> andrew
>>>
>>> 在 2015/3/4 22:54, Adrien Grand 写道:
>>>> Norms and doc values are indeed using the same API. However
>>>> implementations differ a bit (eg. norms are stored in memory and use
>>>> different compression schemes).
>>>>
>>>> The precision loss is up to the similarity. You could write a
>>>> similarity impl which keeps full float precision, but scoring being
>>>> fuzzy anyway this would multiply your memory needs for norms by 4
>>>> while not really improving the quality of the scores of your
>>>> documents. This precision loss is the right trade-off for most
>>>> use-cases.
>>>>
>>>> On Wed, Mar 4, 2015 at 3:04 PM, Ahmet Arslan <io...@yahoo.com.invalid>
>>>> wrote:
>>>>> Hi Adrien,
>>>>>
>>>>> I read somewhere that norms are stored using docValues.
>>>>> In my understanding, docvalues can store lossless float values.
>>>>> So the question is, why are still several decode/encode methods exist in
>>>>> similarity implementations?
>>>>> Intuitively switching to docvalues for norms should prevent precision
>>>>> loss thing.
>>>>>
>>>>> Ahmet
>>>>>
>>>>>
>>>>> On Wednesday, March 4, 2015 3:22 PM, Adrien Grand <jp...@gmail.com>
>>>>> wrote:
>>>>> Hi,
>>>>>
>>>>> Floats require 32 bits but norms are encoded on a single byte. So
>>>>> there is a precision loss when encoding float values into a single
>>>>> byte. In your example, 0.75 and 0.89 are sufficiently close to each
>>>>> other so that they are encoded to the same byte.
>>>>>
>>>>>
>>>>> On Wed, Mar 4, 2015 at 4:48 AM, wangdong <hr...@gmail.com> wrote:
>>>>>> I read the article about the scoring section in lucene as follows:
>>>>>>
>>>>>> Encoding and decoding of the resulted float norm in a single byte are
>>>>>> done
>>>>>> by the static methods of the class Similarity:encodeNorm()
>>>>>>
>>>>>> <http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html#encodeNorm%28float%29>anddecodeNorm()
>>>>>>
>>>>>> <http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html#decodeNorm%28byte%29>.
>>>>>> Due to loss of precision, it is not guaranteed that decode(encode(x)) =
>>>>>> x,
>>>>>> e.g. decode(encode(0.89)) = 0.75. At scoring (search) time, this norm is
>>>>>> brought into the score of document as*norm(t, d)*, as shown by the
>>>>>> formula
>>>>>> inSimilarity
>>>>>>
>>>>>> <http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html>.
>>>>>>
>>>>>> I can not understand the formula decode(encode(0.89)) = 0.75
>>>>>> how can i get the 0.75 from the left.
>>>>>>
>>>>>> Is anyone can help me ?
>>>>>> thanks ahead!
>>>>>>
>>>>>> andrew
>>>>>
>>>>>
>>>>> --
>>>>> Adrien
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>
>> --
>> András
>
>

Re: understanding the norm encode and decode

Posted by András Péteri <ap...@b2international.com>.

Sorry, I also got it wrong in the previous message. :) It goes 0.89f
-> 123 -> 0.875f.

On Thu, Mar 5, 2015 at 10:08 AM, András Péteri
<ap...@b2international.com> wrote:
> Hi Andrew,
>
> If you are using Lucene 3.6.1, you can take a look at the method which
> creates a single byte value out of the received float using bit
> manipulation at [1]. There is also a 256-element decoder table in
> Similarity, where each byte corresponds to a decoded float value
> computed by [2].
>
> The first method encodes 0.89f to byte 123. 123 is decoded to 0.85f
> via the second method, so it seems that the documentation is incorrect
> in this regard.
>
> [1] https://github.com/apache/lucene-solr/blob/lucene_solr_3_6_1/lucene/core/src/java/org/apache/lucene/util/SmallFloat.java#L75
> [2] https://github.com/apache/lucene-solr/blob/lucene_solr_3_6_1/lucene/core/src/java/org/apache/lucene/util/SmallFloat.java#L88
>
> On Thu, Mar 5, 2015 at 3:45 AM, wangdong <hr...@gmail.com> wrote:
>> thank you for your disscussion.
>>
>> I am a junior user of lucene, so i am not**familiar with some deep concept
>> you mentioned.
>> my question is simple. I just want to know how to get 0.75 from
>> decode(encode(0.89)) in offical document.
>>
>> why not 0.875?   (0.875=0.5+0.25+0.125)
>>
>> thanks
>> andrew
>>
>> 在 2015/3/4 22:54, Adrien Grand 写道:
>>>
>>> Norms and doc values are indeed using the same API. However
>>> implementations differ a bit (eg. norms are stored in memory and use
>>> different compression schemes).
>>>
>>> The precision loss is up to the similarity. You could write a
>>> similarity impl which keeps full float precision, but scoring being
>>> fuzzy anyway this would multiply your memory needs for norms by 4
>>> while not really improving the quality of the scores of your
>>> documents. This precision loss is the right trade-off for most
>>> use-cases.
>>>
>>> On Wed, Mar 4, 2015 at 3:04 PM, Ahmet Arslan <io...@yahoo.com.invalid>
>>> wrote:
>>>>
>>>> Hi Adrien,
>>>>
>>>> I read somewhere that norms are stored using docValues.
>>>> In my understanding, docvalues can store lossless float values.
>>>> So the question is, why are still several decode/encode methods exist in
>>>> similarity implementations?
>>>> Intuitively switching to docvalues for norms should prevent precision
>>>> loss thing.
>>>>
>>>> Ahmet
>>>>
>>>>
>>>> On Wednesday, March 4, 2015 3:22 PM, Adrien Grand <jp...@gmail.com>
>>>> wrote:
>>>> Hi,
>>>>
>>>> Floats require 32 bits but norms are encoded on a single byte. So
>>>> there is a precision loss when encoding float values into a single
>>>> byte. In your example, 0.75 and 0.89 are sufficiently close to each
>>>> other so that they are encoded to the same byte.
>>>>
>>>>
>>>> On Wed, Mar 4, 2015 at 4:48 AM, wangdong <hr...@gmail.com> wrote:
>>>>>
>>>>> I read the article about the scoring section in lucene as follows:
>>>>>
>>>>> Encoding and decoding of the resulted float norm in a single byte are
>>>>> done
>>>>> by the static methods of the class Similarity:encodeNorm()
>>>>>
>>>>> <http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html#encodeNorm%28float%29>anddecodeNorm()
>>>>>
>>>>> <http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html#decodeNorm%28byte%29>.
>>>>> Due to loss of precision, it is not guaranteed that decode(encode(x)) =
>>>>> x,
>>>>> e.g. decode(encode(0.89)) = 0.75. At scoring (search) time, this norm is
>>>>> brought into the score of document as*norm(t, d)*, as shown by the
>>>>> formula
>>>>> inSimilarity
>>>>>
>>>>> <http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html>.
>>>>>
>>>>> I can not understand the formula decode(encode(0.89)) = 0.75
>>>>> how can i get the 0.75 from the left.
>>>>>
>>>>> Is anyone can help me ?
>>>>> thanks ahead!
>>>>>
>>>>> andrew
>>>>
>>>>
>>>>
>>>> --
>>>> Adrien
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>>
>>
>
> --
> András



-- 
Péteri András

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: understanding the norm encode and decode

Posted by András Péteri <ap...@b2international.com>.

Hi Andrew,

If you are using Lucene 3.6.1, you can take a look at the method which
creates a single byte value out of the received float using bit
manipulation at [1]. There is also a 256-element decoder table in
Similarity, where each byte corresponds to a decoded float value
computed by [2].

The first method encodes 0.89f to byte 123. 123 is decoded to 0.85f
via the second method, so it seems that the documentation is incorrect
in this regard.

[1] https://github.com/apache/lucene-solr/blob/lucene_solr_3_6_1/lucene/core/src/java/org/apache/lucene/util/SmallFloat.java#L75
[2] https://github.com/apache/lucene-solr/blob/lucene_solr_3_6_1/lucene/core/src/java/org/apache/lucene/util/SmallFloat.java#L88

On Thu, Mar 5, 2015 at 3:45 AM, wangdong <hr...@gmail.com> wrote:
> thank you for your disscussion.
>
> I am a junior user of lucene, so i am not**familiar with some deep concept
> you mentioned.
> my question is simple. I just want to know how to get 0.75 from
> decode(encode(0.89)) in offical document.
>
> why not 0.875?   (0.875=0.5+0.25+0.125)
>
> thanks
> andrew
>
> 在 2015/3/4 22:54, Adrien Grand 写道:
>>
>> Norms and doc values are indeed using the same API. However
>> implementations differ a bit (eg. norms are stored in memory and use
>> different compression schemes).
>>
>> The precision loss is up to the similarity. You could write a
>> similarity impl which keeps full float precision, but scoring being
>> fuzzy anyway this would multiply your memory needs for norms by 4
>> while not really improving the quality of the scores of your
>> documents. This precision loss is the right trade-off for most
>> use-cases.
>>
>> On Wed, Mar 4, 2015 at 3:04 PM, Ahmet Arslan <io...@yahoo.com.invalid>
>> wrote:
>>>
>>> Hi Adrien,
>>>
>>> I read somewhere that norms are stored using docValues.
>>> In my understanding, docvalues can store lossless float values.
>>> So the question is, why are still several decode/encode methods exist in
>>> similarity implementations?
>>> Intuitively switching to docvalues for norms should prevent precision
>>> loss thing.
>>>
>>> Ahmet
>>>
>>>
>>> On Wednesday, March 4, 2015 3:22 PM, Adrien Grand <jp...@gmail.com>
>>> wrote:
>>> Hi,
>>>
>>> Floats require 32 bits but norms are encoded on a single byte. So
>>> there is a precision loss when encoding float values into a single
>>> byte. In your example, 0.75 and 0.89 are sufficiently close to each
>>> other so that they are encoded to the same byte.
>>>
>>>
>>> On Wed, Mar 4, 2015 at 4:48 AM, wangdong <hr...@gmail.com> wrote:
>>>>
>>>> I read the article about the scoring section in lucene as follows:
>>>>
>>>> Encoding and decoding of the resulted float norm in a single byte are
>>>> done
>>>> by the static methods of the class Similarity:encodeNorm()
>>>>
>>>> <http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html#encodeNorm%28float%29>anddecodeNorm()
>>>>
>>>> <http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html#decodeNorm%28byte%29>.
>>>> Due to loss of precision, it is not guaranteed that decode(encode(x)) =
>>>> x,
>>>> e.g. decode(encode(0.89)) = 0.75. At scoring (search) time, this norm is
>>>> brought into the score of document as*norm(t, d)*, as shown by the
>>>> formula
>>>> inSimilarity
>>>>
>>>> <http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html>.
>>>>
>>>> I can not understand the formula decode(encode(0.89)) = 0.75
>>>> how can i get the 0.75 from the left.
>>>>
>>>> Is anyone can help me ?
>>>> thanks ahead!
>>>>
>>>> andrew
>>>
>>>
>>>
>>> --
>>> Adrien
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>

-- 
András

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: understanding the norm encode and decode

Posted by wangdong <hr...@gmail.com>.

thank you for your disscussion.

I am a junior user of lucene, so i am not**familiar with some deep 
concept you mentioned.
my question is simple. I just want to know how to get 0.75 from 
decode(encode(0.89)) in offical document.

why not 0.875?   (0.875=0.5+0.25+0.125)

thanks
andrew

在 2015/3/4 22:54, Adrien Grand 写道:
> Norms and doc values are indeed using the same API. However
> implementations differ a bit (eg. norms are stored in memory and use
> different compression schemes).
>
> The precision loss is up to the similarity. You could write a
> similarity impl which keeps full float precision, but scoring being
> fuzzy anyway this would multiply your memory needs for norms by 4
> while not really improving the quality of the scores of your
> documents. This precision loss is the right trade-off for most
> use-cases.
>
> On Wed, Mar 4, 2015 at 3:04 PM, Ahmet Arslan <io...@yahoo.com.invalid> wrote:
>> Hi Adrien,
>>
>> I read somewhere that norms are stored using docValues.
>> In my understanding, docvalues can store lossless float values.
>> So the question is, why are still several decode/encode methods exist in similarity implementations?
>> Intuitively switching to docvalues for norms should prevent precision loss thing.
>>
>> Ahmet
>>
>>
>> On Wednesday, March 4, 2015 3:22 PM, Adrien Grand <jp...@gmail.com> wrote:
>> Hi,
>>
>> Floats require 32 bits but norms are encoded on a single byte. So
>> there is a precision loss when encoding float values into a single
>> byte. In your example, 0.75 and 0.89 are sufficiently close to each
>> other so that they are encoded to the same byte.
>>
>>
>> On Wed, Mar 4, 2015 at 4:48 AM, wangdong <hr...@gmail.com> wrote:
>>> I read the article about the scoring section in lucene as follows:
>>>
>>> Encoding and decoding of the resulted float norm in a single byte are done
>>> by the static methods of the class Similarity:encodeNorm()
>>> <http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html#encodeNorm%28float%29>anddecodeNorm()
>>> <http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html#decodeNorm%28byte%29>.
>>> Due to loss of precision, it is not guaranteed that decode(encode(x)) = x,
>>> e.g. decode(encode(0.89)) = 0.75. At scoring (search) time, this norm is
>>> brought into the score of document as*norm(t, d)*, as shown by the formula
>>> inSimilarity
>>> <http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html>.
>>>
>>> I can not understand the formula decode(encode(0.89)) = 0.75
>>> how can i get the 0.75 from the left.
>>>
>>> Is anyone can help me ?
>>> thanks ahead!
>>>
>>> andrew
>>
>>
>> --
>> Adrien
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>

Re: understanding the norm encode and decode

Posted by Adrien Grand <jp...@gmail.com>.

Norms and doc values are indeed using the same API. However
implementations differ a bit (eg. norms are stored in memory and use
different compression schemes).

The precision loss is up to the similarity. You could write a
similarity impl which keeps full float precision, but scoring being
fuzzy anyway this would multiply your memory needs for norms by 4
while not really improving the quality of the scores of your
documents. This precision loss is the right trade-off for most
use-cases.

On Wed, Mar 4, 2015 at 3:04 PM, Ahmet Arslan <io...@yahoo.com.invalid> wrote:
> Hi Adrien,
>
> I read somewhere that norms are stored using docValues.
> In my understanding, docvalues can store lossless float values.
> So the question is, why are still several decode/encode methods exist in similarity implementations?
> Intuitively switching to docvalues for norms should prevent precision loss thing.
>
> Ahmet
>
>
> On Wednesday, March 4, 2015 3:22 PM, Adrien Grand <jp...@gmail.com> wrote:
> Hi,
>
> Floats require 32 bits but norms are encoded on a single byte. So
> there is a precision loss when encoding float values into a single
> byte. In your example, 0.75 and 0.89 are sufficiently close to each
> other so that they are encoded to the same byte.
>
>
> On Wed, Mar 4, 2015 at 4:48 AM, wangdong <hr...@gmail.com> wrote:
>> I read the article about the scoring section in lucene as follows:
>>
>> Encoding and decoding of the resulted float norm in a single byte are done
>> by the static methods of the class Similarity:encodeNorm()
>> <http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html#encodeNorm%28float%29>anddecodeNorm()
>> <http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html#decodeNorm%28byte%29>.
>> Due to loss of precision, it is not guaranteed that decode(encode(x)) = x,
>> e.g. decode(encode(0.89)) = 0.75. At scoring (search) time, this norm is
>> brought into the score of document as*norm(t, d)*, as shown by the formula
>> inSimilarity
>> <http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html>.
>>
>> I can not understand the formula decode(encode(0.89)) = 0.75
>> how can i get the 0.75 from the left.
>>
>> Is anyone can help me ?
>> thanks ahead!
>>
>> andrew
>
>
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>



-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: understanding the norm encode and decode

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Hi Adrien,

I read somewhere that norms are stored using docValues. 
In my understanding, docvalues can store lossless float values.
So the question is, why are still several decode/encode methods exist in similarity implementations?
Intuitively switching to docvalues for norms should prevent precision loss thing.

Ahmet

On Wednesday, March 4, 2015 3:22 PM, Adrien Grand <jp...@gmail.com> wrote:
Hi,

Floats require 32 bits but norms are encoded on a single byte. So
there is a precision loss when encoding float values into a single
byte. In your example, 0.75 and 0.89 are sufficiently close to each
other so that they are encoded to the same byte.

On Wed, Mar 4, 2015 at 4:48 AM, wangdong <hr...@gmail.com> wrote:
> I read the article about the scoring section in lucene as follows:
>
> Encoding and decoding of the resulted float norm in a single byte are done
> by the static methods of the class Similarity:encodeNorm()
> <http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html#encodeNorm%28float%29>anddecodeNorm()
> <http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html#decodeNorm%28byte%29>.
> Due to loss of precision, it is not guaranteed that decode(encode(x)) = x,
> e.g. decode(encode(0.89)) = 0.75. At scoring (search) time, this norm is
> brought into the score of document as*norm(t, d)*, as shown by the formula
> inSimilarity
> <http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html>.
>
> I can not understand the formula decode(encode(0.89)) = 0.75
> how can i get the 0.75 from the left.
>
> Is anyone can help me ?
> thanks ahead!
>
> andrew

-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: understanding the norm encode and decode

Posted by Adrien Grand <jp...@gmail.com>.

Hi,

Floats require 32 bits but norms are encoded on a single byte. So
there is a precision loss when encoding float values into a single
byte. In your example, 0.75 and 0.89 are sufficiently close to each
other so that they are encoded to the same byte.

On Wed, Mar 4, 2015 at 4:48 AM, wangdong <hr...@gmail.com> wrote:
> I read the article about the scoring section in lucene as follows:
>
> Encoding and decoding of the resulted float norm in a single byte are done
> by the static methods of the class Similarity:encodeNorm()
> <http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html#encodeNorm%28float%29>anddecodeNorm()
> <http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html#decodeNorm%28byte%29>.
> Due to loss of precision, it is not guaranteed that decode(encode(x)) = x,
> e.g. decode(encode(0.89)) = 0.75. At scoring (search) time, this norm is
> brought into the score of document as*norm(t, d)*, as shown by the formula
> inSimilarity
> <http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html>.
>
> I can not understand the formula decode(encode(0.89)) = 0.75
> how can i get the 0.75 from the left.
>
> Is anyone can help me ?
> thanks ahead!
>
> andrew



-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org