You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Dwaipayan Roy <dw...@gmail.com> on 2016/07/21 15:06:33 UTC

Doc length nomalization in Lucene LM

​Hello,

In *SimilarityBase.java*, I can see that the length of the document is is
getting normalized by using the function *decodeNormValue()*. But I can't
understand how the normalizations is done. Can you please help? Also, is
there any way to avoid this doc-length normalization, to use the raw
doc-length (as used in LM-JM Zhai et al. SIGIR-2001)?

Thanks..

P.S. I am using Lucene 4.10.4

Re: Doc length nomalization in Lucene LM

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Hi,

Yes, as you discovered, there is some precision loss during the encode/decode process.

Ahmet


On Friday, July 22, 2016 1:59 PM, Dwaipayan Roy <dw...@gmail.com> wrote:
Thanks for your reply. But I still have some doubts.

From your answer, I think you mean to say that the document length is just
saved in byte format for less memory consumption. But while debugging, I
found that the doc length, that is passed in score() is 2621.44 where the
actual doc length is 2355.

I am confused. Please help.

On Fri, Jul 22, 2016 at 1:46 PM, Ahmet Arslan <io...@yahoo.com> wrote:

> Hi Roy,
>
> It is about storing the document length into a byte (to use less memory).
> Please edit the source code to avoid this encode/decode thing:
>
> /**
> * Encodes the document length in a lossless way
> */
> @Override
> public long computeNorm(FieldInvertState state) {
> return state.getLength() - state.getNumOverlap();
> }
>
> @Override
> public float score(int doc, float freq) {
> // We have to supply something in case norms are omitted
> return ModelBase.this.score(stats, freq,
> norms == null ? 1L : norms.get(doc));
> }
>
> @Override
> public Explanation explain(int doc, Explanation freq) {
> return ModelBase.this.explain(stats, doc, freq,
> norms == null ? 1L : norms.get(doc));
> }
>
>
>
> On Thursday, July 21, 2016 6:06 PM, Dwaipayan Roy <dw...@gmail.com>
> wrote:
>
>
>
> ​Hello,
>
> In *SimilarityBase.java*, I can see that the length of the document is is
> getting normalized by using the function *decodeNormValue()*. But I can't
> understand how the normalizations is done. Can you please help? Also, is
> there any way to avoid this doc-length normalization, to use the raw
> doc-length (as used in LM-JM Zhai et al. SIGIR-2001)?
>
> Thanks..
>
> P.S. I am using Lucene 4.10.4

>



-- 
Dwaipayan Roy.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Doc length nomalization in Lucene LM

Posted by Dwaipayan Roy <dw...@gmail.com>.
Thanks for your reply. But I still have some doubts.

From your answer, I think you mean to say that the document length is just
saved in byte format for less memory consumption. But while debugging, I
found that the doc length, that is passed in score() is 2621.44 where the
actual doc length is 2355.

I am confused. Please help.

On Fri, Jul 22, 2016 at 1:46 PM, Ahmet Arslan <io...@yahoo.com> wrote:

> Hi Roy,
>
> It is about storing the document length into a byte (to use less memory).
> Please edit the source code to avoid this encode/decode thing:
>
> /**
> * Encodes the document length in a lossless way
> */
> @Override
> public long computeNorm(FieldInvertState state) {
> return state.getLength() - state.getNumOverlap();
> }
>
> @Override
> public float score(int doc, float freq) {
> // We have to supply something in case norms are omitted
> return ModelBase.this.score(stats, freq,
> norms == null ? 1L : norms.get(doc));
> }
>
> @Override
> public Explanation explain(int doc, Explanation freq) {
> return ModelBase.this.explain(stats, doc, freq,
> norms == null ? 1L : norms.get(doc));
> }
>
>
>
> On Thursday, July 21, 2016 6:06 PM, Dwaipayan Roy <dw...@gmail.com>
> wrote:
>
>
>
> ​Hello,
>
> In *SimilarityBase.java*, I can see that the length of the document is is
> getting normalized by using the function *decodeNormValue()*. But I can't
> understand how the normalizations is done. Can you please help? Also, is
> there any way to avoid this doc-length normalization, to use the raw
> doc-length (as used in LM-JM Zhai et al. SIGIR-2001)?
>
> Thanks..
>
> P.S. I am using Lucene 4.10.4
>



-- 
Dwaipayan Roy.

Re: Doc length nomalization in Lucene LM

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
Hi Roy,

It is about storing the document length into a byte (to use less memory).
Please edit the source code to avoid this encode/decode thing:

/**
* Encodes the document length in a lossless way
*/
@Override
public long computeNorm(FieldInvertState state) {
return state.getLength() - state.getNumOverlap();
}

@Override
public float score(int doc, float freq) {
// We have to supply something in case norms are omitted
return ModelBase.this.score(stats, freq,
norms == null ? 1L : norms.get(doc));
}

@Override
public Explanation explain(int doc, Explanation freq) {
return ModelBase.this.explain(stats, doc, freq,
norms == null ? 1L : norms.get(doc));
}



On Thursday, July 21, 2016 6:06 PM, Dwaipayan Roy <dw...@gmail.com> wrote:



​Hello,

In *SimilarityBase.java*, I can see that the length of the document is is
getting normalized by using the function *decodeNormValue()*. But I can't
understand how the normalizations is done. Can you please help? Also, is
there any way to avoid this doc-length normalization, to use the raw
doc-length (as used in LM-JM Zhai et al. SIGIR-2001)?

Thanks..

P.S. I am using Lucene 4.10.4

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org