You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Ahmet Arslan <io...@yahoo.com.INVALID> on 2015/02/06 01:24:16 UTC

getting number of terms in a document/field

Hello Lucene Users,

I am traversing all documents that contains a given term with following code :

Term term = new Term(field, word);
Bits bits = MultiFields.getLiveDocs(reader);
DocsEnum docsEnum = MultiFields.getTermDocsEnum(reader, bits, field, term.bytes());

while (docsEnum.nextDoc() != DocsEnum.NO_MORE_DOCS) {

array[docsEnum.freq()]++;

// how to retrieve term count for this document?
   xxxxx(docsEnum.docID(), field); 


}

How can I get field term count values for these documents using Lucene 4.10.3?

Is above code OK for traversing posting list of term?

Thanks,
Ahmet

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: getting number of terms in a document/field

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Hi,

Sorry for my ignorance, how do I obtain AtomicReader from a IndexReader?

I figured above code but it gives me a list of atomic readers.

for (AtomicReaderContext context : reader.leaves()) {

NumericDocValues docValues = context.reader().getNormValues(field);

if (docValues != null) 
normValue = docValues.get(docID);
}

I implemented a custom similarity you advised by merging tfidf similarity and default similarity.
computeNorm(FieldInvertState state) method was final in tfidf similarity so I just couldn't extend it.
I was able to retrieve those long values from a single segment index, but i didn't like this solution.
Because I am experimenting with different similarity implementations.

It looks like there is no easy way to access 
FieldInvertState.lenght() and index this value into an independent NumericDocValues, say numTerms, other than norms.

I think I will compute length of fields by myself.

Thanks,
Ahmet

On Friday, February 6, 2015 5:31 PM, Michael McCandless <lu...@mikemccandless.com> wrote:
On Fri, Feb 6, 2015 at 8:51 AM, Ahmet Arslan <io...@yahoo.com.invalid> wrote:
> Hi Michael,
>
> Thanks for the explanation. I am working with a TREC dataset,
> since it is static, I set size of that array experimentally.
>
> I followed the DefaultSimilarity#lengthNorm method a bit.
>
> If default similarity and no index time boost is used,
> I assume that norm equals to  1.0 / Math.sqrt(numTerms).
>
> First option is somehow obtain pre-computed norm value and apply reverse operation to obtain numTerms.
> numTerms = (1/norm)^2  This will be an approximation because norms are stored in a byte.
> How do I access that norm value for a given docid and a field?

See the AtomicReader.getNormValues method.

> Second option, I store numTerms as a separate field, like any other organic fields.
> Do I need to calculate it by myself? Or can I access above already computed numTerms value during indexing?
>
> I think I will follow second option.
> Is there a pointer where reading/writing a DocValues based field example is demostrated?

You could just make your own Similarity impl, that encodes the norm
directly as a length?  It's a long so you don't have to compress if
you don't want to.

That custom Similarity is passed FieldInvertState which contains the
number of tokens in the current field, so you can just use that
instead of computing it yourself.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

On Friday, February 6, 2015 5:31 PM, Michael McCandless <lu...@mikemccandless.com> wrote:
On Fri, Feb 6, 2015 at 8:51 AM, Ahmet Arslan <io...@yahoo.com.invalid> wrote:
> Hi Michael,
>
> Thanks for the explanation. I am working with a TREC dataset,
> since it is static, I set size of that array experimentally.
>
> I followed the DefaultSimilarity#lengthNorm method a bit.
>
> If default similarity and no index time boost is used,
> I assume that norm equals to  1.0 / Math.sqrt(numTerms).
>
> First option is somehow obtain pre-computed norm value and apply reverse operation to obtain numTerms.
> numTerms = (1/norm)^2  This will be an approximation because norms are stored in a byte.
> How do I access that norm value for a given docid and a field?

See the AtomicReader.getNormValues method.

> Second option, I store numTerms as a separate field, like any other organic fields.
> Do I need to calculate it by myself? Or can I access above already computed numTerms value during indexing?
>
> I think I will follow second option.
> Is there a pointer where reading/writing a DocValues based field example is demostrated?

You could just make your own Similarity impl, that encodes the norm
directly as a length?  It's a long so you don't have to compress if
you don't want to.

That custom Similarity is passed FieldInvertState which contains the
number of tokens in the current field, so you can just use that
instead of computing it yourself.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: getting number of terms in a document/field

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Fri, Feb 6, 2015 at 8:51 AM, Ahmet Arslan <io...@yahoo.com.invalid> wrote:
> Hi Michael,
>
> Thanks for the explanation. I am working with a TREC dataset,
> since it is static, I set size of that array experimentally.
>
> I followed the DefaultSimilarity#lengthNorm method a bit.
>
> If default similarity and no index time boost is used,
> I assume that norm equals to  1.0 / Math.sqrt(numTerms).
>
> First option is somehow obtain pre-computed norm value and apply reverse operation to obtain numTerms.
> numTerms = (1/norm)^2  This will be an approximation because norms are stored in a byte.
> How do I access that norm value for a given docid and a field?

See the AtomicReader.getNormValues method.

> Second option, I store numTerms as a separate field, like any other organic fields.
> Do I need to calculate it by myself? Or can I access above already computed numTerms value during indexing?
>
> I think I will follow second option.
> Is there a pointer where reading/writing a DocValues based field example is demostrated?

You could just make your own Similarity impl, that encodes the norm
directly as a length?  It's a long so you don't have to compress if
you don't want to.

That custom Similarity is passed FieldInvertState which contains the
number of tokens in the current field, so you can just use that
instead of computing it yourself.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: getting number of terms in a document/field

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Hi Michael,

Thanks for the explanation. I am working with a TREC dataset, 
since it is static, I set size of that array experimentally. 

I followed the DefaultSimilarity#lengthNorm method a bit.

If default similarity and no index time boost is used, 
I assume that norm equals to  1.0 / Math.sqrt(numTerms).

First option is somehow obtain pre-computed norm value and apply reverse operation to obtain numTerms.
numTerms = (1/norm)^2  This will be an approximation because norms are stored in a byte.
How do I access that norm value for a given docid and a field?

Second option, I store numTerms as a separate field, like any other organic fields.
Do I need to calculate it by myself? Or can I access above already computed numTerms value during indexing? 

I think I will follow second option.
Is there a pointer where reading/writing a DocValues based field example is demostrated?

Thanks,
Ahmet

On Friday, February 6, 2015 11:08 AM, Michael McCandless <lu...@mikemccandless.com> wrote:
How will you know how large to allocate that array?  The within-doc
term freq can in general be arbitrarily large...

Lucene does not directly store the total number of terms in a
document, but it does store it approximately in the doc's norm value.
Maybe you can use that?  Alternatively, you can store this statistic
yourself, e.g as a doc value.

Mike McCandless

http://blog.mikemccandless.com

On Thu, Feb 5, 2015 at 7:24 PM, Ahmet Arslan <io...@yahoo.com.invalid> wrote:
> Hello Lucene Users,
>
> I am traversing all documents that contains a given term with following code :
>
> Term term = new Term(field, word);
> Bits bits = MultiFields.getLiveDocs(reader);
> DocsEnum docsEnum = MultiFields.getTermDocsEnum(reader, bits, field, term.bytes());
>
> while (docsEnum.nextDoc() != DocsEnum.NO_MORE_DOCS) {
>
> array[docsEnum.freq()]++;
>
> // how to retrieve term count for this document?
>    xxxxx(docsEnum.docID(), field);
>
>
> }
>
> How can I get field term count values for these documents using Lucene 4.10.3?
>
> Is above code OK for traversing posting list of term?
>
> Thanks,
> Ahmet
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: getting number of terms in a document/field

Posted by Michael McCandless <lu...@mikemccandless.com>.

How will you know how large to allocate that array?  The within-doc
term freq can in general be arbitrarily large...

Lucene does not directly store the total number of terms in a
document, but it does store it approximately in the doc's norm value.
Maybe you can use that?  Alternatively, you can store this statistic
yourself, e.g as a doc value.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Feb 5, 2015 at 7:24 PM, Ahmet Arslan <io...@yahoo.com.invalid> wrote:
> Hello Lucene Users,
>
> I am traversing all documents that contains a given term with following code :
>
> Term term = new Term(field, word);
> Bits bits = MultiFields.getLiveDocs(reader);
> DocsEnum docsEnum = MultiFields.getTermDocsEnum(reader, bits, field, term.bytes());
>
> while (docsEnum.nextDoc() != DocsEnum.NO_MORE_DOCS) {
>
> array[docsEnum.freq()]++;
>
> // how to retrieve term count for this document?
>    xxxxx(docsEnum.docID(), field);
>
>
> }
>
> How can I get field term count values for these documents using Lucene 4.10.3?
>
> Is above code OK for traversing posting list of term?
>
> Thanks,
> Ahmet
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org