You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by William Young <wy...@streetdiligence.com> on 2017/04/04 21:52:40 UTC

Getting Per Document Frequencies in Apache Lucenenet 4.8.0.0

I'm using this version of Lucenet: https://github.com/apache/lucenenet

I'm trying to get the number of phrase matches per document using a
PhraseQuery and an ExactPhraseScorer like so:

// Some phraseQuery defined here

using (IndexReader indexReader =
DirectoryReader.Open(IndexerJob.LuceneDirectory))
{
IndexSearcher indexSearcher = new IndexSearcher(indexReader);

TopDocs topDocs = indexSearcher.Search(masterQuery, _MAXSEARCHRESULTS);
var weight = phraseQuery.CreateWeight(indexSearcher);

var scorers = indexReader.Leaves.Select(o => weight.Scorer(o,
o.AtomicReader.LiveDocs)).Where(o => o != null);
foreach (var scorer in scorers)
{
while (scorer.NextDoc() != DocIdSetIterator.NO_MORE_DOCS)
{
int doc = scorer.DocID();
int freq = scorer.Freq();
Console.WriteLine("Document {0} contains {1} matches", doc, freq);
}
}
}

But when I call scorer.NextDoc(), it always returns
DocIdSetIterator.NO_MORE_DOCS, so the code in the while loop is never
executed. I tried this with a TermQuery instead of a PhraseQuery, and it
works fine. So the problem is with the implementation of PhraseQuery and
the ExactPhraseScorer.

I looked at the source code, and there seems to be a function in
ExactPhraseScorer:

private int PhraseFreq() { ... }

That is responsible for the calculation of the counts per document. Also
involved are the int[]'s Counts and Gens, but I don't really understand
what this is doing well enough to diagnose it.

Any ideas?

William

Re: Getting Per Document Frequencies in Apache Lucenenet 4.8.0.0

Posted by Itamar Syn-Hershko <it...@code972.com>.
You don't need all the code after TopDocs topDocs. Just access
topDocs.TotalHits
and make sure your query is a PhraseQuery, this is how you will know all
hits are the phrases you where searching for.

--

Itamar Syn-Hershko
Freelance Developer & Consultant
Elasticsearch Partner
Microsoft MVP | Lucene.NET PMC
http://code972.com | @synhershko <https://twitter.com/synhershko>
http://BigDataBoutique.co.il/

On Wed, Apr 5, 2017 at 12:52 AM, William Young <wy...@streetdiligence.com>
wrote:

> I'm using this version of Lucenet: https://github.com/apache/lucenenet
>
> I'm trying to get the number of phrase matches per document using a
> PhraseQuery and an ExactPhraseScorer like so:
>
> // Some phraseQuery defined here
>
> using (IndexReader indexReader =
> DirectoryReader.Open(IndexerJob.LuceneDirectory))
> {
> IndexSearcher indexSearcher = new IndexSearcher(indexReader);
>
> TopDocs topDocs = indexSearcher.Search(masterQuery, _MAXSEARCHRESULTS);
> var weight = phraseQuery.CreateWeight(indexSearcher);
>
> var scorers = indexReader.Leaves.Select(o => weight.Scorer(o,
> o.AtomicReader.LiveDocs)).Where(o => o != null);
> foreach (var scorer in scorers)
> {
> while (scorer.NextDoc() != DocIdSetIterator.NO_MORE_DOCS)
> {
> int doc = scorer.DocID();
> int freq = scorer.Freq();
> Console.WriteLine("Document {0} contains {1} matches", doc, freq);
> }
> }
> }
>
> But when I call scorer.NextDoc(), it always returns
> DocIdSetIterator.NO_MORE_DOCS, so the code in the while loop is never
> executed. I tried this with a TermQuery instead of a PhraseQuery, and it
> works fine. So the problem is with the implementation of PhraseQuery and
> the ExactPhraseScorer.
>
> I looked at the source code, and there seems to be a function in
> ExactPhraseScorer:
>
> private int PhraseFreq() { ... }
>
> That is responsible for the calculation of the counts per document. Also
> involved are the int[]'s Counts and Gens, but I don't really understand
> what this is doing well enough to diagnose it.
>
> Any ideas?
>
> William
>