You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Shai Erera <se...@gmail.com> on 2012/12/17 17:24:12 UTC

DocsEnum.freq()

Hi

While migrating code to Lucene 4.0, I noticed that I have an assert on a
field that is indexed with DOCS_ONLY that DocsEnum.freq() == 1. This got me
thinking ... why?

If you index w/ DOCS_ONLY, or ask for DocsEnum with FLAG_NONE, why do we
"lie" to the consumer? Rather, we could just return 0 or -1?

I personally don't mind if we continue to return 1, if there's a real
reason to. I don't think that anyone should call freq() if he asked for
DocsEnum with FLAG_NONE. But if we do keep the current behavior, can we at
least document it?

E.g., something like this patch:

Index: lucene/core/src/java/org/apache/lucene/index/DocsEnum.java
===================================================================
--- lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (revision
1422804)
+++ lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (working
copy)
@@ -47,10 +47,16 @@
   protected DocsEnum() {
   }

-  /** Returns term frequency in the current document.  Do
-   *  not call this before {@link #nextDoc} is first called,
-   *  nor after {@link #nextDoc} returns NO_MORE_DOCS.
-   **/
+  /**
+   * Returns term frequency in the current document, or 1 if the
+   * {@link DocsEnum} was obtained with {@link #FLAG_NONE}. Do not call
this
+   * before {@link #nextDoc} is first called, nor after {@link #nextDoc}
returns
+   * {@link DocIdSetIterator#NO_MORE_DOCS}.
+   *
+   * <p>
+   * <b>NOTE:</b> if the {@link DocsEnum} was obtain with {@link
#FLAG_NONE},
+   * this method returns 1.
+   */
   public abstract int freq() throws IOException;

   /** Returns the related attributes. */

Shai

Re: DocsEnum.freq()

Posted by Shai Erera <se...@gmail.com>.

>
> Also, this will only work if you run it from command-line (an ant task)
>

That's fine. The purpose (for me) is to ensure the robustness of a test /
code change.
Once a seed is found which fails the test, we should be able to debug the
test in eclipse
using -Dtests.seed, at least for single-threaded tests.

Apologies for the inconvenience it causes.
>

No apologies needed !
Your work is a blessing. This runner is really great.

Shai


On Wed, Dec 19, 2012 at 9:40 PM, Dawid Weiss
<da...@cs.put.poznan.pl>wrote:

> > BTW, I don't know if it should be like that or not, but I ran this test
> with
> > -Dtests.iters=1000 and printed at the end of each iteration
> > Codec.getDefault().
> > For all iterations it printed the same Codec, but if I ran the test many
> > times (no iteration), it printed different Codecs in each run.
> > I thought that tests.iters should simulate the behavior of running the
> test
> > many times? And therefore pick a new seed + Codec (among other things)
> for
> > each iteration?
>
> Sorry, I missed this. What you're observing is correct -- the Codec is
> picked at the class level (in TestRuleSetupAndRestoreClassEnv.java).
> -Dtests.iters duplicates each test much like if you put two or more
> test methods in a suite class. Any static rules and static hooks
> execute *once* for a class, regardless of the number of individual
> tests, so the codec is picked once, depending on the master seed (not
> the per-method derivative seed).
>
> This would be of course solved if tests.dups could run each class with
> a different seed. I wish I had more time to look into this; I kind of
> know how to solve this I think but it'll require a significant effort
> put into rewriting core parts of the runner. Also, this will only work
> if you run it from command-line (an ant task); there is no way to
> duplicate an entire test suite in Eclipse or other IDEs.
>
> I'll try to look into this, be patient. Apologies for the
> inconvenience it causes.
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: DocsEnum.freq()

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

> BTW, I don't know if it should be like that or not, but I ran this test with
> -Dtests.iters=1000 and printed at the end of each iteration
> Codec.getDefault().
> For all iterations it printed the same Codec, but if I ran the test many
> times (no iteration), it printed different Codecs in each run.
> I thought that tests.iters should simulate the behavior of running the test
> many times? And therefore pick a new seed + Codec (among other things) for
> each iteration?

Sorry, I missed this. What you're observing is correct -- the Codec is
picked at the class level (in TestRuleSetupAndRestoreClassEnv.java).
-Dtests.iters duplicates each test much like if you put two or more
test methods in a suite class. Any static rules and static hooks
execute *once* for a class, regardless of the number of individual
tests, so the codec is picked once, depending on the master seed (not
the per-method derivative seed).

This would be of course solved if tests.dups could run each class with
a different seed. I wish I had more time to look into this; I kind of
know how to solve this I think but it'll require a significant effort
put into rewriting core parts of the runner. Also, this will only work
if you run it from command-line (an ant task); there is no way to
duplicate an entire test suite in Eclipse or other IDEs.

I'll try to look into this, be patient. Apologies for the
inconvenience it causes.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: DocsEnum.freq()

Posted by Robert Muir <rc...@gmail.com>.

+1

On Wed, Dec 19, 2012 at 7:42 AM, Shai Erera <se...@gmail.com> wrote:
> Thanks for catching that Mike. Will change and commit.
>
> Shai
>
>
> On Wed, Dec 19, 2012 at 2:38 PM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>>
>> +1, but can you change "... if the Document was indexed with
>> DOCS_ONLY" to "... if the Field was indexed with DOCS_ONLY"?
>>
>> Thanks Shai!
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Wed, Dec 19, 2012 at 3:14 AM, Shai Erera <se...@gmail.com> wrote:
>> > Here's the patch with test:
>> >
>> > Index: lucene/core/src/java/org/apache/lucene/index/DocsEnum.java
>> > ===================================================================
>> > --- lucene/core/src/java/org/apache/lucene/index/DocsEnum.java
>> > (revision
>> > 1423774)
>> > +++ lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (working
>> > copy)
>> > @@ -19,6 +19,7 @@
>> >
>> >  import java.io.IOException;
>> >
>> > +import org.apache.lucene.index.FieldInfo.IndexOptions;
>> >  import org.apache.lucene.search.DocIdSetIterator;
>> >  import org.apache.lucene.util.AttributeSource;
>> >  import org.apache.lucene.util.Bits; // javadocs
>> > @@ -47,10 +48,16 @@
>> >
>> >    protected DocsEnum() {
>> >    }
>> >
>> > -  /** Returns term frequency in the current document.  Do
>> > -   *  not call this before {@link #nextDoc} is first called,
>> > -   *  nor after {@link #nextDoc} returns NO_MORE_DOCS.
>> > -   **/
>> > +  /**
>> > +   * Returns term frequency in the current document, or 1 if the
>> > document
>> > was
>> > +   * indexed with {@link IndexOptions#DOCS_ONLY}. Do not call this
>> > before
>> > +   * {@link #nextDoc} is first called, nor after {@link #nextDoc}
>> > returns
>> >
>> > +   * {@link DocIdSetIterator#NO_MORE_DOCS}.
>> > +   *
>> > +   * <p>
>> > +   * <b>NOTE:</b> if the {@link DocsEnum} was obtain with {@link
>> > #FLAG_NONE},
>> > +   * the result of this method is undefined.
>> >
>> > +   */
>> >    public abstract int freq() throws IOException;
>> >
>> >    /** Returns the related attributes. */
>> > Index: lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
>> > ===================================================================
>> > --- lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
>> > (revision 1423774)
>> > +++ lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
>> > (working copy)
>> > @@ -21,6 +21,7 @@
>> >  import java.util.Arrays;
>> >  import java.util.HashSet;
>> >  import java.util.Iterator;
>> > +import java.util.Random;
>> >
>> >  import org.apache.lucene.analysis.MockAnalyzer;
>> >  import org.apache.lucene.codecs.Codec;
>> > @@ -31,7 +32,9 @@
>> >  import org.apache.lucene.codecs.TermsConsumer;
>> >  import org.apache.lucene.codecs.mocksep.MockSepPostingsFormat;
>> >  import org.apache.lucene.document.Document;
>> > +import org.apache.lucene.document.Field.Store;
>> >  import org.apache.lucene.document.FieldType;
>> > +import org.apache.lucene.document.StringField;
>> >  import org.apache.lucene.document.TextField;
>> >  import org.apache.lucene.index.FieldInfo.IndexOptions;
>> >  import org.apache.lucene.search.DocIdSetIterator;
>> > @@ -630,4 +633,33 @@
>> >      }
>> >      consumer.close();
>> >    }
>> > +
>> > +  public void testDocsOnlyFreq() throws Exception {
>> > +    // tests that when fields are indexed with DOCS_ONLY, the Codec
>> > +    // returns 1 in docsEnum.freq()
>> > +    Directory dir = newDirectory();
>> > +    Random random = random();
>> > +    IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig(
>> > +        TEST_VERSION_CURRENT, new MockAnalyzer(random)));
>> > +    // we don't need many documents to assert this, but don't use one
>> > document either
>> > +    int numDocs = atLeast(random, 50);
>> > +    for (int i = 0; i < numDocs; i++) {
>> > +      Document doc = new Document();
>> > +      doc.add(new StringField("f", "doc", Store.NO));
>> > +      writer.addDocument(doc);
>> > +    }
>> > +    writer.close();
>> > +
>> > +    Term term = new Term("f", new BytesRef("doc"));
>> > +    DirectoryReader reader = DirectoryReader.open(dir);
>> > +    for (AtomicReaderContext ctx : reader.leaves()) {
>> > +      DocsEnum de = ctx.reader().termDocsEnum(term);
>> > +      while (de.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
>> > +        assertEquals("wrong freq for doc " + de.docID(), 1, de.freq());
>> > +      }
>> > +    }
>> > +    reader.close();
>> > +
>> > +    dir.close();
>> > +  }
>> >  }
>> >
>> > BTW, I don't know if it should be like that or not, but I ran this test
>> > with
>> > -Dtests.iters=1000 and printed at the end of each iteration
>> > Codec.getDefault().
>> > For all iterations it printed the same Codec, but if I ran the test many
>> > times (no iteration), it printed different Codecs in each run.
>> > I thought that tests.iters should simulate the behavior of running the
>> > test
>> > many times? And therefore pick a new seed + Codec (among other things)
>> > for
>> > each iteration?
>> >
>> > Shai
>> >
>> >
>> > On Tue, Dec 18, 2012 at 1:09 PM, Michael McCandless
>> > <lu...@mikemccandless.com> wrote:
>> >>
>> >> On Tue, Dec 18, 2012 at 4:46 AM, Shai Erera <se...@gmail.com> wrote:
>> >> > Are you sure that all Codecs return 1 if you indexed with DOCS_ONLY?
>> >> > Do
>> >> > we
>> >> > have a test that can trip bad Codecs?
>> >>
>> >> I'm not sure!  We should make a test & fix any failing ones ...
>> >>
>> >> > It may be more than just changing the documentation...
>> >>
>> >> Right.
>> >>
>> >> > Why would e.g. TermQuery need to write specialized code for these
>> >> > cases?
>> >> > I
>> >> > looked at TermScorer, and its freq() just returns docsEnum.freq().
>> >>
>> >> I meant if we did not adopt this spec ("freq() will lie and return 1
>> >> when the field was indexed as DOCS_ONLY"), then e.g. TermQuery would
>> >> need specialized code.
>> >>
>> >> > I think that Similarity may be affected? Which brings the question -
>> >> > how
>> >> > do
>> >> > Similarity impls know what flags the DE was opened with, and
>> >> > shouldn't
>> >> > they
>> >> > be specialized?
>> >> > E.g. TFIDFSimilarity.ExactTFIDFDocScorer uses the freq passed to
>> >> > score()
>> >> > as
>> >> > an index to an array, so clearly it assumes it is >= 0 and also <
>> >> > scoreCache.length.
>> >> > So I wonder what will happen to it when someone's Codec will return a
>> >> > negative value or MAX_INT in case frequencies aren't needed?
>> >>
>> >> Well, if you passed FLAGS_NONE when you opened the DE then it's your
>> >> responsibility to never call freq() ... ie, don't call freq() and pass
>> >> that to the sim.
>> >>
>> >> > I do realize that you shouldn't call Similarity with missing
>> >> > information,
>> >> > and TermWeight obtains a DocsEnum with frequencies, so in that regard
>> >> > it
>> >> > is
>> >> > safe.
>> >> > And if you do obtain a DocsEnum with FLAG_NONE, you'd better know
>> >> > what
>> >> > you're doing and don't pass a random freq() to Similarity.
>> >>
>> >> Right.
>> >>
>> >> > I lean towards documenting the spec from above, and ensuring that all
>> >> > Codecs
>> >> > return 1 for DOCS_ONLY.
>> >>
>> >> +1
>> >>
>> >> So freq() is undefined if you had passed FLAGS_NONE, and we will lie
>> >> and say freq=1 (need a test verifying this) if the field was indexed
>> >> as DOCS_ONLY.
>> >>
>> >> > If in the future we'll need to handle the case where someone receives
>> >> > a
>> >> > DocsEnum which it needs to consume, and doesn't know which flags were
>> >> > used
>> >> > to open it, we can always add a getFlags to DE.
>> >>
>> >> Yeah ...
>> >>
>> >> Mike McCandless
>> >>
>> >> http://blog.mikemccandless.com
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: DocsEnum.freq()

Posted by Shai Erera <se...@gmail.com>.

Thanks for catching that Mike. Will change and commit.

Shai


On Wed, Dec 19, 2012 at 2:38 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> +1, but can you change "... if the Document was indexed with
> DOCS_ONLY" to "... if the Field was indexed with DOCS_ONLY"?
>
> Thanks Shai!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Wed, Dec 19, 2012 at 3:14 AM, Shai Erera <se...@gmail.com> wrote:
> > Here's the patch with test:
> >
> > Index: lucene/core/src/java/org/apache/lucene/index/DocsEnum.java
> > ===================================================================
> > --- lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (revision
> > 1423774)
> > +++ lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (working
> > copy)
> > @@ -19,6 +19,7 @@
> >
> >  import java.io.IOException;
> >
> > +import org.apache.lucene.index.FieldInfo.IndexOptions;
> >  import org.apache.lucene.search.DocIdSetIterator;
> >  import org.apache.lucene.util.AttributeSource;
> >  import org.apache.lucene.util.Bits; // javadocs
> > @@ -47,10 +48,16 @@
> >
> >    protected DocsEnum() {
> >    }
> >
> > -  /** Returns term frequency in the current document.  Do
> > -   *  not call this before {@link #nextDoc} is first called,
> > -   *  nor after {@link #nextDoc} returns NO_MORE_DOCS.
> > -   **/
> > +  /**
> > +   * Returns term frequency in the current document, or 1 if the
> document
> > was
> > +   * indexed with {@link IndexOptions#DOCS_ONLY}. Do not call this
> before
> > +   * {@link #nextDoc} is first called, nor after {@link #nextDoc}
> returns
> >
> > +   * {@link DocIdSetIterator#NO_MORE_DOCS}.
> > +   *
> > +   * <p>
> > +   * <b>NOTE:</b> if the {@link DocsEnum} was obtain with {@link
> > #FLAG_NONE},
> > +   * the result of this method is undefined.
> >
> > +   */
> >    public abstract int freq() throws IOException;
> >
> >    /** Returns the related attributes. */
> > Index: lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
> > ===================================================================
> > --- lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
> > (revision 1423774)
> > +++ lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
> > (working copy)
> > @@ -21,6 +21,7 @@
> >  import java.util.Arrays;
> >  import java.util.HashSet;
> >  import java.util.Iterator;
> > +import java.util.Random;
> >
> >  import org.apache.lucene.analysis.MockAnalyzer;
> >  import org.apache.lucene.codecs.Codec;
> > @@ -31,7 +32,9 @@
> >  import org.apache.lucene.codecs.TermsConsumer;
> >  import org.apache.lucene.codecs.mocksep.MockSepPostingsFormat;
> >  import org.apache.lucene.document.Document;
> > +import org.apache.lucene.document.Field.Store;
> >  import org.apache.lucene.document.FieldType;
> > +import org.apache.lucene.document.StringField;
> >  import org.apache.lucene.document.TextField;
> >  import org.apache.lucene.index.FieldInfo.IndexOptions;
> >  import org.apache.lucene.search.DocIdSetIterator;
> > @@ -630,4 +633,33 @@
> >      }
> >      consumer.close();
> >    }
> > +
> > +  public void testDocsOnlyFreq() throws Exception {
> > +    // tests that when fields are indexed with DOCS_ONLY, the Codec
> > +    // returns 1 in docsEnum.freq()
> > +    Directory dir = newDirectory();
> > +    Random random = random();
> > +    IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig(
> > +        TEST_VERSION_CURRENT, new MockAnalyzer(random)));
> > +    // we don't need many documents to assert this, but don't use one
> > document either
> > +    int numDocs = atLeast(random, 50);
> > +    for (int i = 0; i < numDocs; i++) {
> > +      Document doc = new Document();
> > +      doc.add(new StringField("f", "doc", Store.NO));
> > +      writer.addDocument(doc);
> > +    }
> > +    writer.close();
> > +
> > +    Term term = new Term("f", new BytesRef("doc"));
> > +    DirectoryReader reader = DirectoryReader.open(dir);
> > +    for (AtomicReaderContext ctx : reader.leaves()) {
> > +      DocsEnum de = ctx.reader().termDocsEnum(term);
> > +      while (de.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
> > +        assertEquals("wrong freq for doc " + de.docID(), 1, de.freq());
> > +      }
> > +    }
> > +    reader.close();
> > +
> > +    dir.close();
> > +  }
> >  }
> >
> > BTW, I don't know if it should be like that or not, but I ran this test
> with
> > -Dtests.iters=1000 and printed at the end of each iteration
> > Codec.getDefault().
> > For all iterations it printed the same Codec, but if I ran the test many
> > times (no iteration), it printed different Codecs in each run.
> > I thought that tests.iters should simulate the behavior of running the
> test
> > many times? And therefore pick a new seed + Codec (among other things)
> for
> > each iteration?
> >
> > Shai
> >
> >
> > On Tue, Dec 18, 2012 at 1:09 PM, Michael McCandless
> > <lu...@mikemccandless.com> wrote:
> >>
> >> On Tue, Dec 18, 2012 at 4:46 AM, Shai Erera <se...@gmail.com> wrote:
> >> > Are you sure that all Codecs return 1 if you indexed with DOCS_ONLY?
> Do
> >> > we
> >> > have a test that can trip bad Codecs?
> >>
> >> I'm not sure!  We should make a test & fix any failing ones ...
> >>
> >> > It may be more than just changing the documentation...
> >>
> >> Right.
> >>
> >> > Why would e.g. TermQuery need to write specialized code for these
> cases?
> >> > I
> >> > looked at TermScorer, and its freq() just returns docsEnum.freq().
> >>
> >> I meant if we did not adopt this spec ("freq() will lie and return 1
> >> when the field was indexed as DOCS_ONLY"), then e.g. TermQuery would
> >> need specialized code.
> >>
> >> > I think that Similarity may be affected? Which brings the question -
> how
> >> > do
> >> > Similarity impls know what flags the DE was opened with, and shouldn't
> >> > they
> >> > be specialized?
> >> > E.g. TFIDFSimilarity.ExactTFIDFDocScorer uses the freq passed to
> score()
> >> > as
> >> > an index to an array, so clearly it assumes it is >= 0 and also <
> >> > scoreCache.length.
> >> > So I wonder what will happen to it when someone's Codec will return a
> >> > negative value or MAX_INT in case frequencies aren't needed?
> >>
> >> Well, if you passed FLAGS_NONE when you opened the DE then it's your
> >> responsibility to never call freq() ... ie, don't call freq() and pass
> >> that to the sim.
> >>
> >> > I do realize that you shouldn't call Similarity with missing
> >> > information,
> >> > and TermWeight obtains a DocsEnum with frequencies, so in that regard
> it
> >> > is
> >> > safe.
> >> > And if you do obtain a DocsEnum with FLAG_NONE, you'd better know what
> >> > you're doing and don't pass a random freq() to Similarity.
> >>
> >> Right.
> >>
> >> > I lean towards documenting the spec from above, and ensuring that all
> >> > Codecs
> >> > return 1 for DOCS_ONLY.
> >>
> >> +1
> >>
> >> So freq() is undefined if you had passed FLAGS_NONE, and we will lie
> >> and say freq=1 (need a test verifying this) if the field was indexed
> >> as DOCS_ONLY.
> >>
> >> > If in the future we'll need to handle the case where someone receives
> a
> >> > DocsEnum which it needs to consume, and doesn't know which flags were
> >> > used
> >> > to open it, we can always add a getFlags to DE.
> >>
> >> Yeah ...
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: DocsEnum.freq()

Posted by Michael McCandless <lu...@mikemccandless.com>.

+1, but can you change "... if the Document was indexed with
DOCS_ONLY" to "... if the Field was indexed with DOCS_ONLY"?

Thanks Shai!

Mike McCandless

http://blog.mikemccandless.com

On Wed, Dec 19, 2012 at 3:14 AM, Shai Erera <se...@gmail.com> wrote:
> Here's the patch with test:
>
> Index: lucene/core/src/java/org/apache/lucene/index/DocsEnum.java
> ===================================================================
> --- lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (revision
> 1423774)
> +++ lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (working
> copy)
> @@ -19,6 +19,7 @@
>
>  import java.io.IOException;
>
> +import org.apache.lucene.index.FieldInfo.IndexOptions;
>  import org.apache.lucene.search.DocIdSetIterator;
>  import org.apache.lucene.util.AttributeSource;
>  import org.apache.lucene.util.Bits; // javadocs
> @@ -47,10 +48,16 @@
>
>    protected DocsEnum() {
>    }
>
> -  /** Returns term frequency in the current document.  Do
> -   *  not call this before {@link #nextDoc} is first called,
> -   *  nor after {@link #nextDoc} returns NO_MORE_DOCS.
> -   **/
> +  /**
> +   * Returns term frequency in the current document, or 1 if the document
> was
> +   * indexed with {@link IndexOptions#DOCS_ONLY}. Do not call this before
> +   * {@link #nextDoc} is first called, nor after {@link #nextDoc} returns
>
> +   * {@link DocIdSetIterator#NO_MORE_DOCS}.
> +   *
> +   * <p>
> +   * <b>NOTE:</b> if the {@link DocsEnum} was obtain with {@link
> #FLAG_NONE},
> +   * the result of this method is undefined.
>
> +   */
>    public abstract int freq() throws IOException;
>
>    /** Returns the related attributes. */
> Index: lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
> ===================================================================
> --- lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
> (revision 1423774)
> +++ lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
> (working copy)
> @@ -21,6 +21,7 @@
>  import java.util.Arrays;
>  import java.util.HashSet;
>  import java.util.Iterator;
> +import java.util.Random;
>
>  import org.apache.lucene.analysis.MockAnalyzer;
>  import org.apache.lucene.codecs.Codec;
> @@ -31,7 +32,9 @@
>  import org.apache.lucene.codecs.TermsConsumer;
>  import org.apache.lucene.codecs.mocksep.MockSepPostingsFormat;
>  import org.apache.lucene.document.Document;
> +import org.apache.lucene.document.Field.Store;
>  import org.apache.lucene.document.FieldType;
> +import org.apache.lucene.document.StringField;
>  import org.apache.lucene.document.TextField;
>  import org.apache.lucene.index.FieldInfo.IndexOptions;
>  import org.apache.lucene.search.DocIdSetIterator;
> @@ -630,4 +633,33 @@
>      }
>      consumer.close();
>    }
> +
> +  public void testDocsOnlyFreq() throws Exception {
> +    // tests that when fields are indexed with DOCS_ONLY, the Codec
> +    // returns 1 in docsEnum.freq()
> +    Directory dir = newDirectory();
> +    Random random = random();
> +    IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig(
> +        TEST_VERSION_CURRENT, new MockAnalyzer(random)));
> +    // we don't need many documents to assert this, but don't use one
> document either
> +    int numDocs = atLeast(random, 50);
> +    for (int i = 0; i < numDocs; i++) {
> +      Document doc = new Document();
> +      doc.add(new StringField("f", "doc", Store.NO));
> +      writer.addDocument(doc);
> +    }
> +    writer.close();
> +
> +    Term term = new Term("f", new BytesRef("doc"));
> +    DirectoryReader reader = DirectoryReader.open(dir);
> +    for (AtomicReaderContext ctx : reader.leaves()) {
> +      DocsEnum de = ctx.reader().termDocsEnum(term);
> +      while (de.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
> +        assertEquals("wrong freq for doc " + de.docID(), 1, de.freq());
> +      }
> +    }
> +    reader.close();
> +
> +    dir.close();
> +  }
>  }
>
> BTW, I don't know if it should be like that or not, but I ran this test with
> -Dtests.iters=1000 and printed at the end of each iteration
> Codec.getDefault().
> For all iterations it printed the same Codec, but if I ran the test many
> times (no iteration), it printed different Codecs in each run.
> I thought that tests.iters should simulate the behavior of running the test
> many times? And therefore pick a new seed + Codec (among other things) for
> each iteration?
>
> Shai
>
>
> On Tue, Dec 18, 2012 at 1:09 PM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>>
>> On Tue, Dec 18, 2012 at 4:46 AM, Shai Erera <se...@gmail.com> wrote:
>> > Are you sure that all Codecs return 1 if you indexed with DOCS_ONLY? Do
>> > we
>> > have a test that can trip bad Codecs?
>>
>> I'm not sure!  We should make a test & fix any failing ones ...
>>
>> > It may be more than just changing the documentation...
>>
>> Right.
>>
>> > Why would e.g. TermQuery need to write specialized code for these cases?
>> > I
>> > looked at TermScorer, and its freq() just returns docsEnum.freq().
>>
>> I meant if we did not adopt this spec ("freq() will lie and return 1
>> when the field was indexed as DOCS_ONLY"), then e.g. TermQuery would
>> need specialized code.
>>
>> > I think that Similarity may be affected? Which brings the question - how
>> > do
>> > Similarity impls know what flags the DE was opened with, and shouldn't
>> > they
>> > be specialized?
>> > E.g. TFIDFSimilarity.ExactTFIDFDocScorer uses the freq passed to score()
>> > as
>> > an index to an array, so clearly it assumes it is >= 0 and also <
>> > scoreCache.length.
>> > So I wonder what will happen to it when someone's Codec will return a
>> > negative value or MAX_INT in case frequencies aren't needed?
>>
>> Well, if you passed FLAGS_NONE when you opened the DE then it's your
>> responsibility to never call freq() ... ie, don't call freq() and pass
>> that to the sim.
>>
>> > I do realize that you shouldn't call Similarity with missing
>> > information,
>> > and TermWeight obtains a DocsEnum with frequencies, so in that regard it
>> > is
>> > safe.
>> > And if you do obtain a DocsEnum with FLAG_NONE, you'd better know what
>> > you're doing and don't pass a random freq() to Similarity.
>>
>> Right.
>>
>> > I lean towards documenting the spec from above, and ensuring that all
>> > Codecs
>> > return 1 for DOCS_ONLY.
>>
>> +1
>>
>> So freq() is undefined if you had passed FLAGS_NONE, and we will lie
>> and say freq=1 (need a test verifying this) if the field was indexed
>> as DOCS_ONLY.
>>
>> > If in the future we'll need to handle the case where someone receives a
>> > DocsEnum which it needs to consume, and doesn't know which flags were
>> > used
>> > to open it, we can always add a getFlags to DE.
>>
>> Yeah ...
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: DocsEnum.freq()

Posted by Shai Erera <se...@gmail.com>.

Here's the patch with test:

Index: lucene/core/src/java/org/apache/lucene/index/DocsEnum.java
===================================================================
--- lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (revision
1423774)
+++ lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (working
copy)
@@ -19,6 +19,7 @@

 import java.io.IOException;

+import org.apache.lucene.index.FieldInfo.IndexOptions;
 import org.apache.lucene.search.DocIdSetIterator;
 import org.apache.lucene.util.AttributeSource;
 import org.apache.lucene.util.Bits; // javadocs
@@ -47,10 +48,16 @@
   protected DocsEnum() {
   }

-  /** Returns term frequency in the current document.  Do
-   *  not call this before {@link #nextDoc} is first called,
-   *  nor after {@link #nextDoc} returns NO_MORE_DOCS.
-   **/
+  /**
+   * Returns term frequency in the current document, or 1 if the document
was
+   * indexed with {@link IndexOptions#DOCS_ONLY}. Do not call this before
+   * {@link #nextDoc} is first called, nor after {@link #nextDoc} returns
+   * {@link DocIdSetIterator#NO_MORE_DOCS}.
+   *
+   * <p>
+   * <b>NOTE:</b> if the {@link DocsEnum} was obtain with {@link
#FLAG_NONE},
+   * the result of this method is undefined.
+   */
   public abstract int freq() throws IOException;

   /** Returns the related attributes. */
Index: lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
===================================================================
--- lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
(revision 1423774)
+++ lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
(working copy)
@@ -21,6 +21,7 @@
 import java.util.Arrays;
 import java.util.HashSet;
 import java.util.Iterator;
+import java.util.Random;

 import org.apache.lucene.analysis.MockAnalyzer;
 import org.apache.lucene.codecs.Codec;
@@ -31,7 +32,9 @@
 import org.apache.lucene.codecs.TermsConsumer;
 import org.apache.lucene.codecs.mocksep.MockSepPostingsFormat;
 import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field.Store;
 import org.apache.lucene.document.FieldType;
+import org.apache.lucene.document.StringField;
 import org.apache.lucene.document.TextField;
 import org.apache.lucene.index.FieldInfo.IndexOptions;
 import org.apache.lucene.search.DocIdSetIterator;
@@ -630,4 +633,33 @@
     }
     consumer.close();
   }
+
+  public void testDocsOnlyFreq() throws Exception {
+    // tests that when fields are indexed with DOCS_ONLY, the Codec
+    // returns 1 in docsEnum.freq()
+    Directory dir = newDirectory();
+    Random random = random();
+    IndexWriter writer = new IndexWriter(dir, newIndexWriterConfig(
+        TEST_VERSION_CURRENT, new MockAnalyzer(random)));
+    // we don't need many documents to assert this, but don't use one
document either
+    int numDocs = atLeast(random, 50);
+    for (int i = 0; i < numDocs; i++) {
+      Document doc = new Document();
+      doc.add(new StringField("f", "doc", Store.NO));
+      writer.addDocument(doc);
+    }
+    writer.close();
+
+    Term term = new Term("f", new BytesRef("doc"));
+    DirectoryReader reader = DirectoryReader.open(dir);
+    for (AtomicReaderContext ctx : reader.leaves()) {
+      DocsEnum de = ctx.reader().termDocsEnum(term);
+      while (de.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
+        assertEquals("wrong freq for doc " + de.docID(), 1, de.freq());
+      }
+    }
+    reader.close();
+
+    dir.close();
+  }
 }

BTW, I don't know if it should be like that or not, but I ran this test
with -Dtests.iters=1000 and printed at the end of each iteration
Codec.getDefault().
For all iterations it printed the same Codec, but if I ran the test many
times (no iteration), it printed different Codecs in each run.
I thought that tests.iters should simulate the behavior of running the test
many times? And therefore pick a new seed + Codec (among other things) for
each iteration?

Shai


On Tue, Dec 18, 2012 at 1:09 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Tue, Dec 18, 2012 at 4:46 AM, Shai Erera <se...@gmail.com> wrote:
> > Are you sure that all Codecs return 1 if you indexed with DOCS_ONLY? Do
> we
> > have a test that can trip bad Codecs?
>
> I'm not sure!  We should make a test & fix any failing ones ...
>
> > It may be more than just changing the documentation...
>
> Right.
>
> > Why would e.g. TermQuery need to write specialized code for these cases?
> I
> > looked at TermScorer, and its freq() just returns docsEnum.freq().
>
> I meant if we did not adopt this spec ("freq() will lie and return 1
> when the field was indexed as DOCS_ONLY"), then e.g. TermQuery would
> need specialized code.
>
> > I think that Similarity may be affected? Which brings the question - how
> do
> > Similarity impls know what flags the DE was opened with, and shouldn't
> they
> > be specialized?
> > E.g. TFIDFSimilarity.ExactTFIDFDocScorer uses the freq passed to score()
> as
> > an index to an array, so clearly it assumes it is >= 0 and also <
> > scoreCache.length.
> > So I wonder what will happen to it when someone's Codec will return a
> > negative value or MAX_INT in case frequencies aren't needed?
>
> Well, if you passed FLAGS_NONE when you opened the DE then it's your
> responsibility to never call freq() ... ie, don't call freq() and pass
> that to the sim.
>
> > I do realize that you shouldn't call Similarity with missing information,
> > and TermWeight obtains a DocsEnum with frequencies, so in that regard it
> is
> > safe.
> > And if you do obtain a DocsEnum with FLAG_NONE, you'd better know what
> > you're doing and don't pass a random freq() to Similarity.
>
> Right.
>
> > I lean towards documenting the spec from above, and ensuring that all
> Codecs
> > return 1 for DOCS_ONLY.
>
> +1
>
> So freq() is undefined if you had passed FLAGS_NONE, and we will lie
> and say freq=1 (need a test verifying this) if the field was indexed
> as DOCS_ONLY.
>
> > If in the future we'll need to handle the case where someone receives a
> > DocsEnum which it needs to consume, and doesn't know which flags were
> used
> > to open it, we can always add a getFlags to DE.
>
> Yeah ...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: DocsEnum.freq()

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Tue, Dec 18, 2012 at 4:46 AM, Shai Erera <se...@gmail.com> wrote:
> Are you sure that all Codecs return 1 if you indexed with DOCS_ONLY? Do we
> have a test that can trip bad Codecs?

I'm not sure!  We should make a test & fix any failing ones ...

> It may be more than just changing the documentation...

Right.

> Why would e.g. TermQuery need to write specialized code for these cases? I
> looked at TermScorer, and its freq() just returns docsEnum.freq().

I meant if we did not adopt this spec ("freq() will lie and return 1
when the field was indexed as DOCS_ONLY"), then e.g. TermQuery would
need specialized code.

> I think that Similarity may be affected? Which brings the question - how do
> Similarity impls know what flags the DE was opened with, and shouldn't they
> be specialized?
> E.g. TFIDFSimilarity.ExactTFIDFDocScorer uses the freq passed to score() as
> an index to an array, so clearly it assumes it is >= 0 and also <
> scoreCache.length.
> So I wonder what will happen to it when someone's Codec will return a
> negative value or MAX_INT in case frequencies aren't needed?

Well, if you passed FLAGS_NONE when you opened the DE then it's your
responsibility to never call freq() ... ie, don't call freq() and pass
that to the sim.

> I do realize that you shouldn't call Similarity with missing information,
> and TermWeight obtains a DocsEnum with frequencies, so in that regard it is
> safe.
> And if you do obtain a DocsEnum with FLAG_NONE, you'd better know what
> you're doing and don't pass a random freq() to Similarity.

Right.

> I lean towards documenting the spec from above, and ensuring that all Codecs
> return 1 for DOCS_ONLY.

+1

So freq() is undefined if you had passed FLAGS_NONE, and we will lie
and say freq=1 (need a test verifying this) if the field was indexed
as DOCS_ONLY.

> If in the future we'll need to handle the case where someone receives a
> DocsEnum which it needs to consume, and doesn't know which flags were used
> to open it, we can always add a getFlags to DE.

Yeah ...

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: DocsEnum.freq()

Posted by Shai Erera <se...@gmail.com>.

Are you sure that all Codecs return 1 if you indexed with DOCS_ONLY? Do we
have a test that can trip bad Codecs?
It may be more than just changing the documentation...

Why would e.g. TermQuery need to write specialized code for these cases? I
looked at TermScorer, and its freq() just returns docsEnum.freq().

I think that Similarity may be affected? Which brings the question - how do
Similarity impls know what flags the DE was opened with, and shouldn't they
be specialized?
E.g. TFIDFSimilarity.ExactTFIDFDocScorer uses the freq passed to score() as
an index to an array, so clearly it assumes it is >= 0 and also <
scoreCache.length.
So I wonder what will happen to it when someone's Codec will return a
negative value or MAX_INT in case frequencies aren't needed?

I do realize that you shouldn't call Similarity with missing information,
and TermWeight obtains a DocsEnum with frequencies, so in that regard it is
safe.
And if you do obtain a DocsEnum with FLAG_NONE, you'd better know what
you're doing and don't pass a random freq() to Similarity.

I lean towards documenting the spec from above, and ensuring that all
Codecs return 1 for DOCS_ONLY.

If in the future we'll need to handle the case where someone receives a
DocsEnum which it needs to consume, and doesn't know which flags were used
to open it, we can always add a getFlags to DE.

Shai

On Mon, Dec 17, 2012 at 11:30 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Mon, Dec 17, 2012 at 4:02 PM, Shai Erera <se...@gmail.com> wrote:
> > How do these two go together?
> >
> >> I think for DOCS_ONLY it makes sense that we lie (say freq=1 when we
> >> don't know): lots of places would otherwise have to be special cased
> >> for when they consume DOCS_ONLY vs DOCS_AND_POSITIONS.
> >
> >
> > and
> >
> >> I'm also not sure that
> >> all codecs return 1 today if the fields was indexed with DOCS_ONLY ...
> >
> >
> >  That just makes it even worse right? I.e., we have code today that
> relies
> > no that behavior, but we're not sure it works w/ all Codecs?
>
> Sorry, for my last sentence above I think I meant "I'm also not sure
> that all codecs return 1 today if you ask for FLAG_NONE".
>
> > Remember that DocIdSetIterator.nextDoc() was loosely specified? It was
> very
> > hard to write a decent DISI consumer. Sometimes calling nextDoc()
> returned
> > MAX_VAL, sometimes -1, sometimes who knows. When we hardened the spec, it
> > actually made consumers' life easier, I think?
>
> Right, locking down the API makes total sense in general.
>
> > It's ok if we say that for DOCS_ONLY you have to return 1. That's even
> 99.9%
> > of the time the correct value to return (unless someone adds e.g. the
> same
> > StringField twice to the document).
>
> Right.
>
> > And it's also ok to say that if you passed FLAG_NONE, freq()'s value is
> > unspecified. I think it would be wrong to "lie" here .. not sure if the
> > consumer always knows how DocsEnum was requested. Not sure if this
> happens
> > in real life though (consuming a DocsEnum that you didn't obtain
> yourself),
> > so I'm willing to ignore that case.
>
> +1: I think FLAG_NONE should remain undefined.  I think we have codecs
> today that will return 1, 0, the actual doc freq (when the field was
> indexed as >= DOCS_AND_FREQS).
>
> > These two together sound like a reasonable "spec" to me?
>
> +1
>
> So I think just change your javadocs patch to say that FLAG_NONE means
> freq is not defined, and if field was indexed as DOCS_ONLY and you
> asked for FLAG_FREQS then we promise to lie and say freq=1?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: DocsEnum.freq()

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Mon, Dec 17, 2012 at 4:02 PM, Shai Erera <se...@gmail.com> wrote:
> How do these two go together?
>
>> I think for DOCS_ONLY it makes sense that we lie (say freq=1 when we
>> don't know): lots of places would otherwise have to be special cased
>> for when they consume DOCS_ONLY vs DOCS_AND_POSITIONS.
>
>
> and
>
>> I'm also not sure that
>> all codecs return 1 today if the fields was indexed with DOCS_ONLY ...
>
>
>  That just makes it even worse right? I.e., we have code today that relies
> no that behavior, but we're not sure it works w/ all Codecs?

Sorry, for my last sentence above I think I meant "I'm also not sure
that all codecs return 1 today if you ask for FLAG_NONE".

> Remember that DocIdSetIterator.nextDoc() was loosely specified? It was very
> hard to write a decent DISI consumer. Sometimes calling nextDoc() returned
> MAX_VAL, sometimes -1, sometimes who knows. When we hardened the spec, it
> actually made consumers' life easier, I think?

Right, locking down the API makes total sense in general.

> It's ok if we say that for DOCS_ONLY you have to return 1. That's even 99.9%
> of the time the correct value to return (unless someone adds e.g. the same
> StringField twice to the document).

Right.

> And it's also ok to say that if you passed FLAG_NONE, freq()'s value is
> unspecified. I think it would be wrong to "lie" here .. not sure if the
> consumer always knows how DocsEnum was requested. Not sure if this happens
> in real life though (consuming a DocsEnum that you didn't obtain yourself),
> so I'm willing to ignore that case.

+1: I think FLAG_NONE should remain undefined.  I think we have codecs
today that will return 1, 0, the actual doc freq (when the field was
indexed as >= DOCS_AND_FREQS).

> These two together sound like a reasonable "spec" to me?

+1

So I think just change your javadocs patch to say that FLAG_NONE means
freq is not defined, and if field was indexed as DOCS_ONLY and you
asked for FLAG_FREQS then we promise to lie and say freq=1?

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: DocsEnum.freq()

Posted by Shai Erera <se...@gmail.com>.

How do these two go together?

I think for DOCS_ONLY it makes sense that we lie (say freq=1 when we
> don't know): lots of places would otherwise have to be special cased
> for when they consume DOCS_ONLY vs DOCS_AND_POSITIONS.


and

I'm also not sure that
> all codecs return 1 today if the fields was indexed with DOCS_ONLY ...


 That just makes it even worse right? I.e., we have code today that relies
no that behavior, but we're not sure it works w/ all Codecs?

Remember that DocIdSetIterator.nextDoc() was loosely specified? It was very
hard to write a decent DISI consumer. Sometimes calling nextDoc() returned
MAX_VAL, sometimes -1, sometimes who knows. When we hardened the spec, it
actually made consumers' life easier, I think?

It's ok if we say that for DOCS_ONLY you have to return 1. That's even
99.9% of the time the correct value to return (unless someone adds e.g. the
same StringField twice to the document).

And it's also ok to say that if you passed FLAG_NONE, freq()'s value is
unspecified. I think it would be wrong to "lie" here .. not sure if the
consumer always knows how DocsEnum was requested. Not sure if this happens
in real life though (consuming a DocsEnum that you didn't obtain yourself),
so I'm willing to ignore that case.

These two together sound like a reasonable "spec" to me?

Shai


On Mon, Dec 17, 2012 at 7:16 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> I think the FLAG_NONE ("I don't need/want freqs when reading the
> index") and the DOCS_ONLY ("Do not index freqs") are two different
> cases?
>
> I think for DOCS_ONLY it makes sense that we lie (say freq=1 when we
> don't know): lots of places would otherwise have to be special cased
> for when they consume DOCS_ONLY vs DOCS_AND_POSITIONS.
>
> But, for FLAG_NONE, when the caller passes this it means they have no
> intention of using/calling freq() right?  Eg
> MultiTermQueryWrapperFilter would pass this.  For that case I'm not
> sure we should promise / require that codecs return 1 always?  EG what
> if the index does has freqs?  I think in that case the codec shouldn't
> be required to go out of its way and return 1?  I'm also not sure that
> all codecs return 1 today if the fields was indexed with DOCS_ONLY ...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Mon, Dec 17, 2012 at 11:24 AM, Shai Erera <se...@gmail.com> wrote:
> > Hi
> >
> > While migrating code to Lucene 4.0, I noticed that I have an assert on a
> > field that is indexed with DOCS_ONLY that DocsEnum.freq() == 1. This got
> me
> > thinking ... why?
> >
> > If you index w/ DOCS_ONLY, or ask for DocsEnum with FLAG_NONE, why do we
> > "lie" to the consumer? Rather, we could just return 0 or -1?
> >
> > I personally don't mind if we continue to return 1, if there's a real
> reason
> > to. I don't think that anyone should call freq() if he asked for DocsEnum
> > with FLAG_NONE. But if we do keep the current behavior, can we at least
> > document it?
> >
> > E.g., something like this patch:
> >
> > Index: lucene/core/src/java/org/apache/lucene/index/DocsEnum.java
> > ===================================================================
> > --- lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (revision
> > 1422804)
> > +++ lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (working
> > copy)
> > @@ -47,10 +47,16 @@
> >    protected DocsEnum() {
> >    }
> >
> > -  /** Returns term frequency in the current document.  Do
> > -   *  not call this before {@link #nextDoc} is first called,
> > -   *  nor after {@link #nextDoc} returns NO_MORE_DOCS.
> > -   **/
> > +  /**
> > +   * Returns term frequency in the current document, or 1 if the
> > +   * {@link DocsEnum} was obtained with {@link #FLAG_NONE}. Do not call
> > this
> > +   * before {@link #nextDoc} is first called, nor after {@link #nextDoc}
> > returns
> > +   * {@link DocIdSetIterator#NO_MORE_DOCS}.
> > +   *
> > +   * <p>
> > +   * <b>NOTE:</b> if the {@link DocsEnum} was obtain with {@link
> > #FLAG_NONE},
> > +   * this method returns 1.
> > +   */
> >    public abstract int freq() throws IOException;
> >
> >    /** Returns the related attributes. */
> >
> > Shai
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: DocsEnum.freq()

Posted by Michael McCandless <lu...@mikemccandless.com>.

I think the FLAG_NONE ("I don't need/want freqs when reading the
index") and the DOCS_ONLY ("Do not index freqs") are two different
cases?

I think for DOCS_ONLY it makes sense that we lie (say freq=1 when we
don't know): lots of places would otherwise have to be special cased
for when they consume DOCS_ONLY vs DOCS_AND_POSITIONS.

But, for FLAG_NONE, when the caller passes this it means they have no
intention of using/calling freq() right?  Eg
MultiTermQueryWrapperFilter would pass this.  For that case I'm not
sure we should promise / require that codecs return 1 always?  EG what
if the index does has freqs?  I think in that case the codec shouldn't
be required to go out of its way and return 1?  I'm also not sure that
all codecs return 1 today if the fields was indexed with DOCS_ONLY ...

Mike McCandless

http://blog.mikemccandless.com

On Mon, Dec 17, 2012 at 11:24 AM, Shai Erera <se...@gmail.com> wrote:
> Hi
>
> While migrating code to Lucene 4.0, I noticed that I have an assert on a
> field that is indexed with DOCS_ONLY that DocsEnum.freq() == 1. This got me
> thinking ... why?
>
> If you index w/ DOCS_ONLY, or ask for DocsEnum with FLAG_NONE, why do we
> "lie" to the consumer? Rather, we could just return 0 or -1?
>
> I personally don't mind if we continue to return 1, if there's a real reason
> to. I don't think that anyone should call freq() if he asked for DocsEnum
> with FLAG_NONE. But if we do keep the current behavior, can we at least
> document it?
>
> E.g., something like this patch:
>
> Index: lucene/core/src/java/org/apache/lucene/index/DocsEnum.java
> ===================================================================
> --- lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (revision
> 1422804)
> +++ lucene/core/src/java/org/apache/lucene/index/DocsEnum.java  (working
> copy)
> @@ -47,10 +47,16 @@
>    protected DocsEnum() {
>    }
>
> -  /** Returns term frequency in the current document.  Do
> -   *  not call this before {@link #nextDoc} is first called,
> -   *  nor after {@link #nextDoc} returns NO_MORE_DOCS.
> -   **/
> +  /**
> +   * Returns term frequency in the current document, or 1 if the
> +   * {@link DocsEnum} was obtained with {@link #FLAG_NONE}. Do not call
> this
> +   * before {@link #nextDoc} is first called, nor after {@link #nextDoc}
> returns
> +   * {@link DocIdSetIterator#NO_MORE_DOCS}.
> +   *
> +   * <p>
> +   * <b>NOTE:</b> if the {@link DocsEnum} was obtain with {@link
> #FLAG_NONE},
> +   * this method returns 1.
> +   */
>    public abstract int freq() throws IOException;
>
>    /** Returns the related attributes. */
>
> Shai

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: DocsEnum.freq()

Posted by Robert Muir <rc...@gmail.com>.

On Mon, Dec 17, 2012 at 11:24 AM, Shai Erera <se...@gmail.com> wrote:
> Hi
>
> While migrating code to Lucene 4.0, I noticed that I have an assert on a
> field that is indexed with DOCS_ONLY that DocsEnum.freq() == 1. This got me
> thinking ... why?

1 is pretty intuitive i think. i think of omitting freqs for things
like super-short fields where the frequency is expected to be 1
anyway.

I see your point about "lying", but if we change this, we have to
change all code using it like TermQuery to not call freqs if they are
unavailable: i think this makes the apis pretty difficult to use.

>
> But if we do keep the current behavior, can we at least
> document it?
>
> E.g., something like this patch:
>

+1 to the patch.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org