You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Smiley, David W." <ds...@mitre.org> on 2011/01/04 23:38:11 UTC

Use of MultiFields.getFields() bad practice?

I'm looking through the trunk code on various implementations of Filter.getDocIdSet(IndexReader).  It is often needed to get an instance of Terms and then do other work from there.  Looking at MultiTermQueryWrapperFilter, the first set of lines to do this is:

  public DocIdSet getDocIdSet(IndexReader reader) throws IOException {
    final Fields fields = MultiFields.getFields(reader);
    if (fields == null) {
      // reader has no fields
      return DocIdSet.EMPTY_DOCIDSET;
    }

    final Terms terms = fields.terms(query.field);
    if (terms == null) {
      // field does not exist
      return DocIdSet.EMPTY_DOCIDSET;
    }
....

When I look at the javadoc for MultiFields.getFields(reader), I see some Javadoc (apparently written by Michael McCandless, CC'ed), with the following javadoc snippet :  
   *  <p><b>NOTE</b>: this is a slow way to access postings.
   *  It's better to get the sub-readers (using {@link
   *  Gather}) and iterate through them
   *  yourself.

If this is the case, then why is MultiFields.getFields(reader) used 43 times across Lucene/Solr whereas ReaderUtil.Gather is only used 5 times?  If it's a TODO then perhaps a JIRA issue needs to be created.  I don't find helpful examples of how to use ReaderUtil.Gather... the existing 5 uses are all within MultiFields & ReaderUtil.

FWIW, in a Lucene Filter I wrote, I've been using this code snippet successfully:

	Terms terms = reader.fields().terms(fieldName);

On a related topic, I think that if Filter.getDocIdSet() is documented that it may return null, then it's better code design to consequently return null in appropriate circumstances instead of DocIdSet.EMPTY_DOCIDSET.  That said, FWIW, I prefer API design that favors non-null when you can get away with it, like this case.  So I'm in favor of making getDocIdSet() be documented to not return null (and follow through throughout the codebase).  Admittedly some callers might have short-circuit logic.

~ David Smiley
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


RE: Use of MultiFields.getFields() bad practice?

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi David,

As all Filters and Scorers now work solely on segments, MultiFields is no
longer needed for those cases. Usage of MultiFields is not a problem here,
as all index readers passed into these parts of Lucene are all  atomic, so
MultiFields is a no-op.  There is as far as I know already an issue open to
remove that (see also https://issues.apache.org/jira/browse/LUCENE-2771 for
norms). But this is not important at the moment. So you can at all those
places simply replace that by the direct call to IndexReader. This was just
not yet done.

We currently have MultiFields in that for backwards compatibility (to
support executing a filter on a composite reader). But as trunk no longer
needs backwards compatibility, this is no longer needed but does not hurt,
it's currently just inconsistent.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Smiley, David W. [mailto:dsmiley@mitre.org]
> Sent: Tuesday, January 04, 2011 11:38 PM
> To: dev@lucene.apache.org Dev
> Cc: Michael McCandless
> Subject: Use of MultiFields.getFields() bad practice?
> 
> I'm looking through the trunk code on various implementations of
> Filter.getDocIdSet(IndexReader).  It is often needed to get an instance of
> Terms and then do other work from there.  Looking at
> MultiTermQueryWrapperFilter, the first set of lines to do this is:
> 
>   public DocIdSet getDocIdSet(IndexReader reader) throws IOException {
>     final Fields fields = MultiFields.getFields(reader);
>     if (fields == null) {
>       // reader has no fields
>       return DocIdSet.EMPTY_DOCIDSET;
>     }
> 
>     final Terms terms = fields.terms(query.field);
>     if (terms == null) {
>       // field does not exist
>       return DocIdSet.EMPTY_DOCIDSET;
>     }
> ....
> 
> When I look at the javadoc for MultiFields.getFields(reader), I see some
> Javadoc (apparently written by Michael McCandless, CC'ed), with the
> following javadoc snippet :
>    *  <p><b>NOTE</b>: this is a slow way to access postings.
>    *  It's better to get the sub-readers (using {@link
>    *  Gather}) and iterate through them
>    *  yourself.
> 
> If this is the case, then why is MultiFields.getFields(reader) used 43
times
> across Lucene/Solr whereas ReaderUtil.Gather is only used 5 times?  If
it's a
> TODO then perhaps a JIRA issue needs to be created.  I don't find helpful
> examples of how to use ReaderUtil.Gather... the existing 5 uses are all
within
> MultiFields & ReaderUtil.
> 
> FWIW, in a Lucene Filter I wrote, I've been using this code snippet
> successfully:
> 
> 	Terms terms = reader.fields().terms(fieldName);
> 
> On a related topic, I think that if Filter.getDocIdSet() is documented
that it
> may return null, then it's better code design to consequently return null
in
> appropriate circumstances instead of DocIdSet.EMPTY_DOCIDSET.  That said,
> FWIW, I prefer API design that favors non-null when you can get away with
> it, like this case.  So I'm in favor of making getDocIdSet() be documented
to
> not return null (and follow through throughout the codebase).  Admittedly
> some callers might have short-circuit logic.
> 
> ~ David Smiley
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
> commands, e-mail: dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Use of MultiFields.getFields() bad practice?

Posted by Robert Muir <rc...@gmail.com>.
On Tue, Jan 4, 2011 at 5:38 PM, Smiley, David W. <ds...@mitre.org> wrote:
> If this is the case, then why is MultiFields.getFields(reader) used 43 times across Lucene/Solr whereas ReaderUtil.Gather is only used 5 times?  If it's a TODO then perhaps a JIRA issue needs to be created.  I don't find helpful examples of how to use ReaderUtil.Gather... the existing 5 uses are all within MultiFields & ReaderUtil.

because ReaderUtil.Gather isn't the only way to get per-segment
Fields... for example look at TermCollectingRewrite.

there was already a discussion to clean up the Multi* usage: see Uwe's
comment on https://issues.apache.org/jira/browse/LUCENE-2771

In general this has been happening though, for example only recently
did MultiTermQuery start working completely per-segment:
https://issues.apache.org/jira/browse/LUCENE-2690

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Use of MultiFields.getFields() bad practice?

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Wed, Jan 5, 2011 at 2:21 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
> Some other ideas:
> Maybe add an isEmpty() "hint" method to DocIdSet(Iterator). Empty DocIdSet
> would always return true. The problem, this method is costly for OpenBitSet.
> Maybe its just a "hint". Returning false is also OK when its empty. So if
> you have an docIdSet that’s empty and you can easily detect it, simply
> return true. The default impl returns false.

Hmm... feels like that's overly complicated?  Shouldn't we encourage
impls to just use the empty sentinel?

> Something else: EMPTY_DOCIDSETITER could be same instance EMPTY_SCORER ==
> EMPTY_DOCIDSETITER (and implemented as Scorer). You only have to add few
> methods to this empty instance.

I like this one!

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


RE: Use of MultiFields.getFields() bad practice?

Posted by Uwe Schindler <uw...@thetaphi.de>.
Some other ideas:
Maybe add an isEmpty() "hint" method to DocIdSet(Iterator). Empty DocIdSet
would always return true. The problem, this method is costly for OpenBitSet.
Maybe its just a "hint". Returning false is also OK when its empty. So if
you have an docIdSet that’s empty and you can easily detect it, simply
return true. The default impl returns false.

Something else: EMPTY_DOCIDSETITER could be same instance EMPTY_SCORER ==
EMPTY_DOCIDSETITER (and implemented as Scorer). You only have to add few
methods to this empty instance.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Wednesday, January 05, 2011 8:12 PM
> To: dev@lucene.apache.org
> Subject: Re: Use of MultiFields.getFields() bad practice?
> 
> On Wed, Jan 5, 2011 at 1:35 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
> 
> > Scorer is an iterator :-)
> 
> Aha!  You are right... I keep forgetting this about Scorer :)
> 
> >> In some cases the null return can make a difference in performance,
> >> eg if BQ is OR'ing two terms, but one of them yields a null scorer
> >> (matches no docs) then the scorer can [almost -- coord] rewrite to a
> >> TermScorer with just the other term vs BooleanScorer.  We don't do this
> today, but we should/could.
> >
> > In that case, BQ could simply do a check on scorer==EMPTY_SCORER to
> > achieve the same. As Scorer subclasses DocIdSetIterator, and is an
> > iterator, it should return something empty (see below).
> 
> I think that's a good idea?  Ie the contract would then be "if
Weight.scorer
> can determine up front that no docs can match, you should return
> EMPTY_SCORER since caller may optimize for that case".
> Ie, impls shouldn't return their own custom empty impl; they should return
> that specific instance.
> 
> Hmm are we ever gonna run into classloader hell, where we have more than
> one instance of this EMPTY_SCORER somehow...?  null is safer in that
regard
> since null is always null...
> 
> >> So my feeling is we have to take it case by case, and I think these
> >> two cases (Weight.scorer and Filter.getDocIdSet) should keep their
> >> current contract (null may be returned if no docs will match).
> >
> > But we should not force them to return either null or empty. So the
> > docs say
> > "*may* return null". They don’t need to. They can return an empty
> > scorer if they like.
> 
> Right -- I think "may return null" is the right contract here.
> 
> > BUT:
> >
> > I am just upset about such code:
> >
> >      final DocIdSet dis = filter.getDocIdSet(reader);
> >      if (dis == null)
> >       return null;
> >      final DocIdSetIterator disi = dis.iterator();
> >      if (disi == null)
> >        return null;
> >      return new ConstantScorer(similarity, disi, this);
> >
> > (this is what I have seen during my work for ConstantScoreQuery)
> 
> Alas the DIS.iterator now states null is a valid return... sigh.  I think
it should
> not... but dangerous to change that now.
> 
> But, even if we switch to the EMPTY_X sentinel, you'd still need these
ifs?  IE,
> we want CSQ to detect an "empty" filter and forward this on as an "empty
> scorer".
> 
> Mike
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
> commands, e-mail: dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Use of MultiFields.getFields() bad practice?

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Wed, Jan 5, 2011 at 1:35 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

> Scorer is an iterator :-)

Aha!  You are right... I keep forgetting this about Scorer :)

>> In some cases the null return can make a difference in performance, eg if BQ
>> is OR'ing two terms, but one of them yields a null scorer (matches no docs)
>> then the scorer can [almost -- coord] rewrite to a TermScorer with just the
>> other term vs BooleanScorer.  We don't do this today, but we should/could.
>
> In that case, BQ could simply do a check on scorer==EMPTY_SCORER to achieve
> the same. As Scorer subclasses DocIdSetIterator, and is an iterator, it
> should return something empty (see below).

I think that's a good idea?  Ie the contract would then be "if
Weight.scorer can determine up front that no docs can match, you
should return EMPTY_SCORER since caller may optimize for that case".
Ie, impls shouldn't return their own custom empty impl; they should
return that specific instance.

Hmm are we ever gonna run into classloader hell, where we have more
than one instance of this EMPTY_SCORER somehow...?  null is safer in
that regard since null is always null...

>> So my feeling is we have to take it case by case, and I think these two cases
>> (Weight.scorer and Filter.getDocIdSet) should keep their current contract
>> (null may be returned if no docs will match).
>
> But we should not force them to return either null or empty. So the docs say
> "*may* return null". They don’t need to. They can return an empty scorer if
> they like.

Right -- I think "may return null" is the right contract here.

> BUT:
>
> I am just upset about such code:
>
>      final DocIdSet dis = filter.getDocIdSet(reader);
>      if (dis == null)
>       return null;
>      final DocIdSetIterator disi = dis.iterator();
>      if (disi == null)
>        return null;
>      return new ConstantScorer(similarity, disi, this);
>
> (this is what I have seen during my work for ConstantScoreQuery)

Alas the DIS.iterator now states null is a valid return... sigh.  I
think it should not... but dangerous to change that now.

But, even if we switch to the EMPTY_X sentinel, you'd still need these
ifs?  IE, we want CSQ to detect an "empty" filter and forward this on
as an "empty scorer".

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Use of MultiFields.getFields() bad practice?

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Wed, Jan 5, 2011 at 3:50 PM, Smiley, David W. <ds...@mitre.org> wrote:
> On Jan 5, 2011, at 1:35 PM, Uwe Schindler wrote:
>
>> BUT:
>>
>> I am just upset about such code:
>>
>>      final DocIdSet dis = filter.getDocIdSet(reader);
>>      if (dis == null)
>>        return null;
>>      final DocIdSetIterator disi = dis.iterator();
>>      if (disi == null)
>>        return null;
>>      return new ConstantScorer(similarity, disi, this);
>>
>> (this is what I have seen during my work for ConstantScoreQuery)
>
> Exactly.  I can't stand such code either.  Null has its place but it is often avoidable.  Given that we're talking about trunk for a major version, I think it's definitely not too late.

But, I think even w/ the sentinels we're going to have to have ifs here...

> It would be awesome if we had @NotNull, @Nullable, (and various threadsafe ones!), and used FindBugs to validate various constraints.  There isn't yet a standard set in the JDK so some projects like Apache Http components have their own in their own package.  FindBugs ignores the package name (I know, I've checked).  We could do the same?  If this would be acceptable then I could create a patch.

This sounds awesome!  So it could catch us if sometimes we return null
from a @NotNull method?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Use of MultiFields.getFields() bad practice?

Posted by "Smiley, David W." <ds...@mitre.org>.
On Jan 5, 2011, at 1:35 PM, Uwe Schindler wrote:

> BUT:
> 
> I am just upset about such code:
> 
>      final DocIdSet dis = filter.getDocIdSet(reader);
>      if (dis == null)
>        return null;
>      final DocIdSetIterator disi = dis.iterator();
>      if (disi == null)
>        return null;
>      return new ConstantScorer(similarity, disi, this);
> 
> (this is what I have seen during my work for ConstantScoreQuery)

Exactly.  I can't stand such code either.  Null has its place but it is often avoidable.  Given that we're talking about trunk for a major version, I think it's definitely not too late.

It would be awesome if we had @NotNull, @Nullable, (and various threadsafe ones!), and used FindBugs to validate various constraints.  There isn't yet a standard set in the JDK so some projects like Apache Http components have their own in their own package.  FindBugs ignores the package name (I know, I've checked).  We could do the same?  If this would be acceptable then I could create a patch.

~ David Smiley
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


RE: Use of MultiFields.getFields() bad practice?

Posted by Uwe Schindler <uw...@thetaphi.de>.
> >> I do agree we should be returning null not EMPTY_DOCIDSET since
> >> Filter.getDocIDSet's jdoc clearly states a null return means no docs
> >> are accepted.
> >
> > I disagree. NULL is a stupid indicator, if something is empty it
> > should return something empty. I also don't like somebody returning
> > NULL instead of
> > Collections.emptySet() or like that. We should document Filter and
> > also Weight.scorer to always return non-null and also supply a
> > Scorer.EMPTY_SCORER. If you want to optimize something, you can always
> > change you loops to first do next() and then exit early.
> 
> I don't think it's so clear cut.  I think it depends on how the API is
used, and
> how advanced an API it is.
> 
> For example, Fields.terms("foobar") returns null if foobar is a
non-existent
> field.  I think that's appropriate, because to return a "fake instance"
loses
> information.  EG caller can no longer tell if a given field exists or not.

OK, I agree with that. If the field does not exist, it makes no sense to
return anything.

> In some cases the null return can make a difference in performance, eg if
BQ
> is OR'ing two terms, but one of them yields a null scorer (matches no
docs)
> then the scorer can [almost -- coord] rewrite to a TermScorer with just
the
> other term vs BooleanScorer.  We don't do this today, but we should/could.

In that case, BQ could simply do a check on scorer==EMPTY_SCORER to achieve
the same. As Scorer subclasses DocIdSetIterator, and is an iterator, it
should return something empty (see below).

> So my feeling is we have to take it case by case, and I think these two
cases
> (Weight.scorer and Filter.getDocIdSet) should keep their current contract
> (null may be returned if no docs will match).

But we should not force them to return either null or empty. So the docs say
"*may* return null". They don’t need to. They can return an empty scorer if
they like.

BUT:

I am just upset about such code:

      final DocIdSet dis = filter.getDocIdSet(reader);
      if (dis == null)
        return null;
      final DocIdSetIterator disi = dis.iterator();
      if (disi == null)
        return null;
      return new ConstantScorer(similarity, disi, this);

(this is what I have seen during my work for ConstantScoreQuery)

> Regardless, I think each API should clearly state the contract
unambiguously.
> Ie "this method never returns null", or, "this method will return null if
XYZ",
> or, "this method may return null if XYZ".
> 
> >> And I think I'm on the opposite side of the fence on null returns, at
> > least for
> >> advanced APIs -- I'd prefer not to hide information (by returning an
> >> empty
> >> instance) in this case.  But other cases I think should never return
> >> null
> > -- eg
> >> once you have a non-null DocIdSet, then its .iterator() should never
> > return
> >> null.
> >
> > I agree. DocIdSet.iterator() should return EMPTY_DOCIDSETITERATOR.
> 
> Right, for an iterator from any class I think it should never return null.

Scorer is an iterator :-)

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Use of MultiFields.getFields() bad practice?

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Wed, Jan 5, 2011 at 12:35 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
>> I do agree we should be returning null not EMPTY_DOCIDSET since
>> Filter.getDocIDSet's jdoc clearly states a null return means no docs are
>> accepted.
>
> I disagree. NULL is a stupid indicator, if something is empty it should
> return something empty. I also don't like somebody returning NULL instead of
> Collections.emptySet() or like that. We should document Filter and also
> Weight.scorer to always return non-null and also supply a
> Scorer.EMPTY_SCORER. If you want to optimize something, you can always
> change you loops to first do next() and then exit early.

I don't think it's so clear cut.  I think it depends on how the API is
used, and how advanced an API it is.

For example, Fields.terms("foobar") returns null if foobar is a
non-existent field.  I think that's appropriate, because to return a
"fake instance" loses information.  EG caller can no longer tell if a
given field exists or not.

In some cases the null return can make a difference in performance, eg
if BQ is OR'ing two terms, but one of them yields a null scorer
(matches no docs) then the scorer can [almost -- coord] rewrite to a
TermScorer with just the other term vs BooleanScorer.  We don't do
this today, but we should/could.

So my feeling is we have to take it case by case, and I think these
two cases (Weight.scorer and Filter.getDocIdSet) should keep their
current contract (null may be returned if no docs will match).

Regardless, I think each API should clearly state the contract
unambiguously.  Ie "this method never returns null", or, "this method
will return null if XYZ", or, "this method may return null if XYZ".

>> And I think I'm on the opposite side of the fence on null returns, at
> least for
>> advanced APIs -- I'd prefer not to hide information (by returning an empty
>> instance) in this case.  But other cases I think should never return null
> -- eg
>> once you have a non-null DocIdSet, then its .iterator() should never
> return
>> null.
>
> I agree. DocIdSet.iterator() should return EMPTY_DOCIDSETITERATOR.

Right, for an iterator from any class I think it should never return
null.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


RE: Use of MultiFields.getFields() bad practice?

Posted by Uwe Schindler <uw...@thetaphi.de>.
> I do agree we should be returning null not EMPTY_DOCIDSET since
> Filter.getDocIDSet's jdoc clearly states a null return means no docs are
> accepted.

I disagree. NULL is a stupid indicator, if something is empty it should
return something empty. I also don't like somebody returning NULL instead of
Collections.emptySet() or like that. We should document Filter and also
Weight.scorer to always return non-null and also supply a
Scorer.EMPTY_SCORER. If you want to optimize something, you can always
change you loops to first do next() and then exit early.

> And I think I'm on the opposite side of the fence on null returns, at
least for
> advanced APIs -- I'd prefer not to hide information (by returning an empty
> instance) in this case.  But other cases I think should never return null
-- eg
> once you have a non-null DocIdSet, then its .iterator() should never
return
> null.

I agree. DocIdSet.iterator() should return EMPTY_DOCIDSETITERATOR.

> Mike
> 
> On Tue, Jan 4, 2011 at 5:38 PM, Smiley, David W. <ds...@mitre.org>
> wrote:
> > I'm looking through the trunk code on various implementations of
> Filter.getDocIdSet(IndexReader).  It is often needed to get an instance of
> Terms and then do other work from there.  Looking at
> MultiTermQueryWrapperFilter, the first set of lines to do this is:
> >
> >  public DocIdSet getDocIdSet(IndexReader reader) throws IOException {
> >    final Fields fields = MultiFields.getFields(reader);
> >    if (fields == null) {
> >      // reader has no fields
> >      return DocIdSet.EMPTY_DOCIDSET;
> >    }
> >
> >    final Terms terms = fields.terms(query.field);
> >    if (terms == null) {
> >      // field does not exist
> >      return DocIdSet.EMPTY_DOCIDSET;
> >    }
> > ....
> >
> > When I look at the javadoc for MultiFields.getFields(reader), I see some
> Javadoc (apparently written by Michael McCandless, CC'ed), with the
> following javadoc snippet :
> >   *  <p><b>NOTE</b>: this is a slow way to access postings.
> >   *  It's better to get the sub-readers (using {@link
> >   *  Gather}) and iterate through them
> >   *  yourself.
> >
> > If this is the case, then why is MultiFields.getFields(reader) used 43
times
> across Lucene/Solr whereas ReaderUtil.Gather is only used 5 times?  If
it's a
> TODO then perhaps a JIRA issue needs to be created.  I don't find helpful
> examples of how to use ReaderUtil.Gather... the existing 5 uses are all
within
> MultiFields & ReaderUtil.
> >
> > FWIW, in a Lucene Filter I wrote, I've been using this code snippet
> successfully:
> >
> >        Terms terms = reader.fields().terms(fieldName);
> >
> > On a related topic, I think that if Filter.getDocIdSet() is documented
that it
> may return null, then it's better code design to consequently return null
in
> appropriate circumstances instead of DocIdSet.EMPTY_DOCIDSET.  That said,
> FWIW, I prefer API design that favors non-null when you can get away with
> it, like this case.  So I'm in favor of making getDocIdSet() be documented
to
> not return null (and follow through throughout the codebase).  Admittedly
> some callers might have short-circuit logic.
> >
> > ~ David Smiley
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
> commands, e-mail: dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Use of MultiFields.getFields() bad practice?

Posted by Michael McCandless <lu...@mikemccandless.com>.
I just cleaned up a few more unnecessary ones.

Really, we need to break out atomic vs composite IndexReaders.
Effectively, we already have, it's just that it's dynamically typed
(you hit exc's at runtime) not statically typed.  I'd like to make it
statically typed so it's clear which APIs take what readers.  EG
Filter.getDocIDSet always takes an atomic reader.

I do agree we should be returning null not EMPTY_DOCIDSET since
Filter.getDocIDSet's jdoc clearly states a null return means no docs
are accepted.

And I think I'm on the opposite side of the fence on null returns, at
least for advanced APIs -- I'd prefer not to hide information (by
returning an empty instance) in this case.  But other cases I think
should never return null -- eg once you have a non-null DocIdSet, then
its .iterator() should never return null.

Mike

On Tue, Jan 4, 2011 at 5:38 PM, Smiley, David W. <ds...@mitre.org> wrote:
> I'm looking through the trunk code on various implementations of Filter.getDocIdSet(IndexReader).  It is often needed to get an instance of Terms and then do other work from there.  Looking at MultiTermQueryWrapperFilter, the first set of lines to do this is:
>
>  public DocIdSet getDocIdSet(IndexReader reader) throws IOException {
>    final Fields fields = MultiFields.getFields(reader);
>    if (fields == null) {
>      // reader has no fields
>      return DocIdSet.EMPTY_DOCIDSET;
>    }
>
>    final Terms terms = fields.terms(query.field);
>    if (terms == null) {
>      // field does not exist
>      return DocIdSet.EMPTY_DOCIDSET;
>    }
> ....
>
> When I look at the javadoc for MultiFields.getFields(reader), I see some Javadoc (apparently written by Michael McCandless, CC'ed), with the following javadoc snippet :
>   *  <p><b>NOTE</b>: this is a slow way to access postings.
>   *  It's better to get the sub-readers (using {@link
>   *  Gather}) and iterate through them
>   *  yourself.
>
> If this is the case, then why is MultiFields.getFields(reader) used 43 times across Lucene/Solr whereas ReaderUtil.Gather is only used 5 times?  If it's a TODO then perhaps a JIRA issue needs to be created.  I don't find helpful examples of how to use ReaderUtil.Gather... the existing 5 uses are all within MultiFields & ReaderUtil.
>
> FWIW, in a Lucene Filter I wrote, I've been using this code snippet successfully:
>
>        Terms terms = reader.fields().terms(fieldName);
>
> On a related topic, I think that if Filter.getDocIdSet() is documented that it may return null, then it's better code design to consequently return null in appropriate circumstances instead of DocIdSet.EMPTY_DOCIDSET.  That said, FWIW, I prefer API design that favors non-null when you can get away with it, like this case.  So I'm in favor of making getDocIdSet() be documented to not return null (and follow through throughout the codebase).  Admittedly some callers might have short-circuit logic.
>
> ~ David Smiley

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org