You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Bruce Ritchie <br...@jivesoftware.com> on 2003/09/09 01:31:46 UTC

unexpected behavior with reader.terms(term) not folowing contract

All,

I've been investigating a possible improvement to the DateFilter and have run into an issue I 
believe is a bug with Lucene 1.3 RC1.

Synopsis:

I'm trying to add a clause into the bits(IndexReader reader) of the DateFilter class to eliminate a 
  compareTo() test and improve performance. This should be allowed whenever the DateFilter has a 
start date but no end date since reader.terms(term) says all terms after the given term will be 
greater than those that precede it. Thus, if we see that the endDate is equal to 
DateField.MAX_DATE_STRING() we should be able to skip the "while (enum.term().compareTo(stop) <= 0)" 
test and improve the performance of the filter with a large document set.

/** Returns an enumeration of all terms after a given term.
     The enumeration is ordered by Term.compareTo().  Each term
     is greater than all that precede it in the enumeration.
    */
public abstract TermEnum terms(Term t) throws IOException;


Problem:

The above contract does not seem to be true in my testing. The modified DateFilter.bits(..) method 
attached seems to show that enum.next() will indeed return a term that is less than all terms 
preceeding it in the enumeration.

With my current index I create a DateFilter via filter = new DateFilter.After("creationDate", 
afterDate); where afterDate is set to Sept 07 00:00:00 EDT 2003

The output from my debugging statement is as follows:

setting bit enabled for doc 466305, date Sun Sep 07 00:00:02 EDT 2003, term text was 0dkaji5zk
setting bit enabled for doc 466306, date Sun Sep 07 00:00:05 EDT 2003, term text was 0dkaji8aw
setting bit enabled for doc 466620, date Sun Sep 07 00:00:13 EDT 2003, term text was 0dkajieh4
setting bit enabled for doc 472854, date Sun Sep 07 00:00:15 EDT 2003, term text was 0dkajig0o
setting bit enabled for doc 472855, date Sun Sep 07 00:00:27 EDT 2003, term text was 0dkajipa0
setting bit enabled for doc 467844, date Sun Sep 07 00:00:58 EDT 2003, term text was 0dkajjd74
<snipped for bevity)
setting bit enabled for doc 474111, date Sun Sep 07 17:37:52 EDT 2003, term text was 0dkblajr4
setting bit enabled for doc 474112, date Sun Sep 07 17:38:01 EDT 2003, term text was 0dkblaqp4
setting bit enabled for doc 474044, date Sun Sep 07 17:38:09 EDT 2003, term text was 0dkblawvc
setting bit enabled for doc 474091, date Sun Sep 07 18:00:57 EDT 2003, term text was 0dkbm48fr
setting bit enabled for doc 84, date Wed Dec 31 19:00:00 EST 1969, term text was 10
setting bit enabled for doc 85, date Wed Dec 31 19:00:00 EST 1969, term text was 10
setting bit enabled for doc 86, date Wed Dec 31 19:00:00 EST 1969, term text was 10
<and so on and so forth>


 From the above debug logging you can see that enum.next() has returned a TermEnum with a text of 
'10'. While this is logically greater than or equal to the preceeding text according to 
String.compareTo(), I'm uncertain as to where the '10' text is coming from. As an example, document 
#86 returns in another search the following:

setting bit enabled for doc 86, date Fri Jul 11 17:08:43 EDT 2003, term text was 0di0opnjs

If someone could either point me in the correct direction and/or isolate the bug it would be 
appreciated.



Regards,

Bruce Ritchie

Re: unexpected behavior with reader.terms(term) not folowing contract

Posted by Doug Cutting <cu...@lucene.com>.
Bruce Ritchie wrote:
> TermEnum enum = reader.terms(new Term(field, start));
> 
> It is indeed possible that the '10' value could be text from another field.
> 
> I must say that as a user of the API I'm (obviously) surprised if the 
> above API call returned any results outside of the field specified. If 
> that is indeed the case if is almost certainly the cause of the issue. 
> I'll add some more logging to confirm.

Sorry for the surprise!  A TermEnum enumerates all of the terms in an 
index, ordered first by field, then, within that, by text.  It can be 
repositioned randomly, but that repositioning does not constrain the 
enumeration to the field of the term last repositioned to.  So, if you 
wish to only enumerate the terms of a particular field, you need to 
check that in your loop.  Note that field names are interned strings, so 
you can use '==' instead of 'String.equals(String)', making this test 
quite fast.

Doug


Re: unexpected behavior with reader.terms(term) not folowing contract

Posted by Bruce Ritchie <br...@jivesoftware.com>.
Doug Cutting wrote:
>>  From the above debug logging you can see that enum.next() has 
>> returned a TermEnum with a text of '10'. While this is logically 
>> greater than or equal to the preceeding text according to 
>> String.compareTo(), I'm uncertain as to where the '10' text is coming 
>> from. As an example, document #86 returns in another search the 
>> following:
> 
> 
> Could this be the text from a term in a different field?  From a quick 
> glance, it doesn't look like you're checking that your enumeration stays 
> within the field.

TermEnum enum = reader.terms(new Term(field, start));

It is indeed possible that the '10' value could be text from another field.

I must say that as a user of the API I'm (obviously) surprised if the above API call returned any 
results outside of the field specified. If that is indeed the case if is almost certainly the cause 
of the issue. I'll add some more logging to confirm.


Regards,

Bruce Ritchie

Re: unexpected behavior with reader.terms(term) not folowing contract

Posted by Doug Cutting <cu...@lucene.com>.
Bruce Ritchie wrote:
>  From the above debug logging you can see that enum.next() has returned 
> a TermEnum with a text of '10'. While this is logically greater than or 
> equal to the preceeding text according to String.compareTo(), I'm 
> uncertain as to where the '10' text is coming from. As an example, 
> document #86 returns in another search the following:

Could this be the text from a term in a different field?  From a quick 
glance, it doesn't look like you're checking that your enumeration stays 
within the field.

Doug


Re: unexpected behavior with reader.terms(term) not folowing contract

Posted by Bruce Ritchie <br...@jivesoftware.com>.
Otis Gospodnetic wrote:
> I can't remember where the 'gibberish' text provided by
> enum.term().text() comes from exacly, any more, and have no source code
> handy to check it for you.
> 
> However, the DateFilter modification that you are describing sounds
> acceptable.  Would it be possible for you to submit it as a patch (diff
> -uN)?  It would also be nice to provide a unit test for DateFilter, if
> we haven't got one already (I can't check now), in order to make sure
> that your modification doesn't break the existing behaviour.  Could you
> do that, please?

Once the issue/bug causing the unexpected behavior to occur has been resolved I'll be glad to.

> Finally, before we apply your changes, we should see whether this
> modification really has any significant impact on the performance.  I
> assume you will test this with some large data sets yourself, so please
> share your results.

Of course. My initial testing was spurring by the fact that a search on my sample dataset (~450,000 
usenet messages) was taking > 6 seconds to perform if you specified a date range but < 250 ms if you 
didn't. I've worked around most of the issue by caching the bitset results of filters for reuse 
however the initial query for that date range (yesterday, last week, last 90 days, etc) still 
requires the 6+ seconds. Since the consumer view of our application only allows 'after' queries the 
date filter changes I was testing should improve performance. Once the issue/bug has been resolved 
I'll provide sample timings for before/after cases.


Regards,

Bruce Ritchie

>>All,
>>
>>I've been investigating a possible improvement to the DateFilter and
>>have run into an issue I 
>>believe is a bug with Lucene 1.3 RC1.
>>
>>Synopsis:
>>
>>I'm trying to add a clause into the bits(IndexReader reader) of the
>>DateFilter class to eliminate a 
>>  compareTo() test and improve performance. This should be allowed
>>whenever the DateFilter has a 
>>start date but no end date since reader.terms(term) says all terms
>>after the given term will be 
>>greater than those that precede it. Thus, if we see that the endDate
>>is equal to 
>>DateField.MAX_DATE_STRING() we should be able to skip the "while
>>(enum.term().compareTo(stop) <= 0)" 
>>test and improve the performance of the filter with a large document
>>set.
>>
>>/** Returns an enumeration of all terms after a given term.
>>     The enumeration is ordered by Term.compareTo().  Each term
>>     is greater than all that precede it in the enumeration.
>>    */
>>public abstract TermEnum terms(Term t) throws IOException;
>>
>>
>>Problem:
>>
>>The above contract does not seem to be true in my testing. The
>>modified DateFilter.bits(..) method 
>>attached seems to show that enum.next() will indeed return a term
>>that is less than all terms 
>>preceeding it in the enumeration.
>>
>>With my current index I create a DateFilter via filter = new
>>DateFilter.After("creationDate", 
>>afterDate); where afterDate is set to Sept 07 00:00:00 EDT 2003
>>
>>The output from my debugging statement is as follows:
>>
>>setting bit enabled for doc 466305, date Sun Sep 07 00:00:02 EDT
>>2003, term text was 0dkaji5zk
>>setting bit enabled for doc 466306, date Sun Sep 07 00:00:05 EDT
>>2003, term text was 0dkaji8aw
>>setting bit enabled for doc 466620, date Sun Sep 07 00:00:13 EDT
>>2003, term text was 0dkajieh4
>>setting bit enabled for doc 472854, date Sun Sep 07 00:00:15 EDT
>>2003, term text was 0dkajig0o
>>setting bit enabled for doc 472855, date Sun Sep 07 00:00:27 EDT
>>2003, term text was 0dkajipa0
>>setting bit enabled for doc 467844, date Sun Sep 07 00:00:58 EDT
>>2003, term text was 0dkajjd74
>><snipped for bevity)
>>setting bit enabled for doc 474111, date Sun Sep 07 17:37:52 EDT
>>2003, term text was 0dkblajr4
>>setting bit enabled for doc 474112, date Sun Sep 07 17:38:01 EDT
>>2003, term text was 0dkblaqp4
>>setting bit enabled for doc 474044, date Sun Sep 07 17:38:09 EDT
>>2003, term text was 0dkblawvc
>>setting bit enabled for doc 474091, date Sun Sep 07 18:00:57 EDT
>>2003, term text was 0dkbm48fr
>>setting bit enabled for doc 84, date Wed Dec 31 19:00:00 EST 1969,
>>term text was 10
>>setting bit enabled for doc 85, date Wed Dec 31 19:00:00 EST 1969,
>>term text was 10
>>setting bit enabled for doc 86, date Wed Dec 31 19:00:00 EST 1969,
>>term text was 10
>><and so on and so forth>
>>
>>
>> From the above debug logging you can see that enum.next() has
>>returned a TermEnum with a text of 
>>'10'. While this is logically greater than or equal to the preceeding
>>text according to 
>>String.compareTo(), I'm uncertain as to where the '10' text is coming
>>from. As an example, document 
>>#86 returns in another search the following:
>>
>>setting bit enabled for doc 86, date Fri Jul 11 17:08:43 EDT 2003,
>>term text was 0di0opnjs
>>
>>If someone could either point me in the correct direction and/or
>>isolate the bug it would be 
>>appreciated.
>>
>>
>>
>>Regards,
>>
>>Bruce Ritchie
>>
>>>    public BitSet bits(IndexReader reader) throws IOException {
>>
>>        BitSet bits = new BitSet(reader.maxDoc());
>>        TermEnum enum = reader.terms(new Term(field, start));
>>        TermDocs termDocs = reader.termDocs();
>>        if (enum.term() == null) {
>>            return bits;
>>        }
>>
>>        try {
>>            // we don't need to compare every term in this case
>>            // doing so is a waste of cycles
>>            if (end.equals(DateField.MAX_DATE_STRING())) {
>>                do {
>>                    termDocs.seek(enum.term());
>>                    try {
>>                        while (termDocs.next()) {
>>                            System.err.println("setting bit enabled
>>for doc " + termDocs.doc() + ", date " +
>>DateField.stringToDate(enum.term().text()) + ", term text was " +
>>enum.term().text());
>>                            bits.set(termDocs.doc());
>>                        }
>>                    }
>>                    finally {
>>                        termDocs.close();
>>                    }
>>                }
>>                while (enum.next());
>>            }
>>            else {
>>                Term stop = new Term(field, end);
>>                while (enum.term().compareTo(stop) <= 0) {
>>                    termDocs.seek(enum.term());
>>                    try {
>>                        while (termDocs.next()) {
>>                            bits.set(termDocs.doc());
>>                        }
>>                    }
>>                    finally {
>>                        termDocs.close();
>>                    }
>>
>>                    if (!enum.next()) {
>>                        break;
>>                    }
>>                }
>>            }
>>        }
>>        finally {
>>            enum.close();
>>        }
>>
>>        return bits;
>>    }
> 
> 
>>ATTACHMENT part 2 application/x-pkcs7-signature name=smime.p7s
> 
> 
> 
> 
> __________________________________
> Do you Yahoo!?
> Yahoo! SiteBuilder - Free, easy-to-use web site design software
> http://sitebuilder.yahoo.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 

-- 
AOL - bruceritchie101
ICQ - 9929791
MSN - bruce_ritchie101@hotmail.com

http://www.jivesoftware.com/

Re: unexpected behavior with reader.terms(term) not folowing contract

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Bruce,

I can't remember where the 'gibberish' text provided by
enum.term().text() comes from exacly, any more, and have no source code
handy to check it for you.

However, the DateFilter modification that you are describing sounds
acceptable.  Would it be possible for you to submit it as a patch (diff
-uN)?  It would also be nice to provide a unit test for DateFilter, if
we haven't got one already (I can't check now), in order to make sure
that your modification doesn't break the existing behaviour.  Could you
do that, please?

Finally, before we apply your changes, we should see whether this
modification really has any significant impact on the performance.  I
assume you will test this with some large data sets yourself, so please
share your results.

Thank you,
Otis




--- Bruce Ritchie <br...@jivesoftware.com> wrote:
> All,
> 
> I've been investigating a possible improvement to the DateFilter and
> have run into an issue I 
> believe is a bug with Lucene 1.3 RC1.
> 
> Synopsis:
> 
> I'm trying to add a clause into the bits(IndexReader reader) of the
> DateFilter class to eliminate a 
>   compareTo() test and improve performance. This should be allowed
> whenever the DateFilter has a 
> start date but no end date since reader.terms(term) says all terms
> after the given term will be 
> greater than those that precede it. Thus, if we see that the endDate
> is equal to 
> DateField.MAX_DATE_STRING() we should be able to skip the "while
> (enum.term().compareTo(stop) <= 0)" 
> test and improve the performance of the filter with a large document
> set.
> 
> /** Returns an enumeration of all terms after a given term.
>      The enumeration is ordered by Term.compareTo().  Each term
>      is greater than all that precede it in the enumeration.
>     */
> public abstract TermEnum terms(Term t) throws IOException;
> 
> 
> Problem:
> 
> The above contract does not seem to be true in my testing. The
> modified DateFilter.bits(..) method 
> attached seems to show that enum.next() will indeed return a term
> that is less than all terms 
> preceeding it in the enumeration.
> 
> With my current index I create a DateFilter via filter = new
> DateFilter.After("creationDate", 
> afterDate); where afterDate is set to Sept 07 00:00:00 EDT 2003
> 
> The output from my debugging statement is as follows:
> 
> setting bit enabled for doc 466305, date Sun Sep 07 00:00:02 EDT
> 2003, term text was 0dkaji5zk
> setting bit enabled for doc 466306, date Sun Sep 07 00:00:05 EDT
> 2003, term text was 0dkaji8aw
> setting bit enabled for doc 466620, date Sun Sep 07 00:00:13 EDT
> 2003, term text was 0dkajieh4
> setting bit enabled for doc 472854, date Sun Sep 07 00:00:15 EDT
> 2003, term text was 0dkajig0o
> setting bit enabled for doc 472855, date Sun Sep 07 00:00:27 EDT
> 2003, term text was 0dkajipa0
> setting bit enabled for doc 467844, date Sun Sep 07 00:00:58 EDT
> 2003, term text was 0dkajjd74
> <snipped for bevity)
> setting bit enabled for doc 474111, date Sun Sep 07 17:37:52 EDT
> 2003, term text was 0dkblajr4
> setting bit enabled for doc 474112, date Sun Sep 07 17:38:01 EDT
> 2003, term text was 0dkblaqp4
> setting bit enabled for doc 474044, date Sun Sep 07 17:38:09 EDT
> 2003, term text was 0dkblawvc
> setting bit enabled for doc 474091, date Sun Sep 07 18:00:57 EDT
> 2003, term text was 0dkbm48fr
> setting bit enabled for doc 84, date Wed Dec 31 19:00:00 EST 1969,
> term text was 10
> setting bit enabled for doc 85, date Wed Dec 31 19:00:00 EST 1969,
> term text was 10
> setting bit enabled for doc 86, date Wed Dec 31 19:00:00 EST 1969,
> term text was 10
> <and so on and so forth>
> 
> 
>  From the above debug logging you can see that enum.next() has
> returned a TermEnum with a text of 
> '10'. While this is logically greater than or equal to the preceeding
> text according to 
> String.compareTo(), I'm uncertain as to where the '10' text is coming
> from. As an example, document 
> #86 returns in another search the following:
> 
> setting bit enabled for doc 86, date Fri Jul 11 17:08:43 EDT 2003,
> term text was 0di0opnjs
> 
> If someone could either point me in the correct direction and/or
> isolate the bug it would be 
> appreciated.
> 
> 
> 
> Regards,
> 
> Bruce Ritchie
> >     public BitSet bits(IndexReader reader) throws IOException {
>         BitSet bits = new BitSet(reader.maxDoc());
>         TermEnum enum = reader.terms(new Term(field, start));
>         TermDocs termDocs = reader.termDocs();
>         if (enum.term() == null) {
>             return bits;
>         }
> 
>         try {
>             // we don't need to compare every term in this case
>             // doing so is a waste of cycles
>             if (end.equals(DateField.MAX_DATE_STRING())) {
>                 do {
>                     termDocs.seek(enum.term());
>                     try {
>                         while (termDocs.next()) {
>                             System.err.println("setting bit enabled
> for doc " + termDocs.doc() + ", date " +
> DateField.stringToDate(enum.term().text()) + ", term text was " +
> enum.term().text());
>                             bits.set(termDocs.doc());
>                         }
>                     }
>                     finally {
>                         termDocs.close();
>                     }
>                 }
>                 while (enum.next());
>             }
>             else {
>                 Term stop = new Term(field, end);
>                 while (enum.term().compareTo(stop) <= 0) {
>                     termDocs.seek(enum.term());
>                     try {
>                         while (termDocs.next()) {
>                             bits.set(termDocs.doc());
>                         }
>                     }
>                     finally {
>                         termDocs.close();
>                     }
> 
>                     if (!enum.next()) {
>                         break;
>                     }
>                 }
>             }
>         }
>         finally {
>             enum.close();
>         }
> 
>         return bits;
>     }

> ATTACHMENT part 2 application/x-pkcs7-signature name=smime.p7s



__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com