You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Ryan Heinen <ry...@elasticpath.com> on 2006/11/15 03:05:02 UTC

Bug in LuceneDictionary?

Hello,

I believe that I may have discovered a bug in the spellchecker contrib, 
specifically the LuceneDictionary (or SpellChecker, depending on how you 
look at it) class.

I noticed while doing some testing in my own code that when I was 
running the indexDictionary method of the SpellChecker class it was 
always missing the first term (alphabetically) of the field that I 
specified.

I did some investigating, and believe that I have determined the cause 
of the issue.

When its getWordsIterator() method is invoked, LuceneDictionary 
instantiates a TermEnum by calling terms(new Field(field, "") on the 
IndexReader that it is provided. (field = the name of the field supplied 
to the LuceneDictionary)

The LuceneDictionary.hasNext() method calls termEnum.next() to determine 
whether or not there are more terms left in the TermEnum.

Unfortunately, because terms(Field) returns a TermEnum of all terms 
greater than the supplied term, the next biggest term is already set to 
be the current term of the TermEnum. Thus, because 
LuceneDictionary.hasNext() calls TermEnum.next() regardless of whether 
or not the first term has been read, loops that use the following 
structure, as the SpellChecker does, do have the expected results:

while (iterator.hasNext()) {
	// obtain and do something with iterator.next();
}

With data "abc", "def", "ghi", jkl" in the specified index & field, the 
loop will only execute 3 times, with "def", "ghi", "jkl" being the only 
values retrieved. One would expect that the loop should execute 4 times, 
with all four values ("abc", "def", "ghi", jkl") showing up in the loop.

Has anyone encountered this problem before? Am I missing something, or 
should I report this as a bug?

As far as I see it, the LuceneIterator should not be calling the next() 
method of it's underlying TermEnum unless the next() method of the 
LuceneIterator class is called.

Any advice would be appreciated. I've appended some code below.

Thanks,

Ryan

--------

Here are a few lines from SpellChecker.java showing how it uses 
LuceneDictionary's iterator:

Iterator iter=dict.getWordsIterator();
while (iter.hasNext()) {
       String word=(String) iter.next();
       ...
}

Below are the next() and hasNext() methods from LuceneDictionary.java

public Object next() {
       if (!has_next_called) {
         hasNext();
       }
       has_next_called = false;
       return (actualTerm != null) ? actualTerm.text() : null;
     }


     public boolean hasNext() {
       has_next_called = true;
       try {
         // if there is still words
         if (!termEnum.next()) {
           actualTerm = null;
           return false;
         }
         //  if the next word are in the field
         actualTerm = termEnum.term();
         String fieldt = actualTerm.field();
         if (fieldt != field) {
           actualTerm = null;
           return false;
         }
         return true;
       } catch (IOException ex) {
         ex.printStackTrace();
         return false;
       }
     }


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Bug in LuceneDictionary?

Posted by Ryan Heinen <ry...@elasticpath.com>.
Yonik Seeley wrote:
> Thanks for investigating this Ryan!
> Could you open a JIRA bug and maybe provide a patch? (and a testcase
> reproducing the problem would be great too).

Will do. I've been busy the last few days, but hopefully will get around 
to it soon.

Ryan

> 
> -Yonik
> http://incubator.apache.org/solr Solr, the open-source Lucene search server
> 
> On 11/14/06, Ryan Heinen <ry...@elasticpath.com> wrote:
>> Hello,
>>
>> I believe that I may have discovered a bug in the spellchecker contrib,
>> specifically the LuceneDictionary (or SpellChecker, depending on how you
>> look at it) class.
>>
>> I noticed while doing some testing in my own code that when I was
>> running the indexDictionary method of the SpellChecker class it was
>> always missing the first term (alphabetically) of the field that I
>> specified.
>>
>> I did some investigating, and believe that I have determined the cause
>> of the issue.
>>
>> When its getWordsIterator() method is invoked, LuceneDictionary
>> instantiates a TermEnum by calling terms(new Field(field, "") on the
>> IndexReader that it is provided. (field = the name of the field supplied
>> to the LuceneDictionary)
>>
>> The LuceneDictionary.hasNext() method calls termEnum.next() to determine
>> whether or not there are more terms left in the TermEnum.
>>
>> Unfortunately, because terms(Field) returns a TermEnum of all terms
>> greater than the supplied term, the next biggest term is already set to
>> be the current term of the TermEnum. Thus, because
>> LuceneDictionary.hasNext() calls TermEnum.next() regardless of whether
>> or not the first term has been read, loops that use the following
>> structure, as the SpellChecker does, do have the expected results:
>>
>> while (iterator.hasNext()) {
>>         // obtain and do something with iterator.next();
>> }
>>
>> With data "abc", "def", "ghi", jkl" in the specified index & field, the
>> loop will only execute 3 times, with "def", "ghi", "jkl" being the only
>> values retrieved. One would expect that the loop should execute 4 times,
>> with all four values ("abc", "def", "ghi", jkl") showing up in the loop.
>>
>> Has anyone encountered this problem before? Am I missing something, or
>> should I report this as a bug?
>>
>> As far as I see it, the LuceneIterator should not be calling the next()
>> method of it's underlying TermEnum unless the next() method of the
>> LuceneIterator class is called.
>>
>> Any advice would be appreciated. I've appended some code below.
>>
>> Thanks,
>>
>> Ryan
>>
>> --------
>>
>> Here are a few lines from SpellChecker.java showing how it uses
>> LuceneDictionary's iterator:
>>
>> Iterator iter=dict.getWordsIterator();
>> while (iter.hasNext()) {
>>        String word=(String) iter.next();
>>        ...
>> }
>>
>> Below are the next() and hasNext() methods from LuceneDictionary.java
>>
>> public Object next() {
>>        if (!has_next_called) {
>>          hasNext();
>>        }
>>        has_next_called = false;
>>        return (actualTerm != null) ? actualTerm.text() : null;
>>      }
>>
>>
>>      public boolean hasNext() {
>>        has_next_called = true;
>>        try {
>>          // if there is still words
>>          if (!termEnum.next()) {
>>            actualTerm = null;
>>            return false;
>>          }
>>          //  if the next word are in the field
>>          actualTerm = termEnum.term();
>>          String fieldt = actualTerm.field();
>>          if (fieldt != field) {
>>            actualTerm = null;
>>            return false;
>>          }
>>          return true;
>>        } catch (IOException ex) {
>>          ex.printStackTrace();
>>          return false;
>>        }
>>      }
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 


-- 
Ryan Heinen - Software Engineer
Elastic Path Software, Inc.

Phone   604 408 8078 ext 243
Fax     604 408 8079
E-mail  ryan.heinen@elasticpath.com
Web     http://www.elasticpath.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Bug in LuceneDictionary?

Posted by Yonik Seeley <yo...@apache.org>.
Thanks for investigating this Ryan!
Could you open a JIRA bug and maybe provide a patch? (and a testcase
reproducing the problem would be great too).

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

On 11/14/06, Ryan Heinen <ry...@elasticpath.com> wrote:
> Hello,
>
> I believe that I may have discovered a bug in the spellchecker contrib,
> specifically the LuceneDictionary (or SpellChecker, depending on how you
> look at it) class.
>
> I noticed while doing some testing in my own code that when I was
> running the indexDictionary method of the SpellChecker class it was
> always missing the first term (alphabetically) of the field that I
> specified.
>
> I did some investigating, and believe that I have determined the cause
> of the issue.
>
> When its getWordsIterator() method is invoked, LuceneDictionary
> instantiates a TermEnum by calling terms(new Field(field, "") on the
> IndexReader that it is provided. (field = the name of the field supplied
> to the LuceneDictionary)
>
> The LuceneDictionary.hasNext() method calls termEnum.next() to determine
> whether or not there are more terms left in the TermEnum.
>
> Unfortunately, because terms(Field) returns a TermEnum of all terms
> greater than the supplied term, the next biggest term is already set to
> be the current term of the TermEnum. Thus, because
> LuceneDictionary.hasNext() calls TermEnum.next() regardless of whether
> or not the first term has been read, loops that use the following
> structure, as the SpellChecker does, do have the expected results:
>
> while (iterator.hasNext()) {
>         // obtain and do something with iterator.next();
> }
>
> With data "abc", "def", "ghi", jkl" in the specified index & field, the
> loop will only execute 3 times, with "def", "ghi", "jkl" being the only
> values retrieved. One would expect that the loop should execute 4 times,
> with all four values ("abc", "def", "ghi", jkl") showing up in the loop.
>
> Has anyone encountered this problem before? Am I missing something, or
> should I report this as a bug?
>
> As far as I see it, the LuceneIterator should not be calling the next()
> method of it's underlying TermEnum unless the next() method of the
> LuceneIterator class is called.
>
> Any advice would be appreciated. I've appended some code below.
>
> Thanks,
>
> Ryan
>
> --------
>
> Here are a few lines from SpellChecker.java showing how it uses
> LuceneDictionary's iterator:
>
> Iterator iter=dict.getWordsIterator();
> while (iter.hasNext()) {
>        String word=(String) iter.next();
>        ...
> }
>
> Below are the next() and hasNext() methods from LuceneDictionary.java
>
> public Object next() {
>        if (!has_next_called) {
>          hasNext();
>        }
>        has_next_called = false;
>        return (actualTerm != null) ? actualTerm.text() : null;
>      }
>
>
>      public boolean hasNext() {
>        has_next_called = true;
>        try {
>          // if there is still words
>          if (!termEnum.next()) {
>            actualTerm = null;
>            return false;
>          }
>          //  if the next word are in the field
>          actualTerm = termEnum.term();
>          String fieldt = actualTerm.field();
>          if (fieldt != field) {
>            actualTerm = null;
>            return false;
>          }
>          return true;
>        } catch (IOException ex) {
>          ex.printStackTrace();
>          return false;
>        }
>      }

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org