You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Mark Miller <ma...@gmail.com> on 2009/08/06 16:14:15 UTC

Issue with Solr TokenFilter and the new TokenStream API

I think there is an issue here, but I didn't follow the TokenStream 
improvements very closely.

In Solr, CapitalizationFilterFactory has a CharArray set that it loads 
up with keep words - it then checks (with the old TokenStream API) each 
token (char array) to see if it should keep it. I think because of the 
cloning going on in next, this breaks and you can't match anything in 
the keep set. Does that make sense?

-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Issue with Solr TokenFilter and the new TokenStream API

Posted by Mark Miller <ma...@gmail.com>.

Thanks a lot guys.

Uwe: thats why I was asking ;) I had no proof it was the TokenStream 
API, that just seemed a likely candidate - I'm not familiar with that 
filter, but it worked with a version of Lucene right before the 
TokenStream improvements patch, and then started failing after.

When I traced what was going on, I saw:

In Set:false String:the
Set:[BIG, the, and, it]

Which is pretty weird - and why I asked here what it could be.

Thanks for looking into this, both of you !

Robert Muir wrote:
> that makes perfect sense
>
> On Thu, Aug 6, 2009 at 11:31 AM, Uwe Schindler<uw...@thetaphi.de> wrote:
>   
>>> I have seen ur mail, but this bug should not be related to the new Token
>>> API, it should occur with old API, too.
>>>       
>> Maybe the problem is an unrelated change:
>> https://issues.apache.org/jira/browse/LUCENE-1762
>>
>> This issue changed the default length of the termBuffer in
>> Token/TermAttributeImpl. Because of the wrong calculation in the filter, the
>> larger default size could break this filter. When using termBufferLength
>> instead of termBuffer.length it should be fixed.
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>>     
>
>
>
>   


-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Issue with Solr TokenFilter and the new TokenStream API

Posted by Robert Muir <rc...@gmail.com>.

that makes perfect sense

On Thu, Aug 6, 2009 at 11:31 AM, Uwe Schindler<uw...@thetaphi.de> wrote:
>> I have seen ur mail, but this bug should not be related to the new Token
>> API, it should occur with old API, too.
>
> Maybe the problem is an unrelated change:
> https://issues.apache.org/jira/browse/LUCENE-1762
>
> This issue changed the default length of the termBuffer in
> Token/TermAttributeImpl. Because of the wrong calculation in the filter, the
> larger default size could break this filter. When using termBufferLength
> instead of termBuffer.length it should be fixed.
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: Issue with Solr TokenFilter and the new TokenStream API

Posted by Uwe Schindler <uw...@thetaphi.de>.

> I have seen ur mail, but this bug should not be related to the new Token
> API, it should occur with old API, too.

Maybe the problem is an unrelated change:
https://issues.apache.org/jira/browse/LUCENE-1762

This issue changed the default length of the termBuffer in
Token/TermAttributeImpl. Because of the wrong calculation in the filter, the
larger default size could break this filter. When using termBufferLength
instead of termBuffer.length it should be fixed.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Issue with Solr TokenFilter and the new TokenStream API

Posted by Robert Muir <rc...@gmail.com>.

the bug does occur with the old api (some of the evaluations have
incorrect length, but they are not keep words).

its just doesnt happen to make any tests fail (i guess
termBufferLength() happens to == termBuffer.length() for all the
tested keep words) with the old jar file...

On Thu, Aug 6, 2009 at 11:27 AM, Uwe Schindler<uw...@thetaphi.de> wrote:
> I have seen ur mail, but this bug should not be related to the new Token
> API, it should occur with old API, too.
>
> I did not look very close into the implementations, I only checked who
> changes what in which way. And I see that there is only one Token instance
> with a termBuffer that is changed. No problem at all for the new API. It
> would even work with forcefully cloning Tokens inside CachingTokenFilter.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>> -----Original Message-----
>> From: Robert Muir [mailto:rcmuir@gmail.com]
>> Sent: Thursday, August 06, 2009 5:22 PM
>> To: java-dev@lucene.apache.org
>> Subject: Re: Issue with Solr TokenFilter and the new TokenStream API
>>
>> uwe look at the patch i pasted in haste (i have a delivery guy here,
>> sorry).
>>
>> the filter had a bug all along (it was using termBuffer.length for
>> some length calculations).
>>
>> On Thu, Aug 6, 2009 at 11:17 AM, Uwe Schindler<uw...@thetaphi.de> wrote:
>> > I looked into the code of this Filter. It is very simple and should work
>> out
>> > of the box. There is no cloning done. When the indexer calls
>> incrementToken,
>> > the delegation to next(Token) does not clone at all. It just uses the
>> > encapsulated Token instance (inside the AttributeImpl TokenWrapper) as
>> > reusableToken and calls next(reusable) and then replaces the
>> encapsulated
>> > instance by the return value of next() -- so no cloning. As you do not
>> > change the token instance at all and return the reusable token it is all
>> > done on one Token/Attribute instance.
>> >
>> > In my opinion, this is the simpliest TokenFilter that could occur, it
>> just
>> > changes the contents of the buffer. By the way, this one could be easily
>> > rewritten to use incrementToken() without cloning, just use
>> > termAtt.setTermBuffer() and so on.
>> >
>> > Where do you see a problem, does it simply not work or do you think
>> there
>> > could be an issue?
>> >
>> > -----
>> > Uwe Schindler
>> > H.-H.-Meier-Allee 63, D-28213 Bremen
>> > http://www.thetaphi.de
>> > eMail: uwe@thetaphi.de
>> >
>> >> -----Original Message-----
>> >> From: Mark Miller [mailto:markrmiller@gmail.com]
>> >> Sent: Thursday, August 06, 2009 4:14 PM
>> >> To: java-dev@lucene.apache.org
>> >> Subject: Issue with Solr TokenFilter and the new TokenStream API
>> >>
>> >> I think there is an issue here, but I didn't follow the TokenStream
>> >> improvements very closely.
>> >>
>> >> In Solr, CapitalizationFilterFactory has a CharArray set that it loads
>> >> up with keep words - it then checks (with the old TokenStream API) each
>> >> token (char array) to see if it should keep it. I think because of the
>> >> cloning going on in next, this breaks and you can't match anything in
>> >> the keep set. Does that make sense?
>> >>
>> >> --
>> >> - Mark
>> >>
>> >> http://www.lucidimagination.com
>> >>
>> >>
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >
>> >
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: Issue with Solr TokenFilter and the new TokenStream API

Posted by Uwe Schindler <uw...@thetaphi.de>.

I have seen ur mail, but this bug should not be related to the new Token
API, it should occur with old API, too.

I did not look very close into the implementations, I only checked who
changes what in which way. And I see that there is only one Token instance
with a termBuffer that is changed. No problem at all for the new API. It
would even work with forcefully cloning Tokens inside CachingTokenFilter.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Robert Muir [mailto:rcmuir@gmail.com]
> Sent: Thursday, August 06, 2009 5:22 PM
> To: java-dev@lucene.apache.org
> Subject: Re: Issue with Solr TokenFilter and the new TokenStream API
> 
> uwe look at the patch i pasted in haste (i have a delivery guy here,
> sorry).
> 
> the filter had a bug all along (it was using termBuffer.length for
> some length calculations).
> 
> On Thu, Aug 6, 2009 at 11:17 AM, Uwe Schindler<uw...@thetaphi.de> wrote:
> > I looked into the code of this Filter. It is very simple and should work
> out
> > of the box. There is no cloning done. When the indexer calls
> incrementToken,
> > the delegation to next(Token) does not clone at all. It just uses the
> > encapsulated Token instance (inside the AttributeImpl TokenWrapper) as
> > reusableToken and calls next(reusable) and then replaces the
> encapsulated
> > instance by the return value of next() -- so no cloning. As you do not
> > change the token instance at all and return the reusable token it is all
> > done on one Token/Attribute instance.
> >
> > In my opinion, this is the simpliest TokenFilter that could occur, it
> just
> > changes the contents of the buffer. By the way, this one could be easily
> > rewritten to use incrementToken() without cloning, just use
> > termAtt.setTermBuffer() and so on.
> >
> > Where do you see a problem, does it simply not work or do you think
> there
> > could be an issue?
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> >> -----Original Message-----
> >> From: Mark Miller [mailto:markrmiller@gmail.com]
> >> Sent: Thursday, August 06, 2009 4:14 PM
> >> To: java-dev@lucene.apache.org
> >> Subject: Issue with Solr TokenFilter and the new TokenStream API
> >>
> >> I think there is an issue here, but I didn't follow the TokenStream
> >> improvements very closely.
> >>
> >> In Solr, CapitalizationFilterFactory has a CharArray set that it loads
> >> up with keep words - it then checks (with the old TokenStream API) each
> >> token (char array) to see if it should keep it. I think because of the
> >> cloning going on in next, this breaks and you can't match anything in
> >> the keep set. Does that make sense?
> >>
> >> --
> >> - Mark
> >>
> >> http://www.lucidimagination.com
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> >
> 
> 
> 
> --
> Robert Muir
> rcmuir@gmail.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Issue with Solr TokenFilter and the new TokenStream API

Posted by Robert Muir <rc...@gmail.com>.

uwe look at the patch i pasted in haste (i have a delivery guy here, sorry).

the filter had a bug all along (it was using termBuffer.length for
some length calculations).

On Thu, Aug 6, 2009 at 11:17 AM, Uwe Schindler<uw...@thetaphi.de> wrote:
> I looked into the code of this Filter. It is very simple and should work out
> of the box. There is no cloning done. When the indexer calls incrementToken,
> the delegation to next(Token) does not clone at all. It just uses the
> encapsulated Token instance (inside the AttributeImpl TokenWrapper) as
> reusableToken and calls next(reusable) and then replaces the encapsulated
> instance by the return value of next() -- so no cloning. As you do not
> change the token instance at all and return the reusable token it is all
> done on one Token/Attribute instance.
>
> In my opinion, this is the simpliest TokenFilter that could occur, it just
> changes the contents of the buffer. By the way, this one could be easily
> rewritten to use incrementToken() without cloning, just use
> termAtt.setTermBuffer() and so on.
>
> Where do you see a problem, does it simply not work or do you think there
> could be an issue?
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>> -----Original Message-----
>> From: Mark Miller [mailto:markrmiller@gmail.com]
>> Sent: Thursday, August 06, 2009 4:14 PM
>> To: java-dev@lucene.apache.org
>> Subject: Issue with Solr TokenFilter and the new TokenStream API
>>
>> I think there is an issue here, but I didn't follow the TokenStream
>> improvements very closely.
>>
>> In Solr, CapitalizationFilterFactory has a CharArray set that it loads
>> up with keep words - it then checks (with the old TokenStream API) each
>> token (char array) to see if it should keep it. I think because of the
>> cloning going on in next, this breaks and you can't match anything in
>> the keep set. Does that make sense?
>>
>> --
>> - Mark
>>
>> http://www.lucidimagination.com
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: Issue with Solr TokenFilter and the new TokenStream API

Posted by Uwe Schindler <uw...@thetaphi.de>.

I looked into the code of this Filter. It is very simple and should work out
of the box. There is no cloning done. When the indexer calls incrementToken,
the delegation to next(Token) does not clone at all. It just uses the
encapsulated Token instance (inside the AttributeImpl TokenWrapper) as
reusableToken and calls next(reusable) and then replaces the encapsulated
instance by the return value of next() -- so no cloning. As you do not
change the token instance at all and return the reusable token it is all
done on one Token/Attribute instance.

In my opinion, this is the simpliest TokenFilter that could occur, it just
changes the contents of the buffer. By the way, this one could be easily
rewritten to use incrementToken() without cloning, just use
termAtt.setTermBuffer() and so on.

Where do you see a problem, does it simply not work or do you think there
could be an issue?

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Mark Miller [mailto:markrmiller@gmail.com]
> Sent: Thursday, August 06, 2009 4:14 PM
> To: java-dev@lucene.apache.org
> Subject: Issue with Solr TokenFilter and the new TokenStream API
> 
> I think there is an issue here, but I didn't follow the TokenStream
> improvements very closely.
> 
> In Solr, CapitalizationFilterFactory has a CharArray set that it loads
> up with keep words - it then checks (with the old TokenStream API) each
> token (char array) to see if it should keep it. I think because of the
> cloning going on in next, this breaks and you can't match anything in
> the keep set. Does that make sense?
> 
> --
> - Mark
> 
> http://www.lucidimagination.com
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: Issue with Solr TokenFilter and the new TokenStream API

Posted by Uwe Schindler <uw...@thetaphi.de>.

Thanks, we are always here to help :-)
 
> Test passes with this patch - thanks a lot Robert ! I was going to ask
> you to create a solr issue, but I see you already have, thanks!
> 
> No need to create a test I think - put in the new Lucene jars and it
> fails, so likely thats good enough. Though it is spooky that the test
> passed without the new jars

See LUCENE-1762, I think this problems comes from there. I would strongly
suggest to create a testcase with better lists of terms of different length
and so on.

> so perhaps a more targeted test is
> warranted after all.

More tests are always better :-) When I created some tests locally to test
something (even when they are strange), I often simply add them to Lucene's
testcases.

> - Mark
> 
> Robert Muir wrote:
> > Index:
> src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java
> > ===================================================================
> > --- src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java
> 	(revision
> > 778975)
> > +++ src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java
> 	(working
> > copy)
> > @@ -209,7 +209,7 @@
> >          //make a backup in case we exceed the word count
> >          System.arraycopy(termBuffer, 0, backup, 0, termBufferLength);
> >        }
> > -      if (termBuffer.length < factory.maxTokenLength) {
> > +      if (termBufferLength < factory.maxTokenLength) {
> >          int wordCount = 0;
> >
> >          int lastWordStart = 0;
> > @@ -226,8 +226,8 @@
> >          }
> >
> >          // process the last word
> > -        if (lastWordStart < termBuffer.length) {
> > -          factory.processWord(termBuffer, lastWordStart,
> > termBuffer.length - lastWordStart, wordCount++);
> > +        if (lastWordStart < termBufferLength) {
> > +          factory.processWord(termBuffer, lastWordStart,
> > termBufferLength - lastWordStart, wordCount++);
> >          }
> >
> >          if (wordCount > factory.maxWordCount) {
> >
> >
> > On Thu, Aug 6, 2009 at 10:58 AM, Robert Muir<rc...@gmail.com> wrote:
> >
> >> Mark, I looked at this and think it might be unrelated to tokenstreams.
> >>
> >> I think the length argument being provided to processWord(char[]
> >> buffer, int offset, int length, int wordCount) in that filter might be
> >> incorrectly calculated.
> >> This is the method that checks the keep list.
> >>
> >> (There is trailing trash on the end of tokens, even with the previous
> >> version of lucene in Solr).
> >> It just so happens the tokens with trailing trash were ones that were
> >> keep words in the previous version, so the test didnt fail.
> >>
> >> different tokens have trailing trash in the current version
> >> (specifically some of the "the" tokens), so its failing now.
> >>
> >>
> >> On Thu, Aug 6, 2009 at 10:14 AM, Mark Miller<ma...@gmail.com>
> wrote:
> >>
> >>> I think there is an issue here, but I didn't follow the TokenStream
> >>> improvements very closely.
> >>>
> >>> In Solr, CapitalizationFilterFactory has a CharArray set that it loads
> up
> >>> with keep words - it then checks (with the old TokenStream API) each
> token
> >>> (char array) to see if it should keep it. I think because of the
> cloning
> >>> going on in next, this breaks and you can't match anything in the keep
> set.
> >>> Does that make sense?
> >>>
> >>> --
> >>> - Mark
> >>>
> >>> http://www.lucidimagination.com
> >>>
> >>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >>>
> >>>
> >>>
> >>
> >> --
> >> Robert Muir
> >> rcmuir@gmail.com
> >>
> >>
> >
> >
> >
> >
> 
> 
> --
> - Mark
> 
> http://www.lucidimagination.com
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Issue with Solr TokenFilter and the new TokenStream API

Posted by Robert Muir <rc...@gmail.com>.

Mark, I agree it could use some more tests in the future, like many things :)

On Thu, Aug 6, 2009 at 11:52 AM, Mark Miller<ma...@gmail.com> wrote:
> Test passes with this patch - thanks a lot Robert ! I was going to ask you
> to create a solr issue, but I see you already have, thanks!
>
> No need to create a test I think - put in the new Lucene jars and it fails,
> so likely thats good enough. Though it is spooky that the test passed
> without the new jars, so perhaps a more targeted test is warranted after
> all.
>
> - Mark
>
> Robert Muir wrote:
>>
>> Index: src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java
>> ===================================================================
>> --- src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java
>>  (revision
>> 778975)
>> +++ src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java
>>  (working
>> copy)
>> @@ -209,7 +209,7 @@
>>         //make a backup in case we exceed the word count
>>         System.arraycopy(termBuffer, 0, backup, 0, termBufferLength);
>>       }
>> -      if (termBuffer.length < factory.maxTokenLength) {
>> +      if (termBufferLength < factory.maxTokenLength) {
>>         int wordCount = 0;
>>
>>         int lastWordStart = 0;
>> @@ -226,8 +226,8 @@
>>         }
>>
>>         // process the last word
>> -        if (lastWordStart < termBuffer.length) {
>> -          factory.processWord(termBuffer, lastWordStart,
>> termBuffer.length - lastWordStart, wordCount++);
>> +        if (lastWordStart < termBufferLength) {
>> +          factory.processWord(termBuffer, lastWordStart,
>> termBufferLength - lastWordStart, wordCount++);
>>         }
>>
>>         if (wordCount > factory.maxWordCount) {
>>
>>
>> On Thu, Aug 6, 2009 at 10:58 AM, Robert Muir<rc...@gmail.com> wrote:
>>
>>>
>>> Mark, I looked at this and think it might be unrelated to tokenstreams.
>>>
>>> I think the length argument being provided to processWord(char[]
>>> buffer, int offset, int length, int wordCount) in that filter might be
>>> incorrectly calculated.
>>> This is the method that checks the keep list.
>>>
>>> (There is trailing trash on the end of tokens, even with the previous
>>> version of lucene in Solr).
>>> It just so happens the tokens with trailing trash were ones that were
>>> keep words in the previous version, so the test didnt fail.
>>>
>>> different tokens have trailing trash in the current version
>>> (specifically some of the "the" tokens), so its failing now.
>>>
>>>
>>> On Thu, Aug 6, 2009 at 10:14 AM, Mark Miller<ma...@gmail.com>
>>> wrote:
>>>
>>>>
>>>> I think there is an issue here, but I didn't follow the TokenStream
>>>> improvements very closely.
>>>>
>>>> In Solr, CapitalizationFilterFactory has a CharArray set that it loads
>>>> up
>>>> with keep words - it then checks (with the old TokenStream API) each
>>>> token
>>>> (char array) to see if it should keep it. I think because of the cloning
>>>> going on in next, this breaks and you can't match anything in the keep
>>>> set.
>>>> Does that make sense?
>>>>
>>>> --
>>>> - Mark
>>>>
>>>> http://www.lucidimagination.com
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>
>>> --
>>> Robert Muir
>>> rcmuir@gmail.com
>>>
>>>
>>
>>
>>
>>
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Issue with Solr TokenFilter and the new TokenStream API

Posted by Mark Miller <ma...@gmail.com>.

Test passes with this patch - thanks a lot Robert ! I was going to ask 
you to create a solr issue, but I see you already have, thanks!

No need to create a test I think - put in the new Lucene jars and it 
fails, so likely thats good enough. Though it is spooky that the test 
passed without the new jars, so perhaps a more targeted test is 
warranted after all.

- Mark

Robert Muir wrote:
> Index: src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java
> ===================================================================
> --- src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java	(revision
> 778975)
> +++ src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java	(working
> copy)
> @@ -209,7 +209,7 @@
>          //make a backup in case we exceed the word count
>          System.arraycopy(termBuffer, 0, backup, 0, termBufferLength);
>        }
> -      if (termBuffer.length < factory.maxTokenLength) {
> +      if (termBufferLength < factory.maxTokenLength) {
>          int wordCount = 0;
>
>          int lastWordStart = 0;
> @@ -226,8 +226,8 @@
>          }
>
>          // process the last word
> -        if (lastWordStart < termBuffer.length) {
> -          factory.processWord(termBuffer, lastWordStart,
> termBuffer.length - lastWordStart, wordCount++);
> +        if (lastWordStart < termBufferLength) {
> +          factory.processWord(termBuffer, lastWordStart,
> termBufferLength - lastWordStart, wordCount++);
>          }
>
>          if (wordCount > factory.maxWordCount) {
>
>
> On Thu, Aug 6, 2009 at 10:58 AM, Robert Muir<rc...@gmail.com> wrote:
>   
>> Mark, I looked at this and think it might be unrelated to tokenstreams.
>>
>> I think the length argument being provided to processWord(char[]
>> buffer, int offset, int length, int wordCount) in that filter might be
>> incorrectly calculated.
>> This is the method that checks the keep list.
>>
>> (There is trailing trash on the end of tokens, even with the previous
>> version of lucene in Solr).
>> It just so happens the tokens with trailing trash were ones that were
>> keep words in the previous version, so the test didnt fail.
>>
>> different tokens have trailing trash in the current version
>> (specifically some of the "the" tokens), so its failing now.
>>
>>
>> On Thu, Aug 6, 2009 at 10:14 AM, Mark Miller<ma...@gmail.com> wrote:
>>     
>>> I think there is an issue here, but I didn't follow the TokenStream
>>> improvements very closely.
>>>
>>> In Solr, CapitalizationFilterFactory has a CharArray set that it loads up
>>> with keep words - it then checks (with the old TokenStream API) each token
>>> (char array) to see if it should keep it. I think because of the cloning
>>> going on in next, this breaks and you can't match anything in the keep set.
>>> Does that make sense?
>>>
>>> --
>>> - Mark
>>>
>>> http://www.lucidimagination.com
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>
>>>       
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>>     
>
>
>
>   


-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Issue with Solr TokenFilter and the new TokenStream API

Posted by Robert Muir <rc...@gmail.com>.

Index: src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java
===================================================================
--- src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java	(revision
778975)
+++ src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java	(working
copy)
@@ -209,7 +209,7 @@
         //make a backup in case we exceed the word count
         System.arraycopy(termBuffer, 0, backup, 0, termBufferLength);
       }
-      if (termBuffer.length < factory.maxTokenLength) {
+      if (termBufferLength < factory.maxTokenLength) {
         int wordCount = 0;

         int lastWordStart = 0;
@@ -226,8 +226,8 @@
         }

         // process the last word
-        if (lastWordStart < termBuffer.length) {
-          factory.processWord(termBuffer, lastWordStart,
termBuffer.length - lastWordStart, wordCount++);
+        if (lastWordStart < termBufferLength) {
+          factory.processWord(termBuffer, lastWordStart,
termBufferLength - lastWordStart, wordCount++);
         }

         if (wordCount > factory.maxWordCount) {


On Thu, Aug 6, 2009 at 10:58 AM, Robert Muir<rc...@gmail.com> wrote:
> Mark, I looked at this and think it might be unrelated to tokenstreams.
>
> I think the length argument being provided to processWord(char[]
> buffer, int offset, int length, int wordCount) in that filter might be
> incorrectly calculated.
> This is the method that checks the keep list.
>
> (There is trailing trash on the end of tokens, even with the previous
> version of lucene in Solr).
> It just so happens the tokens with trailing trash were ones that were
> keep words in the previous version, so the test didnt fail.
>
> different tokens have trailing trash in the current version
> (specifically some of the "the" tokens), so its failing now.
>
>
> On Thu, Aug 6, 2009 at 10:14 AM, Mark Miller<ma...@gmail.com> wrote:
>> I think there is an issue here, but I didn't follow the TokenStream
>> improvements very closely.
>>
>> In Solr, CapitalizationFilterFactory has a CharArray set that it loads up
>> with keep words - it then checks (with the old TokenStream API) each token
>> (char array) to see if it should keep it. I think because of the cloning
>> going on in next, this breaks and you can't match anything in the keep set.
>> Does that make sense?
>>
>> --
>> - Mark
>>
>> http://www.lucidimagination.com
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Issue with Solr TokenFilter and the new TokenStream API

Posted by Robert Muir <rc...@gmail.com>.

Mark, I looked at this and think it might be unrelated to tokenstreams.

I think the length argument being provided to processWord(char[]
buffer, int offset, int length, int wordCount) in that filter might be
incorrectly calculated.
This is the method that checks the keep list.

(There is trailing trash on the end of tokens, even with the previous
version of lucene in Solr).
It just so happens the tokens with trailing trash were ones that were
keep words in the previous version, so the test didnt fail.

different tokens have trailing trash in the current version
(specifically some of the "the" tokens), so its failing now.

On Thu, Aug 6, 2009 at 10:14 AM, Mark Miller<ma...@gmail.com> wrote:
> I think there is an issue here, but I didn't follow the TokenStream
> improvements very closely.
>
> In Solr, CapitalizationFilterFactory has a CharArray set that it loads up
> with keep words - it then checks (with the old TokenStream API) each token
> (char array) to see if it should keep it. I think because of the cloning
> going on in next, this breaks and you can't match anything in the keep set.
> Does that make sense?
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org