You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Benson Margulies <be...@basistech.com> on 2014/01/03 19:56:50 UTC

Tracking down the input that hits an analysis chain bug

Using Solr Cloud with 4.3.1.

We've got a problem with a tokenizer that manifests as calling
OffsetAtt.setOffsets() with invalid inputs. OK, so, we want to figure out
what input provokes our code into getting into this pickle.

The problem happens on SolrCloud nodes.

The problem manifests as this sort of thing:

Jan 3, 2014 6:05:33 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.IllegalArgumentException: startOffset must be
non-negative, and endOffset must be >= startOffset,
startOffset=-1811581632,endOffset=-1811581632

How could we get a document ID so that we can tell which document was being
processed?

Re: Tracking down the input that hits an analysis chain bug

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

I think you do (or can) get a log message for each document insert?  If 
that's all you need, I think logging configuration will get you there.  
I use log4j and turn Solr's pretty verbose logging off using:

log4j.logger.org.apache.lucene.solr = WARN

assuming the rest of log4j is set up OK, I think you get the insert 
messages at INFO level?

-Mike

On 1/4/2014 9:24 PM, Benson Margulies wrote:
> I rather assumed that there was some log4j-ish config to be set that
> would do this for me. Lacking one, I guess I'll end up there.
>
> On Fri, Jan 3, 2014 at 8:23 PM, Michael Sokolov
> <ms...@safaribooksonline.com> wrote:
>> Have you considered using a custom UpdateProcessor to catch the exception
>> and provide more context in the logs?
>>
>> -Mike
>>
>>
>> On 01/03/2014 03:33 PM, Benson Margulies wrote:
>>> Robert,
>>>
>>> Yes, if the problem was not data-dependent, indeed I wouldn't need to
>>> index anything. However, I've run a small mountain of data through our
>>> tokenizer on my machine, and never seen the error, but my customer
>>> gets these errors in the middle of a giant spew of data. As it
>>> happens, I _was_ missing that call to clearAttributes(), (and the
>>> usual implementation of end()), but I found and fixed that problem
>>> precisely by creating a random data test case using checkRandomData().
>>> Unfortunately, fixing that didn't make the customer's errors go away.
>>>
>>> So I'm left needing to help them identify the data that provokes this,
>>> because I've so far failed to come up with any.
>>>
>>> --benson
>>>
>>>
>>> On Fri, Jan 3, 2014 at 2:16 PM, Robert Muir <rc...@gmail.com> wrote:
>>>> This exception comes from OffsetAttributeImpl (e.g. you dont need to
>>>> index anything to reproduce it).
>>>>
>>>> Maybe you have a missing clearAttributes() call (your tokenizer
>>>> 'returns true' without calling that first)? This could explain it, if
>>>> something like a StopFilter is also present in the chain: basically
>>>> the offsets overflow.
>>>>
>>>> the test stuff in BaseTokenStreamTestCase should be able to detect
>>>> this as well...
>>>>
>>>> On Fri, Jan 3, 2014 at 1:56 PM, Benson Margulies <be...@basistech.com>
>>>> wrote:
>>>>> Using Solr Cloud with 4.3.1.
>>>>>
>>>>> We've got a problem with a tokenizer that manifests as calling
>>>>> OffsetAtt.setOffsets() with invalid inputs. OK, so, we want to figure
>>>>> out
>>>>> what input provokes our code into getting into this pickle.
>>>>>
>>>>> The problem happens on SolrCloud nodes.
>>>>>
>>>>> The problem manifests as this sort of thing:
>>>>>
>>>>> Jan 3, 2014 6:05:33 PM org.apache.solr.common.SolrException log
>>>>> SEVERE: java.lang.IllegalArgumentException: startOffset must be
>>>>> non-negative, and endOffset must be >= startOffset,
>>>>> startOffset=-1811581632,endOffset=-1811581632
>>>>>
>>>>> How could we get a document ID so that we can tell which document was
>>>>> being
>>>>> processed?
>>

Re: Tracking down the input that hits an analysis chain bug

Posted by Benson Margulies <bi...@gmail.com>.

I rather assumed that there was some log4j-ish config to be set that
would do this for me. Lacking one, I guess I'll end up there.

On Fri, Jan 3, 2014 at 8:23 PM, Michael Sokolov
<ms...@safaribooksonline.com> wrote:
> Have you considered using a custom UpdateProcessor to catch the exception
> and provide more context in the logs?
>
> -Mike
>
>
> On 01/03/2014 03:33 PM, Benson Margulies wrote:
>>
>> Robert,
>>
>> Yes, if the problem was not data-dependent, indeed I wouldn't need to
>> index anything. However, I've run a small mountain of data through our
>> tokenizer on my machine, and never seen the error, but my customer
>> gets these errors in the middle of a giant spew of data. As it
>> happens, I _was_ missing that call to clearAttributes(), (and the
>> usual implementation of end()), but I found and fixed that problem
>> precisely by creating a random data test case using checkRandomData().
>> Unfortunately, fixing that didn't make the customer's errors go away.
>>
>> So I'm left needing to help them identify the data that provokes this,
>> because I've so far failed to come up with any.
>>
>> --benson
>>
>>
>> On Fri, Jan 3, 2014 at 2:16 PM, Robert Muir <rc...@gmail.com> wrote:
>>>
>>> This exception comes from OffsetAttributeImpl (e.g. you dont need to
>>> index anything to reproduce it).
>>>
>>> Maybe you have a missing clearAttributes() call (your tokenizer
>>> 'returns true' without calling that first)? This could explain it, if
>>> something like a StopFilter is also present in the chain: basically
>>> the offsets overflow.
>>>
>>> the test stuff in BaseTokenStreamTestCase should be able to detect
>>> this as well...
>>>
>>> On Fri, Jan 3, 2014 at 1:56 PM, Benson Margulies <be...@basistech.com>
>>> wrote:
>>>>
>>>> Using Solr Cloud with 4.3.1.
>>>>
>>>> We've got a problem with a tokenizer that manifests as calling
>>>> OffsetAtt.setOffsets() with invalid inputs. OK, so, we want to figure
>>>> out
>>>> what input provokes our code into getting into this pickle.
>>>>
>>>> The problem happens on SolrCloud nodes.
>>>>
>>>> The problem manifests as this sort of thing:
>>>>
>>>> Jan 3, 2014 6:05:33 PM org.apache.solr.common.SolrException log
>>>> SEVERE: java.lang.IllegalArgumentException: startOffset must be
>>>> non-negative, and endOffset must be >= startOffset,
>>>> startOffset=-1811581632,endOffset=-1811581632
>>>>
>>>> How could we get a document ID so that we can tell which document was
>>>> being
>>>> processed?
>
>

Re: Tracking down the input that hits an analysis chain bug

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

Have you considered using a custom UpdateProcessor to catch the 
exception and provide more context in the logs?

-Mike

On 01/03/2014 03:33 PM, Benson Margulies wrote:
> Robert,
>
> Yes, if the problem was not data-dependent, indeed I wouldn't need to
> index anything. However, I've run a small mountain of data through our
> tokenizer on my machine, and never seen the error, but my customer
> gets these errors in the middle of a giant spew of data. As it
> happens, I _was_ missing that call to clearAttributes(), (and the
> usual implementation of end()), but I found and fixed that problem
> precisely by creating a random data test case using checkRandomData().
> Unfortunately, fixing that didn't make the customer's errors go away.
>
> So I'm left needing to help them identify the data that provokes this,
> because I've so far failed to come up with any.
>
> --benson
>
>
> On Fri, Jan 3, 2014 at 2:16 PM, Robert Muir <rc...@gmail.com> wrote:
>> This exception comes from OffsetAttributeImpl (e.g. you dont need to
>> index anything to reproduce it).
>>
>> Maybe you have a missing clearAttributes() call (your tokenizer
>> 'returns true' without calling that first)? This could explain it, if
>> something like a StopFilter is also present in the chain: basically
>> the offsets overflow.
>>
>> the test stuff in BaseTokenStreamTestCase should be able to detect
>> this as well...
>>
>> On Fri, Jan 3, 2014 at 1:56 PM, Benson Margulies <be...@basistech.com> wrote:
>>> Using Solr Cloud with 4.3.1.
>>>
>>> We've got a problem with a tokenizer that manifests as calling
>>> OffsetAtt.setOffsets() with invalid inputs. OK, so, we want to figure out
>>> what input provokes our code into getting into this pickle.
>>>
>>> The problem happens on SolrCloud nodes.
>>>
>>> The problem manifests as this sort of thing:
>>>
>>> Jan 3, 2014 6:05:33 PM org.apache.solr.common.SolrException log
>>> SEVERE: java.lang.IllegalArgumentException: startOffset must be
>>> non-negative, and endOffset must be >= startOffset,
>>> startOffset=-1811581632,endOffset=-1811581632
>>>
>>> How could we get a document ID so that we can tell which document was being
>>> processed?

Re: Tracking down the input that hits an analysis chain bug

Posted by Benson Margulies <bi...@gmail.com>.

Robert,

Yes, if the problem was not data-dependent, indeed I wouldn't need to
index anything. However, I've run a small mountain of data through our
tokenizer on my machine, and never seen the error, but my customer
gets these errors in the middle of a giant spew of data. As it
happens, I _was_ missing that call to clearAttributes(), (and the
usual implementation of end()), but I found and fixed that problem
precisely by creating a random data test case using checkRandomData().
Unfortunately, fixing that didn't make the customer's errors go away.

So I'm left needing to help them identify the data that provokes this,
because I've so far failed to come up with any.

--benson

On Fri, Jan 3, 2014 at 2:16 PM, Robert Muir <rc...@gmail.com> wrote:
> This exception comes from OffsetAttributeImpl (e.g. you dont need to
> index anything to reproduce it).
>
> Maybe you have a missing clearAttributes() call (your tokenizer
> 'returns true' without calling that first)? This could explain it, if
> something like a StopFilter is also present in the chain: basically
> the offsets overflow.
>
> the test stuff in BaseTokenStreamTestCase should be able to detect
> this as well...
>
> On Fri, Jan 3, 2014 at 1:56 PM, Benson Margulies <be...@basistech.com> wrote:
>> Using Solr Cloud with 4.3.1.
>>
>> We've got a problem with a tokenizer that manifests as calling
>> OffsetAtt.setOffsets() with invalid inputs. OK, so, we want to figure out
>> what input provokes our code into getting into this pickle.
>>
>> The problem happens on SolrCloud nodes.
>>
>> The problem manifests as this sort of thing:
>>
>> Jan 3, 2014 6:05:33 PM org.apache.solr.common.SolrException log
>> SEVERE: java.lang.IllegalArgumentException: startOffset must be
>> non-negative, and endOffset must be >= startOffset,
>> startOffset=-1811581632,endOffset=-1811581632
>>
>> How could we get a document ID so that we can tell which document was being
>> processed?

Re: Tracking down the input that hits an analysis chain bug

Posted by Robert Muir <rc...@gmail.com>.

This exception comes from OffsetAttributeImpl (e.g. you dont need to
index anything to reproduce it).

Maybe you have a missing clearAttributes() call (your tokenizer
'returns true' without calling that first)? This could explain it, if
something like a StopFilter is also present in the chain: basically
the offsets overflow.

the test stuff in BaseTokenStreamTestCase should be able to detect
this as well...

On Fri, Jan 3, 2014 at 1:56 PM, Benson Margulies <be...@basistech.com> wrote:
> Using Solr Cloud with 4.3.1.
>
> We've got a problem with a tokenizer that manifests as calling
> OffsetAtt.setOffsets() with invalid inputs. OK, so, we want to figure out
> what input provokes our code into getting into this pickle.
>
> The problem happens on SolrCloud nodes.
>
> The problem manifests as this sort of thing:
>
> Jan 3, 2014 6:05:33 PM org.apache.solr.common.SolrException log
> SEVERE: java.lang.IllegalArgumentException: startOffset must be
> non-negative, and endOffset must be >= startOffset,
> startOffset=-1811581632,endOffset=-1811581632
>
> How could we get a document ID so that we can tell which document was being
> processed?

Re: Tracking down the input that hits an analysis chain bug

Posted by Benson Margulies <bi...@gmail.com>.

I think that https://issues.apache.org/jira/browse/SOLR-5623 should be
ready to go. Would someone please commit from the PR? If there's a
preference, I can attach a patch as well.

On Fri, Jan 10, 2014 at 1:37 PM, Benson Margulies <bi...@gmail.com> wrote:
> Thanks, that's the recipe that I need.
>
> On Fri, Jan 10, 2014 at 11:40 AM, Chris Hostetter
> <ho...@fucit.org> wrote:
>>
>> : Is there a neighborhood of existing tests I should be visiting here?
>>
>> You'll need a custom schema that refers to your new
>> MockFailOnCertainTokensFilterFactory, so i would create a completley new
>> test class somewhere in ...solr.update (you're testing that an update
>> fails with a clean error)
>>
>>
>> -Hoss
>> http://www.lucidworks.com/

Re: Tracking down the input that hits an analysis chain bug

Posted by Benson Margulies <bi...@gmail.com>.

I think that https://issues.apache.org/jira/browse/SOLR-5623 should be
ready to go. Would someone please commit from the PR? If there's a
preference, I can attach a patch as well.

On Fri, Jan 10, 2014 at 1:37 PM, Benson Margulies <bi...@gmail.com> wrote:
> Thanks, that's the recipe that I need.
>
> On Fri, Jan 10, 2014 at 11:40 AM, Chris Hostetter
> <ho...@fucit.org> wrote:
>>
>> : Is there a neighborhood of existing tests I should be visiting here?
>>
>> You'll need a custom schema that refers to your new
>> MockFailOnCertainTokensFilterFactory, so i would create a completley new
>> test class somewhere in ...solr.update (you're testing that an update
>> fails with a clean error)
>>
>>
>> -Hoss
>> http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Tracking down the input that hits an analysis chain bug

Posted by Benson Margulies <bi...@gmail.com>.

Thanks, that's the recipe that I need.

On Fri, Jan 10, 2014 at 11:40 AM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> : Is there a neighborhood of existing tests I should be visiting here?
>
> You'll need a custom schema that refers to your new
> MockFailOnCertainTokensFilterFactory, so i would create a completley new
> test class somewhere in ...solr.update (you're testing that an update
> fails with a clean error)
>
>
> -Hoss
> http://www.lucidworks.com/

Re: Tracking down the input that hits an analysis chain bug

Posted by Chris Hostetter <ho...@fucit.org>.

: Is there a neighborhood of existing tests I should be visiting here?

You'll need a custom schema that refers to your new 
MockFailOnCertainTokensFilterFactory, so i would create a completley new 
test class somewhere in ...solr.update (you're testing that an update 
fails with a clean error)


-Hoss
http://www.lucidworks.com/

Re: Tracking down the input that hits an analysis chain bug

Posted by Benson Margulies <bi...@gmail.com>.

Is there a neighborhood of existing tests I should be visiting here?


On Fri, Jan 10, 2014 at 11:27 AM, Benson Margulies
<bi...@gmail.com> wrote:
> OK, patch forthcoming.
>
> On Fri, Jan 10, 2014 at 11:23 AM, Chris Hostetter
> <ho...@fucit.org> wrote:
>>
>> : The problem manifests as this sort of thing:
>> :
>> : Jan 3, 2014 6:05:33 PM org.apache.solr.common.SolrException log
>> : SEVERE: java.lang.IllegalArgumentException: startOffset must be
>> : non-negative, and endOffset must be >= startOffset,
>> : startOffset=-1811581632,endOffset=-1811581632
>>
>> Is there a stack trace in the log to go along with that?  there should be.
>>
>> My suspicion is that since analysis errors like these are
>> RuntimeExceptions, they may not be getting caught & re-thrown with as much
>> context as they should -- so by the time they get logged (or returned to
>> the client) there isn't any info about the problematic field value, let
>> alone the unqiueKey.
>>
>> If we had a test case that reproduces (ie: with a mock tokenfilter that
>> always throws a RuntimeException when a token matches "fail_now" or
>> something) we could have some tests that assert indexing a doc with that
>> token results in a useful error -- which should help ensure that useful
>> error also gets logged (although i don't think we don't really have any
>> easy way of asserting specific log messages at the moment)
>>
>>
>> -Hoss
>> http://www.lucidworks.com/

Re: Tracking down the input that hits an analysis chain bug

Posted by Benson Margulies <bi...@gmail.com>.

OK, patch forthcoming.

On Fri, Jan 10, 2014 at 11:23 AM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> : The problem manifests as this sort of thing:
> :
> : Jan 3, 2014 6:05:33 PM org.apache.solr.common.SolrException log
> : SEVERE: java.lang.IllegalArgumentException: startOffset must be
> : non-negative, and endOffset must be >= startOffset,
> : startOffset=-1811581632,endOffset=-1811581632
>
> Is there a stack trace in the log to go along with that?  there should be.
>
> My suspicion is that since analysis errors like these are
> RuntimeExceptions, they may not be getting caught & re-thrown with as much
> context as they should -- so by the time they get logged (or returned to
> the client) there isn't any info about the problematic field value, let
> alone the unqiueKey.
>
> If we had a test case that reproduces (ie: with a mock tokenfilter that
> always throws a RuntimeException when a token matches "fail_now" or
> something) we could have some tests that assert indexing a doc with that
> token results in a useful error -- which should help ensure that useful
> error also gets logged (although i don't think we don't really have any
> easy way of asserting specific log messages at the moment)
>
>
> -Hoss
> http://www.lucidworks.com/

Re: Tracking down the input that hits an analysis chain bug

Posted by Chris Hostetter <ho...@fucit.org>.

: The problem manifests as this sort of thing:
: 
: Jan 3, 2014 6:05:33 PM org.apache.solr.common.SolrException log
: SEVERE: java.lang.IllegalArgumentException: startOffset must be
: non-negative, and endOffset must be >= startOffset,
: startOffset=-1811581632,endOffset=-1811581632

Is there a stack trace in the log to go along with that?  there should be.

My suspicion is that since analysis errors like these are 
RuntimeExceptions, they may not be getting caught & re-thrown with as much 
context as they should -- so by the time they get logged (or returned to 
the client) there isn't any info about the problematic field value, let 
alone the unqiueKey.

If we had a test case that reproduces (ie: with a mock tokenfilter that 
always throws a RuntimeException when a token matches "fail_now" or 
something) we could have some tests that assert indexing a doc with that 
token results in a useful error -- which should help ensure that useful 
error also gets logged (although i don't think we don't really have any 
easy way of asserting specific log messages at the moment)


-Hoss
http://www.lucidworks.com/