You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Benson Margulies <be...@basistech.com> on 2014/01/03 19:56:50 UTC
Tracking down the input that hits an analysis chain bug
Using Solr Cloud with 4.3.1.
We've got a problem with a tokenizer that manifests as calling
OffsetAtt.setOffsets() with invalid inputs. OK, so, we want to figure out
what input provokes our code into getting into this pickle.
The problem happens on SolrCloud nodes.
The problem manifests as this sort of thing:
Jan 3, 2014 6:05:33 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.IllegalArgumentException: startOffset must be
non-negative, and endOffset must be >= startOffset,
startOffset=-1811581632,endOffset=-1811581632
How could we get a document ID so that we can tell which document was being
processed?
Re: Tracking down the input that hits an analysis chain bug
Posted by Michael Sokolov <ms...@safaribooksonline.com>.
I think you do (or can) get a log message for each document insert? If
that's all you need, I think logging configuration will get you there.
I use log4j and turn Solr's pretty verbose logging off using:
log4j.logger.org.apache.lucene.solr = WARN
assuming the rest of log4j is set up OK, I think you get the insert
messages at INFO level?
-Mike
On 1/4/2014 9:24 PM, Benson Margulies wrote:
> I rather assumed that there was some log4j-ish config to be set that
> would do this for me. Lacking one, I guess I'll end up there.
>
> On Fri, Jan 3, 2014 at 8:23 PM, Michael Sokolov
> <ms...@safaribooksonline.com> wrote:
>> Have you considered using a custom UpdateProcessor to catch the exception
>> and provide more context in the logs?
>>
>> -Mike
>>
>>
>> On 01/03/2014 03:33 PM, Benson Margulies wrote:
>>> Robert,
>>>
>>> Yes, if the problem was not data-dependent, indeed I wouldn't need to
>>> index anything. However, I've run a small mountain of data through our
>>> tokenizer on my machine, and never seen the error, but my customer
>>> gets these errors in the middle of a giant spew of data. As it
>>> happens, I _was_ missing that call to clearAttributes(), (and the
>>> usual implementation of end()), but I found and fixed that problem
>>> precisely by creating a random data test case using checkRandomData().
>>> Unfortunately, fixing that didn't make the customer's errors go away.
>>>
>>> So I'm left needing to help them identify the data that provokes this,
>>> because I've so far failed to come up with any.
>>>
>>> --benson
>>>
>>>
>>> On Fri, Jan 3, 2014 at 2:16 PM, Robert Muir <rc...@gmail.com> wrote:
>>>> This exception comes from OffsetAttributeImpl (e.g. you dont need to
>>>> index anything to reproduce it).
>>>>
>>>> Maybe you have a missing clearAttributes() call (your tokenizer
>>>> 'returns true' without calling that first)? This could explain it, if
>>>> something like a StopFilter is also present in the chain: basically
>>>> the offsets overflow.
>>>>
>>>> the test stuff in BaseTokenStreamTestCase should be able to detect
>>>> this as well...
>>>>
>>>> On Fri, Jan 3, 2014 at 1:56 PM, Benson Margulies <be...@basistech.com>
>>>> wrote:
>>>>> Using Solr Cloud with 4.3.1.
>>>>>
>>>>> We've got a problem with a tokenizer that manifests as calling
>>>>> OffsetAtt.setOffsets() with invalid inputs. OK, so, we want to figure
>>>>> out
>>>>> what input provokes our code into getting into this pickle.
>>>>>
>>>>> The problem happens on SolrCloud nodes.
>>>>>
>>>>> The problem manifests as this sort of thing:
>>>>>
>>>>> Jan 3, 2014 6:05:33 PM org.apache.solr.common.SolrException log
>>>>> SEVERE: java.lang.IllegalArgumentException: startOffset must be
>>>>> non-negative, and endOffset must be >= startOffset,
>>>>> startOffset=-1811581632,endOffset=-1811581632
>>>>>
>>>>> How could we get a document ID so that we can tell which document was
>>>>> being
>>>>> processed?
>>
Re: Tracking down the input that hits an analysis chain bug
Posted by Benson Margulies <bi...@gmail.com>.
I rather assumed that there was some log4j-ish config to be set that
would do this for me. Lacking one, I guess I'll end up there.
On Fri, Jan 3, 2014 at 8:23 PM, Michael Sokolov
<ms...@safaribooksonline.com> wrote:
> Have you considered using a custom UpdateProcessor to catch the exception
> and provide more context in the logs?
>
> -Mike
>
>
> On 01/03/2014 03:33 PM, Benson Margulies wrote:
>>
>> Robert,
>>
>> Yes, if the problem was not data-dependent, indeed I wouldn't need to
>> index anything. However, I've run a small mountain of data through our
>> tokenizer on my machine, and never seen the error, but my customer
>> gets these errors in the middle of a giant spew of data. As it
>> happens, I _was_ missing that call to clearAttributes(), (and the
>> usual implementation of end()), but I found and fixed that problem
>> precisely by creating a random data test case using checkRandomData().
>> Unfortunately, fixing that didn't make the customer's errors go away.
>>
>> So I'm left needing to help them identify the data that provokes this,
>> because I've so far failed to come up with any.
>>
>> --benson
>>
>>
>> On Fri, Jan 3, 2014 at 2:16 PM, Robert Muir <rc...@gmail.com> wrote:
>>>
>>> This exception comes from OffsetAttributeImpl (e.g. you dont need to
>>> index anything to reproduce it).
>>>
>>> Maybe you have a missing clearAttributes() call (your tokenizer
>>> 'returns true' without calling that first)? This could explain it, if
>>> something like a StopFilter is also present in the chain: basically
>>> the offsets overflow.
>>>
>>> the test stuff in BaseTokenStreamTestCase should be able to detect
>>> this as well...
>>>
>>> On Fri, Jan 3, 2014 at 1:56 PM, Benson Margulies <be...@basistech.com>
>>> wrote:
>>>>
>>>> Using Solr Cloud with 4.3.1.
>>>>
>>>> We've got a problem with a tokenizer that manifests as calling
>>>> OffsetAtt.setOffsets() with invalid inputs. OK, so, we want to figure
>>>> out
>>>> what input provokes our code into getting into this pickle.
>>>>
>>>> The problem happens on SolrCloud nodes.
>>>>
>>>> The problem manifests as this sort of thing:
>>>>
>>>> Jan 3, 2014 6:05:33 PM org.apache.solr.common.SolrException log
>>>> SEVERE: java.lang.IllegalArgumentException: startOffset must be
>>>> non-negative, and endOffset must be >= startOffset,
>>>> startOffset=-1811581632,endOffset=-1811581632
>>>>
>>>> How could we get a document ID so that we can tell which document was
>>>> being
>>>> processed?
>
>
Re: Tracking down the input that hits an analysis chain bug
Posted by Michael Sokolov <ms...@safaribooksonline.com>.
Have you considered using a custom UpdateProcessor to catch the
exception and provide more context in the logs?
-Mike
On 01/03/2014 03:33 PM, Benson Margulies wrote:
> Robert,
>
> Yes, if the problem was not data-dependent, indeed I wouldn't need to
> index anything. However, I've run a small mountain of data through our
> tokenizer on my machine, and never seen the error, but my customer
> gets these errors in the middle of a giant spew of data. As it
> happens, I _was_ missing that call to clearAttributes(), (and the
> usual implementation of end()), but I found and fixed that problem
> precisely by creating a random data test case using checkRandomData().
> Unfortunately, fixing that didn't make the customer's errors go away.
>
> So I'm left needing to help them identify the data that provokes this,
> because I've so far failed to come up with any.
>
> --benson
>
>
> On Fri, Jan 3, 2014 at 2:16 PM, Robert Muir <rc...@gmail.com> wrote:
>> This exception comes from OffsetAttributeImpl (e.g. you dont need to
>> index anything to reproduce it).
>>
>> Maybe you have a missing clearAttributes() call (your tokenizer
>> 'returns true' without calling that first)? This could explain it, if
>> something like a StopFilter is also present in the chain: basically
>> the offsets overflow.
>>
>> the test stuff in BaseTokenStreamTestCase should be able to detect
>> this as well...
>>
>> On Fri, Jan 3, 2014 at 1:56 PM, Benson Margulies <be...@basistech.com> wrote:
>>> Using Solr Cloud with 4.3.1.
>>>
>>> We've got a problem with a tokenizer that manifests as calling
>>> OffsetAtt.setOffsets() with invalid inputs. OK, so, we want to figure out
>>> what input provokes our code into getting into this pickle.
>>>
>>> The problem happens on SolrCloud nodes.
>>>
>>> The problem manifests as this sort of thing:
>>>
>>> Jan 3, 2014 6:05:33 PM org.apache.solr.common.SolrException log
>>> SEVERE: java.lang.IllegalArgumentException: startOffset must be
>>> non-negative, and endOffset must be >= startOffset,
>>> startOffset=-1811581632,endOffset=-1811581632
>>>
>>> How could we get a document ID so that we can tell which document was being
>>> processed?
Re: Tracking down the input that hits an analysis chain bug
Posted by Benson Margulies <bi...@gmail.com>.
Robert,
Yes, if the problem was not data-dependent, indeed I wouldn't need to
index anything. However, I've run a small mountain of data through our
tokenizer on my machine, and never seen the error, but my customer
gets these errors in the middle of a giant spew of data. As it
happens, I _was_ missing that call to clearAttributes(), (and the
usual implementation of end()), but I found and fixed that problem
precisely by creating a random data test case using checkRandomData().
Unfortunately, fixing that didn't make the customer's errors go away.
So I'm left needing to help them identify the data that provokes this,
because I've so far failed to come up with any.
--benson
On Fri, Jan 3, 2014 at 2:16 PM, Robert Muir <rc...@gmail.com> wrote:
> This exception comes from OffsetAttributeImpl (e.g. you dont need to
> index anything to reproduce it).
>
> Maybe you have a missing clearAttributes() call (your tokenizer
> 'returns true' without calling that first)? This could explain it, if
> something like a StopFilter is also present in the chain: basically
> the offsets overflow.
>
> the test stuff in BaseTokenStreamTestCase should be able to detect
> this as well...
>
> On Fri, Jan 3, 2014 at 1:56 PM, Benson Margulies <be...@basistech.com> wrote:
>> Using Solr Cloud with 4.3.1.
>>
>> We've got a problem with a tokenizer that manifests as calling
>> OffsetAtt.setOffsets() with invalid inputs. OK, so, we want to figure out
>> what input provokes our code into getting into this pickle.
>>
>> The problem happens on SolrCloud nodes.
>>
>> The problem manifests as this sort of thing:
>>
>> Jan 3, 2014 6:05:33 PM org.apache.solr.common.SolrException log
>> SEVERE: java.lang.IllegalArgumentException: startOffset must be
>> non-negative, and endOffset must be >= startOffset,
>> startOffset=-1811581632,endOffset=-1811581632
>>
>> How could we get a document ID so that we can tell which document was being
>> processed?
Re: Tracking down the input that hits an analysis chain bug
Posted by Robert Muir <rc...@gmail.com>.
This exception comes from OffsetAttributeImpl (e.g. you dont need to
index anything to reproduce it).
Maybe you have a missing clearAttributes() call (your tokenizer
'returns true' without calling that first)? This could explain it, if
something like a StopFilter is also present in the chain: basically
the offsets overflow.
the test stuff in BaseTokenStreamTestCase should be able to detect
this as well...
On Fri, Jan 3, 2014 at 1:56 PM, Benson Margulies <be...@basistech.com> wrote:
> Using Solr Cloud with 4.3.1.
>
> We've got a problem with a tokenizer that manifests as calling
> OffsetAtt.setOffsets() with invalid inputs. OK, so, we want to figure out
> what input provokes our code into getting into this pickle.
>
> The problem happens on SolrCloud nodes.
>
> The problem manifests as this sort of thing:
>
> Jan 3, 2014 6:05:33 PM org.apache.solr.common.SolrException log
> SEVERE: java.lang.IllegalArgumentException: startOffset must be
> non-negative, and endOffset must be >= startOffset,
> startOffset=-1811581632,endOffset=-1811581632
>
> How could we get a document ID so that we can tell which document was being
> processed?
Re: Tracking down the input that hits an analysis chain bug
Posted by Benson Margulies <bi...@gmail.com>.
I think that https://issues.apache.org/jira/browse/SOLR-5623 should be
ready to go. Would someone please commit from the PR? If there's a
preference, I can attach a patch as well.
On Fri, Jan 10, 2014 at 1:37 PM, Benson Margulies <bi...@gmail.com> wrote:
> Thanks, that's the recipe that I need.
>
> On Fri, Jan 10, 2014 at 11:40 AM, Chris Hostetter
> <ho...@fucit.org> wrote:
>>
>> : Is there a neighborhood of existing tests I should be visiting here?
>>
>> You'll need a custom schema that refers to your new
>> MockFailOnCertainTokensFilterFactory, so i would create a completley new
>> test class somewhere in ...solr.update (you're testing that an update
>> fails with a clean error)
>>
>>
>> -Hoss
>> http://www.lucidworks.com/
Re: Tracking down the input that hits an analysis chain bug
Posted by Benson Margulies <bi...@gmail.com>.
I think that https://issues.apache.org/jira/browse/SOLR-5623 should be
ready to go. Would someone please commit from the PR? If there's a
preference, I can attach a patch as well.
On Fri, Jan 10, 2014 at 1:37 PM, Benson Margulies <bi...@gmail.com> wrote:
> Thanks, that's the recipe that I need.
>
> On Fri, Jan 10, 2014 at 11:40 AM, Chris Hostetter
> <ho...@fucit.org> wrote:
>>
>> : Is there a neighborhood of existing tests I should be visiting here?
>>
>> You'll need a custom schema that refers to your new
>> MockFailOnCertainTokensFilterFactory, so i would create a completley new
>> test class somewhere in ...solr.update (you're testing that an update
>> fails with a clean error)
>>
>>
>> -Hoss
>> http://www.lucidworks.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Tracking down the input that hits an analysis chain bug
Posted by Benson Margulies <bi...@gmail.com>.
Thanks, that's the recipe that I need.
On Fri, Jan 10, 2014 at 11:40 AM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> : Is there a neighborhood of existing tests I should be visiting here?
>
> You'll need a custom schema that refers to your new
> MockFailOnCertainTokensFilterFactory, so i would create a completley new
> test class somewhere in ...solr.update (you're testing that an update
> fails with a clean error)
>
>
> -Hoss
> http://www.lucidworks.com/
Re: Tracking down the input that hits an analysis chain bug
Posted by Chris Hostetter <ho...@fucit.org>.
: Is there a neighborhood of existing tests I should be visiting here?
You'll need a custom schema that refers to your new
MockFailOnCertainTokensFilterFactory, so i would create a completley new
test class somewhere in ...solr.update (you're testing that an update
fails with a clean error)
-Hoss
http://www.lucidworks.com/
Re: Tracking down the input that hits an analysis chain bug
Posted by Benson Margulies <bi...@gmail.com>.
Is there a neighborhood of existing tests I should be visiting here?
On Fri, Jan 10, 2014 at 11:27 AM, Benson Margulies
<bi...@gmail.com> wrote:
> OK, patch forthcoming.
>
> On Fri, Jan 10, 2014 at 11:23 AM, Chris Hostetter
> <ho...@fucit.org> wrote:
>>
>> : The problem manifests as this sort of thing:
>> :
>> : Jan 3, 2014 6:05:33 PM org.apache.solr.common.SolrException log
>> : SEVERE: java.lang.IllegalArgumentException: startOffset must be
>> : non-negative, and endOffset must be >= startOffset,
>> : startOffset=-1811581632,endOffset=-1811581632
>>
>> Is there a stack trace in the log to go along with that? there should be.
>>
>> My suspicion is that since analysis errors like these are
>> RuntimeExceptions, they may not be getting caught & re-thrown with as much
>> context as they should -- so by the time they get logged (or returned to
>> the client) there isn't any info about the problematic field value, let
>> alone the unqiueKey.
>>
>> If we had a test case that reproduces (ie: with a mock tokenfilter that
>> always throws a RuntimeException when a token matches "fail_now" or
>> something) we could have some tests that assert indexing a doc with that
>> token results in a useful error -- which should help ensure that useful
>> error also gets logged (although i don't think we don't really have any
>> easy way of asserting specific log messages at the moment)
>>
>>
>> -Hoss
>> http://www.lucidworks.com/
Re: Tracking down the input that hits an analysis chain bug
Posted by Benson Margulies <bi...@gmail.com>.
OK, patch forthcoming.
On Fri, Jan 10, 2014 at 11:23 AM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> : The problem manifests as this sort of thing:
> :
> : Jan 3, 2014 6:05:33 PM org.apache.solr.common.SolrException log
> : SEVERE: java.lang.IllegalArgumentException: startOffset must be
> : non-negative, and endOffset must be >= startOffset,
> : startOffset=-1811581632,endOffset=-1811581632
>
> Is there a stack trace in the log to go along with that? there should be.
>
> My suspicion is that since analysis errors like these are
> RuntimeExceptions, they may not be getting caught & re-thrown with as much
> context as they should -- so by the time they get logged (or returned to
> the client) there isn't any info about the problematic field value, let
> alone the unqiueKey.
>
> If we had a test case that reproduces (ie: with a mock tokenfilter that
> always throws a RuntimeException when a token matches "fail_now" or
> something) we could have some tests that assert indexing a doc with that
> token results in a useful error -- which should help ensure that useful
> error also gets logged (although i don't think we don't really have any
> easy way of asserting specific log messages at the moment)
>
>
> -Hoss
> http://www.lucidworks.com/
Re: Tracking down the input that hits an analysis chain bug
Posted by Chris Hostetter <ho...@fucit.org>.
: The problem manifests as this sort of thing:
:
: Jan 3, 2014 6:05:33 PM org.apache.solr.common.SolrException log
: SEVERE: java.lang.IllegalArgumentException: startOffset must be
: non-negative, and endOffset must be >= startOffset,
: startOffset=-1811581632,endOffset=-1811581632
Is there a stack trace in the log to go along with that? there should be.
My suspicion is that since analysis errors like these are
RuntimeExceptions, they may not be getting caught & re-thrown with as much
context as they should -- so by the time they get logged (or returned to
the client) there isn't any info about the problematic field value, let
alone the unqiueKey.
If we had a test case that reproduces (ie: with a mock tokenfilter that
always throws a RuntimeException when a token matches "fail_now" or
something) we could have some tests that assert indexing a doc with that
token results in a useful error -- which should help ensure that useful
error also gets logged (although i don't think we don't really have any
easy way of asserting specific log messages at the moment)
-Hoss
http://www.lucidworks.com/