You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Rupert Fiasco <ru...@gmail.com> on 2009/08/25 00:30:32 UTC

Responses getting truncated

I am seeing our responses getting truncated if and only if I search on
our main text field.

E.g. I just do some basic like

title_t:arthritis

Then I get a valid document back. But if I add in our larger text field:

title_t:arthritis OR text_t:arthritis

then the resultant document is NOT valid XML (if using wt=xml) or Ruby
(using wt=ruby). If I run these through curl on the command its
truncated and if I run the search through the web-based admin panel
then I get an XML parse error.

This appears to have just started recently and the only thing we have
done is change our indexer from a PHP one to a Java one, but
functionally they are identical.

Any thoughts? Thanks in advance.

- Rupert

Re: Responses getting truncated

Posted by Rupert Fiasco <ru...@gmail.com>.

I know in my last message I said I was having issues with "extra
content" at the start of a response, resulting in an invalid document.
I still am having issues with documents getting truncated (yes, I have
problems galore).

I will elaborate on why its so difficult to track down an actual
document which is causing the failure (if I could find the document I
could post it to the group) - causing an invalid / truncated document.

I will just document the steps:

1) I have a query which results in a bogus truncated document. This
query pulls back all fields. If I take that same query and remove the
"text_t" field from the returned field list, then all is well. This
indicates to me that its a problem with the text_t field. This query
uses the default returned rows of 10.

2) So far so good. Then my next step is to find the document. So I
take my original query, remove the text_t from the field list to get
my result set.

3) I run a new query that JUST selects that document based on its Doc
ID (which I have from the first query). My thinking is that my
"broken" document HAS to be in that set, so I can just select it by ID
and then validate the response.

This is where it breaks down: I know one or more broken documents is
in my set, but if I iterate over each doc id and pull it out
individually, its response is valid. Its only broken when I pull it
out in the first query. Its NOT broken when I pull it out by ID, even
though I am also pulling out the same "broken" field.

If you can read Ruby my script is here:

http://brockwine.com/solr_fetch.txt

In the first net/http call, if I include the "text_t" field in the
"fl" list then it breaks. If I remove it, get the doc ids and then
iterate over each one and get it back from Solr (including the
supposed broken field "text_t") then it works just fine - the
exception is never raised. But it is raised in the first call if I
include it.

To me this makes absolutely no sense.

Thanks
-Rupert

On Fri, Aug 28, 2009 at 2:14 PM, Joe Calderon<ca...@gmail.com> wrote:
> i had a similar issue with text from past requests showing up, this was on
> 1.3 nightly, i switched to using the lucid build of 1.3 and the problem went
> away, im using a nightly of 1.4 right now also without probs, then again
> your mileage may vary as i also made a bunch of schema changes that might
> have had some effect, it wouldnt hurt to try though
>
>
> On 08/28/2009 02:04 PM, Rupert Fiasco wrote:
>>
>> Firstly, to everyone who has been helping me, thank you very much. All
>> this feedback is helping me narrow down these issues.
>>
>> I deleted the index and re-indexed all the data from scratch and for a
>> couple of days we were OK, but now it seems to be erring again.
>>
>> It happens on different input documents so what was broken before now
>> works (documents that were having issues before are OK now, after a
>> fresh re-index).
>>
>> An issue we are seeing now is that an XML response from Solr will
>> contain the "tail" of an earlier response, for an example:
>>
>> http://brockwine.com/solr2.txt
>>
>> That is a response we are getting from Solr - using the web interface
>> for Solr in Firefox, Firefox freaks out because it tries to parse
>> that, and of course, its invalid XML, but I can retrieve that via
>> curl.
>>
>> Anyone seeing this before?
>>
>> In regards to earlier questions:
>>
>>
>>>
>>> i assume you are correct, but you listed several steps of transformation
>>> above, are you certian they all work correctly and produce valid UTF-8?
>>>
>>
>> Yes, I have looked at the source and contacted the author of the
>> conversion library we are using and have verified that if UTF8 goes in
>> then UTF8 will come out and UTF8 is definitely going in.
>>
>> I dont think sending over an actual input document would help because
>> it seems to change. Plus, this latest issue appears to be more an
>> issue of the last response buffer not clearing or something.
>>
>> Whats strange is that if I wait a few minutes and reload, then the
>> buffer is cleared and I get back a valid response, its intermittent,
>> but appears to be happening frequently.
>>
>> If it matters, we started using LucidGaze for Solr about 10 days ago,
>> approximately when these issues started happening (but its hard to say
>> if thats an issue because at this same time we switched from a PHP to
>> Java indexing client).
>>
>> Thanks for your patience
>>
>> -Rupert
>>
>> On Tue, Aug 25, 2009 at 8:33 PM, Chris
>> Hostetter<ho...@fucit.org>  wrote:
>>
>>>
>>> : We are running an instance of MediaWiki so the text goes through a
>>> : couple of transformations: wiki markup ->  html ->  plain text.
>>> : Its at this last step that I take a "snippet" and insert that into
>>> Solr.
>>>        ...
>>> : doc.addField("text_snippet_t", article.getSnippet(1000));
>>>
>>> ok, well first off: that's the not the field we're you are having
>>> problems
>>> is it?  if i remember correctly from your previous posts, wasn't the
>>> response getting aborted in the middle of the Contents field?
>>>
>>> : and a maximum of 1K chars if its bigger. I initialized this String
>>> : from the DB by using the String constructor where I pass in the
>>> : charset/collation
>>> :
>>> : text = new String(textFromDB, "UTF-8");
>>> :
>>> : So to the best of my knowledge, accessing a substring of a UTF-8
>>> : encoded string should not break up the UTF-8 code point. Is that an
>>>
>>> i assume you are correct, but you listed several steps of transformation
>>> above, are you certian they all work correctly and produce valid UTF-8?
>>>
>>> this leads back to my suggestion before....
>>>
>>> :>  Can you put the orriginal (pre solr, pre solrj, raw untouched,
>>> etc...)
>>> :>  file that this solr doc came from online somewhere?
>>> :>
>>> :>  What does your *indexing* code look like? ... Can you add some
>>> debuging to
>>> :>  the SolrJ client when you *add* this doc to print out exactly what
>>> those
>>> :>  1000 characters are?
>>>
>>>
>>> -Hoss
>>>
>>>
>
>

Re: Responses getting truncated

Posted by Joe Calderon <ca...@gmail.com>.

i had a similar issue with text from past requests showing up, this was 
on 1.3 nightly, i switched to using the lucid build of 1.3 and the 
problem went away, im using a nightly of 1.4 right now also without 
probs, then again your mileage may vary as i also made a bunch of schema 
changes that might have had some effect, it wouldnt hurt to try though


On 08/28/2009 02:04 PM, Rupert Fiasco wrote:
> Firstly, to everyone who has been helping me, thank you very much. All
> this feedback is helping me narrow down these issues.
>
> I deleted the index and re-indexed all the data from scratch and for a
> couple of days we were OK, but now it seems to be erring again.
>
> It happens on different input documents so what was broken before now
> works (documents that were having issues before are OK now, after a
> fresh re-index).
>
> An issue we are seeing now is that an XML response from Solr will
> contain the "tail" of an earlier response, for an example:
>
> http://brockwine.com/solr2.txt
>
> That is a response we are getting from Solr - using the web interface
> for Solr in Firefox, Firefox freaks out because it tries to parse
> that, and of course, its invalid XML, but I can retrieve that via
> curl.
>
> Anyone seeing this before?
>
> In regards to earlier questions:
>
>    
>> i assume you are correct, but you listed several steps of transformation
>> above, are you certian they all work correctly and produce valid UTF-8?
>>      
> Yes, I have looked at the source and contacted the author of the
> conversion library we are using and have verified that if UTF8 goes in
> then UTF8 will come out and UTF8 is definitely going in.
>
> I dont think sending over an actual input document would help because
> it seems to change. Plus, this latest issue appears to be more an
> issue of the last response buffer not clearing or something.
>
> Whats strange is that if I wait a few minutes and reload, then the
> buffer is cleared and I get back a valid response, its intermittent,
> but appears to be happening frequently.
>
> If it matters, we started using LucidGaze for Solr about 10 days ago,
> approximately when these issues started happening (but its hard to say
> if thats an issue because at this same time we switched from a PHP to
> Java indexing client).
>
> Thanks for your patience
>
> -Rupert
>
> On Tue, Aug 25, 2009 at 8:33 PM, Chris
> Hostetter<ho...@fucit.org>  wrote:
>    
>> : We are running an instance of MediaWiki so the text goes through a
>> : couple of transformations: wiki markup ->  html ->  plain text.
>> : Its at this last step that I take a "snippet" and insert that into Solr.
>>         ...
>> : doc.addField("text_snippet_t", article.getSnippet(1000));
>>
>> ok, well first off: that's the not the field we're you are having problems
>> is it?  if i remember correctly from your previous posts, wasn't the
>> response getting aborted in the middle of the Contents field?
>>
>> : and a maximum of 1K chars if its bigger. I initialized this String
>> : from the DB by using the String constructor where I pass in the
>> : charset/collation
>> :
>> : text = new String(textFromDB, "UTF-8");
>> :
>> : So to the best of my knowledge, accessing a substring of a UTF-8
>> : encoded string should not break up the UTF-8 code point. Is that an
>>
>> i assume you are correct, but you listed several steps of transformation
>> above, are you certian they all work correctly and produce valid UTF-8?
>>
>> this leads back to my suggestion before....
>>
>> :>  Can you put the orriginal (pre solr, pre solrj, raw untouched, etc...)
>> :>  file that this solr doc came from online somewhere?
>> :>
>> :>  What does your *indexing* code look like? ... Can you add some debuging to
>> :>  the SolrJ client when you *add* this doc to print out exactly what those
>> :>  1000 characters are?
>>
>>
>> -Hoss
>>
>>

Re: Responses getting truncated

Posted by Rupert Fiasco <ru...@gmail.com>.

Firstly, to everyone who has been helping me, thank you very much. All
this feedback is helping me narrow down these issues.

I deleted the index and re-indexed all the data from scratch and for a
couple of days we were OK, but now it seems to be erring again.

It happens on different input documents so what was broken before now
works (documents that were having issues before are OK now, after a
fresh re-index).

An issue we are seeing now is that an XML response from Solr will
contain the "tail" of an earlier response, for an example:

http://brockwine.com/solr2.txt

That is a response we are getting from Solr - using the web interface
for Solr in Firefox, Firefox freaks out because it tries to parse
that, and of course, its invalid XML, but I can retrieve that via
curl.

Anyone seeing this before?

In regards to earlier questions:

> i assume you are correct, but you listed several steps of transformation
> above, are you certian they all work correctly and produce valid UTF-8?

Yes, I have looked at the source and contacted the author of the
conversion library we are using and have verified that if UTF8 goes in
then UTF8 will come out and UTF8 is definitely going in.

I dont think sending over an actual input document would help because
it seems to change. Plus, this latest issue appears to be more an
issue of the last response buffer not clearing or something.

Whats strange is that if I wait a few minutes and reload, then the
buffer is cleared and I get back a valid response, its intermittent,
but appears to be happening frequently.

If it matters, we started using LucidGaze for Solr about 10 days ago,
approximately when these issues started happening (but its hard to say
if thats an issue because at this same time we switched from a PHP to
Java indexing client).

Thanks for your patience

-Rupert

On Tue, Aug 25, 2009 at 8:33 PM, Chris
Hostetter<ho...@fucit.org> wrote:
>
> : We are running an instance of MediaWiki so the text goes through a
> : couple of transformations: wiki markup -> html -> plain text.
> : Its at this last step that I take a "snippet" and insert that into Solr.
>        ...
> : doc.addField("text_snippet_t", article.getSnippet(1000));
>
> ok, well first off: that's the not the field we're you are having problems
> is it?  if i remember correctly from your previous posts, wasn't the
> response getting aborted in the middle of the Contents field?
>
> : and a maximum of 1K chars if its bigger. I initialized this String
> : from the DB by using the String constructor where I pass in the
> : charset/collation
> :
> : text = new String(textFromDB, "UTF-8");
> :
> : So to the best of my knowledge, accessing a substring of a UTF-8
> : encoded string should not break up the UTF-8 code point. Is that an
>
> i assume you are correct, but you listed several steps of transformation
> above, are you certian they all work correctly and produce valid UTF-8?
>
> this leads back to my suggestion before....
>
> : > Can you put the orriginal (pre solr, pre solrj, raw untouched, etc...)
> : > file that this solr doc came from online somewhere?
> : >
> : > What does your *indexing* code look like? ... Can you add some debuging to
> : > the SolrJ client when you *add* this doc to print out exactly what those
> : > 1000 characters are?
>
>
> -Hoss
>

Re: Responses getting truncated

Posted by Chris Hostetter <ho...@fucit.org>.

: We are running an instance of MediaWiki so the text goes through a
: couple of transformations: wiki markup -> html -> plain text.
: Its at this last step that I take a "snippet" and insert that into Solr.
	...
: doc.addField("text_snippet_t", article.getSnippet(1000));

ok, well first off: that's the not the field we're you are having problems 
is it?  if i remember correctly from your previous posts, wasn't the 
response getting aborted in the middle of the Contents field?

: and a maximum of 1K chars if its bigger. I initialized this String
: from the DB by using the String constructor where I pass in the
: charset/collation
: 
: text = new String(textFromDB, "UTF-8");
: 
: So to the best of my knowledge, accessing a substring of a UTF-8
: encoded string should not break up the UTF-8 code point. Is that an

i assume you are correct, but you listed several steps of transformation 
above, are you certian they all work correctly and produce valid UTF-8?

this leads back to my suggestion before....

: > Can you put the orriginal (pre solr, pre solrj, raw untouched, etc...)
: > file that this solr doc came from online somewhere?
: >
: > What does your *indexing* code look like? ... Can you add some debuging to
: > the SolrJ client when you *add* this doc to print out exactly what those
: > 1000 characters are?


-Hoss

Re: Responses getting truncated

Posted by Rupert Fiasco <ru...@gmail.com>.

> 1.  Exactly which version of Solr / SolrJ are you using?

Solr Specification Version: 1.3.0
Solr Implementation Version: 1.3.0 694707 - grantingersoll - 2008-09-12 11:06:47
Latest SolrJ that I downloaded a couple of days ago.

> Can you put the orriginal (pre solr, pre solrj, raw untouched, etc...)
> file that this solr doc came from online somewhere?

We are running an instance of MediaWiki so the text goes through a
couple of transformations: wiki markup -> html -> plain text.
Its at this last step that I take a "snippet" and insert that into Solr.

My snippet code is:

 // article.java
public String getSnippet(int maxlen) {
  int length = getPlainText().length() >= maxlen ? maxlen :
getPlainText().length();
  return getPlainText().substring(0, length);
}
// ... later on .... add to solr
doc.addField("text_snippet_t", article.getSnippet(1000));

So in theory, I am getting the whole article if its less than 1K chars
and a maximum of 1K chars if its bigger. I initialized this String
from the DB by using the String constructor where I pass in the
charset/collation

text = new String(textFromDB, "UTF-8");

So to the best of my knowledge, accessing a substring of a UTF-8
encoded string should not break up the UTF-8 code point. Is that an
incorrect assumption? If so, what is best way to break up a UTF-8
encoded string and get approximately that many characters? Exactness
is not a requirement.

-Rupert

On Tue, Aug 25, 2009 at 5:37 PM, Chris
Hostetter<ho...@fucit.org> wrote:
>
> 1.  Exactly which version of Solr / SolrJ are you using?
>
> 2. ...
>
> : >>>> I am using the SolrJ client to add documents to in my index. My field
> : >>>> is a normal "text" field type and the text itself is the first 1000
> : >>>> characters of an article.
>
> Can you put the orriginal (pre solr, pre solrj, raw untouched, etc...)
> file that this solr doc came from online somewhere?
>
> What does your *indexing* code look like? ... Can you add some debuging to
> the SolrJ client when you *add* this doc to print out exactly what those
> 1000 characters are?
>
> My hunch: when you are extracting the first 1000 characters, you're
> getting only the first half of a character ...or... you are getting docs
> with less them 1000 characters and winding up with a buffer (char[]?) that
> has garbage at the end; SolrJ isn't complaining on the way in, but
> something farther down (maybe before indexing, maybe after) is seeing that
> garbage and cutting the field off at that point.
>
>
>
> -Hoss
>
>

Re: Responses getting truncated

Posted by Chris Hostetter <ho...@fucit.org>.

1.  Exactly which version of Solr / SolrJ are you using?

2. ...

: >>>> I am using the SolrJ client to add documents to in my index. My field
: >>>> is a normal "text" field type and the text itself is the first 1000
: >>>> characters of an article.

Can you put the orriginal (pre solr, pre solrj, raw untouched, etc...) 
file that this solr doc came from online somewhere?

What does your *indexing* code look like? ... Can you add some debuging to 
the SolrJ client when you *add* this doc to print out exactly what those 
1000 characters are?

My hunch: when you are extracting the first 1000 characters, you're 
getting only the first half of a character ...or... you are getting docs 
with less them 1000 characters and winding up with a buffer (char[]?) that 
has garbage at the end; SolrJ isn't complaining on the way in, but 
something farther down (maybe before indexing, maybe after) is seeing that 
garbage and cutting the field off at that point.



-Hoss

Re: Responses getting truncated

Posted by Rupert Fiasco <ru...@gmail.com>.

So I whipped up a quick SolrJ client and ran it against the document
that I referenced earlier. When I retrieve the doc and just print its
field/value pairs to stdout it ends like this:

http://brockwine.com/images/output1.png

It appears to be some kind of garbage characters.

-Rupert

On Tue, Aug 25, 2009 at 12:19 PM, Uri Boness<ub...@gmail.com> wrote:
> Hi,
>
> This is a very strange behavior and the fact that it is cause by one
> specific field, again, leads me to believe it's still a data issue. Did you
> try using SolrJ to query the data as well? If the same thing happens when
> using the binary protocol, then it's probably not a data issue. On the other
> hand, if it works fine, then at least you can inspect the data to see where
> things go wrong. Sorry for insisting on that, but I cannot think of anything
> else that can cause this problem.
>
> If anyone else have a better idea, I'm actually very curious to hear about
> it.
>
> Uri
>
> Rupert Fiasco wrote:
>>
>> The text file at:
>>
>> http://brockwine.com/solr.txt
>>
>> Represents one of these truncated responses (this one in XML). It
>> starts out great, then look at the bottom, boom, game over. :)
>>
>> I found this document by first running our bigger search which breaks
>> and then zeroing in a specific broken document by using the rows/start
>> parameters. But there are any unknown number of these "broken"
>> documents - a lot I presume.
>>
>> -Rupert
>>
>> On Tue, Aug 25, 2009 at 9:40 AM, Avlesh Singh<av...@gmail.com> wrote:
>>
>>>
>>> Can you copy-paste the source data indexed in this field which causes the
>>> error?
>>>
>>> Cheers
>>> Avlesh
>>>
>>> On Tue, Aug 25, 2009 at 10:01 PM, Rupert Fiasco <ru...@gmail.com>
>>> wrote:
>>>
>>>
>>>>
>>>> Using wt=json also yields an invalid document. So after more
>>>> investigation it appears that I can always "break" the response by
>>>> pulling back a specific field via the "fl" parameter. If I leave off a
>>>> field then the response is valid, if I include it then Solr yields an
>>>> invalid document - a truncated document. This happens in any response
>>>> format (xml, json, ruby).
>>>>
>>>> I am using the SolrJ client to add documents to in my index. My field
>>>> is a normal "text" field type and the text itself is the first 1000
>>>> characters of an article.
>>>>
>>>>
>>>>>
>>>>> It can very well be an issue with the data itself. For example, if the
>>>>>
>>>>
>>>> data
>>>>
>>>>>
>>>>> contains un-escaped characters which invalidates the response
>>>>>
>>>>
>>>> When I look at the document in using wt=xml then all XML entities are
>>>> escaped. When I look at it under wt=ruby then all single quotes are
>>>> escaped, same for json, so it appears that all escaping it taking
>>>> place. The core problem seems to be that the document is just
>>>> truncated - it just plain end of files. Jetty's log says its sending
>>>> back an HTTP 200 so all is well.
>>>>
>>>> Any ideas on how I can dig deeper?
>>>>
>>>> Thanks
>>>> -Rupert
>>>>
>>>>
>>>> On Mon, Aug 24, 2009 at 4:31 PM, Uri Boness<ub...@gmail.com> wrote:
>>>>
>>>>>
>>>>> It can very well be an issue with the data itself. For example, if the
>>>>>
>>>>
>>>> data
>>>>
>>>>>
>>>>> contains un-escaped characters which invalidates the response. I don't
>>>>>
>>>>
>>>> know
>>>>
>>>>>
>>>>> much about ruby, but what do you get with wt=json?
>>>>>
>>>>> Rupert Fiasco wrote:
>>>>>
>>>>>>
>>>>>> I am seeing our responses getting truncated if and only if I search on
>>>>>> our main text field.
>>>>>>
>>>>>> E.g. I just do some basic like
>>>>>>
>>>>>> title_t:arthritis
>>>>>>
>>>>>> Then I get a valid document back. But if I add in our larger text
>>>>>> field:
>>>>>>
>>>>>> title_t:arthritis OR text_t:arthritis
>>>>>>
>>>>>> then the resultant document is NOT valid XML (if using wt=xml) or Ruby
>>>>>> (using wt=ruby). If I run these through curl on the command its
>>>>>> truncated and if I run the search through the web-based admin panel
>>>>>> then I get an XML parse error.
>>>>>>
>>>>>> This appears to have just started recently and the only thing we have
>>>>>> done is change our indexer from a PHP one to a Java one, but
>>>>>> functionally they are identical.
>>>>>>
>>>>>> Any thoughts? Thanks in advance.
>>>>>>
>>>>>> - Rupert
>>>>>>
>>>>>>
>>>>>>
>>
>>
>

Re: Responses getting truncated

Posted by Uri Boness <ub...@gmail.com>.

Hi,

This is a very strange behavior and the fact that it is cause by one 
specific field, again, leads me to believe it's still a data issue. Did 
you try using SolrJ to query the data as well? If the same thing happens 
when using the binary protocol, then it's probably not a data issue. On 
the other hand, if it works fine, then at least you can inspect the data 
to see where things go wrong. Sorry for insisting on that, but I cannot 
think of anything else that can cause this problem.

If anyone else have a better idea, I'm actually very curious to hear 
about it.

Uri

Rupert Fiasco wrote:
> The text file at:
>
> http://brockwine.com/solr.txt
>
> Represents one of these truncated responses (this one in XML). It
> starts out great, then look at the bottom, boom, game over. :)
>
> I found this document by first running our bigger search which breaks
> and then zeroing in a specific broken document by using the rows/start
> parameters. But there are any unknown number of these "broken"
> documents - a lot I presume.
>
> -Rupert
>
> On Tue, Aug 25, 2009 at 9:40 AM, Avlesh Singh<av...@gmail.com> wrote:
>   
>> Can you copy-paste the source data indexed in this field which causes the
>> error?
>>
>> Cheers
>> Avlesh
>>
>> On Tue, Aug 25, 2009 at 10:01 PM, Rupert Fiasco <ru...@gmail.com> wrote:
>>
>>     
>>> Using wt=json also yields an invalid document. So after more
>>> investigation it appears that I can always "break" the response by
>>> pulling back a specific field via the "fl" parameter. If I leave off a
>>> field then the response is valid, if I include it then Solr yields an
>>> invalid document - a truncated document. This happens in any response
>>> format (xml, json, ruby).
>>>
>>> I am using the SolrJ client to add documents to in my index. My field
>>> is a normal "text" field type and the text itself is the first 1000
>>> characters of an article.
>>>
>>>       
>>>> It can very well be an issue with the data itself. For example, if the
>>>>         
>>> data
>>>       
>>>> contains un-escaped characters which invalidates the response
>>>>         
>>> When I look at the document in using wt=xml then all XML entities are
>>> escaped. When I look at it under wt=ruby then all single quotes are
>>> escaped, same for json, so it appears that all escaping it taking
>>> place. The core problem seems to be that the document is just
>>> truncated - it just plain end of files. Jetty's log says its sending
>>> back an HTTP 200 so all is well.
>>>
>>> Any ideas on how I can dig deeper?
>>>
>>> Thanks
>>> -Rupert
>>>
>>>
>>> On Mon, Aug 24, 2009 at 4:31 PM, Uri Boness<ub...@gmail.com> wrote:
>>>       
>>>> It can very well be an issue with the data itself. For example, if the
>>>>         
>>> data
>>>       
>>>> contains un-escaped characters which invalidates the response. I don't
>>>>         
>>> know
>>>       
>>>> much about ruby, but what do you get with wt=json?
>>>>
>>>> Rupert Fiasco wrote:
>>>>         
>>>>> I am seeing our responses getting truncated if and only if I search on
>>>>> our main text field.
>>>>>
>>>>> E.g. I just do some basic like
>>>>>
>>>>> title_t:arthritis
>>>>>
>>>>> Then I get a valid document back. But if I add in our larger text field:
>>>>>
>>>>> title_t:arthritis OR text_t:arthritis
>>>>>
>>>>> then the resultant document is NOT valid XML (if using wt=xml) or Ruby
>>>>> (using wt=ruby). If I run these through curl on the command its
>>>>> truncated and if I run the search through the web-based admin panel
>>>>> then I get an XML parse error.
>>>>>
>>>>> This appears to have just started recently and the only thing we have
>>>>> done is change our indexer from a PHP one to a Java one, but
>>>>> functionally they are identical.
>>>>>
>>>>> Any thoughts? Thanks in advance.
>>>>>
>>>>> - Rupert
>>>>>
>>>>>
>>>>>           
>
>

Re: Responses getting truncated

Posted by Rupert Fiasco <ru...@gmail.com>.

The text file at:

http://brockwine.com/solr.txt

Represents one of these truncated responses (this one in XML). It
starts out great, then look at the bottom, boom, game over. :)

I found this document by first running our bigger search which breaks
and then zeroing in a specific broken document by using the rows/start
parameters. But there are any unknown number of these "broken"
documents - a lot I presume.

-Rupert

On Tue, Aug 25, 2009 at 9:40 AM, Avlesh Singh<av...@gmail.com> wrote:
> Can you copy-paste the source data indexed in this field which causes the
> error?
>
> Cheers
> Avlesh
>
> On Tue, Aug 25, 2009 at 10:01 PM, Rupert Fiasco <ru...@gmail.com> wrote:
>
>> Using wt=json also yields an invalid document. So after more
>> investigation it appears that I can always "break" the response by
>> pulling back a specific field via the "fl" parameter. If I leave off a
>> field then the response is valid, if I include it then Solr yields an
>> invalid document - a truncated document. This happens in any response
>> format (xml, json, ruby).
>>
>> I am using the SolrJ client to add documents to in my index. My field
>> is a normal "text" field type and the text itself is the first 1000
>> characters of an article.
>>
>> > It can very well be an issue with the data itself. For example, if the
>> data
>> > contains un-escaped characters which invalidates the response
>>
>> When I look at the document in using wt=xml then all XML entities are
>> escaped. When I look at it under wt=ruby then all single quotes are
>> escaped, same for json, so it appears that all escaping it taking
>> place. The core problem seems to be that the document is just
>> truncated - it just plain end of files. Jetty's log says its sending
>> back an HTTP 200 so all is well.
>>
>> Any ideas on how I can dig deeper?
>>
>> Thanks
>> -Rupert
>>
>>
>> On Mon, Aug 24, 2009 at 4:31 PM, Uri Boness<ub...@gmail.com> wrote:
>> > It can very well be an issue with the data itself. For example, if the
>> data
>> > contains un-escaped characters which invalidates the response. I don't
>> know
>> > much about ruby, but what do you get with wt=json?
>> >
>> > Rupert Fiasco wrote:
>> >>
>> >> I am seeing our responses getting truncated if and only if I search on
>> >> our main text field.
>> >>
>> >> E.g. I just do some basic like
>> >>
>> >> title_t:arthritis
>> >>
>> >> Then I get a valid document back. But if I add in our larger text field:
>> >>
>> >> title_t:arthritis OR text_t:arthritis
>> >>
>> >> then the resultant document is NOT valid XML (if using wt=xml) or Ruby
>> >> (using wt=ruby). If I run these through curl on the command its
>> >> truncated and if I run the search through the web-based admin panel
>> >> then I get an XML parse error.
>> >>
>> >> This appears to have just started recently and the only thing we have
>> >> done is change our indexer from a PHP one to a Java one, but
>> >> functionally they are identical.
>> >>
>> >> Any thoughts? Thanks in advance.
>> >>
>> >> - Rupert
>> >>
>> >>
>> >
>>
>

Re: Responses getting truncated

Posted by Avlesh Singh <av...@gmail.com>.

Can you copy-paste the source data indexed in this field which causes the
error?

Cheers
Avlesh

On Tue, Aug 25, 2009 at 10:01 PM, Rupert Fiasco <ru...@gmail.com> wrote:

> Using wt=json also yields an invalid document. So after more
> investigation it appears that I can always "break" the response by
> pulling back a specific field via the "fl" parameter. If I leave off a
> field then the response is valid, if I include it then Solr yields an
> invalid document - a truncated document. This happens in any response
> format (xml, json, ruby).
>
> I am using the SolrJ client to add documents to in my index. My field
> is a normal "text" field type and the text itself is the first 1000
> characters of an article.
>
> > It can very well be an issue with the data itself. For example, if the
> data
> > contains un-escaped characters which invalidates the response
>
> When I look at the document in using wt=xml then all XML entities are
> escaped. When I look at it under wt=ruby then all single quotes are
> escaped, same for json, so it appears that all escaping it taking
> place. The core problem seems to be that the document is just
> truncated - it just plain end of files. Jetty's log says its sending
> back an HTTP 200 so all is well.
>
> Any ideas on how I can dig deeper?
>
> Thanks
> -Rupert
>
>
> On Mon, Aug 24, 2009 at 4:31 PM, Uri Boness<ub...@gmail.com> wrote:
> > It can very well be an issue with the data itself. For example, if the
> data
> > contains un-escaped characters which invalidates the response. I don't
> know
> > much about ruby, but what do you get with wt=json?
> >
> > Rupert Fiasco wrote:
> >>
> >> I am seeing our responses getting truncated if and only if I search on
> >> our main text field.
> >>
> >> E.g. I just do some basic like
> >>
> >> title_t:arthritis
> >>
> >> Then I get a valid document back. But if I add in our larger text field:
> >>
> >> title_t:arthritis OR text_t:arthritis
> >>
> >> then the resultant document is NOT valid XML (if using wt=xml) or Ruby
> >> (using wt=ruby). If I run these through curl on the command its
> >> truncated and if I run the search through the web-based admin panel
> >> then I get an XML parse error.
> >>
> >> This appears to have just started recently and the only thing we have
> >> done is change our indexer from a PHP one to a Java one, but
> >> functionally they are identical.
> >>
> >> Any thoughts? Thanks in advance.
> >>
> >> - Rupert
> >>
> >>
> >
>

Re: Responses getting truncated

Posted by Rupert Fiasco <ru...@gmail.com>.

Using wt=json also yields an invalid document. So after more
investigation it appears that I can always "break" the response by
pulling back a specific field via the "fl" parameter. If I leave off a
field then the response is valid, if I include it then Solr yields an
invalid document - a truncated document. This happens in any response
format (xml, json, ruby).

I am using the SolrJ client to add documents to in my index. My field
is a normal "text" field type and the text itself is the first 1000
characters of an article.

> It can very well be an issue with the data itself. For example, if the data
> contains un-escaped characters which invalidates the response

When I look at the document in using wt=xml then all XML entities are
escaped. When I look at it under wt=ruby then all single quotes are
escaped, same for json, so it appears that all escaping it taking
place. The core problem seems to be that the document is just
truncated - it just plain end of files. Jetty's log says its sending
back an HTTP 200 so all is well.

Any ideas on how I can dig deeper?

Thanks
-Rupert

On Mon, Aug 24, 2009 at 4:31 PM, Uri Boness<ub...@gmail.com> wrote:
> It can very well be an issue with the data itself. For example, if the data
> contains un-escaped characters which invalidates the response. I don't know
> much about ruby, but what do you get with wt=json?
>
> Rupert Fiasco wrote:
>>
>> I am seeing our responses getting truncated if and only if I search on
>> our main text field.
>>
>> E.g. I just do some basic like
>>
>> title_t:arthritis
>>
>> Then I get a valid document back. But if I add in our larger text field:
>>
>> title_t:arthritis OR text_t:arthritis
>>
>> then the resultant document is NOT valid XML (if using wt=xml) or Ruby
>> (using wt=ruby). If I run these through curl on the command its
>> truncated and if I run the search through the web-based admin panel
>> then I get an XML parse error.
>>
>> This appears to have just started recently and the only thing we have
>> done is change our indexer from a PHP one to a Java one, but
>> functionally they are identical.
>>
>> Any thoughts? Thanks in advance.
>>
>> - Rupert
>>
>>
>

Re: Responses getting truncated

Posted by Uri Boness <ub...@gmail.com>.

It can very well be an issue with the data itself. For example, if the 
data contains un-escaped characters which invalidates the response. I 
don't know much about ruby, but what do you get with wt=json?

Rupert Fiasco wrote:
> I am seeing our responses getting truncated if and only if I search on
> our main text field.
>
> E.g. I just do some basic like
>
> title_t:arthritis
>
> Then I get a valid document back. But if I add in our larger text field:
>
> title_t:arthritis OR text_t:arthritis
>
> then the resultant document is NOT valid XML (if using wt=xml) or Ruby
> (using wt=ruby). If I run these through curl on the command its
> truncated and if I run the search through the web-based admin panel
> then I get an XML parse error.
>
> This appears to have just started recently and the only thing we have
> done is change our indexer from a PHP one to a Java one, but
> functionally they are identical.
>
> Any thoughts? Thanks in advance.
>
> - Rupert
>
>

Re: Responses getting truncated

Posted by Joe Calderon <ca...@gmail.com>.

yonik has a point, when i ran into this i also upgraded to the latest 
stable jetty, im using jetty 6.1.18

On 08/28/2009 04:07 PM, Rupert Fiasco wrote:
> I deployed LucidWorks with my existing solrconfig / schema and
> re-indexed my data into it and pushed it out to production, we'll see
> how it stacks up over the weekend. Already queries that were breaking
> on the prior Jetty/stock Solr setup are now working - but I have seen
> it before where upon an initial re-index things work OK then a couple
> of days later they break.
>
> Keep y'all posted.
>
> Thanks
> -Rupert
>
> On Fri, Aug 28, 2009 at 3:12 PM, Rupert Fiasco<ru...@gmail.com>  wrote:
>    
>> Yes, I am hitting the Solr server directly (medsolr1.colo:9007)
>>
>> Versions / architectures:
>>
>> Jetty(6.1.3)
>>
>> ooga@medsolr1 ~ $ uname -a
>> Linux medsolr1 2.6.18-xen-r12 #9 SMP Tue Mar 3 15:34:08 PST 2009
>> x86_64 Intel(R) Xeon(R) CPU L5420 @ 2.50GHz GenuineIntel GNU/Linux
>>
>> ooga@medsolr1 ~ $ java -version
>> java version "1.6.0_11"
>> Java(TM) SE Runtime Environment (build 1.6.0_11-b03)
>> Java HotSpot(TM) 64-Bit Server VM (build 11.0-b16, mixed mode)
>>
>>
>> I was thinking of trying LucidWorks for Solr (1.3.02) x64 - worth a try.
>>
>> -Rupert
>>
>> On Fri, Aug 28, 2009 at 3:08 PM, Yonik Seeley<ys...@gmail.com>  wrote:
>>      
>>> On Mon, Aug 24, 2009 at 6:30 PM, Rupert Fiasco<ru...@gmail.com>  wrote:
>>>        
>>>> If I run these through curl on the command its
>>>> truncated and if I run the search through the web-based admin panel
>>>> then I get an XML parse error.
>>>>          
>>> Are you running curl directly against the solr server, or going
>>> through a load balancer?  Cutting out the middle-men using curl was a
>>> great idea - just make sure to go all the way.
>>>
>>> At first I thought it could possibly be a FastWriter bug (internal
>>> Solr class), but that's only used on the TextWriter (JSON, Python,
>>> Ruby) based formats, not on the original XML format.
>>>
>>> It really looks like you're hitting a lower-level IO buffering bug
>>> (esp when you see a response starting off with the tail of another
>>> response).  That doesn't look like it could be a Solr bug... but
>>> rather smells like a thread safety bug in the servlet container.
>>>
>>> What type of machine are you running on?  What JVM?
>>> You could try upgrading your version of Jetty, the JVM, or try
>>> switching to Tomcat.
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>>
>>>        
>>>> This appears to have just started recently and the only thing we have
>>>> done is change our indexer from a PHP one to a Java one, but
>>>> functionally they are identical.
>>>>
>>>> Any thoughts? Thanks in advance.
>>>>
>>>> - Rupert
>>>>
>>>>          
>>>        
>>

Re: Responses getting truncated

Posted by Rupert Fiasco <ru...@gmail.com>.

So we have been running LucidWorks for Solr for about a week now and
have seen no problems - so I believe it was due to that buffering
issue in Jetty 6.1.3, estimated here:

>>> It really looks like you're hitting a lower-level IO buffering bug
>>> (esp when you see a response starting off with the tail of another
>>> response).  That doesn't look like it could be a Solr bug... but
>>> rather smells like a thread safety bug in the servlet container.

Thanks for everyones help and input. LucidWorks For The Win.

-Rupert

On Fri, Aug 28, 2009 at 4:07 PM, Rupert Fiasco<ru...@gmail.com> wrote:
> I deployed LucidWorks with my existing solrconfig / schema and
> re-indexed my data into it and pushed it out to production, we'll see
> how it stacks up over the weekend. Already queries that were breaking
> on the prior Jetty/stock Solr setup are now working - but I have seen
> it before where upon an initial re-index things work OK then a couple
> of days later they break.
>
> Keep y'all posted.
>
> Thanks
> -Rupert
>
> On Fri, Aug 28, 2009 at 3:12 PM, Rupert Fiasco<ru...@gmail.com> wrote:
>> Yes, I am hitting the Solr server directly (medsolr1.colo:9007)
>>
>> Versions / architectures:
>>
>> Jetty(6.1.3)
>>
>> ooga@medsolr1 ~ $ uname -a
>> Linux medsolr1 2.6.18-xen-r12 #9 SMP Tue Mar 3 15:34:08 PST 2009
>> x86_64 Intel(R) Xeon(R) CPU L5420 @ 2.50GHz GenuineIntel GNU/Linux
>>
>> ooga@medsolr1 ~ $ java -version
>> java version "1.6.0_11"
>> Java(TM) SE Runtime Environment (build 1.6.0_11-b03)
>> Java HotSpot(TM) 64-Bit Server VM (build 11.0-b16, mixed mode)
>>
>>
>> I was thinking of trying LucidWorks for Solr (1.3.02) x64 - worth a try.
>>
>> -Rupert
>>
>> On Fri, Aug 28, 2009 at 3:08 PM, Yonik Seeley<ys...@gmail.com> wrote:
>>> On Mon, Aug 24, 2009 at 6:30 PM, Rupert Fiasco<ru...@gmail.com> wrote:
>>>> If I run these through curl on the command its
>>>> truncated and if I run the search through the web-based admin panel
>>>> then I get an XML parse error.
>>>
>>> Are you running curl directly against the solr server, or going
>>> through a load balancer?  Cutting out the middle-men using curl was a
>>> great idea - just make sure to go all the way.
>>>
>>> At first I thought it could possibly be a FastWriter bug (internal
>>> Solr class), but that's only used on the TextWriter (JSON, Python,
>>> Ruby) based formats, not on the original XML format.
>>>
>>> It really looks like you're hitting a lower-level IO buffering bug
>>> (esp when you see a response starting off with the tail of another
>>> response).  That doesn't look like it could be a Solr bug... but
>>> rather smells like a thread safety bug in the servlet container.
>>>
>>> What type of machine are you running on?  What JVM?
>>> You could try upgrading your version of Jetty, the JVM, or try
>>> switching to Tomcat.
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>>
>>>> This appears to have just started recently and the only thing we have
>>>> done is change our indexer from a PHP one to a Java one, but
>>>> functionally they are identical.
>>>>
>>>> Any thoughts? Thanks in advance.
>>>>
>>>> - Rupert
>>>>
>>>
>>
>

Re: Responses getting truncated

Posted by Rupert Fiasco <ru...@gmail.com>.

I deployed LucidWorks with my existing solrconfig / schema and
re-indexed my data into it and pushed it out to production, we'll see
how it stacks up over the weekend. Already queries that were breaking
on the prior Jetty/stock Solr setup are now working - but I have seen
it before where upon an initial re-index things work OK then a couple
of days later they break.

Keep y'all posted.

Thanks
-Rupert

On Fri, Aug 28, 2009 at 3:12 PM, Rupert Fiasco<ru...@gmail.com> wrote:
> Yes, I am hitting the Solr server directly (medsolr1.colo:9007)
>
> Versions / architectures:
>
> Jetty(6.1.3)
>
> ooga@medsolr1 ~ $ uname -a
> Linux medsolr1 2.6.18-xen-r12 #9 SMP Tue Mar 3 15:34:08 PST 2009
> x86_64 Intel(R) Xeon(R) CPU L5420 @ 2.50GHz GenuineIntel GNU/Linux
>
> ooga@medsolr1 ~ $ java -version
> java version "1.6.0_11"
> Java(TM) SE Runtime Environment (build 1.6.0_11-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 11.0-b16, mixed mode)
>
>
> I was thinking of trying LucidWorks for Solr (1.3.02) x64 - worth a try.
>
> -Rupert
>
> On Fri, Aug 28, 2009 at 3:08 PM, Yonik Seeley<ys...@gmail.com> wrote:
>> On Mon, Aug 24, 2009 at 6:30 PM, Rupert Fiasco<ru...@gmail.com> wrote:
>>> If I run these through curl on the command its
>>> truncated and if I run the search through the web-based admin panel
>>> then I get an XML parse error.
>>
>> Are you running curl directly against the solr server, or going
>> through a load balancer?  Cutting out the middle-men using curl was a
>> great idea - just make sure to go all the way.
>>
>> At first I thought it could possibly be a FastWriter bug (internal
>> Solr class), but that's only used on the TextWriter (JSON, Python,
>> Ruby) based formats, not on the original XML format.
>>
>> It really looks like you're hitting a lower-level IO buffering bug
>> (esp when you see a response starting off with the tail of another
>> response).  That doesn't look like it could be a Solr bug... but
>> rather smells like a thread safety bug in the servlet container.
>>
>> What type of machine are you running on?  What JVM?
>> You could try upgrading your version of Jetty, the JVM, or try
>> switching to Tomcat.
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>>
>>> This appears to have just started recently and the only thing we have
>>> done is change our indexer from a PHP one to a Java one, but
>>> functionally they are identical.
>>>
>>> Any thoughts? Thanks in advance.
>>>
>>> - Rupert
>>>
>>
>

Re: Responses getting truncated

Posted by Rupert Fiasco <ru...@gmail.com>.

Yes, I am hitting the Solr server directly (medsolr1.colo:9007)

Versions / architectures:

Jetty(6.1.3)

ooga@medsolr1 ~ $ uname -a
Linux medsolr1 2.6.18-xen-r12 #9 SMP Tue Mar 3 15:34:08 PST 2009
x86_64 Intel(R) Xeon(R) CPU L5420 @ 2.50GHz GenuineIntel GNU/Linux

ooga@medsolr1 ~ $ java -version
java version "1.6.0_11"
Java(TM) SE Runtime Environment (build 1.6.0_11-b03)
Java HotSpot(TM) 64-Bit Server VM (build 11.0-b16, mixed mode)


I was thinking of trying LucidWorks for Solr (1.3.02) x64 - worth a try.

-Rupert

On Fri, Aug 28, 2009 at 3:08 PM, Yonik Seeley<ys...@gmail.com> wrote:
> On Mon, Aug 24, 2009 at 6:30 PM, Rupert Fiasco<ru...@gmail.com> wrote:
>> If I run these through curl on the command its
>> truncated and if I run the search through the web-based admin panel
>> then I get an XML parse error.
>
> Are you running curl directly against the solr server, or going
> through a load balancer?  Cutting out the middle-men using curl was a
> great idea - just make sure to go all the way.
>
> At first I thought it could possibly be a FastWriter bug (internal
> Solr class), but that's only used on the TextWriter (JSON, Python,
> Ruby) based formats, not on the original XML format.
>
> It really looks like you're hitting a lower-level IO buffering bug
> (esp when you see a response starting off with the tail of another
> response).  That doesn't look like it could be a Solr bug... but
> rather smells like a thread safety bug in the servlet container.
>
> What type of machine are you running on?  What JVM?
> You could try upgrading your version of Jetty, the JVM, or try
> switching to Tomcat.
>
> -Yonik
> http://www.lucidimagination.com
>
>
>> This appears to have just started recently and the only thing we have
>> done is change our indexer from a PHP one to a Java one, but
>> functionally they are identical.
>>
>> Any thoughts? Thanks in advance.
>>
>> - Rupert
>>
>

Re: Responses getting truncated

Posted by Yonik Seeley <ys...@gmail.com>.

On Mon, Aug 24, 2009 at 6:30 PM, Rupert Fiasco<ru...@gmail.com> wrote:
> If I run these through curl on the command its
> truncated and if I run the search through the web-based admin panel
> then I get an XML parse error.

Are you running curl directly against the solr server, or going
through a load balancer?  Cutting out the middle-men using curl was a
great idea - just make sure to go all the way.

At first I thought it could possibly be a FastWriter bug (internal
Solr class), but that's only used on the TextWriter (JSON, Python,
Ruby) based formats, not on the original XML format.

It really looks like you're hitting a lower-level IO buffering bug
(esp when you see a response starting off with the tail of another
response).  That doesn't look like it could be a Solr bug... but
rather smells like a thread safety bug in the servlet container.

What type of machine are you running on?  What JVM?
You could try upgrading your version of Jetty, the JVM, or try
switching to Tomcat.

-Yonik
http://www.lucidimagination.com

> This appears to have just started recently and the only thing we have
> done is change our indexer from a PHP one to a Java one, but
> functionally they are identical.
>
> Any thoughts? Thanks in advance.
>
> - Rupert
>