You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Terry Steichen <te...@net-frame.com> on 2018/09/26 19:23:23 UTC

Making Solr Indexing Errors Visible

I'm pretty sure this was covered earlier.  But I can't find references
to it.  The question is how to make indexing errors clear and obvious. 
(I find that there are maybe 10% more files in a directory than end up
in the index.  I presume they were indexing errors, but I have no idea
which ones or what might have caused the error.)  As I recall, Solr's
post tool doesn't give any errors when indexing.  I (vaguely) recall
that there's a way (through the logs?) to overcome this and show the
errors.  Or maybe it's that you have to do the indexing outside of Solr?

Terry Steichen

Re: Making Solr Indexing Errors Visible

Posted by Terry Steichen <te...@net-frame.com>.
Alex,

Please look at my embedded responses to your questions.

Terry


On 09/26/2018 04:57 PM, Alexandre Rafalovitch wrote:
> The challenge here is to figure out exactly what you are doing,
> because the original description could have been 10 different things.
>
> So:
> 1) You are using bin/post command (we just found this out)
No, I said that at the outset.  And repeated it.
> 2) You are indexing a bunch of files (what format? all same or different?)
I also said I was indexing a mixture of pdf and doc files
> 3) You are indexing them into a Schema supposedly ready for those
> files (which one?)
I'm using the managed-schema, the data-driven approach
> 4) You think some of them are not in in Solr (how do you know that?
> how do you know that some are? why do you not know _which_ of the
> files are not indexed?)
I thought I made it very clear (twice) that I find that the list of
indexed files is 10% fewer than those in the directory holding the files
being indexed.  And I said that I don't know which are not getting
indexed because I am not getting error messages.
> 5) You are asking whether the error message should have told you if
> there is a problem with indexing (normally yes, but maybe there are
> some edge cases).
That's my question - why am I not getting error messages.  That's the
whole point of my query to the list.
>
> I've put the questions in brackets. I would focus on looking at
> questions in 4) first as they roughly bisect the problem. But other
> things are important too.
>
> I hope this helps,
>     Alex.
>
>
> On 26 September 2018 at 16:39, Terry Steichen <te...@net-frame.com> wrote:
>> Shawn,
>>
>> To the best of my knowledge, I'm not using SolrJ at all.  Just
>> Solr-out-of-the-box.  In this case, if I understand you below, it
>> "should indicate an error status"
>>
>> But it doesn't.
>>
>> Let me try to clarify a bit - I'm just using bin/post to index the files
>> in a directory.  That indexing process produces a lengthy screen display
>> of files that were indexed.  (I realize this isn't production-quality,
>> but I'm not ready for production just yet, so that should be OK.)
>>
>> But no errors are shown (even though there have to be because the totals
>> indexed is less than the directory totals).
>>
>> Are you saying I can't use post (to verify correct indexing), but that I
>> have to write custom software to accomplish that?
>>
>> And that there's no solr variable I can define that will do a kind of
>> "verbose" to show that?
>>
>> And that such errors will not show up in any of solr's log files?
>>
>> Hard to believe (but what is, is, I guess).
>>
>> Terry
>>
>> On 09/26/2018 03:49 PM, Shawn Heisey wrote:
>>> On 9/26/2018 1:23 PM, Terry Steichen wrote:
>>>> I'm pretty sure this was covered earlier.  But I can't find references
>>>> to it.  The question is how to make indexing errors clear and obvious.
>>> If there's an indexing error and you're NOT using the concurrent
>>> client in SolrJ, the response that Solr returns should indicate an
>>> error status.  ConcurrentUpdateSolrClient gets those errors and
>>> swallows them so the calling program never knows they occurred.
>>>
>>>> (I find that there are maybe 10% more files in a directory than end up
>>>> in the index.  I presume they were indexing errors, but I have no idea
>>>> which ones or what might have caused the error.)  As I recall, Solr's
>>>> post tool doesn't give any errors when indexing.  I (vaguely) recall
>>>> that there's a way (through the logs?) to overcome this and show the
>>>> errors.  Or maybe it's that you have to do the indexing outside of Solr?
>>> The simple post tool is not really meant for production use.  It is a
>>> simple tool for interactive testing.
>>>
>>> I don't see anything in SimplePostTool for changing the program's exit
>>> status when an error is encountered during program operation.  If an
>>> error is encountered during the upload, a message would be logged to
>>> stderr, but you wouldn't be able to rely on the program's exit status
>>> to indicate an error.  To get that, you will need to write the
>>> indexing software.
>>>
>>> Thanks,
>>> Shawn
>>>
>>>


Re: Making Solr Indexing Errors Visible

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
The challenge here is to figure out exactly what you are doing,
because the original description could have been 10 different things.

So:
1) You are using bin/post command (we just found this out)
2) You are indexing a bunch of files (what format? all same or different?)
3) You are indexing them into a Schema supposedly ready for those
files (which one?)
4) You think some of them are not in in Solr (how do you know that?
how do you know that some are? why do you not know _which_ of the
files are not indexed?)
5) You are asking whether the error message should have told you if
there is a problem with indexing (normally yes, but maybe there are
some edge cases).

I've put the questions in brackets. I would focus on looking at
questions in 4) first as they roughly bisect the problem. But other
things are important too.

I hope this helps,
    Alex.


On 26 September 2018 at 16:39, Terry Steichen <te...@net-frame.com> wrote:
> Shawn,
>
> To the best of my knowledge, I'm not using SolrJ at all.  Just
> Solr-out-of-the-box.  In this case, if I understand you below, it
> "should indicate an error status"
>
> But it doesn't.
>
> Let me try to clarify a bit - I'm just using bin/post to index the files
> in a directory.  That indexing process produces a lengthy screen display
> of files that were indexed.  (I realize this isn't production-quality,
> but I'm not ready for production just yet, so that should be OK.)
>
> But no errors are shown (even though there have to be because the totals
> indexed is less than the directory totals).
>
> Are you saying I can't use post (to verify correct indexing), but that I
> have to write custom software to accomplish that?
>
> And that there's no solr variable I can define that will do a kind of
> "verbose" to show that?
>
> And that such errors will not show up in any of solr's log files?
>
> Hard to believe (but what is, is, I guess).
>
> Terry
>
> On 09/26/2018 03:49 PM, Shawn Heisey wrote:
>> On 9/26/2018 1:23 PM, Terry Steichen wrote:
>>> I'm pretty sure this was covered earlier.  But I can't find references
>>> to it.  The question is how to make indexing errors clear and obvious.
>>
>> If there's an indexing error and you're NOT using the concurrent
>> client in SolrJ, the response that Solr returns should indicate an
>> error status.  ConcurrentUpdateSolrClient gets those errors and
>> swallows them so the calling program never knows they occurred.
>>
>>> (I find that there are maybe 10% more files in a directory than end up
>>> in the index.  I presume they were indexing errors, but I have no idea
>>> which ones or what might have caused the error.)  As I recall, Solr's
>>> post tool doesn't give any errors when indexing.  I (vaguely) recall
>>> that there's a way (through the logs?) to overcome this and show the
>>> errors.  Or maybe it's that you have to do the indexing outside of Solr?
>>
>> The simple post tool is not really meant for production use.  It is a
>> simple tool for interactive testing.
>>
>> I don't see anything in SimplePostTool for changing the program's exit
>> status when an error is encountered during program operation.  If an
>> error is encountered during the upload, a message would be logged to
>> stderr, but you wouldn't be able to rely on the program's exit status
>> to indicate an error.  To get that, you will need to write the
>> indexing software.
>>
>> Thanks,
>> Shawn
>>
>>
>

Re: Making Solr Indexing Errors Visible

Posted by Shawn Heisey <ap...@elyograg.org>.
On 9/26/2018 2:39 PM, Terry Steichen wrote:
> To the best of my knowledge, I'm not using SolrJ at all.  Just
> Solr-out-of-the-box.  In this case, if I understand you below, it
> "should indicate an error status"

I think you'd know if you were using SolrJ directly.  You'd have written 
the indexing program, or whoever DID write it would likely indicate that 
they used SolrJ to talk to Solr.  I was surprised to learn that 
SimplePostTool does NOT use SolrJ ... it uses the HTTP capability built 
into Java.

> Let me try to clarify a bit - I'm just using bin/post to index the files
> in a directory.  That indexing process produces a lengthy screen display
> of files that were indexed.  (I realize this isn't production-quality,
> but I'm not ready for production just yet, so that should be OK.)

If you check your index, are you missing files that bin/post said were 
indexed?  Have you looked in that kind of detail?

The post tool should indicate that an error occurred, and if there was 
any text in the response about the error, it should be displayed.  I was 
looking at the 7.4 code branch.  I didn't see anything about which Solr 
version you're running.

I have not spent any real time using bin/post.  It was part of a class 
that I attended as part of Lucene Revolution in 2010, but I do not 
recall what the output was.  It was all pre-designed and tested so it 
was known to work before I received it.  No errors occurred when I ran 
the script included with the class materials.

> But no errors are shown (even though there have to be because the totals
> indexed is less than the directory totals).
>
> Are you saying I can't use post (to verify correct indexing), but that I
> have to write custom software to accomplish that?

If you want errors detected programmatically, you'll need to write the 
indexing program.  The simple post tool won't report errors to anything 
that calls it, it will just log them.

> And that there's no solr variable I can define that will do a kind of
> "verbose" to show that?

If Solr returned errors during the indexing, then they will show up in 
the solr.log file, or possibly one of the rotated versions of that 
logfile.  You can also see them in the admin UI Logging tab if Solr 
hasn't been restarted, but the logfile is generally a better way to find 
them.  If you're not seeing errors there, then maybe something went 
wrong with bin/post.

I notice in a later message you indicate that you're indexing PDF and 
DOC files.  When those kinds of files are sent with bin/post, they will 
normally end up in the Extracting Request Handler, also known as SolrCell.

It is highly recommended that the Extracting Request Handler never be 
used in production.  That software embeds Tika inside Solr.  Tika is 
known to explode spectacularly when it gets a file it doesn't know how 
to handle.  PDF files in particular seem to trigger this behavior, but 
other formats can cause it as well.  If Tika is running inside Solr when 
that happens, Solr will also explode, and then you no longer have a 
search engine on that machine.  A better option is to include Tika in an 
indexing program that you write, so if it explodes, Solr stays running.

Thanks,
Shawn


Re: Making Solr Indexing Errors Visible

Posted by Jason Gerlowski <ge...@gmail.com>.
Hi

Also worth mentioning that bin/post only handles certain file
extensions, and AFAIR it doesn't mention specifically when it skips
over a file because of the extension. You mentioned you're trying to
index Word docs and pdf's.  Are there any other formats in the
directory that might be messing up your counts?

I also second Shawn's suggestion that you post the "bin/post" output
and a directory listing.  Additionally, if you're able to clean up the
output a bit, you might be able to diff the two lists of files and see
if the ones missing have anything particular in common.

Good luck,

Jason
On Thu, Sep 27, 2018 at 9:58 AM Shawn Heisey <ap...@elyograg.org> wrote:
>
> On 9/26/2018 2:39 PM, Terry Steichen wrote:
> > Let me try to clarify a bit - I'm just using bin/post to index the files
> > in a directory.  That indexing process produces a lengthy screen display
> > of files that were indexed.  (I realize this isn't production-quality,
> > but I'm not ready for production just yet, so that should be OK.)
>
> I see a previous message on the list from you indicating solr 6.6.0.
> FYI there are five bugfix releases after 6.6.0 -- the latest 6.x release
> is 6.6.5.  I don't see any fixes related to the post tool, but maybe one
> of the problems that did get fixed might help your server behave better.
>
> Switching my source checkout to the 6.6.0 tag and checking that version...
>
> Each time a file is sent, you should get a log line starting with
> "POSTing file".
>
> The error detection in SimplePostTool has a bunch of parts.  It seems
> that *most* errors will abort the tool entirely, skipping any files that
> have not yet been processed, and logging a message with "FATAL" included.
>
> Can you show us a directory listing and all the output that you get from
> bin/post when processing that directory?
>
> Thanks,
> Shawn
>

Re: Making Solr Indexing Errors Visible

Posted by Shawn Heisey <ap...@elyograg.org>.
On 9/26/2018 2:39 PM, Terry Steichen wrote:
> Let me try to clarify a bit - I'm just using bin/post to index the files
> in a directory.  That indexing process produces a lengthy screen display
> of files that were indexed.  (I realize this isn't production-quality,
> but I'm not ready for production just yet, so that should be OK.)

I see a previous message on the list from you indicating solr 6.6.0.  
FYI there are five bugfix releases after 6.6.0 -- the latest 6.x release 
is 6.6.5.  I don't see any fixes related to the post tool, but maybe one 
of the problems that did get fixed might help your server behave better.

Switching my source checkout to the 6.6.0 tag and checking that version...

Each time a file is sent, you should get a log line starting with 
"POSTing file".

The error detection in SimplePostTool has a bunch of parts.  It seems 
that *most* errors will abort the tool entirely, skipping any files that 
have not yet been processed, and logging a message with "FATAL" included.

Can you show us a directory listing and all the output that you get from 
bin/post when processing that directory?

Thanks,
Shawn


Re: Making Solr Indexing Errors Visible

Posted by Terry Steichen <te...@net-frame.com>.
Shawn,

To the best of my knowledge, I'm not using SolrJ at all.  Just
Solr-out-of-the-box.  In this case, if I understand you below, it
"should indicate an error status" 

But it doesn't.

Let me try to clarify a bit - I'm just using bin/post to index the files
in a directory.  That indexing process produces a lengthy screen display
of files that were indexed.  (I realize this isn't production-quality,
but I'm not ready for production just yet, so that should be OK.)

But no errors are shown (even though there have to be because the totals
indexed is less than the directory totals).

Are you saying I can't use post (to verify correct indexing), but that I
have to write custom software to accomplish that? 

And that there's no solr variable I can define that will do a kind of
"verbose" to show that?

And that such errors will not show up in any of solr's log files?

Hard to believe (but what is, is, I guess).

Terry

On 09/26/2018 03:49 PM, Shawn Heisey wrote:
> On 9/26/2018 1:23 PM, Terry Steichen wrote:
>> I'm pretty sure this was covered earlier.  But I can't find references
>> to it.  The question is how to make indexing errors clear and obvious.
>
> If there's an indexing error and you're NOT using the concurrent
> client in SolrJ, the response that Solr returns should indicate an
> error status.  ConcurrentUpdateSolrClient gets those errors and
> swallows them so the calling program never knows they occurred.
>
>> (I find that there are maybe 10% more files in a directory than end up
>> in the index.  I presume they were indexing errors, but I have no idea
>> which ones or what might have caused the error.)  As I recall, Solr's
>> post tool doesn't give any errors when indexing.  I (vaguely) recall
>> that there's a way (through the logs?) to overcome this and show the
>> errors.  Or maybe it's that you have to do the indexing outside of Solr?
>
> The simple post tool is not really meant for production use.  It is a
> simple tool for interactive testing.
>
> I don't see anything in SimplePostTool for changing the program's exit
> status when an error is encountered during program operation.  If an
> error is encountered during the upload, a message would be logged to
> stderr, but you wouldn't be able to rely on the program's exit status
> to indicate an error.  To get that, you will need to write the
> indexing software.
>
> Thanks,
> Shawn
>
>


Re: Making Solr Indexing Errors Visible

Posted by Shawn Heisey <ap...@elyograg.org>.
On 9/26/2018 1:23 PM, Terry Steichen wrote:
> I'm pretty sure this was covered earlier.  But I can't find references
> to it.  The question is how to make indexing errors clear and obvious.

If there's an indexing error and you're NOT using the concurrent client 
in SolrJ, the response that Solr returns should indicate an error 
status.  ConcurrentUpdateSolrClient gets those errors and swallows them 
so the calling program never knows they occurred.

> (I find that there are maybe 10% more files in a directory than end up
> in the index.  I presume they were indexing errors, but I have no idea
> which ones or what might have caused the error.)  As I recall, Solr's
> post tool doesn't give any errors when indexing.  I (vaguely) recall
> that there's a way (through the logs?) to overcome this and show the
> errors.  Or maybe it's that you have to do the indexing outside of Solr?

The simple post tool is not really meant for production use.  It is a 
simple tool for interactive testing.

I don't see anything in SimplePostTool for changing the program's exit 
status when an error is encountered during program operation.  If an 
error is encountered during the upload, a message would be logged to 
stderr, but you wouldn't be able to rely on the program's exit status to 
indicate an error.  To get that, you will need to write the indexing 
software.

Thanks,
Shawn