You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Terry Steichen <te...@net-frame.com> on 2018/09/26 19:23:23 UTC
Making Solr Indexing Errors Visible
I'm pretty sure this was covered earlier. But I can't find references
to it. The question is how to make indexing errors clear and obvious.
(I find that there are maybe 10% more files in a directory than end up
in the index. I presume they were indexing errors, but I have no idea
which ones or what might have caused the error.) As I recall, Solr's
post tool doesn't give any errors when indexing. I (vaguely) recall
that there's a way (through the logs?) to overcome this and show the
errors. Or maybe it's that you have to do the indexing outside of Solr?
Terry Steichen
Re: Making Solr Indexing Errors Visible
Posted by Terry Steichen <te...@net-frame.com>.
Alex,
Please look at my embedded responses to your questions.
Terry
On 09/26/2018 04:57 PM, Alexandre Rafalovitch wrote:
> The challenge here is to figure out exactly what you are doing,
> because the original description could have been 10 different things.
>
> So:
> 1) You are using bin/post command (we just found this out)
No, I said that at the outset. And repeated it.
> 2) You are indexing a bunch of files (what format? all same or different?)
I also said I was indexing a mixture of pdf and doc files
> 3) You are indexing them into a Schema supposedly ready for those
> files (which one?)
I'm using the managed-schema, the data-driven approach
> 4) You think some of them are not in in Solr (how do you know that?
> how do you know that some are? why do you not know _which_ of the
> files are not indexed?)
I thought I made it very clear (twice) that I find that the list of
indexed files is 10% fewer than those in the directory holding the files
being indexed. And I said that I don't know which are not getting
indexed because I am not getting error messages.
> 5) You are asking whether the error message should have told you if
> there is a problem with indexing (normally yes, but maybe there are
> some edge cases).
That's my question - why am I not getting error messages. That's the
whole point of my query to the list.
>
> I've put the questions in brackets. I would focus on looking at
> questions in 4) first as they roughly bisect the problem. But other
> things are important too.
>
> I hope this helps,
> Alex.
>
>
> On 26 September 2018 at 16:39, Terry Steichen <te...@net-frame.com> wrote:
>> Shawn,
>>
>> To the best of my knowledge, I'm not using SolrJ at all. Just
>> Solr-out-of-the-box. In this case, if I understand you below, it
>> "should indicate an error status"
>>
>> But it doesn't.
>>
>> Let me try to clarify a bit - I'm just using bin/post to index the files
>> in a directory. That indexing process produces a lengthy screen display
>> of files that were indexed. (I realize this isn't production-quality,
>> but I'm not ready for production just yet, so that should be OK.)
>>
>> But no errors are shown (even though there have to be because the totals
>> indexed is less than the directory totals).
>>
>> Are you saying I can't use post (to verify correct indexing), but that I
>> have to write custom software to accomplish that?
>>
>> And that there's no solr variable I can define that will do a kind of
>> "verbose" to show that?
>>
>> And that such errors will not show up in any of solr's log files?
>>
>> Hard to believe (but what is, is, I guess).
>>
>> Terry
>>
>> On 09/26/2018 03:49 PM, Shawn Heisey wrote:
>>> On 9/26/2018 1:23 PM, Terry Steichen wrote:
>>>> I'm pretty sure this was covered earlier. But I can't find references
>>>> to it. The question is how to make indexing errors clear and obvious.
>>> If there's an indexing error and you're NOT using the concurrent
>>> client in SolrJ, the response that Solr returns should indicate an
>>> error status. ConcurrentUpdateSolrClient gets those errors and
>>> swallows them so the calling program never knows they occurred.
>>>
>>>> (I find that there are maybe 10% more files in a directory than end up
>>>> in the index. I presume they were indexing errors, but I have no idea
>>>> which ones or what might have caused the error.) As I recall, Solr's
>>>> post tool doesn't give any errors when indexing. I (vaguely) recall
>>>> that there's a way (through the logs?) to overcome this and show the
>>>> errors. Or maybe it's that you have to do the indexing outside of Solr?
>>> The simple post tool is not really meant for production use. It is a
>>> simple tool for interactive testing.
>>>
>>> I don't see anything in SimplePostTool for changing the program's exit
>>> status when an error is encountered during program operation. If an
>>> error is encountered during the upload, a message would be logged to
>>> stderr, but you wouldn't be able to rely on the program's exit status
>>> to indicate an error. To get that, you will need to write the
>>> indexing software.
>>>
>>> Thanks,
>>> Shawn
>>>
>>>
Re: Making Solr Indexing Errors Visible
Posted by Alexandre Rafalovitch <ar...@gmail.com>.
The challenge here is to figure out exactly what you are doing,
because the original description could have been 10 different things.
So:
1) You are using bin/post command (we just found this out)
2) You are indexing a bunch of files (what format? all same or different?)
3) You are indexing them into a Schema supposedly ready for those
files (which one?)
4) You think some of them are not in in Solr (how do you know that?
how do you know that some are? why do you not know _which_ of the
files are not indexed?)
5) You are asking whether the error message should have told you if
there is a problem with indexing (normally yes, but maybe there are
some edge cases).
I've put the questions in brackets. I would focus on looking at
questions in 4) first as they roughly bisect the problem. But other
things are important too.
I hope this helps,
Alex.
On 26 September 2018 at 16:39, Terry Steichen <te...@net-frame.com> wrote:
> Shawn,
>
> To the best of my knowledge, I'm not using SolrJ at all. Just
> Solr-out-of-the-box. In this case, if I understand you below, it
> "should indicate an error status"
>
> But it doesn't.
>
> Let me try to clarify a bit - I'm just using bin/post to index the files
> in a directory. That indexing process produces a lengthy screen display
> of files that were indexed. (I realize this isn't production-quality,
> but I'm not ready for production just yet, so that should be OK.)
>
> But no errors are shown (even though there have to be because the totals
> indexed is less than the directory totals).
>
> Are you saying I can't use post (to verify correct indexing), but that I
> have to write custom software to accomplish that?
>
> And that there's no solr variable I can define that will do a kind of
> "verbose" to show that?
>
> And that such errors will not show up in any of solr's log files?
>
> Hard to believe (but what is, is, I guess).
>
> Terry
>
> On 09/26/2018 03:49 PM, Shawn Heisey wrote:
>> On 9/26/2018 1:23 PM, Terry Steichen wrote:
>>> I'm pretty sure this was covered earlier. But I can't find references
>>> to it. The question is how to make indexing errors clear and obvious.
>>
>> If there's an indexing error and you're NOT using the concurrent
>> client in SolrJ, the response that Solr returns should indicate an
>> error status. ConcurrentUpdateSolrClient gets those errors and
>> swallows them so the calling program never knows they occurred.
>>
>>> (I find that there are maybe 10% more files in a directory than end up
>>> in the index. I presume they were indexing errors, but I have no idea
>>> which ones or what might have caused the error.) As I recall, Solr's
>>> post tool doesn't give any errors when indexing. I (vaguely) recall
>>> that there's a way (through the logs?) to overcome this and show the
>>> errors. Or maybe it's that you have to do the indexing outside of Solr?
>>
>> The simple post tool is not really meant for production use. It is a
>> simple tool for interactive testing.
>>
>> I don't see anything in SimplePostTool for changing the program's exit
>> status when an error is encountered during program operation. If an
>> error is encountered during the upload, a message would be logged to
>> stderr, but you wouldn't be able to rely on the program's exit status
>> to indicate an error. To get that, you will need to write the
>> indexing software.
>>
>> Thanks,
>> Shawn
>>
>>
>
Re: Making Solr Indexing Errors Visible
Posted by Shawn Heisey <ap...@elyograg.org>.
On 9/26/2018 2:39 PM, Terry Steichen wrote:
> To the best of my knowledge, I'm not using SolrJ at all. Just
> Solr-out-of-the-box. In this case, if I understand you below, it
> "should indicate an error status"
I think you'd know if you were using SolrJ directly. You'd have written
the indexing program, or whoever DID write it would likely indicate that
they used SolrJ to talk to Solr. I was surprised to learn that
SimplePostTool does NOT use SolrJ ... it uses the HTTP capability built
into Java.
> Let me try to clarify a bit - I'm just using bin/post to index the files
> in a directory. That indexing process produces a lengthy screen display
> of files that were indexed. (I realize this isn't production-quality,
> but I'm not ready for production just yet, so that should be OK.)
If you check your index, are you missing files that bin/post said were
indexed? Have you looked in that kind of detail?
The post tool should indicate that an error occurred, and if there was
any text in the response about the error, it should be displayed. I was
looking at the 7.4 code branch. I didn't see anything about which Solr
version you're running.
I have not spent any real time using bin/post. It was part of a class
that I attended as part of Lucene Revolution in 2010, but I do not
recall what the output was. It was all pre-designed and tested so it
was known to work before I received it. No errors occurred when I ran
the script included with the class materials.
> But no errors are shown (even though there have to be because the totals
> indexed is less than the directory totals).
>
> Are you saying I can't use post (to verify correct indexing), but that I
> have to write custom software to accomplish that?
If you want errors detected programmatically, you'll need to write the
indexing program. The simple post tool won't report errors to anything
that calls it, it will just log them.
> And that there's no solr variable I can define that will do a kind of
> "verbose" to show that?
If Solr returned errors during the indexing, then they will show up in
the solr.log file, or possibly one of the rotated versions of that
logfile. You can also see them in the admin UI Logging tab if Solr
hasn't been restarted, but the logfile is generally a better way to find
them. If you're not seeing errors there, then maybe something went
wrong with bin/post.
I notice in a later message you indicate that you're indexing PDF and
DOC files. When those kinds of files are sent with bin/post, they will
normally end up in the Extracting Request Handler, also known as SolrCell.
It is highly recommended that the Extracting Request Handler never be
used in production. That software embeds Tika inside Solr. Tika is
known to explode spectacularly when it gets a file it doesn't know how
to handle. PDF files in particular seem to trigger this behavior, but
other formats can cause it as well. If Tika is running inside Solr when
that happens, Solr will also explode, and then you no longer have a
search engine on that machine. A better option is to include Tika in an
indexing program that you write, so if it explodes, Solr stays running.
Thanks,
Shawn
Re: Making Solr Indexing Errors Visible
Posted by Jason Gerlowski <ge...@gmail.com>.
Hi
Also worth mentioning that bin/post only handles certain file
extensions, and AFAIR it doesn't mention specifically when it skips
over a file because of the extension. You mentioned you're trying to
index Word docs and pdf's. Are there any other formats in the
directory that might be messing up your counts?
I also second Shawn's suggestion that you post the "bin/post" output
and a directory listing. Additionally, if you're able to clean up the
output a bit, you might be able to diff the two lists of files and see
if the ones missing have anything particular in common.
Good luck,
Jason
On Thu, Sep 27, 2018 at 9:58 AM Shawn Heisey <ap...@elyograg.org> wrote:
>
> On 9/26/2018 2:39 PM, Terry Steichen wrote:
> > Let me try to clarify a bit - I'm just using bin/post to index the files
> > in a directory. That indexing process produces a lengthy screen display
> > of files that were indexed. (I realize this isn't production-quality,
> > but I'm not ready for production just yet, so that should be OK.)
>
> I see a previous message on the list from you indicating solr 6.6.0.
> FYI there are five bugfix releases after 6.6.0 -- the latest 6.x release
> is 6.6.5. I don't see any fixes related to the post tool, but maybe one
> of the problems that did get fixed might help your server behave better.
>
> Switching my source checkout to the 6.6.0 tag and checking that version...
>
> Each time a file is sent, you should get a log line starting with
> "POSTing file".
>
> The error detection in SimplePostTool has a bunch of parts. It seems
> that *most* errors will abort the tool entirely, skipping any files that
> have not yet been processed, and logging a message with "FATAL" included.
>
> Can you show us a directory listing and all the output that you get from
> bin/post when processing that directory?
>
> Thanks,
> Shawn
>
Re: Making Solr Indexing Errors Visible
Posted by Shawn Heisey <ap...@elyograg.org>.
On 9/26/2018 2:39 PM, Terry Steichen wrote:
> Let me try to clarify a bit - I'm just using bin/post to index the files
> in a directory. That indexing process produces a lengthy screen display
> of files that were indexed. (I realize this isn't production-quality,
> but I'm not ready for production just yet, so that should be OK.)
I see a previous message on the list from you indicating solr 6.6.0.
FYI there are five bugfix releases after 6.6.0 -- the latest 6.x release
is 6.6.5. I don't see any fixes related to the post tool, but maybe one
of the problems that did get fixed might help your server behave better.
Switching my source checkout to the 6.6.0 tag and checking that version...
Each time a file is sent, you should get a log line starting with
"POSTing file".
The error detection in SimplePostTool has a bunch of parts. It seems
that *most* errors will abort the tool entirely, skipping any files that
have not yet been processed, and logging a message with "FATAL" included.
Can you show us a directory listing and all the output that you get from
bin/post when processing that directory?
Thanks,
Shawn
Re: Making Solr Indexing Errors Visible
Posted by Terry Steichen <te...@net-frame.com>.
Shawn,
To the best of my knowledge, I'm not using SolrJ at all. Just
Solr-out-of-the-box. In this case, if I understand you below, it
"should indicate an error status"
But it doesn't.
Let me try to clarify a bit - I'm just using bin/post to index the files
in a directory. That indexing process produces a lengthy screen display
of files that were indexed. (I realize this isn't production-quality,
but I'm not ready for production just yet, so that should be OK.)
But no errors are shown (even though there have to be because the totals
indexed is less than the directory totals).
Are you saying I can't use post (to verify correct indexing), but that I
have to write custom software to accomplish that?
And that there's no solr variable I can define that will do a kind of
"verbose" to show that?
And that such errors will not show up in any of solr's log files?
Hard to believe (but what is, is, I guess).
Terry
On 09/26/2018 03:49 PM, Shawn Heisey wrote:
> On 9/26/2018 1:23 PM, Terry Steichen wrote:
>> I'm pretty sure this was covered earlier. But I can't find references
>> to it. The question is how to make indexing errors clear and obvious.
>
> If there's an indexing error and you're NOT using the concurrent
> client in SolrJ, the response that Solr returns should indicate an
> error status. ConcurrentUpdateSolrClient gets those errors and
> swallows them so the calling program never knows they occurred.
>
>> (I find that there are maybe 10% more files in a directory than end up
>> in the index. I presume they were indexing errors, but I have no idea
>> which ones or what might have caused the error.) As I recall, Solr's
>> post tool doesn't give any errors when indexing. I (vaguely) recall
>> that there's a way (through the logs?) to overcome this and show the
>> errors. Or maybe it's that you have to do the indexing outside of Solr?
>
> The simple post tool is not really meant for production use. It is a
> simple tool for interactive testing.
>
> I don't see anything in SimplePostTool for changing the program's exit
> status when an error is encountered during program operation. If an
> error is encountered during the upload, a message would be logged to
> stderr, but you wouldn't be able to rely on the program's exit status
> to indicate an error. To get that, you will need to write the
> indexing software.
>
> Thanks,
> Shawn
>
>
Re: Making Solr Indexing Errors Visible
Posted by Shawn Heisey <ap...@elyograg.org>.
On 9/26/2018 1:23 PM, Terry Steichen wrote:
> I'm pretty sure this was covered earlier. But I can't find references
> to it. The question is how to make indexing errors clear and obvious.
If there's an indexing error and you're NOT using the concurrent client
in SolrJ, the response that Solr returns should indicate an error
status. ConcurrentUpdateSolrClient gets those errors and swallows them
so the calling program never knows they occurred.
> (I find that there are maybe 10% more files in a directory than end up
> in the index. I presume they were indexing errors, but I have no idea
> which ones or what might have caused the error.) As I recall, Solr's
> post tool doesn't give any errors when indexing. I (vaguely) recall
> that there's a way (through the logs?) to overcome this and show the
> errors. Or maybe it's that you have to do the indexing outside of Solr?
The simple post tool is not really meant for production use. It is a
simple tool for interactive testing.
I don't see anything in SimplePostTool for changing the program's exit
status when an error is encountered during program operation. If an
error is encountered during the upload, a message would be logged to
stderr, but you wouldn't be able to rely on the program's exit status to
indicate an error. To get that, you will need to write the indexing
software.
Thanks,
Shawn