You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by SolrUser1543 <os...@gmail.com> on 2015/01/01 21:59:40 UTC

ignoring bad documents during index

Suppose I need to index a bulk of several documents ( D1 D2 D3 D4 )  - 4
documents in one request.

If e.g. D3 was an incorrect , so exception will be thrown and HTTP response
with 400 bad request will be returned .

Documents D1 and D2 will be indexed, but  D4 not . Also no indication will
be returned .

1. If it is possible to ignore such an error and continue to index D4 ? 
2. What will the best way to add an information about failed documents ? I
thought about an update processor , with try / catch in addCommand and in
case of exception add a doc ID to response .
Or it may be better to implement a component or response writer to add the
info ? 




--
View this message in context: http://lucene.472066.n3.nabble.com/ignoring-bad-documents-during-index-tp4176947.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: ignoring bad documents during index

Posted by SolrUser1543 <os...@gmail.com>.

What i tried is to make an update processor , with try / catch inside of
ProcessAdd.
This update processor was the last one last in update chain .
in catch statement I  tried to add to response the id of failed item . This
information ( about failed items ) is lost somewhere when request redirected
from shard which got the initial request to another .

What I am looking for is a place which looks like foreach statement,which
iterates over all shards and can aggregate a reponse from each one .
Including ability to handle a case when some shard is down . 

 

  



--
View this message in context: http://lucene.472066.n3.nabble.com/ignoring-bad-documents-during-index-tp4176947p4188041.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: ignoring bad documents during index

Posted by SolrUser1543 <os...@gmail.com>.

We are working with the following configuration :

There is Indexer service that prepares a bulk of xmls  .
Those XMLs received by a shard , which used only for distributing a request
among a shards. ( let's call it GW)
Some of shards could return OK, some 400 ( wrong field value ) some 500 (
because they were down ) 

I want to return from this GW a detailed status to Indexer and to know
exactly what items were failed. 




--
View this message in context: http://lucene.472066.n3.nabble.com/ignoring-bad-documents-during-index-tp4176947p4188006.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: ignoring bad documents during index

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Can you use CloudSolrServer to submit XMLs that ommit the intermediate
relay, and might make simple to respond additional info.

regarding experimenting with DistributingUpdateProcessor you can copy it
and make its' factory implement DistributingUpdateProcessorFactory and add
your processor factory into certain update chain.
https://wiki.apache.org/solr/UpdateRequestProcessor#Distributed_Updates

On Sun, Feb 22, 2015 at 3:12 PM, SolrUser1543 <os...@gmail.com> wrote:

> I'm not using a replicas. Does this class relevant anyway?
>
> Is there any way to not change this class ,but inherit it and do try catch
> on ProcessAdd?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/ignoring-bad-documents-during-index-tp4176947p4188008.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mk...@griddynamics.com>

Re: ignoring bad documents during index

Posted by SolrUser1543 <os...@gmail.com>.

I'm not using a replicas. Does this class relevant anyway? 

Is there any way to not change this class ,but inherit it and do try catch
on ProcessAdd? 



--
View this message in context: http://lucene.472066.n3.nabble.com/ignoring-bad-documents-during-index-tp4176947p4188008.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: ignoring bad documents during index

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

On Sun, Feb 22, 2015 at 12:20 PM, SolrUser1543 <os...@gmail.com> wrote:

> Does anyone know where is it?


local update on leader happens first (assuming you use CloudSolrServer),
https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/update/processor/DistributedUpdateProcessor.java#L704

and if update exceptions like wrong field format occurs it give up to
distribute to replicas.
perhaps you need to catch it there and respond back failed ids.
What is the kind of bad docs you aim, doc/xml format or wrong field values?


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mk...@griddynamics.com>

Re: ignoring bad documents during index

Posted by SolrUser1543 <os...@gmail.com>.

I think , you did not understand the question. 

The problem is indexing via cloud. 
When one shard gets a request,  distributes it among others and in case of
error on one of them this information is not passed to request initiator. 

Does anyone know where is it? 




--
View this message in context: http://lucene.472066.n3.nabble.com/ignoring-bad-documents-during-index-tp4176947p4187997.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: ignoring bad documents during index

Posted by Michael Della Bitta <mi...@appinions.com>.

At the layer right before you send that XML out, have it have a fallback
option on error where it sends each document one at a time if there's a
failure with the batch.

Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Fri, Feb 20, 2015 at 10:26 AM, SolrUser1543 <os...@gmail.com> wrote:

> I am sending a bulk of XML via http request.
>
> The same way like indexing via " documents " in solr interface.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/ignoring-bad-documents-during-index-tp4176947p4187632.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: ignoring bad documents during index

Posted by SolrUser1543 <os...@gmail.com>.

I am sending a bulk of XML via http request. 

The same way like indexing via " documents " in solr interface. 



--
View this message in context: http://lucene.472066.n3.nabble.com/ignoring-bad-documents-during-index-tp4176947p4187632.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: ignoring bad documents during index

Posted by Gora Mohanty <go...@mimirtech.com>.

On 20 February 2015 at 15:31, SolrUser1543 <os...@gmail.com> wrote:
>
> I want to experiment with this issue , where exactly I should take a look ?
> I want to try to fix this missing aggregation .
>
> What class is responsible to that ?

Are you indexing through SolrJ, DIH, or what?

Regards,

Re: ignoring bad documents during index

Posted by SolrUser1543 <os...@gmail.com>.

I want to experiment with this issue , where exactly I should take a look ? 
I want to try to fix this missing aggregation . 

What class is responsible to that ?



--
View this message in context: http://lucene.472066.n3.nabble.com/ignoring-bad-documents-during-index-tp4176947p4187587.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: ignoring bad documents during index

Posted by Erick Erickson <er...@gmail.com>.

There are some significant throughput improvements when you batch up
a bunch of docs to Solr (assuming SolrJ). You can go ahead and send, say,
1,000 docs in a batch and if the batch fails, re-process the list to find the
bad doc.

But as Jack says, Solr could do better here.

Best,
Erick

On Sat, Jan 10, 2015 at 3:46 AM, Jack Krupansky
<ja...@gmail.com> wrote:
> Sending individual documents will give you absolute control - just make
> sure not to "commit" on each document sent since that would really slow
> down indexing.
>
> You could also send smaller batches, life 5 to 20 documents to balance
> between fine control and performance. It also depends on your document size
> - small documents should be collected into larger batches, but large
> documents should be sent in smaller batches. Sending a total of 2K to 20K
> of bytes of data at a time is probably a good target. Smaller than 2K
> incurs more overhead, and more than 50K or 100K may simply overload the
> server rather than optimize performance.
>
> -- Jack Krupansky
>
> On Sat, Jan 10, 2015 at 6:02 AM, SolrUser1543 <os...@gmail.com> wrote:
>
>> Would it be a good solution to index single document instead of bulk ?
>> In this case I will know about the status of each message .
>>
>> What is recommendation in this case : Bulk vs Single ?
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/ignoring-bad-documents-during-index-tp4176947p4178546.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>

Re: ignoring bad documents during index

Posted by Jack Krupansky <ja...@gmail.com>.

Sending individual documents will give you absolute control - just make
sure not to "commit" on each document sent since that would really slow
down indexing.

You could also send smaller batches, life 5 to 20 documents to balance
between fine control and performance. It also depends on your document size
- small documents should be collected into larger batches, but large
documents should be sent in smaller batches. Sending a total of 2K to 20K
of bytes of data at a time is probably a good target. Smaller than 2K
incurs more overhead, and more than 50K or 100K may simply overload the
server rather than optimize performance.

-- Jack Krupansky

On Sat, Jan 10, 2015 at 6:02 AM, SolrUser1543 <os...@gmail.com> wrote:

> Would it be a good solution to index single document instead of bulk ?
> In this case I will know about the status of each message .
>
> What is recommendation in this case : Bulk vs Single ?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/ignoring-bad-documents-during-index-tp4176947p4178546.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: ignoring bad documents during index

Posted by SolrUser1543 <os...@gmail.com>.

Would it be a good solution to index single document instead of bulk ? 
In this case I will know about the status of each message . 

What is recommendation in this case : Bulk vs Single ? 



--
View this message in context: http://lucene.472066.n3.nabble.com/ignoring-bad-documents-during-index-tp4176947p4178546.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: ignoring bad documents during index

Posted by Jack Krupansky <ja...@gmail.com>.

Correct, Solr clearly needs improvement in this area. Feel free to comment
on the Jira about what options you would like to see supported.

-- Jack Krupansky

On Sat, Jan 10, 2015 at 5:49 AM, SolrUser1543 <os...@gmail.com> wrote:

> From reading this (https://issues.apache.org/jira/browse/SOLR-445) I see
> that
> there is no solution provided for the issue of aggregating responses from
> several solr instances is available .
>
>
> Solr is not able to do that ?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/ignoring-bad-documents-during-index-tp4176947p4178544.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: ignoring bad documents during index

Posted by SolrUser1543 <os...@gmail.com>.

>From reading this (https://issues.apache.org/jira/browse/SOLR-445) I see that
there is no solution provided for the issue of aggregating responses from
several solr instances is available . 


Solr is not able to do that ? 



--
View this message in context: http://lucene.472066.n3.nabble.com/ignoring-bad-documents-during-index-tp4176947p4178544.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: ignoring bad documents during index

Posted by Chris Hostetter <ho...@fucit.org>.

i don't have specific answers toall of your questions, but you should 
probably look at SOLR-445 where a lot of this has already ben discussed 
and multiple patches with different approaches have been started...

https://issues.apache.org/jira/browse/SOLR-445

: Date: Wed, 7 Jan 2015 12:38:47 -0700 (MST)
: From: SolrUser1543 <os...@gmail.com>
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Re: ignoring bad documents during index
: 
: I have implemented an update processor as  described above. 
: 
: On single solr instance it works fine. 
: 
: When I testing it on solr cloud with several nodes and trying to index few
: documents , when some of them are incorrect , each instance is creating its
: response, but it is not aggregated by the instance which got a request . 
: 
: I also tried to use QueryReponseWriter , but it is also was not aggregated . 
: 
: The questions are : 
: 1.  how to make it be aggregated ? 
: 2. what kind of update processor it should be : UpdateRequestProcessor or
: DistributedUpdateRequestProcessor ? 
: 
: 
: 
: 
: --
: View this message in context: http://lucene.472066.n3.nabble.com/ignoring-bad-documents-during-index-tp4176947p4177911.html
: Sent from the Solr - User mailing list archive at Nabble.com.
: 

-Hoss
http://www.lucidworks.com/

Re: ignoring bad documents during index

Posted by SolrUser1543 <os...@gmail.com>.

I have implemented an update processor as  described above. 

On single solr instance it works fine. 

When I testing it on solr cloud with several nodes and trying to index few
documents , when some of them are incorrect , each instance is creating its
response, but it is not aggregated by the instance which got a request . 

I also tried to use QueryReponseWriter , but it is also was not aggregated . 

The questions are : 
1.  how to make it be aggregated ? 
2. what kind of update processor it should be : UpdateRequestProcessor or
DistributedUpdateRequestProcessor ? 




--
View this message in context: http://lucene.472066.n3.nabble.com/ignoring-bad-documents-during-index-tp4176947p4177911.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: ignoring bad documents during index

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Hello,
Please find below

On Thu, Jan 1, 2015 at 11:59 PM, SolrUser1543 <os...@gmail.com> wrote:

> 1. If it is possible to ignore such an error and continue to index D4 ?
>
this can be done by catching and swallowing an exception in custom
UpdateRequestProcessor


> 2. What will the best way to add an information about failed documents ? I
> thought about an update processor , with try / catch in addCommand and in
> case of exception add a doc ID to response .
> Or it may be better to implement a component or response writer to add the
> info ?
>
it turns that you can add this info into SolrQueryResponse.getValues() even
in custom UpdateRequestProcessor and it should be responded back.


>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/ignoring-bad-documents-during-index-tp4176947.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mk...@griddynamics.com>