You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Hoss Man (JIRA)" <ji...@apache.org> on 2015/12/17 01:31:47 UTC
[jira] [Updated] (SOLR-445) Update Handlers abort with bad documents

     [ https://issues.apache.org/jira/browse/SOLR-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoss Man updated SOLR-445:
--------------------------
    Attachment: SOLR-445.patch


I started playing arround with this patch a bit to see if I could help move it forward.  I'm a little out of my depth with a lot of the details of how distribute updates work, but the more I tried to make sense of it, the more convinced I was that there was a lot of things that just weren't very well accounted for in the existing tests (which were consistently failing, but the failures themselves weren't consistent between runs).

Here's a summary of what's new/different in the patch i'm attaching...


* DistributedUpdateProcessor.DistribPhase
** not sure why this enum was made non-static in earlier patches ... i reverted this unneeded change.
* TolerantUpdateProcessor
** processDelete
*** Method has a couple of glaringly obvious bugs, that aparently don't trip under the current tests
*** added several nocommits of things that jumpted out at me
* DistribTolerantUpdateProcessorTest
** beefed up assertion msgs in assertUSucceedsWithErrors
** fixed testValidAdds so it's not dead code
** testInvalidAdds
*** sanity check code wasn't passing reliably
**** details of what failed are lost depending on how update is routed (random seed)
**** relaxed this check to be reliable with a nocommit comment to see if we can tighten it up
*** assuming sanity check passes assertUSucceedsWithErrors (still) fails on some seeds w/null error list
**** I'm Guessing this is what anshum alluded to in last comment: "Node2 as of now return an HTTP OK and doesn't throw an exception, the StreamingSolrClient used but the Distributed Updated Processor doesn't realize the error that was consumed by the leader of shard 1"
* TestTolerantUpdateProcessorCloud
** New MiniSolrCloudCluster based test to try and demonstrate all the possible distrib code paths i could think of (see below)

TestTolerantUpdateProcessorCloud is the real meat of what i've added here.  Starting with the basic behavior/assertions currently tested in TolerantUpdateProcessorTest, I built it up to try and exorcise every possible distribute update code path i could imagine (updates with docs all on one shard some of which fail, updates with docs for diff shards and some from each shard fail, updates with docs for diff shards but only one shard fails, etc...) -- but only tested against a MinSolrCloud collection that actaully had 1 node, 1 shard, 1 replica and an HttpSolrClient talking directly to that node.  Once all those assertions were passing, then I changed it to use 5 nodes, 2 shards, 2 replicas and started testing all of those scenerios against 5 HttpSolrClients pointed at every individual node (one of which hosts no replicas) as well as a ZK aware CloudSolrClient.  All 6 tests against all 6 clients currently fail (reliably) at some point in these scenerios.

----

Independent of all the things i still need to make sense of in the existing code to try and help get these tests passing, I still have one big question about what the desired/epected behavior should be for clients when maxErrors is exceeded -- at the moment, in single node setups, the client gets a 400 error with the top level "error" section corisponding with whatever error caused it to exceed the maxErrors, but the responseHeader is still populated with the individual errors and the appropraite numAdds & numErrors, for example...

{code}
$ curl -v -X POST 'http://localhost:8983/solr/techproducts/update?indent=true&commit=true&update.chain=tolerant' -H 'Content-Type: application/json' --data-binary '[{"id":"hoss1","foo_i":42},{"id":"bogus1","foo_i":"bogus"},{"id":"hoss2","foo_i":66},{"id":"bogus2","foo_i":"bogus"},{"id":"bogus3","foo_i":"bogus"},{"id":"hoss3","foo_i":42}]'
* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8983 (#0)
> POST /solr/techproducts/update?indent=true&commit=true&update.chain=tolerant HTTP/1.1
> User-Agent: curl/7.38.0
> Host: localhost:8983
> Accept: */*
> Content-Type: application/json
> Content-Length: 175
> 
* upload completely sent off: 175 out of 175 bytes
< HTTP/1.1 400 Bad Request
< Content-Type: text/plain;charset=utf-8
< Transfer-Encoding: chunked
< 
{
  "responseHeader":{
    "numErrors":3,
    "errors":{
      "bogus1":{
        "message":"ERROR: [doc=bogus1] Error adding field 'foo_i'='bogus' msg=For input string: \"bogus\""},
      "bogus2":{
        "message":"ERROR: [doc=bogus2] Error adding field 'foo_i'='bogus' msg=For input string: \"bogus\""},
      "bogus3":{
        "message":"ERROR: [doc=bogus3] Error adding field 'foo_i'='bogus' msg=For input string: \"bogus\""}},
    "numAdds":2,
    "status":400,
    "QTime":4},
  "error":{
    "msg":"ERROR: [doc=bogus3] Error adding field 'foo_i'='bogus' msg=For input string: \"bogus\"",
    "code":400}}
* Connection #0 to host localhost left intact
{code}

...but because this is a 400 error, that means that if you use HttpSolrClient, you're not going to get access to any of that detailed error information at all -- you'll just get a RemoteSolrException with the bare details.

* Should the use of this processor force *all* "error" responses to be rewritten as HTTP 200s?
* Should the solrj clients be updated so that RemoteSolrException still provides an accessor to get the parsed/structured SolrResponse (assuming the HTTP response body can be parsed w/o any other errors?)

?


> Update Handlers abort with bad documents
> ----------------------------------------
>
>                 Key: SOLR-445
>                 URL: https://issues.apache.org/jira/browse/SOLR-445
>             Project: Solr
>          Issue Type: Improvement
>          Components: update
>    Affects Versions: 1.3
>            Reporter: Will Johnson
>            Assignee: Anshum Gupta
>         Attachments: SOLR-445-3_x.patch, SOLR-445-alternative.patch, SOLR-445-alternative.patch, SOLR-445-alternative.patch, SOLR-445-alternative.patch, SOLR-445.patch, SOLR-445.patch, SOLR-445.patch, SOLR-445.patch, SOLR-445.patch, SOLR-445.patch, SOLR-445.patch, SOLR-445.patch, SOLR-445_3x.patch, solr-445.xml
>
>
> Has anyone run into the problem of handling bad documents / failures mid batch.  Ie:
> <add>
>   <doc>
>     <field name="id">1</field>
>   </doc>
>   <doc>
>     <field name="id">2</field>
>     <field name="myDateField">I_AM_A_BAD_DATE</field>
>   </doc>
>   <doc>
>     <field name="id">3</field>
>   </doc>
> </add>
> Right now solr adds the first doc and then aborts.  It would seem like it should either fail the entire batch or log a message/return a code and then continue on to add doc 3.  Option 1 would seem to be much harder to accomplish and possibly require more memory while Option 2 would require more information to come back from the API.  I'm about to dig into this but I thought I'd ask to see if anyone had any suggestions, thoughts or comments.    



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org