You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Yonik Seeley (JIRA)" <ji...@apache.org> on 2015/10/09 06:21:26 UTC
[jira] [Commented] (SOLR-8129) HdfsChaosMonkeyNothingIsSafeTest failures

    [ https://issues.apache.org/jira/browse/SOLR-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949872#comment-14949872 ] 

Yonik Seeley commented on SOLR-8129:
------------------------------------

This seems to be a case of extreme reordering of updates due to thread scheduling or whatever.
The leader is selected by the chaos monkey to be killed, and there are a bunch of updates that are sent to one shard but not another (i.e. they may have been waiting to be sent, but didn't make it before the leader was killed).
A bunch more updates flood in, and when a leader election is held and peersync is done, everyone thinks they are up-to-date because the most recent updates do match.

{code}
43897 has fewer docs than 36605: missing: {_version_=1514192462261780480, id=ft1-1715} ... {_version_=1514192462057308160, id=ft1-1700} ... {_version_=1514192462056259585, id=ft1-1698}

// Port 43897 is recovering and begins buffering updates
  2> 31048 INFO  (RecoveryThread-collection1) [n:127.0.0.1:43897_ c:collection1 s:shard2 r:core_node4 x:collection1] o.a.s.c.RecoveryStrategy Starting recovery process. recoveringAfterStartup=true
  2> 31049 INFO  (RecoveryThread-collection1) [n:127.0.0.1:43897_ c:collection1 s:shard2 r:core_node4 x:collection1] o.a.s.c.RecoveryStrategy ###### startupVersions=[]

// going "active"
  2> 38552 INFO  (RecoveryThread-collection1) [n:127.0.0.1:43897_ c:collection1 s:shard2 r:core_node4 x:collection1] o.a.s.c.RecoveryStrategy Registering as Active after recovery.

// example of a normal add
  2> 55692 INFO  (qtp1287419856-309) [n:127.0.0.1:43897_ c:collection1 s:shard2 r:core_node4 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp= path=/update params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:49240/collection1/&wt=javabin&version=2} {add=[ft1-48 (1514192447743197184)]} 0 188
  
// The leader is going to be killed
  2> 68281 INFO  (Thread-378) [    ] o.a.s.c.ChaosMonkey monkey: stop shard! 49240
 
// This is presumably where 43897 see the leader marked as "down" in the clusterstate
  2> 68347 INFO  (zkCallback-25-thread-1-processing-n:127.0.0.1:43897_) [n:127.0.0.1:43897_    ] o.a.s.c.c.ZkStateReader A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/collection1/state.json for collection collection1 has occurred - updating... (live nodes size: 9)

// updates are still being processed by the leader
  2> 68743 DEBUG (qtp139680011-202) [n:127.0.0.1:49240_ c:collection1 s:shard2 r:core_node1 x:collection1] o.a.s.u.p.LogUpdateProcessor PRE_UPDATE add{,id=ft1-1700} {wt=javabin&version=2}

// Replica 36605 receives and processes the update, but 43897 never does (this may be normal as the leader is being killed)
  2> 69010 DEBUG (qtp33344819-419) [n:127.0.0.1:36605_ c:collection1 s:shard2 r:core_node7 x:collection1] o.a.s.u.p.LogUpdateProcessor PRE_UPDATE add{,id=ft1-1700} {update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:49240/collection1/&wt=javabin&version=2}
  2> 69015 INFO  (qtp33344819-419) [n:127.0.0.1:36605_ c:collection1 s:shard2 r:core_node7 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp= path=/update params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:49240/collection1/&wt=javabin&version=2} {add=[ft1-1656 (1514192461610614784), ft1-1670 (1514192462015365120), ft1-1679 (1514192462052065280), ft1-1682 (1514192462053113856), ft1-1688 (1514192462055211008), ft1-1695 (1514192462056259584), ft1-1698 (1514192462056259585), ft1-1700 (1514192462057308160), ft1-1702 (1514192462058356736)]} 0 224

// a ton of more update flood in
  2> 69710 INFO  (qtp1287419856-654) [n:127.0.0.1:43897_ c:collection1 s:shard2 r:core_node4 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp= path=/update params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:49240/collection1/&wt=javabin&version=2} {add=[ft1-1516 (1514192460689965056), ft1-1518 (1514192460829425664), ft1-1519 (1514192460867174400), ft1-1539 (1514192460868222976), ft1-1540 (1514192460868222977), ft1-1542 (1514192460868222978), ft1-1546 (1514192460869271552), ft1-1552 (1514192460869271553), ft1-1563 (1514192460871368704), ft1-1564 (1514192460871368705), ... (41 adds)]} 0 992
  2> 70571 INFO  (qtp1287419856-655) [n:127.0.0.1:43897_ c:collection1 s:shard2 r:core_node4 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp= path=/update params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:49240/collection1/&wt=javabin&version=2} {add=[ft1-1790 (1514192462017462272), ft1-1791 (1514192462035288064), ft1-1803 (1514192462036336640), ft1-1810 (1514192462037385216), ft1-1813 (1514192462042628096), ft1-1814 (1514192462043676672), ft1-1815 (1514192462043676673), ft1-1818 (1514192462044725248), ft1-1819 (1514192462044725249), ft1-1820 (1514192462303723520), ... (45 adds)]} 0 856
 2> 70911 INFO  (qtp1287419856-309) [n:127.0.0.1:43897_ c:collection1 s:shard2 r:core_node4 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp= path=/update params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:49240/collection1/&wt=javabin&version=2} {add=[ft1-1998 (1514192463049261056), ft1-1999 (1514192463084912640), ft1-2001 (1514192463085961216), ft1-2003 (1514192463085961217), ft1-2011 (1514192463087009792), ft1-2012 (1514192463088058368), ft1-2016 (1514192463099592704), ft1-2020 (1514192463101689856), ft1-2022 (1514192463102738432), ft1-2023 (1514192463392145408), ... (35 adds)]} 0 325
  2> 71386 INFO  (qtp1287419856-653) [n:127.0.0.1:43897_ c:collection1 s:shard2 r:core_node4 x:collection1] o.a.s.u.p.LogUpdateProcessor [collection1] webapp= path=/update params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:49240/collection1/&wt=javabin&version=2} {add=[ft1-2203 (1514192463788507136), ft1-2206 (1514192464060088320), ft1-2217 (1514192464074768384), ft1-2219 (1514192464074768385), ft1-2224 (1514192464075816960), ft1-2229 (1514192464075816961), ft1-2231 (1514192464075816962), ft1-2239 (1514192464076865536), ft1-2249 (1514192464077914112), ft1-2268 (1514192464077914113), ... (27 adds)]} 0 471
  

// the most 100 recent versions match for both replicas
  2> 76191 DEBUG (zkCallback-25-thread-1-processing-n:127.0.0.1:43897_) [n:127.0.0.1:43897_ c:collection1 s:shard2 r:core_node4 x:collection1] o.a.s.u.PeerSync PeerSync: core=collection1 url=http://127.0.0.1:43897  sorted versions from http://127.0.0.1:36605/collection1/ = [1514192465046798337, 1514192465046798336, ... , 1514192463538946050, 1514192463538946049]
  2> 76241 DEBUG (qtp33344819-419) [n:127.0.0.1:36605_ c:collection1 s:shard2 r:core_node7 x:collection1] o.a.s.u.PeerSync PeerSync: core=collection1 url=http://127.0.0.1:36605  sorted versions from http://127.0.0.1:43897/collection1/ = [1514192465046798337, 1514192465046798336, ... , 1514192463538946050, 1514192463538946049]

// 43897 becomes the new leader
  2> 76242 INFO  (zkCallback-25-thread-1-processing-n:127.0.0.1:43897_) [n:127.0.0.1:43897_ c:collection1 s:shard2 r:core_node4 x:collection1] o.a.s.c.ShardLeaderElectionContext I am the new leader: http://127.0.0.1:43897/collection1/ shard2


// end of test
  2> ######shard2 is not consistent.  Got 2076 from http://127.0.0.1:43897/collection1 (previous client) and got 2103 from http://127.0.0.1:36605/collection1
  2> 129545 INFO  (qtp1287419856-653) [n:127.0.0.1:43897_ c:collection1 s:shard2 r:core_node4 x:collection1] o.a.s.c.S.Request [collection1] webapp= path=/select params={q=*:*&distrib=false&tests=checkShardConsistency/showDiff&fl=id,_version_&sort=id+asc&rows=100000&wt=javabin&version=2} hits=2076 status=0 QTime=114 
  2> 129656 INFO  (qtp33344819-422) [n:127.0.0.1:36605_ c:collection1 s:shard2 r:core_node7 x:collection1] o.a.s.c.S.Request [collection1] webapp= path=/select params={q=*:*&distrib=false&tests=checkShardConsistency/showDiff&fl=id,_version_&sort=id+asc&rows=100000&wt=javabin&version=2} hits=2103 status=0 QTime=4 
  2> ######http://127.0.0.1:43897/collection1: SolrDocumentList[sz=2076]=[SolrDocument{id=0-0, _version_=1514192448210862080}, SolrDocument{id=0-100, _version_=1514192469996077056}, SolrDocument{id=0-1000, _version_=1514192493627834368}, SolrDocument{id=0-1006, _version_=1514192493742129152}, SolrDocument{id=0-1008, _version_=1514192493795606528}] , [...] , [SolrDocument{id=ft1-984, _version_=1514192458787848193}, SolrDocument{id=ft1-991, _version_=1514192459077255168}, SolrDocument{id=ft1-992, _version_=1514192459148558336}, SolrDocument{id=ft1-994, _version_=1514192459149606912}, SolrDocument{id=ft1-997, _version_=1514192459150655488}]
  2> ######http://127.0.0.1:36605/collection1: SolrDocumentList[sz=2103]=[SolrDocument{id=0-0, _version_=1514192448210862080}, SolrDocument{id=0-100, _version_=1514192469996077056}, SolrDocument{id=0-1000, _version_=1514192493627834368}, SolrDocument{id=0-1006, _version_=1514192493742129152}, SolrDocument{id=0-1008, _version_=1514192493795606528}] , [...] , [SolrDocument{id=ft1-984, _version_=1514192458787848193}, SolrDocument{id=ft1-991, _version_=1514192459077255168}, SolrDocument{id=ft1-992, _version_=1514192459148558336}, SolrDocument{id=ft1-994, _version_=1514192459149606912}, SolrDocument{id=ft1-997, _version_=1514192459150655488}]
  2> ###### sizes=2076,2103
  2> ###### Only in http://127.0.0.1:36605/collection1: [{_version_=1514192462261780480, id=ft1-1715}, {_version_=1514192462260731904, id=ft1-1711}, {_version_=1514192462263877632, id=ft1-1719}, {_version_=1514192462262829056, id=ft1-1718}, {_version_=1514192462264926208, id=ft1-1720}, {_version_=1514192462015365120, id=ft1-1670}, {_version_=1514192462276460544, id=ft1-1722}, {_version_=1514192462415921152, id=ft1-1844}, {_version_=1514192462413824000, id=ft1-1840}, {_version_=1514192462230323200, id=ft1-1709}, {_version_=1514192462414872576, id=ft1-1841}, {_version_=1514192462412775424, id=ft1-1838}, {_version_=1514192462412775425, id=ft1-1839}, {_version_=1514192462411726848, id=ft1-1837}, {_version_=1514192462230323201, id=ft1-1710}, {_version_=1514192462057308160, id=ft1-1700}, {_version_=1514192462058356737, id=ft1-1703}, {_version_=1514192462058356736, id=ft1-1702}, {_version_=1514192462375026688, id=ft1-1796}, {_version_=1514192462400192512, id=ft1-1798}, {_version_=1514192462399143936, id=ft1-1797}, {_version_=1514192462193623040, id=ft1-1707}, {_version_=1514192462052065280, id=ft1-1679}, {_version_=1514192462055211008, id=ft1-1688}, {_version_=1514192462053113856, id=ft1-1682}, {_version_=1514192462056259584, id=ft1-1695}, {_version_=1514192462056259585, id=ft1-1698}]
 {code}

> HdfsChaosMonkeyNothingIsSafeTest failures
> -----------------------------------------
>
>                 Key: SOLR-8129
>                 URL: https://issues.apache.org/jira/browse/SOLR-8129
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Yonik Seeley
>         Attachments: fail.151005_080319
>
>
> New HDFS chaos test in SOLR-8123 hits a number of types of failures, including shard inconsistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org