You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Hoss Man (Jira)" <ji...@apache.org> on 2019/09/25 16:48:00 UTC

[jira] [Updated] (SOLR-13792) SolrZkClient should include more MDC info when zkCallback threads process a WatchedEvent

     [ https://issues.apache.org/jira/browse/SOLR-13792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoss Man updated SOLR-13792:
----------------------------
    Attachment: SOLR-13792.patch
        Status: Open  (was: Open)


I initially set out to see what it would take just to ensure that the {{MDCLoggingContext.setNode(...)}} information was set when {{SolrZkClient}} submitted tasks to the  {{zkCallback}} {{MDCAwareThreadPoolExecutor}}, but the more I looked a it the more I started to think what might make a lot more sense in general is if {{SolrZkClient}} ensured that the MDC context used when submitting a Runnable (to the {{zkCallback}} {{MDCAwareThreadPoolExecutor}}) contained all the MDC information of the thread that *registered* the {{Watcher}} in the first place.

This could be useful to external solrj based {{SolrZkClient}} users that might use MDC when registering their own {{Watcher}} instances, but would also be useful in Solr not only to ensure that "node level" Watchers (the most common) would at least have the {{n:host_port}} dteails in their log messages, but also to help ensure that if any "individual requests or SolrCore level code registered a {{Watcher}} we'd have that collection/core level detail from that request in the log messages as well.

The attached patch takes things in this direction by making a copy of the MDC context info in the {{ProcessWatchWithExecutor}} that {{SolrZkClient.wrapWatcher}} creates for every {{Watcher}} provided by it's caller.  This MDC context copy is then used when the Watcher is processed, but the code ensures that any MDC values set at that time can override them (just in case down the road ZooKeeper ever uses MDC).

On the whole I think it's an improvement -- but one thing I'm not sure about is if/how thse MDC values should be accounted for in {{ProcessWatchWithExecutor.hashCode()}} and {{ProcessWatchWithExecutor.equals()}} given the comments in that class.  (i've included a nocommit regarding this)

----

Some sample log output from an arbitrary cloud test (TestMiniSolrCloudClusterSSL) w/o the patch...

{noformat}
   [junit4]   2> 8511 INFO  (qtp666514211-77) [n:127.0.0.1:43335_solr c:first_collection s:shard1 r:core_node4 x:first_collection_shard1_replica_n1 ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={qt=/admin/cores&collection.configName=org.apache.solr.cloud.TestMiniSolrCloudClusterSSL&newCollection=true&collection=first_collection&version=2&replicaType=NRT&property.config=solrconfig-tlog.xml&coreNodeName=core_node4&name=first_collection_shard1_replica_n1&action=CREATE&numShards=3&shard=shard1&wt=javabin} status=0 QTime=2840
   [junit4]   2> 8514 INFO  (qtp1436582928-73) [n:127.0.0.1:39513_solr c:first_collection s:shard3 r:core_node6 x:first_collection_shard3_replica_n3 ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={qt=/admin/cores&collection.configName=org.apache.solr.cloud.TestMiniSolrCloudClusterSSL&newCollection=true&collection=first_collection&version=2&replicaType=NRT&property.config=solrconfig-tlog.xml&coreNodeName=core_node6&name=first_collection_shard3_replica_n3&action=CREATE&numShards=3&shard=shard3&wt=javabin} status=0 QTime=2851
   [junit4]   2> 8530 INFO  (qtp1436582928-75) [n:127.0.0.1:39513_solr     ] o.a.s.h.a.CollectionsHandler Wait for new collection to be active for at most 45 seconds. Check all shard replicas
   [junit4]   2> 8611 INFO  (zkCallback-36-thread-1) [     ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/first_collection/state.json] for collection [first_collection] has occurred - updating... (live nodes size: [3])
   [junit4]   2> 8611 INFO  (zkCallback-36-thread-2) [     ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/first_collection/state.json] for collection [first_collection] has occurred - updating... (live nodes size: [3])
   [junit4]   2> 8611 INFO  (zkCallback-40-thread-3) [     ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/first_collection/state.json] for collection [first_collection] has occurred - updating... (live nodes size: [3])
   [junit4]   2> 8611 INFO  (zkCallback-36-thread-3) [     ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/first_collection/state.json] for collection [first_collection] has occurred - updating... (live nodes size: [3])
   [junit4]   2> 8611 INFO  (zkCallback-44-thread-3) [     ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/first_collection/state.json] for collection [first_collection] has occurred - updating... (live nodes size: [3])
   [junit4]   2> 8612 INFO  (zkCallback-44-thread-1) [     ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/first_collection/state.json] for collection [first_collection] has occurred - updating... (live nodes size: [3])
   [junit4]   2> 8612 INFO  (zkCallback-40-thread-2) [     ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/first_collection/state.json] for collection [first_collection] has occurred - updating... (live nodes size: [3])
   [junit4]   2> 8612 INFO  (zkCallback-44-thread-2) [     ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/first_collection/state.json] for collection [first_collection] has occurred - updating... (live nodes size: [3])
   [junit4]   2> 8611 INFO  (zkCallback-40-thread-4) [     ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/first_collection/state.json] for collection [first_collection] has occurred - updating... (live nodes size: [3])
   [junit4]   2> 8614 INFO  (qtp1436582928-75) [n:127.0.0.1:39513_solr     ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections params={property.config=solrconfig-tlog.xml&collection.
{noformat}

And the same log messages from the same test w/this patch...

{noformat}
   [junit4]   2> 8838 INFO  (qtp1528846265-78) [n:127.0.0.1:34197_solr c:first_collection s:shard2 r:core_node5 x:first_collection_shard2_
replica_n2 ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={qt=/admin/cores&collection.configName=org.apache.solr.clou
d.TestMiniSolrCloudClusterSSL&newCollection=true&collection=first_collection&version=2&replicaType=NRT&property.config=solrconfig-tlog.xml&coreNodeName=core_node5&name=first_collection_shard2_replica_n2&action=CREATE&numShards=3&shard=shard2&wt=javabin} status=0 QTime=2768
   [junit4]   2> 8840 INFO  (qtp1649430949-72) [n:127.0.0.1:35181_solr c:first_collection s:shard3 r:core_node6 x:first_collection_shard3_replica_n3 ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={qt=/admin/cores&collection.configName=org.apache.solr.cloud.TestMiniSolrCloudClusterSSL&newCollection=true&collection=first_collection&version=2&replicaType=NRT&property.config=solrconfig-tlog.xml&coreNodeName=core_node6&name=first_collection_shard3_replica_n3&action=CREATE&numShards=3&shard=shard3&wt=javabin} status=0 QTime=2781
   [junit4]   2> 8854 INFO  (qtp1649430949-76) [n:127.0.0.1:35181_solr     ] o.a.s.h.a.CollectionsHandler Wait for new collection to be active for at most 45 seconds. Check all shard replicas
   [junit4]   2> 8940 INFO  (zkCallback-44-thread-1-processing-n:127.0.0.1:34197_solr x:first_collection_shard2_replica_n2 c:first_collection s:shard2 r:core_node5) [n:127.0.0.1:34197_solr c:first_collection s:shard2 r:core_node5 x:first_collection_shard2_replica_n2 ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/first_collection/state.json] for collection [first_collection] has occurred - updating... (live nodes size: [3])
   [junit4]   2> 8940 INFO  (zkCallback-44-thread-2-processing-n:127.0.0.1:34197_solr x:first_collection_shard2_replica_n2 c:first_collection s:shard2 r:core_node5) [n:127.0.0.1:34197_solr c:first_collection s:shard2 r:core_node5 x:first_collection_shard2_replica_n2 ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/first_collection/state.json] for collection [first_collection] has occurred - updating... (live nodes size: [3])
   [junit4]   2> 8940 INFO  (zkCallback-36-thread-2-processing-n:127.0.0.1:33433_solr x:first_collection_shard1_replica_n1 c:first_collection s:shard1 r:core_node4) [n:127.0.0.1:33433_solr c:first_collection s:shard1 r:core_node4 x:first_collection_shard1_replica_n1 ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/first_collection/state.json] for collection [first_collection] has occurred - updating... (live nodes size: [3])
   [junit4]   2> 8940 INFO  (zkCallback-40-thread-1-processing-n:127.0.0.1:35181_solr x:first_collection_shard3_replica_n3 c:first_collection s:shard3 r:core_node6) [n:127.0.0.1:35181_solr c:first_collection s:shard3 r:core_node6 x:first_collection_shard3_replica_n3 ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/first_collection/state.json] for collection [first_collection] has occurred - updating... (live nodes size: [3])
   [junit4]   2> 8940 INFO  (zkCallback-36-thread-3-processing-n:127.0.0.1:33433_solr x:first_collection_shard1_replica_n1 c:first_collection s:shard1 r:core_node4) [n:127.0.0.1:33433_solr c:first_collection s:shard1 r:core_node4 x:first_collection_shard1_replica_n1 ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/first_collection/state.json] for collection [first_collection] has occurred - updating... (live nodes size: [3])
   [junit4]   2> 8940 INFO  (zkCallback-36-thread-1-processing-n:127.0.0.1:33433_solr x:first_collection_shard1_replica_n1 c:first_collection s:shard1 r:core_node4) [n:127.0.0.1:33433_solr c:first_collection s:shard1 r:core_node4 x:first_collection_shard1_replica_n1 ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/first_collection/state.json] for collection [first_collection] has occurred - updating... (live nodes size: [3])
   [junit4]   2> 8940 INFO  (zkCallback-40-thread-2-processing-n:127.0.0.1:35181_solr x:first_collection_shard3_replica_n3 c:first_collection s:shard3 r:core_node6) [n:127.0.0.1:35181_solr c:first_collection s:shard3 r:core_node6 x:first_collection_shard3_replica_n3 ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/first_collection/state.json] for collection [first_collection] has occurred - updating... (live nodes size: [3])
   [junit4]   2> 8940 INFO  (zkCallback-40-thread-3-processing-n:127.0.0.1:35181_solr x:first_collection_shard3_replica_n3 c:first_collection s:shard3 r:core_node6) [n:127.0.0.1:35181_solr c:first_collection s:shard3 r:core_node6 x:first_collection_shard3_replica_n3 ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/first_collection/state.json] for collection [first_collection] has occurred - updating... (live nodes size: [3])
   [junit4]   2> 8942 INFO  (zkCallback-44-thread-3-processing-n:127.0.0.1:34197_solr x:first_collection_shard2_replica_n2 c:first_collection s:shard2 r:core_node5) [n:127.0.0.1:34197_solr c:first_collection s:shard2 r:core_node5 x:first_collection_shard2_replica_n2 ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/first_collection/state.json] for collection [first_collection] has occurred - updating... (live nodes size: [3])
   [junit4]   2> 8944 INFO  (qtp1649430949-76) [n:127.0.0.1:35181_solr     ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections params={property.config=solrconfig-tlog.xml&collection.configName=org.apache.solr.cloud.TestMiniSolrCloudClusterSSL&name=first_collection&nrtReplicas=1&action=CREATE&numShards=3&wt=javabin&version=2} status=0 QTime=3373
{noformat}

(FWIW: The logging info w/MDC values populated includes some redundent info because the MDCAwareThreadPoolExecutor putting all MDC values into the thread name -- see GIT:382f9e45e07a384 ... i'm not a fan of this redundency, but it's a bigger issue then the change i'm suggesting, and already affects all other uses of MDCAwareThreadPoolExecutor, like searcherExecutor, OverseerThreadFactory, coreZkRegister, coreLoadExecutor, etc...)


> SolrZkClient should include more MDC info when zkCallback threads process a WatchedEvent
> ----------------------------------------------------------------------------------------
>
>                 Key: SOLR-13792
>                 URL: https://issues.apache.org/jira/browse/SOLR-13792
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Priority: Major
>         Attachments: SOLR-13792.patch
>
>
> One of the biggest headaches when debigging multi-node cloud tests is disambiguiating the log messages and what nodes they came from.
> For many threads, the MDC context info makes this a non-issue, but in the case of "zkCallback" threads it can be virtually impossible to tell which "node" of the cluster each of the zkCallback threads belongs to, because they don't have MDC info ({{SolrZkClient}} already uses an {{MDCAwareThreadPoolExecutor}} to process the ZK {{WatchEvent}} callbacks, and {{MDCAwareThreadPoolExecutor}} ensures that the _submitter's_ MDC values are used in the Thread that executes the Runnable -- but in this case the "submitter" is the ZooKeeper Thread.
> From a test debugging standpoint, it would be very useful if more MDC context info about the *node* existed when {{zkCallback}} thread's execute.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org