You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Chris M. Hostetter (Jira)" <ji...@apache.org> on 2019/10/21 22:24:00 UTC

[jira] [Updated] (SOLR-13856) 8.x HdfsWriteToMultipleCollectionsTest jenkins failures due to TImeoutException

     [ https://issues.apache.org/jira/browse/SOLR-13856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris M. Hostetter updated SOLR-13856:
--------------------------------------
    Attachment: apache_Lucene-Solr-NightlyTests-8.3_25.log.txt
                apache_Lucene-Solr-repro_3681.log.txt
                8.x.fail3.log.txt
                8.x.fail2.log.txt
                8.x.fail1.log.txt
                8.3.fail3.log.txt
                8.3.fail2.log.txt
                8.3.fail1.log.txt
        Status: Open  (was: Open)


I'm attaching some logs from jenkins, as well as from from my own local runs that show the same failure behavior on 8x and 83 with a variety of seeds.

The specifics of the failure looks like this...

{noformat}
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=HdfsWriteToMultipleCollectionsTest -Dtests.method=test -Dtests.seed=7DB11BE46316786B -Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-8.3/test-data/enwiki.random.lines.txt -Dtests.locale=el-CY -Dtests.timezone=Asia/Yekaterinburg -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
   [junit4] FAILURE  287s J0 | HdfsWriteToMultipleCollectionsTest.test <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: expected:<680> but was:<595>
   [junit4]    >        at __randomizedtesting.SeedInfo.seed([7DB11BE46316786B:F5E5243ECDEA1593]:0)
   [junit4]    >        at org.apache.solr.cloud.hdfs.HdfsWriteToMultipleCollectionsTest.test(HdfsWriteToMultipleCollectionsTest.java:129)
   [junit4]    >        at org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsFixedStatement.callStatement(BaseDistributedSearchTestCase.java:1082)
   [junit4]    >        at org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsStatement.evaluate(BaseDistributedSearchTestCase.java:1054)
   [junit4]    >        at java.lang.Thread.run(Thread.java:748)
{noformat}

...the exact number "expected" will vary by seed, and the exact number of "actual" docs found will vary by run.  This assertion happens after all of the (parallel) indexing threads have finished, when the test is simply trying to count documents across all of hte various collections it's indexed to, before it makes any HDFS specific assertions.

The root cause of the failure to get the expected number of docs seems to be due to some failures in forwarding updates between nodes -- evidently due to a lock in the HDFS layer?

We'll see lots of log messages showing that that updates are screaming along, and then suddenly there will be a big gap in time where only "metrics" related requests are being logged -- followed by several updateExecutor's logging that they (eventually) timed out waiting for a response from remote nodes to the forwarded pdate requests (sometimes TOLEADER, sometimes FROMLEADER).

These are accompanied by WARNing messages from the HDFS DataStreamer class indicating it caught an InterruptedException while writting to some index file...

{noformat}
   [junit4]   2> 1918294 ERROR (updateExecutor-6515-thread-1-processing-n:127.0.0.1:37445_awn%2Fps x:acollection0_shard2_replica_n3 c:acollection0 s:shard2 r:core_node8) [n:127.0.0.1:37445_awn%2Fps c:acollection0 s:shard2 r:core_node8 x:acollection0_shard2_replica_n3 ] o.a.s.u.SolrCmdDistributor org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: http://127.0.0.1:34527/awn/ps/acollection0_shard1_replica_n1/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F127.0.0.1%3A37445%2Fawn%2Fps%2Facollection0_shard2_replica_n3%2F
   [junit4]   2>        at org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:408)
   [junit4]   2>        at org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:754)
   [junit4]   2>        at org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient.request(ConcurrentUpdateHttp2SolrClient.java:364)
   [junit4]   2>        at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290)
   [junit4]   2>        at org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:342)
   [junit4]   2>        at org.apache.solr.update.SolrCmdDistributor.lambda$submit$0(SolrCmdDistributor.java:331)
   [junit4]   2>        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   [junit4]   2>        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
   [junit4]   2>        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   [junit4]   2>        at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181)
   [junit4]   2>        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
   [junit4]   2>        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   [junit4]   2>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   [junit4]   2>        at java.lang.Thread.run(Thread.java:748)
   [junit4]   2> Caused by: java.util.concurrent.TimeoutException
   [junit4]   2>        at org.eclipse.jetty.client.util.InputStreamResponseListener.get(InputStreamResponseListener.java:216)
   [junit4]   2>        at org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:399)
   [junit4]   2>        ... 13 more
   [junit4]   2> 
   [junit4]   2> 1918294 INFO  (qtp1716164178-22854) [n:127.0.0.1:37445_awn%2Fps c:acollection0 s:shard2 r:core_node8 x:acollection0_shard2_replica_n3 ] o.a.s.u.SolrCmdDistributor SolrCmdDistributor found 1 errors
   [junit4]   2> 1918294 INFO  (qtp1716164178-22854) [n:127.0.0.1:37445_awn%2Fps c:acollection0 s:shard2 r:core_node8 x:acollection0_shard2_replica_n3 ] o.a.s.u.DirectUpdateHandler2 start commit{_version_=1647570631572586496,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
   [junit4]   2> 1918294 INFO  (qtp1716164178-22854) [n:127.0.0.1:37445_awn%2Fps c:acollection0 s:shard2 r:core_node8 x:acollection0_shard2_replica_n3 ] o.a.s.u.SolrIndexWriter Calling setCommitData with IW:org.apache.solr.update.SolrIndexWriter@34262628 commitCommandVersion:1647570631572586496
   [junit4]   2> 1918296 ERROR (updateExecutor-6546-thread-2-processing-n:127.0.0.1:34527_awn%2Fps x:acollection0_shard1_replica_n1 c:acollection0 s:shard1 r:core_node4) [n:127.0.0.1:34527_awn%2Fps c:acollection0 s:shard1 r:core_node4 x:acollection0_shard1_replica_n1 ] o.a.s.u.SolrCmdDistributor org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: http://127.0.0.1:45357/awn/ps/acollection0_shard1_replica_n2/update?update.distrib=FROMLEADER&distrib.from=http%3A%2F%2F127.0.0.1%3A34527%2Fawn%2Fps%2Facollection0_shard1_replica_n1%2F
   [junit4]   2>        at org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:408)
   [junit4]   2>        at org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:754)
   [junit4]   2>        at org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient.request(ConcurrentUpdateHttp2SolrClient.java:364)
   [junit4]   2>        at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290)
   [junit4]   2>        at org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:342)
   [junit4]   2>        at org.apache.solr.update.SolrCmdDistributor.lambda$submit$0(SolrCmdDistributor.java:331)
   [junit4]   2>        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   [junit4]   2>        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
   [junit4]   2>        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   [junit4]   2>        at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181)
   [junit4]   2>        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
   [junit4]   2>        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   [junit4]   2>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   [junit4]   2>        at java.lang.Thread.run(Thread.java:748)
   [junit4]   2> Caused by: java.util.concurrent.TimeoutException
   [junit4]   2>        at org.eclipse.jetty.client.util.InputStreamResponseListener.get(InputStreamResponseListener.java:216)
   [junit4]   2>        at org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:399)
   [junit4]   2>        ... 13 more
   [junit4]   2> 
   [junit4]   2> 1918297 INFO  (qtp1628872495-22943) [n:127.0.0.1:34595_awn%2Fps c:acollection0 s:shard2 r:core_node7 x:acollection0_shard2_replica_n5 ] o.a.s.u.DirectUpdateHandler2 start commit{_version_=1647570631575732224,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
   [junit4]   2> 1918297 INFO  (qtp1628872495-22943) [n:127.0.0.1:34595_awn%2Fps c:acollection0 s:shard2 r:core_node7 x:acollection0_shard2_replica_n5 ] o.a.s.u.SolrIndexWriter Calling setCommitData with IW:org.apache.solr.update.SolrIndexWriter@6850bd7f commitCommandVersion:1647570631575732224
   [junit4]   2> 1927547 WARN  (DataStreamer for file /solr_hdfs_home/acollection0/core_node6/data/index/_0_LuceneFixedGap_0.pos) [     ] o.a.h.h.DataStreamer Caught exception
   [junit4]   2>           => java.lang.InterruptedException
   [junit4]   2>        at java.lang.Object.wait(Native Method)
   [junit4]   2> java.lang.InterruptedException: null
   [junit4]   2>        at java.lang.Object.wait(Native Method) ~[?:1.8.0_191]
   [junit4]   2>        at java.lang.Thread.join(Thread.java:1252) ~[?:1.8.0_191]
   [junit4]   2>        at java.lang.Thread.join(Thread.java:1326) [?:1.8.0_191]
   [junit4]   2>        at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:986) ~[hadoop-hdfs-client-3.2.0.jar:?]
   [junit4]   2>        at org.apache.hadoop.hdfs.DataStreamer.endBlock(DataStreamer.java:640) ~[hadoop-hdfs-client-3.2.0.jar:?]
   [junit4]   2>        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:810) ~[hadoop-hdfs-client-3.2.0.jar:?]
   [junit4]   2> 1942300 INFO  (qtp1194399847-22933) [n:127.0.0.1:45357_awn%2Fps c:acollection0 s:shard1 r:core_node6 x:acollection0_shard1_replica_n2 ] o.a.s.s.SolrIndexSearcher Opening [Searcher@30effb28[acollection0_shard1_replica_n2] main]
{noformat}

----

I currently have no idea WTF is going on here.





> 8.x HdfsWriteToMultipleCollectionsTest jenkins failures due to TImeoutException
> -------------------------------------------------------------------------------
>
>                 Key: SOLR-13856
>                 URL: https://issues.apache.org/jira/browse/SOLR-13856
>             Project: Solr
>          Issue Type: Test
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Chris M. Hostetter
>            Priority: Major
>         Attachments: 8.3.fail1.log.txt, 8.3.fail2.log.txt, 8.3.fail3.log.txt, 8.x.fail1.log.txt, 8.x.fail2.log.txt, 8.x.fail3.log.txt, apache_Lucene-Solr-NightlyTests-8.3_25.log.txt, apache_Lucene-Solr-repro_3681.log.txt
>
>
> I've noticed a trend in jenkins failures where HdfsWriteToMultipleCollectionsTest...
> * does _NOT_ ever seem to fail on master even w/heavy beasting
> * fails on 8.x (28c1049a258bbd060a80803c72e1c6cadc784dab) and 8.3 (25968e3b75e5e9a4f2a64de10500aae10a257bdd) easily
> ** failing seeds frequently reproduce, but not 100%
> ** seeds reproduce even when tested using newer (ie: java11) JVMs
> ** doesn't fail when commenting out HDFS aspects of test
> *** suggests failure cause is somehow specific to HDFS, not differences in the 8x/master HTTP/solr indexing stack...
> *However:* There are currently zero differences between the *.hdfs.* packaged solr code (src or test) on branch_8x vs master; likewise 8x and master also use the exact same hadoop jars.
> So what the hell is different?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org