You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Chris M. Hostetter (Jira)" <ji...@apache.org> on 2019/11/12 18:20:00 UTC

[jira] [Commented] (SOLR-13924) MoveReplica failures when using HDFS (NullPointerException)

    [ https://issues.apache.org/jira/browse/SOLR-13924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972677#comment-16972677 ] 

Chris M. Hostetter commented on SOLR-13924:
-------------------------------------------

Reviewing the logs for {{MoveReplicaHDFSTest}} jenkins failures from the past 7 days shows a similar pattern in every failure...

* Test ultimately fails with...{noformat}
   [junit4] FAILURE 19.4s J0 | MoveReplicaHDFSTest.test <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: expected not same
   [junit4]    >        at __randomizedtesting.SeedInfo.seed([7631F50DBFD47AFA:FE65CAD711281702]:0)
   [junit4]    >        at org.apache.solr.cloud.MoveReplicaTest.test(MoveReplicaTest.java:147)
{noformat}
** This assertion comes from a loop checking the async status of a {{MoveReplica}} replica request...{code}
      assertNotSame(rsp.getRequestStatus(), RequestStatusState.FAILED);
{code}
* Looking back in the logs we see...{noformat}
   [junit4]   2> 1325710 ERROR (OverseerThreadFactory-2562-thread-3-processing-n:127.0.0.1:37772_solr) [n:127.0.0.1:37772_solr c:MoveReplicaHDFSTest_coll_true  r:core_node8  ] o.a.s.c.a.c.O
verseerCollectionMessageHandler Collection: MoveReplicaHDFSTest_coll_true operation: movereplica failed:java.lang.NullPointerException
   [junit4]   2>        at org.apache.solr.cloud.api.collections.MoveReplicaCmd.moveHdfsReplica(MoveReplicaCmd.java:220)
   [junit4]   2>        at org.apache.solr.cloud.api.collections.MoveReplicaCmd.moveReplica(MoveReplicaCmd.java:160)
   [junit4]   2>        at org.apache.solr.cloud.api.collections.MoveReplicaCmd.call(MoveReplicaCmd.java:70)
   [junit4]   2>        at org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:263)
{noformat}

If we look back at the long term stats, we see a significant uptick in jenkins failures from this test starting around 2019-10-18.  Note the number of jenkins failures from this test every day in 2019 (skipping days where there were '0' failures)...

{noformat}
~/jenkins-reports/output/html/reports/archive/daily$ zgrep -c MoveReplicaHDFSTest 2019-*method-failures.csv.gz | grep -v ':0$'
2019-03-13.method-failures.csv.gz:1
2019-10-18.method-failures.csv.gz:1
2019-10-19.method-failures.csv.gz:1
2019-10-20.method-failures.csv.gz:8
2019-10-21.method-failures.csv.gz:5
2019-10-23.method-failures.csv.gz:1
2019-10-24.method-failures.csv.gz:2
2019-10-25.method-failures.csv.gz:1
2019-10-26.method-failures.csv.gz:1
2019-10-27.method-failures.csv.gz:2
2019-10-28.method-failures.csv.gz:6
2019-10-29.method-failures.csv.gz:4
2019-10-30.method-failures.csv.gz:2
2019-10-31.method-failures.csv.gz:7
2019-11-01.method-failures.csv.gz:3
2019-11-02.method-failures.csv.gz:2
2019-11-03.method-failures.csv.gz:3
2019-11-04.method-failures.csv.gz:2
2019-11-05.method-failures.csv.gz:7
2019-11-06.method-failures.csv.gz:6
2019-11-08.method-failures.csv.gz:6
2019-11-09.method-failures.csv.gz:6
2019-11-10.method-failures.csv.gz:1
2019-11-11.method-failures.csv.gz:9
2019-11-12.method-failures.csv.gz:10
{noformat}

That date corrispond closely to when SOLR-13843 was commited (86a40c1cd5691ce8c9c233c9a8186a4f50aa4f5f on master)

If I compare {{git co master}}, {{git co 86a40c1cd5691ce8c9c233c9a8186a4f50aa4f5f}} and {{git co 86a40c1cd5691ce8c9c233c9a8186a4f50aa4f5f~1}} I see that beasting this seed just a few times fails reliably on {{HEAD}}, and @ {{86a40c1cd56}} but passess reliably one SHA prior to the SOLR-13843 commit....

{noformat}
hossman@slate:~/lucene/dev/solr/core [j11] [master] $ ant beast -Dbeast.iters=10  -Dtestcase=MoveReplicaHDFSTest -Dtests.nightly=true -Dtests.slow=true -Dtests.seed=7631F50DBFD47AFA
...
  [beaster] Tests with failures [seed: 7631F50DBFD47AFA]:
  [beaster]   - org.apache.solr.cloud.MoveReplicaHDFSTest.test
{noformat}


{noformat}
hossman@slate:~/lucene/dev/solr/core [j11] [86a40c1cd56] $ ant beast -Dbeast.iters=10  -Dtestcase=MoveReplicaHDFSTest -Dtests.nightly=true -Dtests.slow=true -Dtests.seed=7631F50DBFD47AFA
...
  [beaster] Tests with failures [seed: 7631F50DBFD47AFA]:
  [beaster]   - org.apache.solr.cloud.MoveReplicaHDFSTest.test
{noformat}


{noformat}
hossman@slate:~/lucene/dev/solr/core [j11] [63e9bcf5d15] $ ant beast -Dbeast.iters=10  -Dtestcase=MoveReplicaHDFSTest -Dtests.nightly=true -Dtests.slow=true -Dtests.seed=7631F50DBFD47AFA
...
  [beaster] Beast round 10 results: /home/hossman/lucene/dev/solr/build/solr-core/test/10
  [beaster] Beasting finished Successfully.
{noformat}




> MoveReplica failures when using HDFS (NullPointerException)
> -----------------------------------------------------------
>
>                 Key: SOLR-13924
>                 URL: https://issues.apache.org/jira/browse/SOLR-13924
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 8.3
>            Reporter: Chris M. Hostetter
>            Assignee: Shalin Shekhar Mangar
>            Priority: Major
>
> Based on recent jenkins test failures, it appears that attemping to use the "MoveReplica" command on HDFS has a high chance of failure due to an underlying NPE.
> I'm not sure if this bug *only* affects HDFS, or if it's just more likly to occur when using HDFS due to some timing quirks.
> It's also possible that the bug impacts non-HDFS users just as much as HDFS users, but only manifests in our tests due to some quick of our {{cloud-hdfs}} test configs.
> The problem appears to be new in 8.3 as a result of changes made in SOLR-13843



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org