You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Stephen O'Donnell (Jira)" <ji...@apache.org> on 2020/10/12 13:25:00 UTC

[jira] [Created] (HDDS-4336) ContainerInfo does not persist BCSID leading to failed replicas reports

Stephen O'Donnell created HDDS-4336:
---------------------------------------

             Summary: ContainerInfo does not persist BCSID leading to failed replicas reports
                 Key: HDDS-4336
                 URL: https://issues.apache.org/jira/browse/HDDS-4336
             Project: Hadoop Distributed Data Store
          Issue Type: Bug
          Components: SCM
    Affects Versions: 1.1.0
            Reporter: Stephen O'Donnell
            Assignee: Stephen O'Donnell


If you create a container, and then close it, the BCSID is synced on the datanodes and then the value is updated in SCM via setting the "sequenceID" field on the containerInfo object for the container.

If you later restart just SCM, the sequenceID becomes null, and then container reports for the replica fail with a stack trace like:

{code}
Exception in thread "EventQueue-ContainerReportForContainerReportHandler" java.lang.AssertionError
	at org.apache.hadoop.hdds.scm.container.ContainerInfo.updateSequenceId(ContainerInfo.java:176)
	at org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.updateContainerStats(AbstractContainerReportHandler.java:108)
	at org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:83)
	at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:162)
	at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:130)
	at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:50)
	at org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
{code}

The assertion here is what is failing, as it does not allow for the sequenceID to be changed on a CLOSED container:

{code}
  public void updateSequenceId(long sequenceID) {
    assert (isOpen() || state == HddsProtos.LifeCycleState.QUASI_CLOSED);
    sequenceId = max(sequenceID, sequenceId);
  }
{code}

The issue seems to be caused by the serialisation and deserialisation of the containerInfo object to protobuf, as sequenceId never persisted or restored.

However, I am also confused about how this ever worked, as this is a pretty significant problem.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org