You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Shashikant Banerjee (Jira)" <ji...@apache.org> on 2020/03/18 14:32:00 UTC

[jira] [Commented] (HDDS-3229) 2-way commit did not happen when WRITE failure injected in one of the datanodes of a piepeline

    [ https://issues.apache.org/jira/browse/HDDS-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17061769#comment-17061769 ] 

Shashikant Banerjee commented on HDDS-3229:
-------------------------------------------

The client did try for a 3 way commit first but failed but by the time it tried 2 way commit , pipeline was destroyed.
{code:java}
2020-03-16 04:05:22,537|INFO|MainThread|machine.py:180 - run()||GUID=fef08720-0e0c-40e4-b285-c0bac009bb14|20/03/16 04:05:22 WARN scm.XceiverClientRatis:* 3 way commit failed on pipeline Pipeline*[ Id: 3a772023-bb6e-4b22-806c-e79bc42429df, Nodes: 88c993a5-18ca-45c7-b854-e72cc5012709{ip: 172.27.79.65, host: quasar-elfnqw-8.quasar-elfnqw.root.hwx.site, networkLocation: /default-rack, certSerialId: null}8442f7b4-a498-41ed-86b8-9ed4f0207f4f{ip: 172.27.25.64, host: quasar-elfnqw-1.quasar-elfnqw.root.hwx.site, networkLocation: /default-rack, certSerialId: null}bbc2192c-382e-45c9-979b-912108b7e915{ip: 172.27.86.128, host: quasar-elfnqw-3.quasar-elfnqw.root.hwx.site, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, State:OPEN, leaderId:bbc2192c-382e-45c9-979b-912108b7e915, CreationTimestamp2020-03-16T03:43:10.514Z]

2020-03-16 04:05:22,538|INFO|MainThread|machine.py:180 - run()||GUID=fef08720-0e0c-40e4-b285-c0bac009bb14|java.util.concurrent.ExecutionException: org.apache.ratis.protocol.GroupMismatchException: 8442f7b4-a498-41ed-86b8-9ed4f0207f4f: group-E79BC42429DF not found.
2020-03-16 04:05:22,538|INFO|MainThread|machine.py:180 - run()||GUID=fef08720-0e0c-40e4-b285-c0bac009bb14|at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
2020-03-16 04:05:22,538|INFO|MainThread|machine.py:180 - run()||GUID=fef08720-0e0c-40e4-b285-c0bac009bb14|at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
2020-03-16 04:05:22,538|INFO|MainThread|machine.py:180 - run()||GUID=fef08720-0e0c-40e4-b285-c0bac009bb14|at org.apache.hadoop.hdds.scm.XceiverClientRatis.watchForCommit(XceiverClientRatis.java:262)
2020-03-16 04:05:22,538|INFO|MainThread|machine.py:180 - run()||GUID=fef08720-0e0c-40e4-b285-c0bac009bb14|at org.apache.hadoop.hdds.scm.storage.CommitWatcher.watchForCommit(CommitWatcher.java:190)
2020-03-16 04:05:22,538|INFO|MainThread|machine.py:180 - run()||GUID=fef08720-0e0c-40e4-b285-c0bac009bb14|at org.apache.hadoop.hdds.scm.storage.CommitWatcher.watchOnFirstIndex(CommitWatcher.java:133)
2020-03-16 04:05:22,538|INFO|MainThread|machine.py:180 - run()||GUID=fef08720-0e0c-40e4-b285-c0bac009bb14|at org.apache.hadoop.hdds.scm.storage.BlockOutputStream.watchForCommit(BlockOutputStream.java:345)
2020-03-16 04:05:22,539|INFO|MainThread|machine.py:180 - run()||GUID=fef08720-0e0c-40e4-b285-c0bac009bb14|at org.apache.hadoop.hdds.scm.storage.BlockOutputStream.handleFullBuffer(BlockOutputStream.java:322)
2020-03-16 04:05:22,539|INFO|MainThread|machine.py:180 - run()||GUID=fef08720-0e0c-40e4-b285-c0bac009bb14|at org.apache.hadoop.hdds.scm.storage.BlockOutputStream.writeOnRetry(BlockOutputStream.java:300)
2020-03-16 04:05:22,539|INFO|MainThread|machine.py:180 - run()||GUID=fef08720-0e0c-40e4-b285-c0bac009bb14|at org.apache.hadoop.ozone.client.io.BlockOutputStreamEntry.writeOnRetry(BlockOutputStreamEntry.java:201)
2020-03-16 04:05:22,539|INFO|MainThread|machine.py:180 - run()||GUID=fef08720-0e0c-40e4-b285-c0bac009bb14|at org.apache.hadoop.ozone.client.io.KeyOutputStream.writeToOutputStream(KeyOutputStream.java:238)
2020-03-16 04:05:22,539|INFO|MainThread|machine.py:180 - run()||GUID=fef08720-0e0c-40e4-b285-c0bac009bb14|at org.apache.hadoop.ozone.client.io.KeyOutputStream.handleWrite(KeyOutputStream.java:218)
2020-03-16 04:05:22,539|INFO|MainThread|machine.py:180 - run()||GUID=fef08720-0e0c-40e4-b285-c0bac009bb14|at org.apache.hadoop.ozone.client.io.KeyOutputStream.handleRetry(KeyOutputStream.java:401)
2020-03-16 04:05:22,539|INFO|MainThread|machine.py:180 - run()||GUID=fef08720-0e0c-40e4-b285-c0bac009bb14|at org.apache.hadoop.ozone.client.io.KeyOutputStream.handleException(KeyOutputStream.

{code}
 

All the pipelines are destroyed were continuously getting destroyed as on each dn, a pipeline action will be initiated to destroy the pipeline on any ratis write/apply failure. In the sytem , one node in each pipeline is injecting error in write path itself.

The client ultimately gives up after retrying for 100 times over multiple pipelines as a full key write never survived bcoz all pipelines were destroyed.

 
{code:java}
20/03/16 06:38:10 WARN io.KeyOutputStream: Encountered exception java.io.IOException: Unexpected Storage Container Exception: java.io.IOException: Unexpected Storage Container Exception: java.util.concurrent.ExecutionException: org.apache.ratis.protocol.GroupMismatchException: 1f06b734-97a7-4462-8d4c-52eb13be1075: group-4AC604E07728 not found. on the pipeline Pipeline[ Id: dea3d845-66d8-4380-8d09-4ac604e07728, Nodes: 045c0770-8df3-40f6-945d-190f72ea80e1{ip: 172.27.100.192, host: quasar-elfnqw-4.quasar-elfnqw.root.hwx.site, networkLocation: /default-rack, certSerialId: null}1f06b734-97a7-4462-8d4c-52eb13be1075{ip: 172.27.13.128, host: quasar-elfnqw-10.quasar-elfnqw.root.hwx.site, networkLocation: /default-rack, certSerialId: null}df437a1c-2dd6-412f-9c1a-06c6be63167b{ip: 172.27.22.64, host: quasar-elfnqw-6.quasar-elfnqw.root.hwx.site, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, State:OPEN, leaderId:1f06b734-97a7-4462-8d4c-52eb13be1075, CreationTimestamp2020-03-16T06:19:11.016Z]. The last committed block length is 0, uncommitted data length is 33554432 retry count 100..{code}
 
 
 

> 2-way commit did not happen when WRITE failure injected in one of the datanodes of a piepeline
> ----------------------------------------------------------------------------------------------
>
>                 Key: HDDS-3229
>                 URL: https://issues.apache.org/jira/browse/HDDS-3229
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>            Reporter: Nilotpal Nandi
>            Priority: Major
>              Labels: fault_injection
>
> This is an extension of bug HDDS-3214.
> steps taken :
> 1) Mounted noise injection FUSE on all datanodes
> 2) Selected 1 datanode from each open pipeline (factor=3)
> 3) Injected WRITE FAILURE noise with error code - ENOENT on "hdds.datanode.dir" path of list of datanodes selected in step 2)
> 4) start PUT key operation of size  32 MB.
>  
> Observation :
> ----------------
> PUT key operation failed. 
> As there is a WRITE failure in one of the datanodes in the pipeline, 3 way commit should fail.
> But it should proceed with 2-way commit and the operation should have been successful.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org