You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Uma Maheswara Rao G (Jira)" <ji...@apache.org> on 2022/02/28 17:24:00 UTC

[jira] [Resolved] (HDDS-6373) EC: Exclude pipeline upon container close instead of exclude DNs.

     [ https://issues.apache.org/jira/browse/HDDS-6373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uma Maheswara Rao G resolved HDDS-6373.
---------------------------------------
    Fix Version/s: EC-Branch
       Resolution: Fixed

> EC: Exclude pipeline upon container close instead of exclude DNs.
> -----------------------------------------------------------------
>
>                 Key: HDDS-6373
>                 URL: https://issues.apache.org/jira/browse/HDDS-6373
>             Project: Apache Ozone
>          Issue Type: Sub-task
>            Reporter: Mark Gui
>            Assignee: Mark Gui
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: EC-Branch
>
>
> Container close due to container full will make DN reply a ContainerNotOpenException to the Client, but it doesn't mean that this DN is failed and should be excluded for new block group allocation. Otherwise we may get many HEALTHY DNs to be excluded and new block group may fail to be allocated in a small cluster.
> E.g.
> 45 DNs(docker simulated), ozone-site.xml: 
>   <property>
>     <name>ozone.scm.container.size</name>
>     <value>256MB</value>
>   </property>
>   <property>
>     <name>ozone.scm.block.size</name>
>     <value>16MB</value>
>   </property>
> test with Freon ockg:
> ./bin/ozone freon ockg --type=EC --replication=rs-10-4-1024k -p test -n 10 -t 10 -s $((4 * 1024 * 1024 * 1024))
> would result in a 5-8 failures with HDDS-6364 patched.
> {code:java}
> INTERNAL_ERROR org.apache.hadoop.ozone.om.exceptions.OMException: Allocated 0 blocks. Requested 1 blocks
>         at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:660)
>         at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.allocateBlock(OzoneManagerProtocolClientSideTranslatorPB.java:695)
>         at org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.allocateNewBlock(BlockOutputStreamEntryPool.java:309)
>         at org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.allocateBlockIfNeeded(BlockOutputStreamEntryPool.java:371)
>         at org.apache.hadoop.ozone.client.io.ECKeyOutputStream.rewriteStripeToNewBlockGroup(ECKeyOutputStream.java:244)
>         at org.apache.hadoop.ozone.client.io.ECKeyOutputStream.handleStripeFailure(ECKeyOutputStream.java:586)
>         at org.apache.hadoop.ozone.client.io.ECKeyOutputStream.checkAndWriteParityCells(ECKeyOutputStream.java:306)
>         at org.apache.hadoop.ozone.client.io.ECKeyOutputStream.write(ECKeyOutputStream.java:192)
>         at org.apache.hadoop.ozone.client.io.OzoneOutputStream.write(OzoneOutputStream.java:50)
>         at org.apache.hadoop.ozone.freon.ContentGenerator.write(ContentGenerator.java:76)
>         at org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.lambda$createKey$36(OzoneClientKeyGenerator.java:146)
>         at com.codahale.metrics.Timer.time(Timer.java:101)
>         at org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.createKey(OzoneClientKeyGenerator.java:143)
>         at org.apache.hadoop.ozone.freon.BaseFreonGenerator.tryNextTask(BaseFreonGenerator.java:183)
>         at org.apache.hadoop.ozone.freon.BaseFreonGenerator.taskLoop(BaseFreonGenerator.java:163)
>         at org.apache.hadoop.ozone.freon.BaseFreonGenerator.lambda$startTaskRunners$1(BaseFreonGenerator.java:146)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
>         Suppressed: java.lang.IllegalArgumentException: Expected writeOffset= 1069543424 Expected offset=1059061760
>                 at com.google.common.base.Preconditions.checkArgument(Preconditions.java:144)
>                 at org.apache.hadoop.ozone.client.io.ECKeyOutputStream.close(ECKeyOutputStream.java:564)
>                 at org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:61)
>                 at org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.lambda$createKey$36(OzoneClientKeyGenerator.java:151)
>                 ... 8 more
> One ore more freon test is failed.
> 2022-02-24 08:41:44,272 [shutdown-hook-0] INFO metrics: type=TIMER, name=key-create, count=10, min=313491.661668, max=577254.304029, mean=563762.9508485134, stddev=44787.24799551536, median=575542.093982, p75=577254.304029, p95=577254.304029, p98=577254.304029, p99=577254.304029, p999=577254.304029, mean_rate=0.017322637056902915, m1=0.029562618662863496, m5=0.014855802773079099, m15=0.007191674083204336, rate_unit=events/second, duration_unit=milliseconds
> 2022-02-24 08:41:44,273 [shutdown-hook-0] INFO freon.BaseFreonGenerator: Total execution time (sec): 578
> 2022-02-24 08:41:44,273 [shutdown-hook-0] INFO freon.BaseFreonGenerator: Failures: 6
> 2022-02-24 08:41:44,273 [shutdown-hook-0] INFO freon.BaseFreonGenerator: Successful executions: 4 {code}
> But with this fix and HDDS-6364 together, it shows all 10 success for many rounds.
> {code:java}
> 2022-02-24 10:56:45,013 [Thread-4] INFO freon.ProgressBar: Progress: 90.00 % (9 out of 10)
> 2022-02-24 10:56:46,013 [Thread-4] INFO freon.ProgressBar: Progress: 100.00 % (10 out of 10)
> 2022-02-24 10:56:46,257 [shutdown-hook-0] INFO metrics: type=TIMER, name=key-create, count=10, min=958022.893372, max=1038271.448129, mean=1018238.201558835, stddev=22083.604143242464, median=1029968.020144, p75=1034239.403617, p95=1038271.448129, p98=1038271.448129, p99=1038271.448129, p999=1038271.448129, mean_rate=0.009623163938983789, m1=0.09995782091693355, m5=0.02731461121892791, m15=0.009684867189776935, rate_unit=events/second, duration_unit=milliseconds
> 2022-02-24 10:56:46,258 [shutdown-hook-0] INFO freon.BaseFreonGenerator: Total execution time (sec): 1040
> 2022-02-24 10:56:46,258 [shutdown-hook-0] INFO freon.BaseFreonGenerator: Failures: 0
> 2022-02-24 10:56:46,258 [shutdown-hook-0] INFO freon.BaseFreonGenerator: Successful executions: 10 {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org