You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Ethan Rose (Jira)" <ji...@apache.org> on 2022/08/31 23:16:00 UTC

[jira] [Commented] (HDDS-7196) Disk space used by failed job(teragen here) is not reclaimable

    [ https://issues.apache.org/jira/browse/HDDS-7196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598661#comment-17598661 ] 

Ethan Rose commented on HDDS-7196:
----------------------------------

Hi [~frnklnsm]. Since the data is not showing up in the committed namespace, it looks like the clients were stopped while writing data to the datanodes but before they could commit the data to OM to make it visible in the namespace. This means the corresponding keys are remaining as open keys in the Ozone Manager. In Ozone's master branch and upcoming 1.3.0 release, the open key cleanup service has been implemented. This will scan the open key table to remove open keys that have been there for over a week (configurable value using om.open.key.expire.threshold), and move them to the deleted key table. In prior versions like 1.0.0 listed here, the open keys will remain in the system indefinitely which appears to be what you observed.

Ozone's normal key deletion flow will take affect after that, which is also what is used when keys are explicitly deleted. Every minute the OM will scan the deleted keys table and move up to 20,000 keys' blocks to SCM for deletion. Every minute SCM will move 20,000 blocks to their corresponding datanodes for deletion. Every minute datanodes will scan containers for blocks to delete, eventually removing them from the system. Note that blocks are not deleted from open containers until they are closed.

 

 

> Disk space used by failed job(teragen here) is not reclaimable
> --------------------------------------------------------------
>
>                 Key: HDDS-7196
>                 URL: https://issues.apache.org/jira/browse/HDDS-7196
>             Project: Apache Ozone
>          Issue Type: Improvement
>          Components: Ozone Datanode
>         Environment: |Apache Ozone|1.0.0|
>            Reporter: Franklinsam Paul
>            Priority: Major
>         Attachments: Ozone usage_ after_failing_cleanup.png, Ozone usage_ fresh_install.png
>
>
> On Fresh ozone cluster, ran a tergane job and killed it around 25% completion. this left ozone used about 74.4GB but none of the files written is listing. 
> Issue can be reproducible with below steps. ( snapshots from the recon UI will be attached for usage reference)
> {code:java}
> ozone sh volume create  o3://ozonefrankserviceid/testvol/
> ozone sh bucket create o3://ozonefrankserviceid/testvol/testbucketyarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen -Dmapreduce.job.maps=2 1000000000 ofs://ozonefrankserviceid/testvol/testbucket
>  ozone sh volume create  o3://ozonefrankserviceid/testvol/
>  ozone sh bucket create o3://ozonefrankserviceid/testvol/testbucket
>  
>  yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen -Dmapreduce.job.maps=2 1000000000 ofs://ozonefrankserviceid/testvol/testbucket/teragentest1
>  
>  ozone fs -ls ofs://ozonefrankserviceid/testvol/testbucket/teragentest1
>  ozone fs -ls ofs://ozonefrankserviceid/testvol/testbucket/teragentest1/_temporary
>  ozone fs -ls ofs://ozonefrankserviceid/testvol/testbucket/teragentest1/_temporary/1
>  ozone fs -ls ofs://ozonefrankserviceid/testvol/testbucket/teragentest1/_temporary/1/_temporary
>  ozone fs -du -s -h ofs://ozonefrankserviceid/testvol/testbucket/teragentest1/_temporary/1/_temporary
>  ozone fs -ls ofs://ozonefrankserviceid/testvol/testbucket/teragentest1/_temporary/1/_temporary/attempt_1661777485132_0001_m_000000_2 --> no files/bject
>  ozone fs -ls ofs://ozonefrankserviceid/testvol/testbucket/teragentest1/_temporary/1/_temporary/attempt_1661777485132_0001_m_000001_2 --> no files/object
>  
>  Ozone usage is increased in the recon UI as 75GB
>  
>  hdfs dfs -rm -r -skipTrash ofs://ozonefrankserviceid/testvol/testbucket/teragentest1
>  ozone sh bucket delete o3://ozonefrankserviceid/testvol/testbucket
>  
>  [root@DNHOST1 ozone-conf]# grep -A1 'hdds.datanode.dir' ozone-site.xml
>     <name>hdds.datanode.dir</name>
>     <value>/var/lib/hadoop-ozone/datanode/data</value>
> [root@DNHOST1 ozone-conf]#[root@DNHOST1 containerDir0]# du -sh /var/lib/hadoop-ozone/datanode/data/hdds/a9461a7f-ef81-4942-a278-15ff7602df14/current/containerDir0/
> 26G    /var/lib/hadoop-ozone/datanode/data/hdds/a9461a7f-ef81-4942-a278-15ff7602df14/current/containerDir0/
> [root@DNHOST1 containerDir0]#
> [root@DNHOST1 chunks]# ozone sh volume list  o3://ozonefrankserviceid/ -a |egrep 'name|usedNamespace'
>   "name" : "s3v",
>   "usedNamespace" : 0,
>     "name" : "om",
>   "name" : "testvol",
>   "usedNamespace" : 0,
>     "name" : "hive/HMSHOST.example.com@SUPPORT.COM",
>     "name" : "hive",
> [root@DNHOST1 chunks]# {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org