You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2021/07/13 09:56:00 UTC

[jira] [Updated] (HDDS-5005) Multipart Upload fails due to partName mismatch

     [ https://issues.apache.org/jira/browse/HDDS-5005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ASF GitHub Bot updated HDDS-5005:
---------------------------------
    Labels: pull-request-available  (was: )

> Multipart Upload fails due to partName mismatch
> -----------------------------------------------
>
>                 Key: HDDS-5005
>                 URL: https://issues.apache.org/jira/browse/HDDS-5005
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: OM, S3
>    Affects Versions: 1.0.0
>         Environment: h4. {color:#172b4d}Ozone 1.0.0{color}
>  * version = 1.0.0
>  *  build = https://github.com/apache/hadoop-ozone.git/28d372ca903b4741131bace09e0339e9161257bb ; compiled by 'sammi' on 2020-08-25T13:05Z
>  * java = 1.8.0_162
> h4. OS
>  * Ubuntu 18.04.5 LTS
>  * Linux kernel 5.4.0-64-generic
>            Reporter: Kiyoshi Mizumaru
>            Assignee: Bharat Viswanadham
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: om-audit-HOSTNAME-2021-03-14-19-53-09-1.log.gz
>
>
> We're running the official Ozone 1.0.0 release and facing S3 Multipart Upload failures with large files. The error message looks similar to that is reported in HDDS-3554 but we'd like to report what we've found so far to help the further investigation of this issue.
> h1. The error message recorded in OM log
> Please find the following error message excerpted from our OM. Forgive us we redacted some sensitive information such as username and keyname which imply our project's topic.
> {code:java}
> 2021-03-14 07:48:41,947 [IPC Server handler 88 on default port 9862] ERROR org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCompleteRequest: MultipartUpload Complete request failed for Key: <REDACTED_KEYNAME> in Volume/Bucket s3v/<BUCKETNAME>
> INVALID_PART org.apache.hadoop.ozone.om.exceptions.OMException: Complete Multipart Upload Failed: volume: s3v bucket: <BUCKETNAME> key: <REDACTED_KEYNAME>. Provided Part info is { /s3v/<BUCKETNAME>/<REDACTED_KEYNAME>105884795658268282, 4}, whereas OM has partName /s3v/<BUCKETNAME>/<REDACTED_KEYNAME>105884791629180406
>         at org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCompleteRequest.validateAndUpdateCache(S3MultipartUploadCompleteRequest.java:199)
>         at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:227)
>         at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.java:224)
>         at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:145)
>         at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:74)
>         at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:113)
>         at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:999)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:927)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2915){code}
> Anyway, OM thinks the partName for the partNumber 4 is /s3v/<BUCKETNAME>/<REDACTED_KEYNAME>105884791629180406 but COMPLETE_MULTIPART_UPLOAD request think it must be /s3v/<BUCKETNAME>/<REDACTED_KEYNAME>105884795658268282. This discrepancy is the immediate cause for this error.
> h1. OM audit log says both are correct
> Please find attached om-audit-HOSTNAME-2021-03-14-19-53-09-1.log.gz (also redacted, sorry), it contains filtered output of our OM audit log, the lines which include <REDACTED_KEYNAME> and multipartList entry are remain.
> Interestingly, according to the OM audit log, there're two COMMIT_MULTIPART_UPLOAD_PARTKEY operations exist for partNumber=4 and both operations were succeeded:
>  
> {code:java}
> % zgrep partNumber=4, om-audit-HOSTNAME-2021-03-14-19-53-09-1.log.gz
> 2021-03-14 07:16:04,992 | INFO  | OMAudit | user=<REDACTED_UPN> | ip=10.192.17.172 | op=COMMIT_MULTIPART_UPLOAD_PARTKEY {volume=s3v, bucket=<BUCKETNAME>, key=<REDACTED_KEYNAME>, dataSize=8388608, replicationType=RATIS, replicationFactor=ONE, partNumber=4, partName=/s3v/<BUCKETNAME>/<REDACTED_KEYNAME>105884795658268282} | ret=SUCCESS | 
> 2021-03-14 07:18:11,828 | INFO  | OMAudit | user=<REDACTED_UPN> | ip=10.192.17.172 | op=COMMIT_MULTIPART_UPLOAD_PARTKEY {volume=s3v, bucket=<BUCKETNAME>, key=<REDACTED_KEYNAME>, dataSize=8388608, replicationType=RATIS, replicationFactor=ONE, partNumber=4, partName=/s3v/<BUCKETNAME>/<REDACTED_KEYNAME>105884791629180406} | ret=SUCCESS | 
> %
> {code}
>  
> OM seemed to have accepted both partName ending with 105884795658268282 and  105884791629180406 for partNumber 4. And COMPLETE_MULTIPART_UPLOAD operation was called with the prior partName but OM believed it had the latter partName for partNumber 4.
>  
> {code:java}
> 2021-03-14 07:48:41,947 | ERROR | OMAudit | user=<REDACTED_UPN> | ip=10.192.17.172 | op=COMPLETE_MULTIPART_UPLOAD {volume=s3v, bucket=<BUCKETNAME>, key=<REDACTED_KEYNAME>, dataSize=0, replicationType=
> RATIS, replicationFactor=ONE, multipartList=[partNumber: 1
> partName: "/s3v/<BUCKETNAME>/<REDACTED_KEYNAME>105884791631605244"
> , partNumber: 2
> partName: "/s3v/<BUCKETNAME>/<REDACTED_KEYNAME>105884791631539707"
> , partNumber: 3
> partName: "/s3v/<BUCKETNAME>/<REDACTED_KEYNAME>105884791628262900"
> , partNumber: 4
> partName: "/s3v/<BUCKETNAME>/<REDACTED_KEYNAME>105884795658268282"
> , partNumber: 5
> partName: "/s3v/<BUCKETNAME>/<REDACTED_KEYNAME>105884791629245944"
> , partNumber: 6
> partName: "/s3v/<BUCKETNAME>/<REDACTED_KEYNAME>105884791629245943"
> {code}
> We can also find there're multiple COMMIT_MULTIPART_UPLOAD_PARTKEY operations for several partNumbers, such as partNumber 4, 13, 20, 45, 57, 67, 73, ... some partNumbers like 172 have more than three COMMIT_MULTIPART_UPLOAD_PARTKEY operations they're all succeeded.
>  
> h1. How to solve this issue?
> At first we thought this issue is caused by race condition, but noticed that there're enough time between each COMMIT_MULTIPART_UPLOAD_PARKEY operation. We're not sure but noticed that write operations to OmMetadataManager are isolated with omMetadataManager.getLock().acquireWriteLock(BUCKET_LOCK, volumeName, bucketName);
>  
> {code:java}
>      multipartKey = omMetadataManager.getMultipartKey(volumeName,
>          bucketName, keyName, uploadID);
>      // TODO to support S3 ACL later.
>      acquiredLock = omMetadataManager.getLock().acquireWriteLock(BUCKET_LOCK,
>          volumeName, bucketName);
>      validateBucketAndVolume(omMetadataManager, volumeName, bucketName);
>      String ozoneKey = omMetadataManager.getOzoneKey(
>          volumeName, bucketName, keyName);
>      OmMultipartKeyInfo multipartKeyInfo = omMetadataManager
>          .getMultipartInfoTable().get(multipartKey);
> {code}
>  
> So our question is, is it normal to have multiple COMMIT_MULTIPART_UPLOAD_PARTKEY operations for a partNumber, with different partNames?
> h1. Other findings
> This issue occurs less frequently with aws configure set default.s3.multipart_chunksize 256MB. Almost always fails with multipart_chunksize 8MB, 1GB in our environment.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org