You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Sean Mackrory (JIRA)" <ji...@apache.org> on 2017/03/15 01:01:53 UTC

[jira] [Comment Edited] (HADOOP-14036) S3Guard: intermittent duplicate item keys failure

    [ https://issues.apache.org/jira/browse/HADOOP-14036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925351#comment-15925351 ] 

Sean Mackrory edited comment on HADOOP-14036 at 3/15/17 12:59 AM:
------------------------------------------------------------------

Not proposing this for inclusion just yet (although it's possible this is precisely the correct solution), but just a proof-of-concept of the problem. I see paths getting added to the containers of objects to move here in the loop I'm modifying, and then also down below the comment, "We moved all the children, now move the top-level dir."

I should dig a bit into the listObjects call, as I'm curious why we don't have this problem with a lot more tests / workloads that involve renames. I'm also not entirely sure we do actually have to move the top-level dir last (although my current fix ensures that it is added last). If the move isn't atomic, the invariant that parent paths always exist is going to be violated for either the new path or the old path sometime, and this particular operation is just adding it to the collection to be broken into batches. Seems cleaner IMO to do it last like we do, but I want to think through it a bit more. Speak up if you have any insight or opinions there...

After applying this fix (checking if the directory we're adding matches the name of the parent directory we add separately at the end, and then skipping that part if it does), I was able to run that test over and over again without problems, and after reverting it reproduced the issue at least 50% of the time. On one run I had a bunch of failures listed below, and I'm positive no other workload was using that bucket at the time, but I've been able to run each of those tests successfully and do several more full runs without a problem:

{code}
Failed tests: 
  ITestS3GuardToolDynamoDB>S3GuardToolTestBase.testPruneCommandConf:157->S3GuardToolTestBase.testPruneCommand:135->Assert.assertEquals:542->Assert.assertEquals:555->Assert.assertEquals:118->Assert.failNotEquals:743->Assert.fail:88 expected:<2> but was:<1>
  ITestS3AContractGetFileStatus>AbstractContractGetFileStatusTest.testListLocatedStatusFiltering:499->AbstractContractGetFileStatusTest.verifyListStatus:534->Assert.assertEquals:555->Assert.assertEquals:118->Assert.failNotEquals:743->Assert.fail:88 length of listStatus(s3a://mackrory/test, org.apache.hadoop.fs.contract.AbstractContractGetFileStatusTest$AllPathsFilter@69b9805a ) expected:<2> but was:<1>
  ITestS3AContractGetFileStatus>AbstractContractGetFileStatusTest.testListStatusFiltering:466->AbstractContractGetFileStatusTest.verifyListStatus:534->Assert.assertEquals:555->Assert.assertEquals:118->Assert.failNotEquals:743->Assert.fail:88 length of listStatus(s3a://mackrory/test, org.apache.hadoop.fs.contract.AbstractContractGetFileStatusTest$MatchesNameFilter@4ce8f437 ) expected:<1> but was:<0>
  ITestS3AContractGetFileStatus>AbstractContractGetFileStatusTest.testComplexDirActions:143->AbstractContractGetFileStatusTest.checkListStatusStatusComplexDir:162->Assert.assertEquals:555->Assert.assertEquals:118->Assert.failNotEquals:743->Assert.fail:88 listStatus(): file count in 1 directory and 0 files expected:<4> but was:<0>

Tests in error: 
  ITestS3AEncryptionSSEKMSDefaultKey>AbstractTestS3AEncryption.testEncryptionOverRename:71 » FileNotFound
  ITestS3AContractSeek>AbstractContractSeekTest.testReadSmallFile:531 » FileNotFound
  ITestS3AContractSeek>AbstractContractSeekTest.testNegativeSeek:181 » FileNotFound
  ITestS3AContractSeek>AbstractContractSeekTest.testSeekFile:207 » FileNotFound ...
  ITestS3AContractSeek>AbstractContractSeekTest.testReadFullyPastEOF:467 » FileNotFound
  ITestS3AContractDistCp>AbstractContractDistCpTest.deepDirectoryStructureToRemote:90->AbstractContractDistCpTest.deepDirectoryStructure:139 » FileNotFound
  ITestS3AContractDistCp>AbstractContractDistCpTest.largeFilesToRemote:96->AbstractContractDistCpTest.largeFiles:174 » FileNotFound
{code}


was (Author: mackrorysd):
Not proposing this for inclusion just yet (although it's possible this is precisely the correct solution), but just a proof-of-concept of the problem. I see paths getting added to the containers of objects to move here in the loop I'm modifying, and then also down below the comment, "We moved all the children, now move the top-level dir."

I should dig a bit into the listObjects call, as I'm curious why we don't have this problem with a lot more tests / workloads that involve renames. I'm also not entirely sure we do actually have to move the top-level dir last (although my current fix ensures that it is added last). If the move isn't atomic, the invariant that parent paths always exist is going to be violated for either the new path or the old path sometime, and this particular operation is just adding it to the collection to be broken into batches. Seems cleaner IMO to do it last like we do, but I want to think through it a bit more. Speak up if you have any insight or opinions there...

> S3Guard: intermittent duplicate item keys failure
> -------------------------------------------------
>
>                 Key: HADOOP-14036
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14036
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: HADOOP-13345
>            Reporter: Aaron Fabbri
>            Assignee: Mingliang Liu
>         Attachments: HADOOP-14036-HADOOP-13345.000.patch
>
>
> I see this occasionally when running integration tests with -Ds3guard -Ddynamo:
> {noformat}
> testRenameToDirWithSamePrefixAllowed(org.apache.hadoop.fs.s3a.ITestS3AFileSystemContract)  Time elapsed: 2.756 sec  <<< ERROR!
> org.apache.hadoop.fs.s3a.AWSServiceIOException: move: com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException: Provided list of item keys contains duplicates (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ValidationException; Request ID: QSBVQV69279UGOB4AJ4NO9Q86VVV4KQNSO5AEMVJF66Q9ASUAAJG): Provided list of item keys contains duplicates (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ValidationException; Request ID: QSBVQV69279UGOB4AJ4NO9Q86VVV4KQNSO5AEMVJF66Q9ASUAAJG)
>         at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:178)
>         at org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore.move(DynamoDBMetadataStore.java:408)
>         at org.apache.hadoop.fs.s3a.S3AFileSystem.innerRename(S3AFileSystem.java:869)
>         at org.apache.hadoop.fs.s3a.S3AFileSystem.rename(S3AFileSystem.java:662)
>         at org.apache.hadoop.fs.FileSystemContractBaseTest.rename(FileSystemContractBaseTest.java:525)
>         at org.apache.hadoop.fs.FileSystemContractBaseTest.testRenameToDirWithSamePrefixAllowed(FileSystemContractBaseTest.java:669)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcces
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org