You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Steve Loughran (Jira)" <ji...@apache.org> on 2020/08/24 16:59:00 UTC

[jira] [Commented] (HADOOP-17190) Intermittent ITestTerasortOnS3A.test_120_terasort failure

    [ https://issues.apache.org/jira/browse/HADOOP-17190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183452#comment-17183452 ] 

Steve Loughran commented on HADOOP-17190:
-----------------------------------------

I've looked more and can see the problem: clock skew between s3guard and S3 timestamps cause the container localizer to fail
{code}
2020-08-24 14:58:31,059 [IPC Server handler 3 on 65048] WARN  localizer.ResourceLocalizationService (ResourceLocalizationService.java:processHeartbeat(1152)) - { s3a://stevel-london/terasort-directory/sortout/_partition.lst, 1598277507027, FILE, null } failed: Resource s3a://stevel-london/terasort-directory/sortout/_partition.lst changed on src filesystem (expected 1598277507027, was 1598277507000
java.io.IOException: Resource s3a://stevel-london/terasort-directory/sortout/_partition.lst changed on src filesystem (expected 1598277507027, was 1598277507000
	at org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:273)
	at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)
	at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)
	at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
	at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:248)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:241)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:229)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
{code}

its intermittent as it only happens when there's a mismatch in time between when the upload completed and a timestamp was added to the s3guard table, and that of the S3A FS.

This is the localizer being brittle to clock errors, really it needs a range value over which it doesn't overreact about changed files

> Intermittent ITestTerasortOnS3A.test_120_terasort failure
> ---------------------------------------------------------
>
>                 Key: HADOOP-17190
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17190
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 3.3.0
>            Reporter: Mukund Thakur
>            Priority: Minor
>
> [*INFO*] Running org.apache.hadoop.fs.s3a.commit.terasort.*ITestTerasortOnS3A*
> [*ERROR*] *Tests* *run: 14*, *Failures: 2*, Errors: 0, *Skipped: 2*, Time elapsed: 110.43 s *<<< FAILURE!* - in org.apache.hadoop.fs.s3a.commit.terasort.*ITestTerasortOnS3A*
> [*ERROR*] test_120_terasort[directory](org.apache.hadoop.fs.s3a.commit.terasort.ITestTerasortOnS3A)  Time elapsed: 6.261 s  <<< FAILURE!
> java.lang.AssertionError: terasort(s3a://mthakur-data/terasort-directory/sortin, s3a://mthakur-data/terasort-directory/sortout) failed expected:<0> but was:<1>
>  at org.apache.hadoop.fs.s3a.commit.terasort.ITestTerasortOnS3A.executeStage(ITestTerasortOnS3A.java:241)
>  at org.apache.hadoop.fs.s3a.commit.terasort.ITestTerasortOnS3A.test_120_terasort(ITestTerasortOnS3A.java:291)
>  
> [*ERROR*] test_120_terasort[magic](org.apache.hadoop.fs.s3a.commit.terasort.ITestTerasortOnS3A)  Time elapsed: 5.962 s  <<< FAILURE!
> java.lang.AssertionError: terasort(s3a://mthakur-data/terasort-magic/sortin, s3a://mthakur-data/terasort-magic/sortout) failed expected:<0> but was:<1>
>  at org.apache.hadoop.fs.s3a.commit.terasort.ITestTerasortOnS3A.executeStage(ITestTerasortOnS3A.java:241)
>  at org.apache.hadoop.fs.s3a.commit.terasort.ITestTerasortOnS3A.test_120_terasort(ITestTerasortOnS3A.java:291)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org