You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oozie.apache.org by "Robert Kanter (JIRA)" <ji...@apache.org> on 2015/07/16 00:23:05 UTC

[jira] [Updated] (OOZIE-1837) LauncherMainHadoopUtils sensitive to clock skew

     [ https://issues.apache.org/jira/browse/OOZIE-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Kanter updated OOZIE-1837:
---------------------------------
    Attachment: OOZIE-1837.001.patch

The patch makes the code detect when this will occur and tries to workaround the issue by adding twice the difference as an offset, with the idea that that should be safe to capture the time window that we actually want here.  It also prints out a warning message to tell the user to fix their clocks.

Writing a unit test for this will be tricky because (a) it's time specific and (b) it's in hadoop-utils so that gets messy with the different hadoop versions.  I did, however, test it out on a cluster by hacking the start time to always be 5 seconds after the end time and some print statements:
{noformat}
 AAA: startTime = 1436998590089 :: Wed Jul 15 15:16:30 PDT 2015
 AAA: endTime   = 1436998585089 :: Wed Jul 15 15:16:25 PDT 2015
 WARNING: Clock skew between the Oozie server host and this host detected.  Please fix this.  Attempting to work around...
 BBB: startTime = 1436998580089 :: Wed Jul 15 15:16:20 PDT 2015
 BBB: endTime   = 1436998595089 :: Wed Jul 15 15:16:35 PDT 2015
{noformat}

> LauncherMainHadoopUtils sensitive to clock skew
> -----------------------------------------------
>
>                 Key: OOZIE-1837
>                 URL: https://issues.apache.org/jira/browse/OOZIE-1837
>             Project: Oozie
>          Issue Type: Bug
>         Environment: Oozie 4.0.0 (CDH5)
>            Reporter: Lars Francke
>            Assignee: Robert Kanter
>            Priority: Minor
>         Attachments: OOZIE-1837.001.patch
>
>
> The method {{getChildYarnJobs}} in {{LauncherMainHadoopUtils}} can fail with a message like {{begin > end in range (begin, end): (1399972474014, 1399972473948)}}.
> {code}
> startTime = Long.parseLong((System.getProperty("oozie.job.launch.time")));
> ....
> gar.setStartRange(startTime, System.currentTimeMillis());
> {code}
> I guess this is happening when the server on which the launch time was set has a different time then the one this task is running on. In our case there was a skew of about 8 seconds which caused all of our jobs that hit this server to fail.
> I understand that skew in clocks is generally not a good idea but I feel that Oozie could be a bit more resilient here or print a better warning maybe?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)