You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Allen Wittenauer <aw...@altiscale.com> on 2015/05/03 22:02:02 UTC

we need a fix: precommit failures correlate to hdfs patches

	
	So, as some may have noticed, I slammed the Jenkins servers over the weekend to get some recent patch test runs in JIRA for the bug bash this week.  I've had a suspicion for a while now that either the long run times of the hadoop-hdfs module unit tests (typically 2+ hours) or the hdfs tests themselves were related to the patch process directory getting removed out from underneath test-patch.

	To test the hypothesis, I submitted all of the non-HDFS patches so that they were first in the queue.  Let them run for a very long time.  Jenkins bounced back and forth between YARN, MR, and HADOOP.   No issues encounters.  Added HDFS patches into the mix. BOOM. The dreaded "The patch artifact directory has been removed! “ started to appear here and there.  This seems to provide some evidence that, yes, hdfs unit tests are directory or indirectly related to the failures.

	IMO, I think we need to take a serious look at:

	* splitting up the hadoop-hdfs module into multiple modules to reduce unit test run times
	* checking to see if the pre commit hooks in hdfs are different than the rest (I do know that the YARN bits are different and appear to have some bugs as well)
	* increasing the timeout for jenkins job runs

	FWIW, I’ve also found some minor things here and there with the rewritten test-patch.sh.  JIRAs have been filed.  One critical, one major and a handful of minor things.    

Re: we need a fix: precommit failures correlate to hdfs patches

Posted by Allen Wittenauer <aw...@altiscale.com>.
FWIW, I’m working on getting the Jenkins race conditions that Sean pointed out fixed in HADOOP-11917.


On May 4, 2015, at 2:23 PM, Chris Nauroth <cn...@hortonworks.com> wrote:

> If we suspect long run times are a potential root cause, then another
> thing we could try is turning on parallel test execution.  To do that,
> we'd add the -Pparallel-tests argument and possibly tune
> -DtestsThreadCount=N.  (The default for N is 4.)
> 
> https://issues.apache.org/jira/browse/HADOOP-9287
> 
> This has given some of us significant speed-ups while running tests in our
> dev environments.  I haven't tried it in a while though, so we might
> surface some test isolation problems, such as if 2 test suites tried to
> work in the same directory for data.  We cleaned up a lot of issues like
> that before committing the parallel-tests patches, but it's possible new
> problems have crept in.
> 
> --Chris Nauroth
> 
> 
> 
> 
> On 5/3/15, 9:02 PM, "Sean Busbey" <bu...@cloudera.com> wrote:
> 
>> The patch artifact directory in the mainline hadoop jenkins jobs are
>> outside of the workspace. I'm not sure what, if anything, jenkins
>> guarantees about files out of the main workspace.
>> 
>> They all write to ${WORKSPACE}/../patchProcess, which will probably
>> collide
>> if multiple runs happen on the same machine. They also all blindly move
>> that directory at the end of the run.
>> 
>> On Sun, May 3, 2015 at 3:02 PM, Allen Wittenauer <aw...@altiscale.com> wrote:
>> 
>>> 
>>>        So, as some may have noticed, I slammed the Jenkins servers over
>>> the weekend to get some recent patch test runs in JIRA for the bug bash
>>> this week.  I've had a suspicion for a while now that either the long
>>> run
>>> times of the hadoop-hdfs module unit tests (typically 2+ hours) or the
>>> hdfs
>>> tests themselves were related to the patch process directory getting
>>> removed out from underneath test-patch.
>>> 
>>>        To test the hypothesis, I submitted all of the non-HDFS patches
>>> so
>>> that they were first in the queue.  Let them run for a very long time.
>>> Jenkins bounced back and forth between YARN, MR, and HADOOP.   No issues
>>> encounters.  Added HDFS patches into the mix. BOOM. The dreaded "The
>>> patch
>>> artifact directory has been removed! ³ started to appear here and there.
>>> This seems to provide some evidence that, yes, hdfs unit tests are
>>> directory or indirectly related to the failures.
>>> 
>>>        IMO, I think we need to take a serious look at:
>>> 
>>>        * splitting up the hadoop-hdfs module into multiple modules to
>>> reduce unit test run times
>>>        * checking to see if the pre commit hooks in hdfs are different
>>> than the rest (I do know that the YARN bits are different and appear to
>>> have some bugs as well)
>>>        * increasing the timeout for jenkins job runs
>>> 
>>>        FWIW, I¹ve also found some minor things here and there with the
>>> rewritten test-patch.sh.  JIRAs have been filed.  One critical, one
>>> major
>>> and a handful of minor things.
>> 
>> 
>> 
>> 
>> -- 
>> Sean
> 


Re: we need a fix: precommit failures correlate to hdfs patches

Posted by Allen Wittenauer <aw...@altiscale.com>.
FWIW, I’m working on getting the Jenkins race conditions that Sean pointed out fixed in HADOOP-11917.


On May 4, 2015, at 2:23 PM, Chris Nauroth <cn...@hortonworks.com> wrote:

> If we suspect long run times are a potential root cause, then another
> thing we could try is turning on parallel test execution.  To do that,
> we'd add the -Pparallel-tests argument and possibly tune
> -DtestsThreadCount=N.  (The default for N is 4.)
> 
> https://issues.apache.org/jira/browse/HADOOP-9287
> 
> This has given some of us significant speed-ups while running tests in our
> dev environments.  I haven't tried it in a while though, so we might
> surface some test isolation problems, such as if 2 test suites tried to
> work in the same directory for data.  We cleaned up a lot of issues like
> that before committing the parallel-tests patches, but it's possible new
> problems have crept in.
> 
> --Chris Nauroth
> 
> 
> 
> 
> On 5/3/15, 9:02 PM, "Sean Busbey" <bu...@cloudera.com> wrote:
> 
>> The patch artifact directory in the mainline hadoop jenkins jobs are
>> outside of the workspace. I'm not sure what, if anything, jenkins
>> guarantees about files out of the main workspace.
>> 
>> They all write to ${WORKSPACE}/../patchProcess, which will probably
>> collide
>> if multiple runs happen on the same machine. They also all blindly move
>> that directory at the end of the run.
>> 
>> On Sun, May 3, 2015 at 3:02 PM, Allen Wittenauer <aw...@altiscale.com> wrote:
>> 
>>> 
>>>        So, as some may have noticed, I slammed the Jenkins servers over
>>> the weekend to get some recent patch test runs in JIRA for the bug bash
>>> this week.  I've had a suspicion for a while now that either the long
>>> run
>>> times of the hadoop-hdfs module unit tests (typically 2+ hours) or the
>>> hdfs
>>> tests themselves were related to the patch process directory getting
>>> removed out from underneath test-patch.
>>> 
>>>        To test the hypothesis, I submitted all of the non-HDFS patches
>>> so
>>> that they were first in the queue.  Let them run for a very long time.
>>> Jenkins bounced back and forth between YARN, MR, and HADOOP.   No issues
>>> encounters.  Added HDFS patches into the mix. BOOM. The dreaded "The
>>> patch
>>> artifact directory has been removed! ³ started to appear here and there.
>>> This seems to provide some evidence that, yes, hdfs unit tests are
>>> directory or indirectly related to the failures.
>>> 
>>>        IMO, I think we need to take a serious look at:
>>> 
>>>        * splitting up the hadoop-hdfs module into multiple modules to
>>> reduce unit test run times
>>>        * checking to see if the pre commit hooks in hdfs are different
>>> than the rest (I do know that the YARN bits are different and appear to
>>> have some bugs as well)
>>>        * increasing the timeout for jenkins job runs
>>> 
>>>        FWIW, I¹ve also found some minor things here and there with the
>>> rewritten test-patch.sh.  JIRAs have been filed.  One critical, one
>>> major
>>> and a handful of minor things.
>> 
>> 
>> 
>> 
>> -- 
>> Sean
> 


Re: we need a fix: precommit failures correlate to hdfs patches

Posted by Chris Nauroth <cn...@hortonworks.com>.
If we suspect long run times are a potential root cause, then another
thing we could try is turning on parallel test execution.  To do that,
we'd add the -Pparallel-tests argument and possibly tune
-DtestsThreadCount=N.  (The default for N is 4.)

https://issues.apache.org/jira/browse/HADOOP-9287

This has given some of us significant speed-ups while running tests in our
dev environments.  I haven't tried it in a while though, so we might
surface some test isolation problems, such as if 2 test suites tried to
work in the same directory for data.  We cleaned up a lot of issues like
that before committing the parallel-tests patches, but it's possible new
problems have crept in.

--Chris Nauroth




On 5/3/15, 9:02 PM, "Sean Busbey" <bu...@cloudera.com> wrote:

>The patch artifact directory in the mainline hadoop jenkins jobs are
>outside of the workspace. I'm not sure what, if anything, jenkins
>guarantees about files out of the main workspace.
>
>They all write to ${WORKSPACE}/../patchProcess, which will probably
>collide
>if multiple runs happen on the same machine. They also all blindly move
>that directory at the end of the run.
>
>On Sun, May 3, 2015 at 3:02 PM, Allen Wittenauer <aw...@altiscale.com> wrote:
>
>>
>>         So, as some may have noticed, I slammed the Jenkins servers over
>> the weekend to get some recent patch test runs in JIRA for the bug bash
>> this week.  I've had a suspicion for a while now that either the long
>>run
>> times of the hadoop-hdfs module unit tests (typically 2+ hours) or the
>>hdfs
>> tests themselves were related to the patch process directory getting
>> removed out from underneath test-patch.
>>
>>         To test the hypothesis, I submitted all of the non-HDFS patches
>>so
>> that they were first in the queue.  Let them run for a very long time.
>> Jenkins bounced back and forth between YARN, MR, and HADOOP.   No issues
>> encounters.  Added HDFS patches into the mix. BOOM. The dreaded "The
>>patch
>> artifact directory has been removed! ³ started to appear here and there.
>> This seems to provide some evidence that, yes, hdfs unit tests are
>> directory or indirectly related to the failures.
>>
>>         IMO, I think we need to take a serious look at:
>>
>>         * splitting up the hadoop-hdfs module into multiple modules to
>> reduce unit test run times
>>         * checking to see if the pre commit hooks in hdfs are different
>> than the rest (I do know that the YARN bits are different and appear to
>> have some bugs as well)
>>         * increasing the timeout for jenkins job runs
>>
>>         FWIW, I¹ve also found some minor things here and there with the
>> rewritten test-patch.sh.  JIRAs have been filed.  One critical, one
>>major
>> and a handful of minor things.
>
>
>
>
>-- 
>Sean


Re: we need a fix: precommit failures correlate to hdfs patches

Posted by Chris Nauroth <cn...@hortonworks.com>.
If we suspect long run times are a potential root cause, then another
thing we could try is turning on parallel test execution.  To do that,
we'd add the -Pparallel-tests argument and possibly tune
-DtestsThreadCount=N.  (The default for N is 4.)

https://issues.apache.org/jira/browse/HADOOP-9287

This has given some of us significant speed-ups while running tests in our
dev environments.  I haven't tried it in a while though, so we might
surface some test isolation problems, such as if 2 test suites tried to
work in the same directory for data.  We cleaned up a lot of issues like
that before committing the parallel-tests patches, but it's possible new
problems have crept in.

--Chris Nauroth




On 5/3/15, 9:02 PM, "Sean Busbey" <bu...@cloudera.com> wrote:

>The patch artifact directory in the mainline hadoop jenkins jobs are
>outside of the workspace. I'm not sure what, if anything, jenkins
>guarantees about files out of the main workspace.
>
>They all write to ${WORKSPACE}/../patchProcess, which will probably
>collide
>if multiple runs happen on the same machine. They also all blindly move
>that directory at the end of the run.
>
>On Sun, May 3, 2015 at 3:02 PM, Allen Wittenauer <aw...@altiscale.com> wrote:
>
>>
>>         So, as some may have noticed, I slammed the Jenkins servers over
>> the weekend to get some recent patch test runs in JIRA for the bug bash
>> this week.  I've had a suspicion for a while now that either the long
>>run
>> times of the hadoop-hdfs module unit tests (typically 2+ hours) or the
>>hdfs
>> tests themselves were related to the patch process directory getting
>> removed out from underneath test-patch.
>>
>>         To test the hypothesis, I submitted all of the non-HDFS patches
>>so
>> that they were first in the queue.  Let them run for a very long time.
>> Jenkins bounced back and forth between YARN, MR, and HADOOP.   No issues
>> encounters.  Added HDFS patches into the mix. BOOM. The dreaded "The
>>patch
>> artifact directory has been removed! ³ started to appear here and there.
>> This seems to provide some evidence that, yes, hdfs unit tests are
>> directory or indirectly related to the failures.
>>
>>         IMO, I think we need to take a serious look at:
>>
>>         * splitting up the hadoop-hdfs module into multiple modules to
>> reduce unit test run times
>>         * checking to see if the pre commit hooks in hdfs are different
>> than the rest (I do know that the YARN bits are different and appear to
>> have some bugs as well)
>>         * increasing the timeout for jenkins job runs
>>
>>         FWIW, I¹ve also found some minor things here and there with the
>> rewritten test-patch.sh.  JIRAs have been filed.  One critical, one
>>major
>> and a handful of minor things.
>
>
>
>
>-- 
>Sean


Re: we need a fix: precommit failures correlate to hdfs patches

Posted by Sean Busbey <bu...@cloudera.com>.
The patch artifact directory in the mainline hadoop jenkins jobs are
outside of the workspace. I'm not sure what, if anything, jenkins
guarantees about files out of the main workspace.

They all write to ${WORKSPACE}/../patchProcess, which will probably collide
if multiple runs happen on the same machine. They also all blindly move
that directory at the end of the run.

On Sun, May 3, 2015 at 3:02 PM, Allen Wittenauer <aw...@altiscale.com> wrote:

>
>         So, as some may have noticed, I slammed the Jenkins servers over
> the weekend to get some recent patch test runs in JIRA for the bug bash
> this week.  I've had a suspicion for a while now that either the long run
> times of the hadoop-hdfs module unit tests (typically 2+ hours) or the hdfs
> tests themselves were related to the patch process directory getting
> removed out from underneath test-patch.
>
>         To test the hypothesis, I submitted all of the non-HDFS patches so
> that they were first in the queue.  Let them run for a very long time.
> Jenkins bounced back and forth between YARN, MR, and HADOOP.   No issues
> encounters.  Added HDFS patches into the mix. BOOM. The dreaded "The patch
> artifact directory has been removed! “ started to appear here and there.
> This seems to provide some evidence that, yes, hdfs unit tests are
> directory or indirectly related to the failures.
>
>         IMO, I think we need to take a serious look at:
>
>         * splitting up the hadoop-hdfs module into multiple modules to
> reduce unit test run times
>         * checking to see if the pre commit hooks in hdfs are different
> than the rest (I do know that the YARN bits are different and appear to
> have some bugs as well)
>         * increasing the timeout for jenkins job runs
>
>         FWIW, I’ve also found some minor things here and there with the
> rewritten test-patch.sh.  JIRAs have been filed.  One critical, one major
> and a handful of minor things.




-- 
Sean

Re: we need a fix: precommit failures correlate to hdfs patches

Posted by Sean Busbey <bu...@cloudera.com>.
The patch artifact directory in the mainline hadoop jenkins jobs are
outside of the workspace. I'm not sure what, if anything, jenkins
guarantees about files out of the main workspace.

They all write to ${WORKSPACE}/../patchProcess, which will probably collide
if multiple runs happen on the same machine. They also all blindly move
that directory at the end of the run.

On Sun, May 3, 2015 at 3:02 PM, Allen Wittenauer <aw...@altiscale.com> wrote:

>
>         So, as some may have noticed, I slammed the Jenkins servers over
> the weekend to get some recent patch test runs in JIRA for the bug bash
> this week.  I've had a suspicion for a while now that either the long run
> times of the hadoop-hdfs module unit tests (typically 2+ hours) or the hdfs
> tests themselves were related to the patch process directory getting
> removed out from underneath test-patch.
>
>         To test the hypothesis, I submitted all of the non-HDFS patches so
> that they were first in the queue.  Let them run for a very long time.
> Jenkins bounced back and forth between YARN, MR, and HADOOP.   No issues
> encounters.  Added HDFS patches into the mix. BOOM. The dreaded "The patch
> artifact directory has been removed! “ started to appear here and there.
> This seems to provide some evidence that, yes, hdfs unit tests are
> directory or indirectly related to the failures.
>
>         IMO, I think we need to take a serious look at:
>
>         * splitting up the hadoop-hdfs module into multiple modules to
> reduce unit test run times
>         * checking to see if the pre commit hooks in hdfs are different
> than the rest (I do know that the YARN bits are different and appear to
> have some bugs as well)
>         * increasing the timeout for jenkins job runs
>
>         FWIW, I’ve also found some minor things here and there with the
> rewritten test-patch.sh.  JIRAs have been filed.  One critical, one major
> and a handful of minor things.




-- 
Sean