You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Sean Owen <sr...@gmail.com> on 2019/04/11 11:26:23 UTC

Raise Jenkins test timeout? with alternatives

I have a big PR that keeps failing because it his the 300 minute build timeout:

https://github.com/apache/spark/pull/24314
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4703/console

It's because it touches so much code that all tests run including
things like Kinesis. It looks like 300 mins isn't enough. We can raise
it to an eye-watering 360 minutes if that's just how long all tests
take.

I can also try splitting up the change to move out changes to a few
optional modules into separate PRs.

(Because this one makes it all the way through Python and Java tests
and almost all R tests several times, and doesn't touch Python or R
and shouldn't have any functional changes, I'm tempted to just merge
it, too, as a solution)

Thoughts?

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Raise Jenkins test timeout? with alternatives

Posted by Xin Lu <xl...@salesforce.com.INVALID>.

Yes it is worth trying just running one jenkins job on each machine, but I
remember at Databricks we actually did just run one job per machine and the
spark tests still took hours.  We used quite large ec2 instances, too.  Now
two years later the number of tests probably increased.

Xin

On Thu, Apr 11, 2019 at 11:15 AM Sean Owen <sr...@gmail.com> wrote:

> If the machines are bottlenecked on I/O or are swapping, doing less work
> concurrently would improve throughput, and parallelizing wouldn't. I don't
> know that it's the case, but am wondering out loud as the runtimes seem to
> vary by 20-30% sometimes. Naturally, having the option to parallelize is
> good as well, if those bottlenecks aren't actually a problem or are
> resolved otherwise.
>
> On Thu, Apr 11, 2019 at 1:10 PM Xin Lu <xl...@salesforce.com> wrote:
>
>> I worked on parallelizing the tests two years ago.  It does require an
>> update to the amplab jenkins, which is very old (1.651.3 released
>> 2016-7-1).  The current  version of cloudbees jenkins has stages and it is
>> not difficult to put tests in parallel stages and aggregate the test
>> results.  Reducing concurrent builds per machine would not resolve just the
>> sheer length of tests running serially and the number of PRs.
>>
>> Xin
>>
>> On Thu, Apr 11, 2019 at 10:53 AM Sean Owen <sr...@gmail.com> wrote:
>>
>>> Agree, and I can make a few of the ML regression tests faster pretty
>>> easily. Here the issue is more about what happens when you run every single
>>> test, and man that does take a long time. Maybe rare enough to not justify
>>> upping the build timeout. (The PR passed just barely this time anyway)
>>>
>>> Q for Shane: we have a ton of build slots, but it seems like worker
>>> performance does slow down when there are multiple builds in progress. Is
>>> there any value in reducing the number of concurrent builds per machine,
>>> esp if we're not really using all of it? might help load balance more or
>>> something. I was also trying to figure out if they were swapping or
>>> something but couldn't find an easy way to tell.
>>>
>>> On Thu, Apr 11, 2019 at 11:00 AM Xiao Li <li...@databricks.com> wrote:
>>>
>>>> Hi, Sean
>>>>
>>>> Your issue actually shows our existing test frameworks needs a change
>>>> ASAP. We need to go over the tests listed in
>>>> https://spark-tests.appspot.com/slow-tests and see whether we can
>>>> reduce the time or run these test suites in parallel.
>>>>
>>>> Thanks,
>>>>
>>>> Xiao
>>>>
>>>>
>>>>
>>>> On Thu, Apr 11, 2019 at 4:26 AM Sean Owen <sr...@gmail.com> wrote:
>>>>
>>>>> I have a big PR that keeps failing because it his the 300 minute build
>>>>> timeout:
>>>>>
>>>>> https://github.com/apache/spark/pull/24314
>>>>>
>>>>> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4703/console
>>>>>
>>>>> It's because it touches so much code that all tests run including
>>>>> things like Kinesis. It looks like 300 mins isn't enough. We can raise
>>>>> it to an eye-watering 360 minutes if that's just how long all tests
>>>>> take.
>>>>>
>>>>> I can also try splitting up the change to move out changes to a few
>>>>> optional modules into separate PRs.
>>>>>
>>>>> (Because this one makes it all the way through Python and Java tests
>>>>> and almost all R tests several times, and doesn't touch Python or R
>>>>> and shouldn't have any functional changes, I'm tempted to just merge
>>>>> it, too, as a solution)
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>
>>>>>
>>>>
>>>> --
>>>> [image:
>>>> https://databricks.com/sparkaisummit/north-america?utm_source=email&utm_medium=signature]
>>>>
>>>

Re: Raise Jenkins test timeout? with alternatives

Posted by Sean Owen <sr...@gmail.com>.

If the machines are bottlenecked on I/O or are swapping, doing less work
concurrently would improve throughput, and parallelizing wouldn't. I don't
know that it's the case, but am wondering out loud as the runtimes seem to
vary by 20-30% sometimes. Naturally, having the option to parallelize is
good as well, if those bottlenecks aren't actually a problem or are
resolved otherwise.

On Thu, Apr 11, 2019 at 1:10 PM Xin Lu <xl...@salesforce.com> wrote:

> I worked on parallelizing the tests two years ago.  It does require an
> update to the amplab jenkins, which is very old (1.651.3 released
> 2016-7-1).  The current  version of cloudbees jenkins has stages and it is
> not difficult to put tests in parallel stages and aggregate the test
> results.  Reducing concurrent builds per machine would not resolve just the
> sheer length of tests running serially and the number of PRs.
>
> Xin
>
> On Thu, Apr 11, 2019 at 10:53 AM Sean Owen <sr...@gmail.com> wrote:
>
>> Agree, and I can make a few of the ML regression tests faster pretty
>> easily. Here the issue is more about what happens when you run every single
>> test, and man that does take a long time. Maybe rare enough to not justify
>> upping the build timeout. (The PR passed just barely this time anyway)
>>
>> Q for Shane: we have a ton of build slots, but it seems like worker
>> performance does slow down when there are multiple builds in progress. Is
>> there any value in reducing the number of concurrent builds per machine,
>> esp if we're not really using all of it? might help load balance more or
>> something. I was also trying to figure out if they were swapping or
>> something but couldn't find an easy way to tell.
>>
>> On Thu, Apr 11, 2019 at 11:00 AM Xiao Li <li...@databricks.com> wrote:
>>
>>> Hi, Sean
>>>
>>> Your issue actually shows our existing test frameworks needs a change
>>> ASAP. We need to go over the tests listed in
>>> https://spark-tests.appspot.com/slow-tests and see whether we can
>>> reduce the time or run these test suites in parallel.
>>>
>>> Thanks,
>>>
>>> Xiao
>>>
>>>
>>>
>>> On Thu, Apr 11, 2019 at 4:26 AM Sean Owen <sr...@gmail.com> wrote:
>>>
>>>> I have a big PR that keeps failing because it his the 300 minute build
>>>> timeout:
>>>>
>>>> https://github.com/apache/spark/pull/24314
>>>>
>>>> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4703/console
>>>>
>>>> It's because it touches so much code that all tests run including
>>>> things like Kinesis. It looks like 300 mins isn't enough. We can raise
>>>> it to an eye-watering 360 minutes if that's just how long all tests
>>>> take.
>>>>
>>>> I can also try splitting up the change to move out changes to a few
>>>> optional modules into separate PRs.
>>>>
>>>> (Because this one makes it all the way through Python and Java tests
>>>> and almost all R tests several times, and doesn't touch Python or R
>>>> and shouldn't have any functional changes, I'm tempted to just merge
>>>> it, too, as a solution)
>>>>
>>>> Thoughts?
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>
>>>>
>>>
>>> --
>>> [image:
>>> https://databricks.com/sparkaisummit/north-america?utm_source=email&utm_medium=signature]
>>>
>>

Re: Raise Jenkins test timeout? with alternatives

Posted by Xin Lu <xl...@salesforce.com.INVALID>.

I worked on parallelizing the tests two years ago.  It does require an
update to the amplab jenkins, which is very old (1.651.3 released
2016-7-1).  The current  version of cloudbees jenkins has stages and it is
not difficult to put tests in parallel stages and aggregate the test
results.  Reducing concurrent builds per machine would not resolve just the
sheer length of tests running serially and the number of PRs.

Xin

On Thu, Apr 11, 2019 at 10:53 AM Sean Owen <sr...@gmail.com> wrote:

> Agree, and I can make a few of the ML regression tests faster pretty
> easily. Here the issue is more about what happens when you run every single
> test, and man that does take a long time. Maybe rare enough to not justify
> upping the build timeout. (The PR passed just barely this time anyway)
>
> Q for Shane: we have a ton of build slots, but it seems like worker
> performance does slow down when there are multiple builds in progress. Is
> there any value in reducing the number of concurrent builds per machine,
> esp if we're not really using all of it? might help load balance more or
> something. I was also trying to figure out if they were swapping or
> something but couldn't find an easy way to tell.
>
> On Thu, Apr 11, 2019 at 11:00 AM Xiao Li <li...@databricks.com> wrote:
>
>> Hi, Sean
>>
>> Your issue actually shows our existing test frameworks needs a change
>> ASAP. We need to go over the tests listed in
>> https://spark-tests.appspot.com/slow-tests and see whether we can reduce
>> the time or run these test suites in parallel.
>>
>> Thanks,
>>
>> Xiao
>>
>>
>>
>> On Thu, Apr 11, 2019 at 4:26 AM Sean Owen <sr...@gmail.com> wrote:
>>
>>> I have a big PR that keeps failing because it his the 300 minute build
>>> timeout:
>>>
>>> https://github.com/apache/spark/pull/24314
>>>
>>> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4703/console
>>>
>>> It's because it touches so much code that all tests run including
>>> things like Kinesis. It looks like 300 mins isn't enough. We can raise
>>> it to an eye-watering 360 minutes if that's just how long all tests
>>> take.
>>>
>>> I can also try splitting up the change to move out changes to a few
>>> optional modules into separate PRs.
>>>
>>> (Because this one makes it all the way through Python and Java tests
>>> and almost all R tests several times, and doesn't touch Python or R
>>> and shouldn't have any functional changes, I'm tempted to just merge
>>> it, too, as a solution)
>>>
>>> Thoughts?
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>
>>>
>>
>> --
>> [image:
>> https://databricks.com/sparkaisummit/north-america?utm_source=email&utm_medium=signature]
>>
>

Re: Raise Jenkins test timeout? with alternatives

Posted by Sean Owen <sr...@gmail.com>.

Agree, and I can make a few of the ML regression tests faster pretty
easily. Here the issue is more about what happens when you run every single
test, and man that does take a long time. Maybe rare enough to not justify
upping the build timeout. (The PR passed just barely this time anyway)

Q for Shane: we have a ton of build slots, but it seems like worker
performance does slow down when there are multiple builds in progress. Is
there any value in reducing the number of concurrent builds per machine,
esp if we're not really using all of it? might help load balance more or
something. I was also trying to figure out if they were swapping or
something but couldn't find an easy way to tell.

On Thu, Apr 11, 2019 at 11:00 AM Xiao Li <li...@databricks.com> wrote:

> Hi, Sean
>
> Your issue actually shows our existing test frameworks needs a change
> ASAP. We need to go over the tests listed in
> https://spark-tests.appspot.com/slow-tests and see whether we can reduce
> the time or run these test suites in parallel.
>
> Thanks,
>
> Xiao
>
>
>
> On Thu, Apr 11, 2019 at 4:26 AM Sean Owen <sr...@gmail.com> wrote:
>
>> I have a big PR that keeps failing because it his the 300 minute build
>> timeout:
>>
>> https://github.com/apache/spark/pull/24314
>>
>> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4703/console
>>
>> It's because it touches so much code that all tests run including
>> things like Kinesis. It looks like 300 mins isn't enough. We can raise
>> it to an eye-watering 360 minutes if that's just how long all tests
>> take.
>>
>> I can also try splitting up the change to move out changes to a few
>> optional modules into separate PRs.
>>
>> (Because this one makes it all the way through Python and Java tests
>> and almost all R tests several times, and doesn't touch Python or R
>> and shouldn't have any functional changes, I'm tempted to just merge
>> it, too, as a solution)
>>
>> Thoughts?
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>
>
> --
> [image:
> https://databricks.com/sparkaisummit/north-america?utm_source=email&utm_medium=signature]
>

Re: Raise Jenkins test timeout? with alternatives

Posted by Xiao Li <li...@databricks.com>.

Hi, Sean

Your issue actually shows our existing test frameworks needs a change ASAP.
We need to go over the tests listed in
https://spark-tests.appspot.com/slow-tests and see whether we can reduce
the time or run these test suites in parallel.

Thanks,

Xiao



On Thu, Apr 11, 2019 at 4:26 AM Sean Owen <sr...@gmail.com> wrote:

> I have a big PR that keeps failing because it his the 300 minute build
> timeout:
>
> https://github.com/apache/spark/pull/24314
>
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4703/console
>
> It's because it touches so much code that all tests run including
> things like Kinesis. It looks like 300 mins isn't enough. We can raise
> it to an eye-watering 360 minutes if that's just how long all tests
> take.
>
> I can also try splitting up the change to move out changes to a few
> optional modules into separate PRs.
>
> (Because this one makes it all the way through Python and Java tests
> and almost all R tests several times, and doesn't touch Python or R
> and shouldn't have any functional changes, I'm tempted to just merge
> it, too, as a solution)
>
> Thoughts?
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

-- 
[image:
https://databricks.com/sparkaisummit/north-america?utm_source=email&utm_medium=signature]