You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Jesus Camacho Rodriguez <jc...@apache.org> on 2018/05/11 20:27:42 UTC

[DISCUSS] Unsustainable situation with ptests

I believe we have reached a state (maybe we did reach it a while ago) that is not sustainable anymore, as there are so many tests failing / timing out that it is not possible to verify whether a patch is breaking some critical parts of the system or not. It also seems to me that due to the timeouts (maybe due to infra, maybe not), ptest runs are taking even longer than usual, which in turn creates even longer queue of patches.

There is an ongoing effort to improve ptests usability (https://issues.apache.org/jira/browse/HIVE-19425), but apart from that, we need to make an effort to stabilize existing tests and bring that failure count to zero.

Hence, I am suggesting *we stop committing any patch before we get a green run*. If someone thinks this proposal is too radical, please come up with an alternative, because I do not think it is OK to have the ptest runs in their current state. Other projects of certain size (e.g., Hadoop, Spark) are always green, we should be able to do the same.

Finally, once we get to zero failures, I suggest we are less tolerant with committing without getting a clean ptests run. If there is a failure, we need to fix it or revert the patch that caused it, then we continue developing.

Please, let’s all work together as a community to fix this issue, that is the only way to get to zero quickly.

Thanks,
Jesús

PS. I assume the flaky tests will come into the discussion. Let´s see first how many of those we have, then we can work to find a fix.



Re: [DISCUSS] Unsustainable situation with ptests

Posted by Siddharth Seth <ss...@apache.org>.
Very nice. There was an effort to get fast and green builds back in 2016.
There wasn't any strict "must be a green build" before commit at the time
though. Instead jiras were filed and the expectation was that they'd be
cited / new ones created pre commit(looking at the jiras now - this was
likely followed for a while, many fixes, and eventually got annoying?).
Think the enforcement step is absolutely required to get to, and maintain a
green build. May want to consider performance characteristics of tests as
well - must complete with X seconds.
Jiras for reference (including test infra improvements which were not done
at the time): HIVE-13503, HIVE-15058, HIVE-14547

This will be painful initially, but eventually it'll be great to be able to
commit without having to scan through a bunch of 'known failures', analyze,
document etc.

On Tue, May 15, 2018 at 5:30 PM, Prasanth Jayachandran <
pjayachandran@hortonworks.com> wrote:

> Wow! Awesome. This is the 3rd time I remember seeing green run in >4yrs. :)
>
> Thanks
> Prasanth
>
> > On May 15, 2018, at 5:28 PM, Jesus Camacho Rodriguez <
> jcamacho@apache.org> wrote:
> >
> > We have just had the first clean run in a while:
> > https://builds.apache.org/job/PreCommit-HIVE-Build/10971/testReport/
> >
> > I will continue monitoring follow-up runs.
> >
> > Thanks,
> > -Jesús
> >
> >
> > On 5/14/18, 11:28 PM, "Prasanth Jayachandran" <
> pjayachandran@hortonworks.com> wrote:
> >
> >    Wondering if we can add a state transition from “Patch Available” to
> “Ready To Commit” which can only be triggered by ptest bot on green test
> run.
> >
> >    Thanks
> >    Prasanth
> >
> >
> >
> >    On Mon, May 14, 2018 at 10:44 PM -0700, "Jesus Camacho Rodriguez" <
> jcamacho@apache.org<ma...@apache.org>> wrote:
> >
> >
> >    I have been working on fixing this situation while commits were still
> coming in.
> >
> >    All the tests that have been disabled are in:
> >    https://issues.apache.org/jira/browse/HIVE-19509
> >    I have created new issues to reenable each of them, they are linked
> to that issue.
> >    Maybe I was slightly aggressive disabling some of the tests, however
> that seemed to be the only way to bring the tests failures with age count >
> 1 to zero.
> >
> >    Instead of starting a vote to freeze the commits in another thread, I
> will start a vote to be stricter wrt committing to master, i.e., only
> commit if we get a clean QA run.
> >
> >    We can discuss more about this issue over there.
> >
> >    Thanks,
> >    Jesús
> >
> >
> >
> >    On 5/14/18, 4:11 PM, "Sergey Shelukhin"  wrote:
> >
> >        Can we please make this freeze conditional, i.e. we unfreeze
> automatically
> >        after ptest is clean (as evidenced by the clean HiveQA run on a
> given
> >        JIRA).
> >
> >        On 18/5/14, 15:16, "Alan Gates"  wrote:
> >
> >> We should do it in a separate thread so that people can see it with the
> >> [VOTE] subject.  Some people use that as a filter in their email to know
> >> when to pay attention to things.
> >>
> >> Alan.
> >>
> >> On Mon, May 14, 2018 at 2:36 PM, Prasanth Jayachandran <
> >> pjayachandran@hortonworks.com> wrote:
> >>
> >>> Will there be a separate voting thread? Or the voting on this thread is
> >>> sufficient for lock down?
> >>>
> >>> Thanks
> >>> Prasanth
> >>>
> >>>> On May 14, 2018, at 2:34 PM, Alan Gates  wrote:
> >>>>
> >>>> ​I see there's support for this, but people are still pouring in
> >>> commits.
> >>>> I proposed we have a quick vote on this to lock down the commits
> >>> until we
> >>>> get to green.  That way everyone knows we have drawn the line at a
> >>> specific
> >>>> point.  Any commits after that point would be reverted.  There isn't a
> >>>> category in the bylaws that fits this kind of vote but I suggest lazy
> >>>> majority as the most appropriate one (at least 3 votes, more +1s than
> >>>> -1s).
> >>>>
> >>>> Alan.​
> >>>>
> >>>> On Mon, May 14, 2018 at 10:34 AM, Vihang Karajgaonkar <
> >>> vihang@cloudera.com>
> >>>> wrote:
> >>>>
> >>>>> I worked on a few quick-fix optimizations in Ptest infrastructure
> >>> over
> >>> the
> >>>>> weekend which reduced the execution run from ~90 min to ~70 min per
> >>> run. I
> >>>>> had to restart Ptest multiple times. I was resubmitting the patches
> >>> which
> >>>>> were in the queue manually, but I may have missed a few. In case you
> >>> have a
> >>>>> patch which is pending pre-commit and you don't see it in the queue,
> >>> please
> >>>>> submit it manually or let me know if you don't have access to the
> >>> jenkins
> >>>>> job. I will continue to work on the sub-tasks in HIVE-19425 and will
> >>> do
> >>>>> some maintenance next weekend as well.
> >>>>>
> >>>>> On Mon, May 14, 2018 at 7:42 AM, Jesus Camacho Rodriguez <
> >>>>> jcamacho@apache.org> wrote:
> >>>>>
> >>>>>> Vineet has already been working on disabling those tests that were
> >>> timing
> >>>>>> out. I am working on disabling those that are generating different q
> >>>>> files
> >>>>>> consistently for last ptests n runs. I am keeping track of all these
> >>>>> tests
> >>>>>> in https://issues.apache.org/jira/browse/HIVE-19509.
> >>>>>>
> >>>>>> -Jesús
> >>>>>>
> >>>>>> On 5/14/18, 2:25 AM, "Prasanth Jayachandran" <
> >>>>>> pjayachandran@hortonworks.com> wrote:
> >>>>>>
> >>>>>>   +1 on freezing commits until we get repetitive green tests. We
> >>> should
> >>>>>> probably disable (and remember in a jira to reenable then at later
> >>> point)
> >>>>>> tests that are flaky to get repetitive green test runs.
> >>>>>>
> >>>>>>   Thanks
> >>>>>>   Prasanth
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>   On Mon, May 14, 2018 at 2:15 AM -0700, "Rui Li" <
> >>>>> lirui.fudan@gmail.com
> >>>>>>> wrote:
> >>>>>>
> >>>>>>
> >>>>>>   +1 to freezing commits until we stabilize
> >>>>>>
> >>>>>>   On Sat, May 12, 2018 at 6:10 AM, Vihang Karajgaonkar
> >>>>>>   wrote:
> >>>>>>
> >>>>>>> In order to understand the end-to-end precommit flow I would like
> >>>>> to
> >>>>>> get
> >>>>>>> access to the PreCommit-HIVE-Build jenkins script. Does anyone one
> >>>>>> know how
> >>>>>>> can I get that?
> >>>>>>>
> >>>>>>> On Fri, May 11, 2018 at 2:03 PM, Jesus Camacho Rodriguez <
> >>>>>>> jcamacho@apache.org> wrote:
> >>>>>>>
> >>>>>>>> Bq. For the short term green runs, I think we should @Ignore the
> >>>>>> tests
> >>>>>>>> which
> >>>>>>>> are known to be failing since many runs. They are anyways not
> >>>>> being
> >>>>>>>> addressed as such. If people think they are important to be run
> >>>>> we
> >>>>>> should
> >>>>>>>> fix them and only then re-enable them.
> >>>>>>>>
> >>>>>>>> I think that is a good idea, as we would minimize the time that
> >>>>> we
> >>>>>> halt
> >>>>>>>> development. We can create a JIRA where we list all tests that
> >>>>> were
> >>>>>>>> failing, and we have disabled to get the clean run. From that
> >>>>>> moment, we
> >>>>>>>> will have zero tolerance towards committing with failing tests.
> >>>>>> And we
> >>>>>>> need
> >>>>>>>> to pick up those tests that should not be ignored and bring them
> >>>>>> up again
> >>>>>>>> but passing. If there is no disagreement, I can start working on
> >>>>>> that.
> >>>>>>>>
> >>>>>>>> Once I am done, I can try to help with infra tickets too.
> >>>>>>>>
> >>>>>>>> -Jesús
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 5/11/18, 1:57 PM, "Vineet Garg"  wrote:
> >>>>>>>>
> >>>>>>>>   +1. I strongly vote for freezing commits and getting our
> >>>>>> testing
> >>>>>>>> coverage in acceptable state.  We have been struggling to
> >>>>> stabilize
> >>>>>>>> branch-3 due to test failures and releasing Hive 3.0 in current
> >>>>>> state
> >>>>>>> would
> >>>>>>>> be unacceptable.
> >>>>>>>>
> >>>>>>>>   Currently there are quite a few test suites which are not
> >>>>> even
> >>>>>>> running
> >>>>>>>> and are being timed out. We have been committing patches (to both
> >>>>>>> branch-3
> >>>>>>>> and master) without test coverage for these tests.
> >>>>>>>>   We should immediately figure out what’s going on before we
> >>>>>> proceed
> >>>>>>>> with commits.
> >>>>>>>>
> >>>>>>>>   For reference following test suites are timing out on
> >>>>> master: (
> >>>>>>>> https://issues.apache.org/jira/browse/HIVE-19506)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>   TestDbNotificationListener - did not produce a TEST-*.xml
> >>>>> file
> >>>>>>> (likely
> >>>>>>>> timed out)
> >>>>>>>>
> >>>>>>>>   TestHCatHiveCompatibility - did not produce a TEST-*.xml file
> >>>>>> (likely
> >>>>>>>> timed out)
> >>>>>>>>
> >>>>>>>>   TestNegativeCliDriver - did not produce a TEST-*.xml file
> >>>>>> (likely
> >>>>>>>> timed out)
> >>>>>>>>
> >>>>>>>>   TestNonCatCallsWithCatalog - did not produce a TEST-*.xml
> >>>>> file
> >>>>>>> (likely
> >>>>>>>> timed out)
> >>>>>>>>
> >>>>>>>>   TestSequenceFileReadWrite - did not produce a TEST-*.xml file
> >>>>>> (likely
> >>>>>>>> timed out)
> >>>>>>>>
> >>>>>>>>   TestTxnExIm - did not produce a TEST-*.xml file (likely timed
> >>>>>> out)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>   Vineet
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>   On May 11, 2018, at 1:46 PM, Vihang Karajgaonkar <
> >>>>>>> vihang@cloudera.com
> >>>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>   +1 There are many problems with the test infrastructure and
> >>>>> in
> >>>>>> my
> >>>>>>>> opinion
> >>>>>>>>   it has not become number one bottleneck for the project. I
> >>>>> was
> >>>>>>> looking
> >>>>>>>> at
> >>>>>>>>   the infrastructure yesterday and I think the current
> >>>>>> infrastructure
> >>>>>>>> (even
> >>>>>>>>   its own set of problems) is still under-utilized. I am
> >>>>>> planning to
> >>>>>>>> increase
> >>>>>>>>   the number of threads to process the parallel test batches to
> >>>>>> start
> >>>>>>>> with.
> >>>>>>>>   It needs a restart on the server side. I can do it now, it
> >>>>>> folks are
> >>>>>>>> okay
> >>>>>>>>   with it. Else I can do it over weekend when the queue is
> >>>>> small.
> >>>>>>>>
> >>>>>>>>   I listed the improvements which I thought would be useful
> >>>>> under
> >>>>>>>>   https://issues.apache.org/jira/browse/HIVE-19425 but frankly
> >>>>>>> speaking
> >>>>>>>> I am
> >>>>>>>>   not able to devote as much time as I would like to on it. I
> >>>>>> would
> >>>>>>>>   appreciate if folks who have some more time if they can help
> >>>>>> out.
> >>>>>>>>
> >>>>>>>>   I think to start with https://issues.apache.org/
> >>>>>>> jira/browse/HIVE-19429
> >>>>>>>> will
> >>>>>>>>   help a lot. We need to pack more test runs in parallel and
> >>>>>> containers
> >>>>>>>>   provide good isolation.
> >>>>>>>>
> >>>>>>>>   For the short term green runs, I think we should @Ignore the
> >>>>>> tests
> >>>>>>>> which
> >>>>>>>>   are known to be failing since many runs. They are anyways not
> >>>>>> being
> >>>>>>>>   addressed as such. If people think they are important to be
> >>>>>> run we
> >>>>>>>> should
> >>>>>>>>   fix them and only then re-enable them.
> >>>>>>>>
> >>>>>>>>   Also, I feel we need light-weight test run which we can run
> >>>>>> locally
> >>>>>>>> before
> >>>>>>>>   submitting it for the full-suite. That way minor issues with
> >>>>>> the
> >>>>>>> patch
> >>>>>>>> can
> >>>>>>>>   be handled locally. May be create a profile which runs a
> >>>>>> subset of
> >>>>>>>>   important tests which are consistent. We can apply some label
> >>>>>> that
> >>>>>>>>   pre-checkin-local tests are runs successful and only then we
> >>>>>> submit
> >>>>>>>> for the
> >>>>>>>>   full-suite.
> >>>>>>>>
> >>>>>>>>   More thoughts are welcome. Thanks for starting this
> >>>>>> conversation.
> >>>>>>>>
> >>>>>>>>   On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
> >>>>>>>>   jcamacho@apache.org> wrote:
> >>>>>>>>
> >>>>>>>>   I believe we have reached a state (maybe we did reach it a
> >>>>>> while ago)
> >>>>>>>> that
> >>>>>>>>   is not sustainable anymore, as there are so many tests
> >>>>> failing
> >>>>>> /
> >>>>>>>> timing out
> >>>>>>>>   that it is not possible to verify whether a patch is breaking
> >>>>>> some
> >>>>>>>> critical
> >>>>>>>>   parts of the system or not. It also seems to me that due to
> >>>>> the
> >>>>>>>> timeouts
> >>>>>>>>   (maybe due to infra, maybe not), ptest runs are taking even
> >>>>>> longer
> >>>>>>> than
> >>>>>>>>   usual, which in turn creates even longer queue of patches.
> >>>>>>>>
> >>>>>>>>   There is an ongoing effort to improve ptests usability (
> >>>>>>>>   https://issues.apache.org/jira/browse/HIVE-19425), but apart
> >>>>>> from
> >>>>>>>> that,
> >>>>>>>>   we need to make an effort to stabilize existing tests and
> >>>>>> bring that
> >>>>>>>>   failure count to zero.
> >>>>>>>>
> >>>>>>>>   Hence, I am suggesting *we stop committing any patch before
> >>>>> we
> >>>>>> get a
> >>>>>>>> green
> >>>>>>>>   run*. If someone thinks this proposal is too radical, please
> >>>>>> come up
> >>>>>>>> with
> >>>>>>>>   an alternative, because I do not think it is OK to have the
> >>>>>> ptest
> >>>>>>> runs
> >>>>>>>> in
> >>>>>>>>   their current state. Other projects of certain size (e.g.,
> >>>>>> Hadoop,
> >>>>>>>> Spark)
> >>>>>>>>   are always green, we should be able to do the same.
> >>>>>>>>
> >>>>>>>>   Finally, once we get to zero failures, I suggest we are less
> >>>>>> tolerant
> >>>>>>>> with
> >>>>>>>>   committing without getting a clean ptests run. If there is a
> >>>>>> failure,
> >>>>>>>> we
> >>>>>>>>   need to fix it or revert the patch that caused it, then we
> >>>>>> continue
> >>>>>>>>   developing.
> >>>>>>>>
> >>>>>>>>   Please, let’s all work together as a community to fix this
> >>>>>> issue,
> >>>>>>> that
> >>>>>>>> is
> >>>>>>>>   the only way to get to zero quickly.
> >>>>>>>>
> >>>>>>>>   Thanks,
> >>>>>>>>   Jesús
> >>>>>>>>
> >>>>>>>>   PS. I assume the flaky tests will come into the discussion.
> >>>>>> Let´s see
> >>>>>>>>   first how many of those we have, then we can work to find a
> >>>>>> fix.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>   --
> >>>>>>   Best regards!
> >>>>>>   Rui Li
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>
> >>>
> >
> >
> >
> >
> >
> >
> >
> >
>
>

Re: [DISCUSS] Unsustainable situation with ptests

Posted by Prasanth Jayachandran <pj...@hortonworks.com>.
Wow! Awesome. This is the 3rd time I remember seeing green run in >4yrs. :)

Thanks
Prasanth

> On May 15, 2018, at 5:28 PM, Jesus Camacho Rodriguez <jc...@apache.org> wrote:
> 
> We have just had the first clean run in a while:
> https://builds.apache.org/job/PreCommit-HIVE-Build/10971/testReport/
> 
> I will continue monitoring follow-up runs.
> 
> Thanks,
> -Jesús
> 
> 
> On 5/14/18, 11:28 PM, "Prasanth Jayachandran" <pj...@hortonworks.com> wrote:
> 
>    Wondering if we can add a state transition from “Patch Available” to “Ready To Commit” which can only be triggered by ptest bot on green test run.
> 
>    Thanks
>    Prasanth
> 
> 
> 
>    On Mon, May 14, 2018 at 10:44 PM -0700, "Jesus Camacho Rodriguez" <jc...@apache.org>> wrote:
> 
> 
>    I have been working on fixing this situation while commits were still coming in.
> 
>    All the tests that have been disabled are in:
>    https://issues.apache.org/jira/browse/HIVE-19509
>    I have created new issues to reenable each of them, they are linked to that issue.
>    Maybe I was slightly aggressive disabling some of the tests, however that seemed to be the only way to bring the tests failures with age count > 1 to zero.
> 
>    Instead of starting a vote to freeze the commits in another thread, I will start a vote to be stricter wrt committing to master, i.e., only commit if we get a clean QA run.
> 
>    We can discuss more about this issue over there.
> 
>    Thanks,
>    Jesús
> 
> 
> 
>    On 5/14/18, 4:11 PM, "Sergey Shelukhin"  wrote:
> 
>        Can we please make this freeze conditional, i.e. we unfreeze automatically
>        after ptest is clean (as evidenced by the clean HiveQA run on a given
>        JIRA).
> 
>        On 18/5/14, 15:16, "Alan Gates"  wrote:
> 
>> We should do it in a separate thread so that people can see it with the
>> [VOTE] subject.  Some people use that as a filter in their email to know
>> when to pay attention to things.
>> 
>> Alan.
>> 
>> On Mon, May 14, 2018 at 2:36 PM, Prasanth Jayachandran <
>> pjayachandran@hortonworks.com> wrote:
>> 
>>> Will there be a separate voting thread? Or the voting on this thread is
>>> sufficient for lock down?
>>> 
>>> Thanks
>>> Prasanth
>>> 
>>>> On May 14, 2018, at 2:34 PM, Alan Gates  wrote:
>>>> 
>>>> ​I see there's support for this, but people are still pouring in
>>> commits.
>>>> I proposed we have a quick vote on this to lock down the commits
>>> until we
>>>> get to green.  That way everyone knows we have drawn the line at a
>>> specific
>>>> point.  Any commits after that point would be reverted.  There isn't a
>>>> category in the bylaws that fits this kind of vote but I suggest lazy
>>>> majority as the most appropriate one (at least 3 votes, more +1s than
>>>> -1s).
>>>> 
>>>> Alan.​
>>>> 
>>>> On Mon, May 14, 2018 at 10:34 AM, Vihang Karajgaonkar <
>>> vihang@cloudera.com>
>>>> wrote:
>>>> 
>>>>> I worked on a few quick-fix optimizations in Ptest infrastructure
>>> over
>>> the
>>>>> weekend which reduced the execution run from ~90 min to ~70 min per
>>> run. I
>>>>> had to restart Ptest multiple times. I was resubmitting the patches
>>> which
>>>>> were in the queue manually, but I may have missed a few. In case you
>>> have a
>>>>> patch which is pending pre-commit and you don't see it in the queue,
>>> please
>>>>> submit it manually or let me know if you don't have access to the
>>> jenkins
>>>>> job. I will continue to work on the sub-tasks in HIVE-19425 and will
>>> do
>>>>> some maintenance next weekend as well.
>>>>> 
>>>>> On Mon, May 14, 2018 at 7:42 AM, Jesus Camacho Rodriguez <
>>>>> jcamacho@apache.org> wrote:
>>>>> 
>>>>>> Vineet has already been working on disabling those tests that were
>>> timing
>>>>>> out. I am working on disabling those that are generating different q
>>>>> files
>>>>>> consistently for last ptests n runs. I am keeping track of all these
>>>>> tests
>>>>>> in https://issues.apache.org/jira/browse/HIVE-19509.
>>>>>> 
>>>>>> -Jesús
>>>>>> 
>>>>>> On 5/14/18, 2:25 AM, "Prasanth Jayachandran" <
>>>>>> pjayachandran@hortonworks.com> wrote:
>>>>>> 
>>>>>>   +1 on freezing commits until we get repetitive green tests. We
>>> should
>>>>>> probably disable (and remember in a jira to reenable then at later
>>> point)
>>>>>> tests that are flaky to get repetitive green test runs.
>>>>>> 
>>>>>>   Thanks
>>>>>>   Prasanth
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>   On Mon, May 14, 2018 at 2:15 AM -0700, "Rui Li" <
>>>>> lirui.fudan@gmail.com
>>>>>>> wrote:
>>>>>> 
>>>>>> 
>>>>>>   +1 to freezing commits until we stabilize
>>>>>> 
>>>>>>   On Sat, May 12, 2018 at 6:10 AM, Vihang Karajgaonkar
>>>>>>   wrote:
>>>>>> 
>>>>>>> In order to understand the end-to-end precommit flow I would like
>>>>> to
>>>>>> get
>>>>>>> access to the PreCommit-HIVE-Build jenkins script. Does anyone one
>>>>>> know how
>>>>>>> can I get that?
>>>>>>> 
>>>>>>> On Fri, May 11, 2018 at 2:03 PM, Jesus Camacho Rodriguez <
>>>>>>> jcamacho@apache.org> wrote:
>>>>>>> 
>>>>>>>> Bq. For the short term green runs, I think we should @Ignore the
>>>>>> tests
>>>>>>>> which
>>>>>>>> are known to be failing since many runs. They are anyways not
>>>>> being
>>>>>>>> addressed as such. If people think they are important to be run
>>>>> we
>>>>>> should
>>>>>>>> fix them and only then re-enable them.
>>>>>>>> 
>>>>>>>> I think that is a good idea, as we would minimize the time that
>>>>> we
>>>>>> halt
>>>>>>>> development. We can create a JIRA where we list all tests that
>>>>> were
>>>>>>>> failing, and we have disabled to get the clean run. From that
>>>>>> moment, we
>>>>>>>> will have zero tolerance towards committing with failing tests.
>>>>>> And we
>>>>>>> need
>>>>>>>> to pick up those tests that should not be ignored and bring them
>>>>>> up again
>>>>>>>> but passing. If there is no disagreement, I can start working on
>>>>>> that.
>>>>>>>> 
>>>>>>>> Once I am done, I can try to help with infra tickets too.
>>>>>>>> 
>>>>>>>> -Jesús
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 5/11/18, 1:57 PM, "Vineet Garg"  wrote:
>>>>>>>> 
>>>>>>>>   +1. I strongly vote for freezing commits and getting our
>>>>>> testing
>>>>>>>> coverage in acceptable state.  We have been struggling to
>>>>> stabilize
>>>>>>>> branch-3 due to test failures and releasing Hive 3.0 in current
>>>>>> state
>>>>>>> would
>>>>>>>> be unacceptable.
>>>>>>>> 
>>>>>>>>   Currently there are quite a few test suites which are not
>>>>> even
>>>>>>> running
>>>>>>>> and are being timed out. We have been committing patches (to both
>>>>>>> branch-3
>>>>>>>> and master) without test coverage for these tests.
>>>>>>>>   We should immediately figure out what’s going on before we
>>>>>> proceed
>>>>>>>> with commits.
>>>>>>>> 
>>>>>>>>   For reference following test suites are timing out on
>>>>> master: (
>>>>>>>> https://issues.apache.org/jira/browse/HIVE-19506)
>>>>>>>> 
>>>>>>>> 
>>>>>>>>   TestDbNotificationListener - did not produce a TEST-*.xml
>>>>> file
>>>>>>> (likely
>>>>>>>> timed out)
>>>>>>>> 
>>>>>>>>   TestHCatHiveCompatibility - did not produce a TEST-*.xml file
>>>>>> (likely
>>>>>>>> timed out)
>>>>>>>> 
>>>>>>>>   TestNegativeCliDriver - did not produce a TEST-*.xml file
>>>>>> (likely
>>>>>>>> timed out)
>>>>>>>> 
>>>>>>>>   TestNonCatCallsWithCatalog - did not produce a TEST-*.xml
>>>>> file
>>>>>>> (likely
>>>>>>>> timed out)
>>>>>>>> 
>>>>>>>>   TestSequenceFileReadWrite - did not produce a TEST-*.xml file
>>>>>> (likely
>>>>>>>> timed out)
>>>>>>>> 
>>>>>>>>   TestTxnExIm - did not produce a TEST-*.xml file (likely timed
>>>>>> out)
>>>>>>>> 
>>>>>>>> 
>>>>>>>>   Vineet
>>>>>>>> 
>>>>>>>> 
>>>>>>>>   On May 11, 2018, at 1:46 PM, Vihang Karajgaonkar <
>>>>>>> vihang@cloudera.com
>>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>   +1 There are many problems with the test infrastructure and
>>>>> in
>>>>>> my
>>>>>>>> opinion
>>>>>>>>   it has not become number one bottleneck for the project. I
>>>>> was
>>>>>>> looking
>>>>>>>> at
>>>>>>>>   the infrastructure yesterday and I think the current
>>>>>> infrastructure
>>>>>>>> (even
>>>>>>>>   its own set of problems) is still under-utilized. I am
>>>>>> planning to
>>>>>>>> increase
>>>>>>>>   the number of threads to process the parallel test batches to
>>>>>> start
>>>>>>>> with.
>>>>>>>>   It needs a restart on the server side. I can do it now, it
>>>>>> folks are
>>>>>>>> okay
>>>>>>>>   with it. Else I can do it over weekend when the queue is
>>>>> small.
>>>>>>>> 
>>>>>>>>   I listed the improvements which I thought would be useful
>>>>> under
>>>>>>>>   https://issues.apache.org/jira/browse/HIVE-19425 but frankly
>>>>>>> speaking
>>>>>>>> I am
>>>>>>>>   not able to devote as much time as I would like to on it. I
>>>>>> would
>>>>>>>>   appreciate if folks who have some more time if they can help
>>>>>> out.
>>>>>>>> 
>>>>>>>>   I think to start with https://issues.apache.org/
>>>>>>> jira/browse/HIVE-19429
>>>>>>>> will
>>>>>>>>   help a lot. We need to pack more test runs in parallel and
>>>>>> containers
>>>>>>>>   provide good isolation.
>>>>>>>> 
>>>>>>>>   For the short term green runs, I think we should @Ignore the
>>>>>> tests
>>>>>>>> which
>>>>>>>>   are known to be failing since many runs. They are anyways not
>>>>>> being
>>>>>>>>   addressed as such. If people think they are important to be
>>>>>> run we
>>>>>>>> should
>>>>>>>>   fix them and only then re-enable them.
>>>>>>>> 
>>>>>>>>   Also, I feel we need light-weight test run which we can run
>>>>>> locally
>>>>>>>> before
>>>>>>>>   submitting it for the full-suite. That way minor issues with
>>>>>> the
>>>>>>> patch
>>>>>>>> can
>>>>>>>>   be handled locally. May be create a profile which runs a
>>>>>> subset of
>>>>>>>>   important tests which are consistent. We can apply some label
>>>>>> that
>>>>>>>>   pre-checkin-local tests are runs successful and only then we
>>>>>> submit
>>>>>>>> for the
>>>>>>>>   full-suite.
>>>>>>>> 
>>>>>>>>   More thoughts are welcome. Thanks for starting this
>>>>>> conversation.
>>>>>>>> 
>>>>>>>>   On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
>>>>>>>>   jcamacho@apache.org> wrote:
>>>>>>>> 
>>>>>>>>   I believe we have reached a state (maybe we did reach it a
>>>>>> while ago)
>>>>>>>> that
>>>>>>>>   is not sustainable anymore, as there are so many tests
>>>>> failing
>>>>>> /
>>>>>>>> timing out
>>>>>>>>   that it is not possible to verify whether a patch is breaking
>>>>>> some
>>>>>>>> critical
>>>>>>>>   parts of the system or not. It also seems to me that due to
>>>>> the
>>>>>>>> timeouts
>>>>>>>>   (maybe due to infra, maybe not), ptest runs are taking even
>>>>>> longer
>>>>>>> than
>>>>>>>>   usual, which in turn creates even longer queue of patches.
>>>>>>>> 
>>>>>>>>   There is an ongoing effort to improve ptests usability (
>>>>>>>>   https://issues.apache.org/jira/browse/HIVE-19425), but apart
>>>>>> from
>>>>>>>> that,
>>>>>>>>   we need to make an effort to stabilize existing tests and
>>>>>> bring that
>>>>>>>>   failure count to zero.
>>>>>>>> 
>>>>>>>>   Hence, I am suggesting *we stop committing any patch before
>>>>> we
>>>>>> get a
>>>>>>>> green
>>>>>>>>   run*. If someone thinks this proposal is too radical, please
>>>>>> come up
>>>>>>>> with
>>>>>>>>   an alternative, because I do not think it is OK to have the
>>>>>> ptest
>>>>>>> runs
>>>>>>>> in
>>>>>>>>   their current state. Other projects of certain size (e.g.,
>>>>>> Hadoop,
>>>>>>>> Spark)
>>>>>>>>   are always green, we should be able to do the same.
>>>>>>>> 
>>>>>>>>   Finally, once we get to zero failures, I suggest we are less
>>>>>> tolerant
>>>>>>>> with
>>>>>>>>   committing without getting a clean ptests run. If there is a
>>>>>> failure,
>>>>>>>> we
>>>>>>>>   need to fix it or revert the patch that caused it, then we
>>>>>> continue
>>>>>>>>   developing.
>>>>>>>> 
>>>>>>>>   Please, let’s all work together as a community to fix this
>>>>>> issue,
>>>>>>> that
>>>>>>>> is
>>>>>>>>   the only way to get to zero quickly.
>>>>>>>> 
>>>>>>>>   Thanks,
>>>>>>>>   Jesús
>>>>>>>> 
>>>>>>>>   PS. I assume the flaky tests will come into the discussion.
>>>>>> Let´s see
>>>>>>>>   first how many of those we have, then we can work to find a
>>>>>> fix.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>   --
>>>>>>   Best regards!
>>>>>>   Rui Li
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>> 
> 
> 
> 
> 
> 
> 
> 
> 


Re: [DISCUSS] Unsustainable situation with ptests

Posted by Jesus Camacho Rodriguez <jc...@apache.org>.
We have just had the first clean run in a while:
https://builds.apache.org/job/PreCommit-HIVE-Build/10971/testReport/

I will continue monitoring follow-up runs.

Thanks,
-Jesús


On 5/14/18, 11:28 PM, "Prasanth Jayachandran" <pj...@hortonworks.com> wrote:

    Wondering if we can add a state transition from “Patch Available” to “Ready To Commit” which can only be triggered by ptest bot on green test run.
    
    Thanks
    Prasanth
    
    
    
    On Mon, May 14, 2018 at 10:44 PM -0700, "Jesus Camacho Rodriguez" <jc...@apache.org>> wrote:
    
    
    I have been working on fixing this situation while commits were still coming in.
    
    All the tests that have been disabled are in:
    https://issues.apache.org/jira/browse/HIVE-19509
    I have created new issues to reenable each of them, they are linked to that issue.
    Maybe I was slightly aggressive disabling some of the tests, however that seemed to be the only way to bring the tests failures with age count > 1 to zero.
    
    Instead of starting a vote to freeze the commits in another thread, I will start a vote to be stricter wrt committing to master, i.e., only commit if we get a clean QA run.
    
    We can discuss more about this issue over there.
    
    Thanks,
    Jesús
    
    
    
    On 5/14/18, 4:11 PM, "Sergey Shelukhin"  wrote:
    
        Can we please make this freeze conditional, i.e. we unfreeze automatically
        after ptest is clean (as evidenced by the clean HiveQA run on a given
        JIRA).
    
        On 18/5/14, 15:16, "Alan Gates"  wrote:
    
        >We should do it in a separate thread so that people can see it with the
        >[VOTE] subject.  Some people use that as a filter in their email to know
        >when to pay attention to things.
        >
        >Alan.
        >
        >On Mon, May 14, 2018 at 2:36 PM, Prasanth Jayachandran <
        >pjayachandran@hortonworks.com> wrote:
        >
        >> Will there be a separate voting thread? Or the voting on this thread is
        >> sufficient for lock down?
        >>
        >> Thanks
        >> Prasanth
        >>
        >> > On May 14, 2018, at 2:34 PM, Alan Gates  wrote:
        >> >
        >> > ​I see there's support for this, but people are still pouring in
        >>commits.
        >> > I proposed we have a quick vote on this to lock down the commits
        >>until we
        >> > get to green.  That way everyone knows we have drawn the line at a
        >> specific
        >> > point.  Any commits after that point would be reverted.  There isn't a
        >> > category in the bylaws that fits this kind of vote but I suggest lazy
        >> > majority as the most appropriate one (at least 3 votes, more +1s than
        >> > -1s).
        >> >
        >> > Alan.​
        >> >
        >> > On Mon, May 14, 2018 at 10:34 AM, Vihang Karajgaonkar <
        >> vihang@cloudera.com>
        >> > wrote:
        >> >
        >> >> I worked on a few quick-fix optimizations in Ptest infrastructure
        >>over
        >> the
        >> >> weekend which reduced the execution run from ~90 min to ~70 min per
        >> run. I
        >> >> had to restart Ptest multiple times. I was resubmitting the patches
        >> which
        >> >> were in the queue manually, but I may have missed a few. In case you
        >> have a
        >> >> patch which is pending pre-commit and you don't see it in the queue,
        >> please
        >> >> submit it manually or let me know if you don't have access to the
        >> jenkins
        >> >> job. I will continue to work on the sub-tasks in HIVE-19425 and will
        >>do
        >> >> some maintenance next weekend as well.
        >> >>
        >> >> On Mon, May 14, 2018 at 7:42 AM, Jesus Camacho Rodriguez <
        >> >> jcamacho@apache.org> wrote:
        >> >>
        >> >>> Vineet has already been working on disabling those tests that were
        >> timing
        >> >>> out. I am working on disabling those that are generating different q
        >> >> files
        >> >>> consistently for last ptests n runs. I am keeping track of all these
        >> >> tests
        >> >>> in https://issues.apache.org/jira/browse/HIVE-19509.
        >> >>>
        >> >>> -Jesús
        >> >>>
        >> >>> On 5/14/18, 2:25 AM, "Prasanth Jayachandran" <
        >> >>> pjayachandran@hortonworks.com> wrote:
        >> >>>
        >> >>>    +1 on freezing commits until we get repetitive green tests. We
        >> should
        >> >>> probably disable (and remember in a jira to reenable then at later
        >> point)
        >> >>> tests that are flaky to get repetitive green test runs.
        >> >>>
        >> >>>    Thanks
        >> >>>    Prasanth
        >> >>>
        >> >>>
        >> >>>
        >> >>>    On Mon, May 14, 2018 at 2:15 AM -0700, "Rui Li" <
        >> >> lirui.fudan@gmail.com
        >> >>> > wrote:
        >> >>>
        >> >>>
        >> >>>    +1 to freezing commits until we stabilize
        >> >>>
        >> >>>    On Sat, May 12, 2018 at 6:10 AM, Vihang Karajgaonkar
        >> >>>    wrote:
        >> >>>
        >> >>>> In order to understand the end-to-end precommit flow I would like
        >> >> to
        >> >>> get
        >> >>>> access to the PreCommit-HIVE-Build jenkins script. Does anyone one
        >> >>> know how
        >> >>>> can I get that?
        >> >>>>
        >> >>>> On Fri, May 11, 2018 at 2:03 PM, Jesus Camacho Rodriguez <
        >> >>>> jcamacho@apache.org> wrote:
        >> >>>>
        >> >>>>> Bq. For the short term green runs, I think we should @Ignore the
        >> >>> tests
        >> >>>>> which
        >> >>>>> are known to be failing since many runs. They are anyways not
        >> >> being
        >> >>>>> addressed as such. If people think they are important to be run
        >> >> we
        >> >>> should
        >> >>>>> fix them and only then re-enable them.
        >> >>>>>
        >> >>>>> I think that is a good idea, as we would minimize the time that
        >> >> we
        >> >>> halt
        >> >>>>> development. We can create a JIRA where we list all tests that
        >> >> were
        >> >>>>> failing, and we have disabled to get the clean run. From that
        >> >>> moment, we
        >> >>>>> will have zero tolerance towards committing with failing tests.
        >> >>> And we
        >> >>>> need
        >> >>>>> to pick up those tests that should not be ignored and bring them
        >> >>> up again
        >> >>>>> but passing. If there is no disagreement, I can start working on
        >> >>> that.
        >> >>>>>
        >> >>>>> Once I am done, I can try to help with infra tickets too.
        >> >>>>>
        >> >>>>> -Jesús
        >> >>>>>
        >> >>>>>
        >> >>>>> On 5/11/18, 1:57 PM, "Vineet Garg"  wrote:
        >> >>>>>
        >> >>>>>    +1. I strongly vote for freezing commits and getting our
        >> >>> testing
        >> >>>>> coverage in acceptable state.  We have been struggling to
        >> >> stabilize
        >> >>>>> branch-3 due to test failures and releasing Hive 3.0 in current
        >> >>> state
        >> >>>> would
        >> >>>>> be unacceptable.
        >> >>>>>
        >> >>>>>    Currently there are quite a few test suites which are not
        >> >> even
        >> >>>> running
        >> >>>>> and are being timed out. We have been committing patches (to both
        >> >>>> branch-3
        >> >>>>> and master) without test coverage for these tests.
        >> >>>>>    We should immediately figure out what’s going on before we
        >> >>> proceed
        >> >>>>> with commits.
        >> >>>>>
        >> >>>>>    For reference following test suites are timing out on
        >> >> master: (
        >> >>>>> https://issues.apache.org/jira/browse/HIVE-19506)
        >> >>>>>
        >> >>>>>
        >> >>>>>    TestDbNotificationListener - did not produce a TEST-*.xml
        >> >> file
        >> >>>> (likely
        >> >>>>> timed out)
        >> >>>>>
        >> >>>>>    TestHCatHiveCompatibility - did not produce a TEST-*.xml file
        >> >>> (likely
        >> >>>>> timed out)
        >> >>>>>
        >> >>>>>    TestNegativeCliDriver - did not produce a TEST-*.xml file
        >> >>> (likely
        >> >>>>> timed out)
        >> >>>>>
        >> >>>>>    TestNonCatCallsWithCatalog - did not produce a TEST-*.xml
        >> >> file
        >> >>>> (likely
        >> >>>>> timed out)
        >> >>>>>
        >> >>>>>    TestSequenceFileReadWrite - did not produce a TEST-*.xml file
        >> >>> (likely
        >> >>>>> timed out)
        >> >>>>>
        >> >>>>>    TestTxnExIm - did not produce a TEST-*.xml file (likely timed
        >> >>> out)
        >> >>>>>
        >> >>>>>
        >> >>>>>    Vineet
        >> >>>>>
        >> >>>>>
        >> >>>>>    On May 11, 2018, at 1:46 PM, Vihang Karajgaonkar <
        >> >>>> vihang@cloudera.com
        >> >>>>>> wrote:
        >> >>>>>
        >> >>>>>    +1 There are many problems with the test infrastructure and
        >> >> in
        >> >>> my
        >> >>>>> opinion
        >> >>>>>    it has not become number one bottleneck for the project. I
        >> >> was
        >> >>>> looking
        >> >>>>> at
        >> >>>>>    the infrastructure yesterday and I think the current
        >> >>> infrastructure
        >> >>>>> (even
        >> >>>>>    its own set of problems) is still under-utilized. I am
        >> >>> planning to
        >> >>>>> increase
        >> >>>>>    the number of threads to process the parallel test batches to
        >> >>> start
        >> >>>>> with.
        >> >>>>>    It needs a restart on the server side. I can do it now, it
        >> >>> folks are
        >> >>>>> okay
        >> >>>>>    with it. Else I can do it over weekend when the queue is
        >> >> small.
        >> >>>>>
        >> >>>>>    I listed the improvements which I thought would be useful
        >> >> under
        >> >>>>>    https://issues.apache.org/jira/browse/HIVE-19425 but frankly
        >> >>>> speaking
        >> >>>>> I am
        >> >>>>>    not able to devote as much time as I would like to on it. I
        >> >>> would
        >> >>>>>    appreciate if folks who have some more time if they can help
        >> >>> out.
        >> >>>>>
        >> >>>>>    I think to start with https://issues.apache.org/
        >> >>>> jira/browse/HIVE-19429
        >> >>>>> will
        >> >>>>>    help a lot. We need to pack more test runs in parallel and
        >> >>> containers
        >> >>>>>    provide good isolation.
        >> >>>>>
        >> >>>>>    For the short term green runs, I think we should @Ignore the
        >> >>> tests
        >> >>>>> which
        >> >>>>>    are known to be failing since many runs. They are anyways not
        >> >>> being
        >> >>>>>    addressed as such. If people think they are important to be
        >> >>> run we
        >> >>>>> should
        >> >>>>>    fix them and only then re-enable them.
        >> >>>>>
        >> >>>>>    Also, I feel we need light-weight test run which we can run
        >> >>> locally
        >> >>>>> before
        >> >>>>>    submitting it for the full-suite. That way minor issues with
        >> >>> the
        >> >>>> patch
        >> >>>>> can
        >> >>>>>    be handled locally. May be create a profile which runs a
        >> >>> subset of
        >> >>>>>    important tests which are consistent. We can apply some label
        >> >>> that
        >> >>>>>    pre-checkin-local tests are runs successful and only then we
        >> >>> submit
        >> >>>>> for the
        >> >>>>>    full-suite.
        >> >>>>>
        >> >>>>>    More thoughts are welcome. Thanks for starting this
        >> >>> conversation.
        >> >>>>>
        >> >>>>>    On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
        >> >>>>>    jcamacho@apache.org> wrote:
        >> >>>>>
        >> >>>>>    I believe we have reached a state (maybe we did reach it a
        >> >>> while ago)
        >> >>>>> that
        >> >>>>>    is not sustainable anymore, as there are so many tests
        >> >> failing
        >> >>> /
        >> >>>>> timing out
        >> >>>>>    that it is not possible to verify whether a patch is breaking
        >> >>> some
        >> >>>>> critical
        >> >>>>>    parts of the system or not. It also seems to me that due to
        >> >> the
        >> >>>>> timeouts
        >> >>>>>    (maybe due to infra, maybe not), ptest runs are taking even
        >> >>> longer
        >> >>>> than
        >> >>>>>    usual, which in turn creates even longer queue of patches.
        >> >>>>>
        >> >>>>>    There is an ongoing effort to improve ptests usability (
        >> >>>>>    https://issues.apache.org/jira/browse/HIVE-19425), but apart
        >> >>> from
        >> >>>>> that,
        >> >>>>>    we need to make an effort to stabilize existing tests and
        >> >>> bring that
        >> >>>>>    failure count to zero.
        >> >>>>>
        >> >>>>>    Hence, I am suggesting *we stop committing any patch before
        >> >> we
        >> >>> get a
        >> >>>>> green
        >> >>>>>    run*. If someone thinks this proposal is too radical, please
        >> >>> come up
        >> >>>>> with
        >> >>>>>    an alternative, because I do not think it is OK to have the
        >> >>> ptest
        >> >>>> runs
        >> >>>>> in
        >> >>>>>    their current state. Other projects of certain size (e.g.,
        >> >>> Hadoop,
        >> >>>>> Spark)
        >> >>>>>    are always green, we should be able to do the same.
        >> >>>>>
        >> >>>>>    Finally, once we get to zero failures, I suggest we are less
        >> >>> tolerant
        >> >>>>> with
        >> >>>>>    committing without getting a clean ptests run. If there is a
        >> >>> failure,
        >> >>>>> we
        >> >>>>>    need to fix it or revert the patch that caused it, then we
        >> >>> continue
        >> >>>>>    developing.
        >> >>>>>
        >> >>>>>    Please, let’s all work together as a community to fix this
        >> >>> issue,
        >> >>>> that
        >> >>>>> is
        >> >>>>>    the only way to get to zero quickly.
        >> >>>>>
        >> >>>>>    Thanks,
        >> >>>>>    Jesús
        >> >>>>>
        >> >>>>>    PS. I assume the flaky tests will come into the discussion.
        >> >>> Let´s see
        >> >>>>>    first how many of those we have, then we can work to find a
        >> >>> fix.
        >> >>>>>
        >> >>>>>
        >> >>>>>
        >> >>>>>
        >> >>>>>
        >> >>>>>
        >> >>>>>
        >> >>>>>
        >> >>>>
        >> >>>
        >> >>>
        >> >>>
        >> >>>    --
        >> >>>    Best regards!
        >> >>>    Rui Li
        >> >>>
        >> >>>
        >> >>>
        >> >>>
        >> >>>
        >> >>
        >>
        >>
    
    
    
    
    
    



Re: [DISCUSS] Unsustainable situation with ptests

Posted by Prasanth Jayachandran <pj...@hortonworks.com>.
Wondering if we can add a state transition from “Patch Available” to “Ready To Commit” which can only be triggered by ptest bot on green test run.

Thanks
Prasanth



On Mon, May 14, 2018 at 10:44 PM -0700, "Jesus Camacho Rodriguez" <jc...@apache.org>> wrote:


I have been working on fixing this situation while commits were still coming in.

All the tests that have been disabled are in:
https://issues.apache.org/jira/browse/HIVE-19509
I have created new issues to reenable each of them, they are linked to that issue.
Maybe I was slightly aggressive disabling some of the tests, however that seemed to be the only way to bring the tests failures with age count > 1 to zero.

Instead of starting a vote to freeze the commits in another thread, I will start a vote to be stricter wrt committing to master, i.e., only commit if we get a clean QA run.

We can discuss more about this issue over there.

Thanks,
Jesús



On 5/14/18, 4:11 PM, "Sergey Shelukhin"  wrote:

    Can we please make this freeze conditional, i.e. we unfreeze automatically
    after ptest is clean (as evidenced by the clean HiveQA run on a given
    JIRA).

    On 18/5/14, 15:16, "Alan Gates"  wrote:

    >We should do it in a separate thread so that people can see it with the
    >[VOTE] subject.  Some people use that as a filter in their email to know
    >when to pay attention to things.
    >
    >Alan.
    >
    >On Mon, May 14, 2018 at 2:36 PM, Prasanth Jayachandran <
    >pjayachandran@hortonworks.com> wrote:
    >
    >> Will there be a separate voting thread? Or the voting on this thread is
    >> sufficient for lock down?
    >>
    >> Thanks
    >> Prasanth
    >>
    >> > On May 14, 2018, at 2:34 PM, Alan Gates  wrote:
    >> >
    >> > ​I see there's support for this, but people are still pouring in
    >>commits.
    >> > I proposed we have a quick vote on this to lock down the commits
    >>until we
    >> > get to green.  That way everyone knows we have drawn the line at a
    >> specific
    >> > point.  Any commits after that point would be reverted.  There isn't a
    >> > category in the bylaws that fits this kind of vote but I suggest lazy
    >> > majority as the most appropriate one (at least 3 votes, more +1s than
    >> > -1s).
    >> >
    >> > Alan.​
    >> >
    >> > On Mon, May 14, 2018 at 10:34 AM, Vihang Karajgaonkar <
    >> vihang@cloudera.com>
    >> > wrote:
    >> >
    >> >> I worked on a few quick-fix optimizations in Ptest infrastructure
    >>over
    >> the
    >> >> weekend which reduced the execution run from ~90 min to ~70 min per
    >> run. I
    >> >> had to restart Ptest multiple times. I was resubmitting the patches
    >> which
    >> >> were in the queue manually, but I may have missed a few. In case you
    >> have a
    >> >> patch which is pending pre-commit and you don't see it in the queue,
    >> please
    >> >> submit it manually or let me know if you don't have access to the
    >> jenkins
    >> >> job. I will continue to work on the sub-tasks in HIVE-19425 and will
    >>do
    >> >> some maintenance next weekend as well.
    >> >>
    >> >> On Mon, May 14, 2018 at 7:42 AM, Jesus Camacho Rodriguez <
    >> >> jcamacho@apache.org> wrote:
    >> >>
    >> >>> Vineet has already been working on disabling those tests that were
    >> timing
    >> >>> out. I am working on disabling those that are generating different q
    >> >> files
    >> >>> consistently for last ptests n runs. I am keeping track of all these
    >> >> tests
    >> >>> in https://issues.apache.org/jira/browse/HIVE-19509.
    >> >>>
    >> >>> -Jesús
    >> >>>
    >> >>> On 5/14/18, 2:25 AM, "Prasanth Jayachandran" <
    >> >>> pjayachandran@hortonworks.com> wrote:
    >> >>>
    >> >>>    +1 on freezing commits until we get repetitive green tests. We
    >> should
    >> >>> probably disable (and remember in a jira to reenable then at later
    >> point)
    >> >>> tests that are flaky to get repetitive green test runs.
    >> >>>
    >> >>>    Thanks
    >> >>>    Prasanth
    >> >>>
    >> >>>
    >> >>>
    >> >>>    On Mon, May 14, 2018 at 2:15 AM -0700, "Rui Li" <
    >> >> lirui.fudan@gmail.com
    >> >>> > wrote:
    >> >>>
    >> >>>
    >> >>>    +1 to freezing commits until we stabilize
    >> >>>
    >> >>>    On Sat, May 12, 2018 at 6:10 AM, Vihang Karajgaonkar
    >> >>>    wrote:
    >> >>>
    >> >>>> In order to understand the end-to-end precommit flow I would like
    >> >> to
    >> >>> get
    >> >>>> access to the PreCommit-HIVE-Build jenkins script. Does anyone one
    >> >>> know how
    >> >>>> can I get that?
    >> >>>>
    >> >>>> On Fri, May 11, 2018 at 2:03 PM, Jesus Camacho Rodriguez <
    >> >>>> jcamacho@apache.org> wrote:
    >> >>>>
    >> >>>>> Bq. For the short term green runs, I think we should @Ignore the
    >> >>> tests
    >> >>>>> which
    >> >>>>> are known to be failing since many runs. They are anyways not
    >> >> being
    >> >>>>> addressed as such. If people think they are important to be run
    >> >> we
    >> >>> should
    >> >>>>> fix them and only then re-enable them.
    >> >>>>>
    >> >>>>> I think that is a good idea, as we would minimize the time that
    >> >> we
    >> >>> halt
    >> >>>>> development. We can create a JIRA where we list all tests that
    >> >> were
    >> >>>>> failing, and we have disabled to get the clean run. From that
    >> >>> moment, we
    >> >>>>> will have zero tolerance towards committing with failing tests.
    >> >>> And we
    >> >>>> need
    >> >>>>> to pick up those tests that should not be ignored and bring them
    >> >>> up again
    >> >>>>> but passing. If there is no disagreement, I can start working on
    >> >>> that.
    >> >>>>>
    >> >>>>> Once I am done, I can try to help with infra tickets too.
    >> >>>>>
    >> >>>>> -Jesús
    >> >>>>>
    >> >>>>>
    >> >>>>> On 5/11/18, 1:57 PM, "Vineet Garg"  wrote:
    >> >>>>>
    >> >>>>>    +1. I strongly vote for freezing commits and getting our
    >> >>> testing
    >> >>>>> coverage in acceptable state.  We have been struggling to
    >> >> stabilize
    >> >>>>> branch-3 due to test failures and releasing Hive 3.0 in current
    >> >>> state
    >> >>>> would
    >> >>>>> be unacceptable.
    >> >>>>>
    >> >>>>>    Currently there are quite a few test suites which are not
    >> >> even
    >> >>>> running
    >> >>>>> and are being timed out. We have been committing patches (to both
    >> >>>> branch-3
    >> >>>>> and master) without test coverage for these tests.
    >> >>>>>    We should immediately figure out what’s going on before we
    >> >>> proceed
    >> >>>>> with commits.
    >> >>>>>
    >> >>>>>    For reference following test suites are timing out on
    >> >> master: (
    >> >>>>> https://issues.apache.org/jira/browse/HIVE-19506)
    >> >>>>>
    >> >>>>>
    >> >>>>>    TestDbNotificationListener - did not produce a TEST-*.xml
    >> >> file
    >> >>>> (likely
    >> >>>>> timed out)
    >> >>>>>
    >> >>>>>    TestHCatHiveCompatibility - did not produce a TEST-*.xml file
    >> >>> (likely
    >> >>>>> timed out)
    >> >>>>>
    >> >>>>>    TestNegativeCliDriver - did not produce a TEST-*.xml file
    >> >>> (likely
    >> >>>>> timed out)
    >> >>>>>
    >> >>>>>    TestNonCatCallsWithCatalog - did not produce a TEST-*.xml
    >> >> file
    >> >>>> (likely
    >> >>>>> timed out)
    >> >>>>>
    >> >>>>>    TestSequenceFileReadWrite - did not produce a TEST-*.xml file
    >> >>> (likely
    >> >>>>> timed out)
    >> >>>>>
    >> >>>>>    TestTxnExIm - did not produce a TEST-*.xml file (likely timed
    >> >>> out)
    >> >>>>>
    >> >>>>>
    >> >>>>>    Vineet
    >> >>>>>
    >> >>>>>
    >> >>>>>    On May 11, 2018, at 1:46 PM, Vihang Karajgaonkar <
    >> >>>> vihang@cloudera.com
    >> >>>>>> wrote:
    >> >>>>>
    >> >>>>>    +1 There are many problems with the test infrastructure and
    >> >> in
    >> >>> my
    >> >>>>> opinion
    >> >>>>>    it has not become number one bottleneck for the project. I
    >> >> was
    >> >>>> looking
    >> >>>>> at
    >> >>>>>    the infrastructure yesterday and I think the current
    >> >>> infrastructure
    >> >>>>> (even
    >> >>>>>    its own set of problems) is still under-utilized. I am
    >> >>> planning to
    >> >>>>> increase
    >> >>>>>    the number of threads to process the parallel test batches to
    >> >>> start
    >> >>>>> with.
    >> >>>>>    It needs a restart on the server side. I can do it now, it
    >> >>> folks are
    >> >>>>> okay
    >> >>>>>    with it. Else I can do it over weekend when the queue is
    >> >> small.
    >> >>>>>
    >> >>>>>    I listed the improvements which I thought would be useful
    >> >> under
    >> >>>>>    https://issues.apache.org/jira/browse/HIVE-19425 but frankly
    >> >>>> speaking
    >> >>>>> I am
    >> >>>>>    not able to devote as much time as I would like to on it. I
    >> >>> would
    >> >>>>>    appreciate if folks who have some more time if they can help
    >> >>> out.
    >> >>>>>
    >> >>>>>    I think to start with https://issues.apache.org/
    >> >>>> jira/browse/HIVE-19429
    >> >>>>> will
    >> >>>>>    help a lot. We need to pack more test runs in parallel and
    >> >>> containers
    >> >>>>>    provide good isolation.
    >> >>>>>
    >> >>>>>    For the short term green runs, I think we should @Ignore the
    >> >>> tests
    >> >>>>> which
    >> >>>>>    are known to be failing since many runs. They are anyways not
    >> >>> being
    >> >>>>>    addressed as such. If people think they are important to be
    >> >>> run we
    >> >>>>> should
    >> >>>>>    fix them and only then re-enable them.
    >> >>>>>
    >> >>>>>    Also, I feel we need light-weight test run which we can run
    >> >>> locally
    >> >>>>> before
    >> >>>>>    submitting it for the full-suite. That way minor issues with
    >> >>> the
    >> >>>> patch
    >> >>>>> can
    >> >>>>>    be handled locally. May be create a profile which runs a
    >> >>> subset of
    >> >>>>>    important tests which are consistent. We can apply some label
    >> >>> that
    >> >>>>>    pre-checkin-local tests are runs successful and only then we
    >> >>> submit
    >> >>>>> for the
    >> >>>>>    full-suite.
    >> >>>>>
    >> >>>>>    More thoughts are welcome. Thanks for starting this
    >> >>> conversation.
    >> >>>>>
    >> >>>>>    On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
    >> >>>>>    jcamacho@apache.org> wrote:
    >> >>>>>
    >> >>>>>    I believe we have reached a state (maybe we did reach it a
    >> >>> while ago)
    >> >>>>> that
    >> >>>>>    is not sustainable anymore, as there are so many tests
    >> >> failing
    >> >>> /
    >> >>>>> timing out
    >> >>>>>    that it is not possible to verify whether a patch is breaking
    >> >>> some
    >> >>>>> critical
    >> >>>>>    parts of the system or not. It also seems to me that due to
    >> >> the
    >> >>>>> timeouts
    >> >>>>>    (maybe due to infra, maybe not), ptest runs are taking even
    >> >>> longer
    >> >>>> than
    >> >>>>>    usual, which in turn creates even longer queue of patches.
    >> >>>>>
    >> >>>>>    There is an ongoing effort to improve ptests usability (
    >> >>>>>    https://issues.apache.org/jira/browse/HIVE-19425), but apart
    >> >>> from
    >> >>>>> that,
    >> >>>>>    we need to make an effort to stabilize existing tests and
    >> >>> bring that
    >> >>>>>    failure count to zero.
    >> >>>>>
    >> >>>>>    Hence, I am suggesting *we stop committing any patch before
    >> >> we
    >> >>> get a
    >> >>>>> green
    >> >>>>>    run*. If someone thinks this proposal is too radical, please
    >> >>> come up
    >> >>>>> with
    >> >>>>>    an alternative, because I do not think it is OK to have the
    >> >>> ptest
    >> >>>> runs
    >> >>>>> in
    >> >>>>>    their current state. Other projects of certain size (e.g.,
    >> >>> Hadoop,
    >> >>>>> Spark)
    >> >>>>>    are always green, we should be able to do the same.
    >> >>>>>
    >> >>>>>    Finally, once we get to zero failures, I suggest we are less
    >> >>> tolerant
    >> >>>>> with
    >> >>>>>    committing without getting a clean ptests run. If there is a
    >> >>> failure,
    >> >>>>> we
    >> >>>>>    need to fix it or revert the patch that caused it, then we
    >> >>> continue
    >> >>>>>    developing.
    >> >>>>>
    >> >>>>>    Please, let’s all work together as a community to fix this
    >> >>> issue,
    >> >>>> that
    >> >>>>> is
    >> >>>>>    the only way to get to zero quickly.
    >> >>>>>
    >> >>>>>    Thanks,
    >> >>>>>    Jesús
    >> >>>>>
    >> >>>>>    PS. I assume the flaky tests will come into the discussion.
    >> >>> Let´s see
    >> >>>>>    first how many of those we have, then we can work to find a
    >> >>> fix.
    >> >>>>>
    >> >>>>>
    >> >>>>>
    >> >>>>>
    >> >>>>>
    >> >>>>>
    >> >>>>>
    >> >>>>>
    >> >>>>
    >> >>>
    >> >>>
    >> >>>
    >> >>>    --
    >> >>>    Best regards!
    >> >>>    Rui Li
    >> >>>
    >> >>>
    >> >>>
    >> >>>
    >> >>>
    >> >>
    >>
    >>






Re: [DISCUSS] Unsustainable situation with ptests

Posted by Jesus Camacho Rodriguez <jc...@apache.org>.
I have been working on fixing this situation while commits were still coming in.

All the tests that have been disabled are in:
https://issues.apache.org/jira/browse/HIVE-19509
I have created new issues to reenable each of them, they are linked to that issue.
Maybe I was slightly aggressive disabling some of the tests, however that seemed to be the only way to bring the tests failures with age count > 1 to zero.

Instead of starting a vote to freeze the commits in another thread, I will start a vote to be stricter wrt committing to master, i.e., only commit if we get a clean QA run.

We can discuss more about this issue over there.

Thanks,
Jesús



On 5/14/18, 4:11 PM, "Sergey Shelukhin" <se...@hortonworks.com> wrote:

    Can we please make this freeze conditional, i.e. we unfreeze automatically
    after ptest is clean (as evidenced by the clean HiveQA run on a given
    JIRA).
    
    On 18/5/14, 15:16, "Alan Gates" <al...@gmail.com> wrote:
    
    >We should do it in a separate thread so that people can see it with the
    >[VOTE] subject.  Some people use that as a filter in their email to know
    >when to pay attention to things.
    >
    >Alan.
    >
    >On Mon, May 14, 2018 at 2:36 PM, Prasanth Jayachandran <
    >pjayachandran@hortonworks.com> wrote:
    >
    >> Will there be a separate voting thread? Or the voting on this thread is
    >> sufficient for lock down?
    >>
    >> Thanks
    >> Prasanth
    >>
    >> > On May 14, 2018, at 2:34 PM, Alan Gates <al...@gmail.com> wrote:
    >> >
    >> > ​I see there's support for this, but people are still pouring in
    >>commits.
    >> > I proposed we have a quick vote on this to lock down the commits
    >>until we
    >> > get to green.  That way everyone knows we have drawn the line at a
    >> specific
    >> > point.  Any commits after that point would be reverted.  There isn't a
    >> > category in the bylaws that fits this kind of vote but I suggest lazy
    >> > majority as the most appropriate one (at least 3 votes, more +1s than
    >> > -1s).
    >> >
    >> > Alan.​
    >> >
    >> > On Mon, May 14, 2018 at 10:34 AM, Vihang Karajgaonkar <
    >> vihang@cloudera.com>
    >> > wrote:
    >> >
    >> >> I worked on a few quick-fix optimizations in Ptest infrastructure
    >>over
    >> the
    >> >> weekend which reduced the execution run from ~90 min to ~70 min per
    >> run. I
    >> >> had to restart Ptest multiple times. I was resubmitting the patches
    >> which
    >> >> were in the queue manually, but I may have missed a few. In case you
    >> have a
    >> >> patch which is pending pre-commit and you don't see it in the queue,
    >> please
    >> >> submit it manually or let me know if you don't have access to the
    >> jenkins
    >> >> job. I will continue to work on the sub-tasks in HIVE-19425 and will
    >>do
    >> >> some maintenance next weekend as well.
    >> >>
    >> >> On Mon, May 14, 2018 at 7:42 AM, Jesus Camacho Rodriguez <
    >> >> jcamacho@apache.org> wrote:
    >> >>
    >> >>> Vineet has already been working on disabling those tests that were
    >> timing
    >> >>> out. I am working on disabling those that are generating different q
    >> >> files
    >> >>> consistently for last ptests n runs. I am keeping track of all these
    >> >> tests
    >> >>> in https://issues.apache.org/jira/browse/HIVE-19509.
    >> >>>
    >> >>> -Jesús
    >> >>>
    >> >>> On 5/14/18, 2:25 AM, "Prasanth Jayachandran" <
    >> >>> pjayachandran@hortonworks.com> wrote:
    >> >>>
    >> >>>    +1 on freezing commits until we get repetitive green tests. We
    >> should
    >> >>> probably disable (and remember in a jira to reenable then at later
    >> point)
    >> >>> tests that are flaky to get repetitive green test runs.
    >> >>>
    >> >>>    Thanks
    >> >>>    Prasanth
    >> >>>
    >> >>>
    >> >>>
    >> >>>    On Mon, May 14, 2018 at 2:15 AM -0700, "Rui Li" <
    >> >> lirui.fudan@gmail.com
    >> >>> <ma...@gmail.com>> wrote:
    >> >>>
    >> >>>
    >> >>>    +1 to freezing commits until we stabilize
    >> >>>
    >> >>>    On Sat, May 12, 2018 at 6:10 AM, Vihang Karajgaonkar
    >> >>>    wrote:
    >> >>>
    >> >>>> In order to understand the end-to-end precommit flow I would like
    >> >> to
    >> >>> get
    >> >>>> access to the PreCommit-HIVE-Build jenkins script. Does anyone one
    >> >>> know how
    >> >>>> can I get that?
    >> >>>>
    >> >>>> On Fri, May 11, 2018 at 2:03 PM, Jesus Camacho Rodriguez <
    >> >>>> jcamacho@apache.org> wrote:
    >> >>>>
    >> >>>>> Bq. For the short term green runs, I think we should @Ignore the
    >> >>> tests
    >> >>>>> which
    >> >>>>> are known to be failing since many runs. They are anyways not
    >> >> being
    >> >>>>> addressed as such. If people think they are important to be run
    >> >> we
    >> >>> should
    >> >>>>> fix them and only then re-enable them.
    >> >>>>>
    >> >>>>> I think that is a good idea, as we would minimize the time that
    >> >> we
    >> >>> halt
    >> >>>>> development. We can create a JIRA where we list all tests that
    >> >> were
    >> >>>>> failing, and we have disabled to get the clean run. From that
    >> >>> moment, we
    >> >>>>> will have zero tolerance towards committing with failing tests.
    >> >>> And we
    >> >>>> need
    >> >>>>> to pick up those tests that should not be ignored and bring them
    >> >>> up again
    >> >>>>> but passing. If there is no disagreement, I can start working on
    >> >>> that.
    >> >>>>>
    >> >>>>> Once I am done, I can try to help with infra tickets too.
    >> >>>>>
    >> >>>>> -Jesús
    >> >>>>>
    >> >>>>>
    >> >>>>> On 5/11/18, 1:57 PM, "Vineet Garg"  wrote:
    >> >>>>>
    >> >>>>>    +1. I strongly vote for freezing commits and getting our
    >> >>> testing
    >> >>>>> coverage in acceptable state.  We have been struggling to
    >> >> stabilize
    >> >>>>> branch-3 due to test failures and releasing Hive 3.0 in current
    >> >>> state
    >> >>>> would
    >> >>>>> be unacceptable.
    >> >>>>>
    >> >>>>>    Currently there are quite a few test suites which are not
    >> >> even
    >> >>>> running
    >> >>>>> and are being timed out. We have been committing patches (to both
    >> >>>> branch-3
    >> >>>>> and master) without test coverage for these tests.
    >> >>>>>    We should immediately figure out what’s going on before we
    >> >>> proceed
    >> >>>>> with commits.
    >> >>>>>
    >> >>>>>    For reference following test suites are timing out on
    >> >> master: (
    >> >>>>> https://issues.apache.org/jira/browse/HIVE-19506)
    >> >>>>>
    >> >>>>>
    >> >>>>>    TestDbNotificationListener - did not produce a TEST-*.xml
    >> >> file
    >> >>>> (likely
    >> >>>>> timed out)
    >> >>>>>
    >> >>>>>    TestHCatHiveCompatibility - did not produce a TEST-*.xml file
    >> >>> (likely
    >> >>>>> timed out)
    >> >>>>>
    >> >>>>>    TestNegativeCliDriver - did not produce a TEST-*.xml file
    >> >>> (likely
    >> >>>>> timed out)
    >> >>>>>
    >> >>>>>    TestNonCatCallsWithCatalog - did not produce a TEST-*.xml
    >> >> file
    >> >>>> (likely
    >> >>>>> timed out)
    >> >>>>>
    >> >>>>>    TestSequenceFileReadWrite - did not produce a TEST-*.xml file
    >> >>> (likely
    >> >>>>> timed out)
    >> >>>>>
    >> >>>>>    TestTxnExIm - did not produce a TEST-*.xml file (likely timed
    >> >>> out)
    >> >>>>>
    >> >>>>>
    >> >>>>>    Vineet
    >> >>>>>
    >> >>>>>
    >> >>>>>    On May 11, 2018, at 1:46 PM, Vihang Karajgaonkar <
    >> >>>> vihang@cloudera.com
    >> >>>>>> wrote:
    >> >>>>>
    >> >>>>>    +1 There are many problems with the test infrastructure and
    >> >> in
    >> >>> my
    >> >>>>> opinion
    >> >>>>>    it has not become number one bottleneck for the project. I
    >> >> was
    >> >>>> looking
    >> >>>>> at
    >> >>>>>    the infrastructure yesterday and I think the current
    >> >>> infrastructure
    >> >>>>> (even
    >> >>>>>    its own set of problems) is still under-utilized. I am
    >> >>> planning to
    >> >>>>> increase
    >> >>>>>    the number of threads to process the parallel test batches to
    >> >>> start
    >> >>>>> with.
    >> >>>>>    It needs a restart on the server side. I can do it now, it
    >> >>> folks are
    >> >>>>> okay
    >> >>>>>    with it. Else I can do it over weekend when the queue is
    >> >> small.
    >> >>>>>
    >> >>>>>    I listed the improvements which I thought would be useful
    >> >> under
    >> >>>>>    https://issues.apache.org/jira/browse/HIVE-19425 but frankly
    >> >>>> speaking
    >> >>>>> I am
    >> >>>>>    not able to devote as much time as I would like to on it. I
    >> >>> would
    >> >>>>>    appreciate if folks who have some more time if they can help
    >> >>> out.
    >> >>>>>
    >> >>>>>    I think to start with https://issues.apache.org/
    >> >>>> jira/browse/HIVE-19429
    >> >>>>> will
    >> >>>>>    help a lot. We need to pack more test runs in parallel and
    >> >>> containers
    >> >>>>>    provide good isolation.
    >> >>>>>
    >> >>>>>    For the short term green runs, I think we should @Ignore the
    >> >>> tests
    >> >>>>> which
    >> >>>>>    are known to be failing since many runs. They are anyways not
    >> >>> being
    >> >>>>>    addressed as such. If people think they are important to be
    >> >>> run we
    >> >>>>> should
    >> >>>>>    fix them and only then re-enable them.
    >> >>>>>
    >> >>>>>    Also, I feel we need light-weight test run which we can run
    >> >>> locally
    >> >>>>> before
    >> >>>>>    submitting it for the full-suite. That way minor issues with
    >> >>> the
    >> >>>> patch
    >> >>>>> can
    >> >>>>>    be handled locally. May be create a profile which runs a
    >> >>> subset of
    >> >>>>>    important tests which are consistent. We can apply some label
    >> >>> that
    >> >>>>>    pre-checkin-local tests are runs successful and only then we
    >> >>> submit
    >> >>>>> for the
    >> >>>>>    full-suite.
    >> >>>>>
    >> >>>>>    More thoughts are welcome. Thanks for starting this
    >> >>> conversation.
    >> >>>>>
    >> >>>>>    On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
    >> >>>>>    jcamacho@apache.org> wrote:
    >> >>>>>
    >> >>>>>    I believe we have reached a state (maybe we did reach it a
    >> >>> while ago)
    >> >>>>> that
    >> >>>>>    is not sustainable anymore, as there are so many tests
    >> >> failing
    >> >>> /
    >> >>>>> timing out
    >> >>>>>    that it is not possible to verify whether a patch is breaking
    >> >>> some
    >> >>>>> critical
    >> >>>>>    parts of the system or not. It also seems to me that due to
    >> >> the
    >> >>>>> timeouts
    >> >>>>>    (maybe due to infra, maybe not), ptest runs are taking even
    >> >>> longer
    >> >>>> than
    >> >>>>>    usual, which in turn creates even longer queue of patches.
    >> >>>>>
    >> >>>>>    There is an ongoing effort to improve ptests usability (
    >> >>>>>    https://issues.apache.org/jira/browse/HIVE-19425), but apart
    >> >>> from
    >> >>>>> that,
    >> >>>>>    we need to make an effort to stabilize existing tests and
    >> >>> bring that
    >> >>>>>    failure count to zero.
    >> >>>>>
    >> >>>>>    Hence, I am suggesting *we stop committing any patch before
    >> >> we
    >> >>> get a
    >> >>>>> green
    >> >>>>>    run*. If someone thinks this proposal is too radical, please
    >> >>> come up
    >> >>>>> with
    >> >>>>>    an alternative, because I do not think it is OK to have the
    >> >>> ptest
    >> >>>> runs
    >> >>>>> in
    >> >>>>>    their current state. Other projects of certain size (e.g.,
    >> >>> Hadoop,
    >> >>>>> Spark)
    >> >>>>>    are always green, we should be able to do the same.
    >> >>>>>
    >> >>>>>    Finally, once we get to zero failures, I suggest we are less
    >> >>> tolerant
    >> >>>>> with
    >> >>>>>    committing without getting a clean ptests run. If there is a
    >> >>> failure,
    >> >>>>> we
    >> >>>>>    need to fix it or revert the patch that caused it, then we
    >> >>> continue
    >> >>>>>    developing.
    >> >>>>>
    >> >>>>>    Please, let’s all work together as a community to fix this
    >> >>> issue,
    >> >>>> that
    >> >>>>> is
    >> >>>>>    the only way to get to zero quickly.
    >> >>>>>
    >> >>>>>    Thanks,
    >> >>>>>    Jesús
    >> >>>>>
    >> >>>>>    PS. I assume the flaky tests will come into the discussion.
    >> >>> Let´s see
    >> >>>>>    first how many of those we have, then we can work to find a
    >> >>> fix.
    >> >>>>>
    >> >>>>>
    >> >>>>>
    >> >>>>>
    >> >>>>>
    >> >>>>>
    >> >>>>>
    >> >>>>>
    >> >>>>
    >> >>>
    >> >>>
    >> >>>
    >> >>>    --
    >> >>>    Best regards!
    >> >>>    Rui Li
    >> >>>
    >> >>>
    >> >>>
    >> >>>
    >> >>>
    >> >>
    >>
    >>
    
    



Re: [DISCUSS] Unsustainable situation with ptests

Posted by Sergey Shelukhin <se...@hortonworks.com>.
Can we please make this freeze conditional, i.e. we unfreeze automatically
after ptest is clean (as evidenced by the clean HiveQA run on a given
JIRA).

On 18/5/14, 15:16, "Alan Gates" <al...@gmail.com> wrote:

>We should do it in a separate thread so that people can see it with the
>[VOTE] subject.  Some people use that as a filter in their email to know
>when to pay attention to things.
>
>Alan.
>
>On Mon, May 14, 2018 at 2:36 PM, Prasanth Jayachandran <
>pjayachandran@hortonworks.com> wrote:
>
>> Will there be a separate voting thread? Or the voting on this thread is
>> sufficient for lock down?
>>
>> Thanks
>> Prasanth
>>
>> > On May 14, 2018, at 2:34 PM, Alan Gates <al...@gmail.com> wrote:
>> >
>> > ​I see there's support for this, but people are still pouring in
>>commits.
>> > I proposed we have a quick vote on this to lock down the commits
>>until we
>> > get to green.  That way everyone knows we have drawn the line at a
>> specific
>> > point.  Any commits after that point would be reverted.  There isn't a
>> > category in the bylaws that fits this kind of vote but I suggest lazy
>> > majority as the most appropriate one (at least 3 votes, more +1s than
>> > -1s).
>> >
>> > Alan.​
>> >
>> > On Mon, May 14, 2018 at 10:34 AM, Vihang Karajgaonkar <
>> vihang@cloudera.com>
>> > wrote:
>> >
>> >> I worked on a few quick-fix optimizations in Ptest infrastructure
>>over
>> the
>> >> weekend which reduced the execution run from ~90 min to ~70 min per
>> run. I
>> >> had to restart Ptest multiple times. I was resubmitting the patches
>> which
>> >> were in the queue manually, but I may have missed a few. In case you
>> have a
>> >> patch which is pending pre-commit and you don't see it in the queue,
>> please
>> >> submit it manually or let me know if you don't have access to the
>> jenkins
>> >> job. I will continue to work on the sub-tasks in HIVE-19425 and will
>>do
>> >> some maintenance next weekend as well.
>> >>
>> >> On Mon, May 14, 2018 at 7:42 AM, Jesus Camacho Rodriguez <
>> >> jcamacho@apache.org> wrote:
>> >>
>> >>> Vineet has already been working on disabling those tests that were
>> timing
>> >>> out. I am working on disabling those that are generating different q
>> >> files
>> >>> consistently for last ptests n runs. I am keeping track of all these
>> >> tests
>> >>> in https://issues.apache.org/jira/browse/HIVE-19509.
>> >>>
>> >>> -Jesús
>> >>>
>> >>> On 5/14/18, 2:25 AM, "Prasanth Jayachandran" <
>> >>> pjayachandran@hortonworks.com> wrote:
>> >>>
>> >>>    +1 on freezing commits until we get repetitive green tests. We
>> should
>> >>> probably disable (and remember in a jira to reenable then at later
>> point)
>> >>> tests that are flaky to get repetitive green test runs.
>> >>>
>> >>>    Thanks
>> >>>    Prasanth
>> >>>
>> >>>
>> >>>
>> >>>    On Mon, May 14, 2018 at 2:15 AM -0700, "Rui Li" <
>> >> lirui.fudan@gmail.com
>> >>> <ma...@gmail.com>> wrote:
>> >>>
>> >>>
>> >>>    +1 to freezing commits until we stabilize
>> >>>
>> >>>    On Sat, May 12, 2018 at 6:10 AM, Vihang Karajgaonkar
>> >>>    wrote:
>> >>>
>> >>>> In order to understand the end-to-end precommit flow I would like
>> >> to
>> >>> get
>> >>>> access to the PreCommit-HIVE-Build jenkins script. Does anyone one
>> >>> know how
>> >>>> can I get that?
>> >>>>
>> >>>> On Fri, May 11, 2018 at 2:03 PM, Jesus Camacho Rodriguez <
>> >>>> jcamacho@apache.org> wrote:
>> >>>>
>> >>>>> Bq. For the short term green runs, I think we should @Ignore the
>> >>> tests
>> >>>>> which
>> >>>>> are known to be failing since many runs. They are anyways not
>> >> being
>> >>>>> addressed as such. If people think they are important to be run
>> >> we
>> >>> should
>> >>>>> fix them and only then re-enable them.
>> >>>>>
>> >>>>> I think that is a good idea, as we would minimize the time that
>> >> we
>> >>> halt
>> >>>>> development. We can create a JIRA where we list all tests that
>> >> were
>> >>>>> failing, and we have disabled to get the clean run. From that
>> >>> moment, we
>> >>>>> will have zero tolerance towards committing with failing tests.
>> >>> And we
>> >>>> need
>> >>>>> to pick up those tests that should not be ignored and bring them
>> >>> up again
>> >>>>> but passing. If there is no disagreement, I can start working on
>> >>> that.
>> >>>>>
>> >>>>> Once I am done, I can try to help with infra tickets too.
>> >>>>>
>> >>>>> -Jesús
>> >>>>>
>> >>>>>
>> >>>>> On 5/11/18, 1:57 PM, "Vineet Garg"  wrote:
>> >>>>>
>> >>>>>    +1. I strongly vote for freezing commits and getting our
>> >>> testing
>> >>>>> coverage in acceptable state.  We have been struggling to
>> >> stabilize
>> >>>>> branch-3 due to test failures and releasing Hive 3.0 in current
>> >>> state
>> >>>> would
>> >>>>> be unacceptable.
>> >>>>>
>> >>>>>    Currently there are quite a few test suites which are not
>> >> even
>> >>>> running
>> >>>>> and are being timed out. We have been committing patches (to both
>> >>>> branch-3
>> >>>>> and master) without test coverage for these tests.
>> >>>>>    We should immediately figure out what’s going on before we
>> >>> proceed
>> >>>>> with commits.
>> >>>>>
>> >>>>>    For reference following test suites are timing out on
>> >> master: (
>> >>>>> https://issues.apache.org/jira/browse/HIVE-19506)
>> >>>>>
>> >>>>>
>> >>>>>    TestDbNotificationListener - did not produce a TEST-*.xml
>> >> file
>> >>>> (likely
>> >>>>> timed out)
>> >>>>>
>> >>>>>    TestHCatHiveCompatibility - did not produce a TEST-*.xml file
>> >>> (likely
>> >>>>> timed out)
>> >>>>>
>> >>>>>    TestNegativeCliDriver - did not produce a TEST-*.xml file
>> >>> (likely
>> >>>>> timed out)
>> >>>>>
>> >>>>>    TestNonCatCallsWithCatalog - did not produce a TEST-*.xml
>> >> file
>> >>>> (likely
>> >>>>> timed out)
>> >>>>>
>> >>>>>    TestSequenceFileReadWrite - did not produce a TEST-*.xml file
>> >>> (likely
>> >>>>> timed out)
>> >>>>>
>> >>>>>    TestTxnExIm - did not produce a TEST-*.xml file (likely timed
>> >>> out)
>> >>>>>
>> >>>>>
>> >>>>>    Vineet
>> >>>>>
>> >>>>>
>> >>>>>    On May 11, 2018, at 1:46 PM, Vihang Karajgaonkar <
>> >>>> vihang@cloudera.com
>> >>>>>> wrote:
>> >>>>>
>> >>>>>    +1 There are many problems with the test infrastructure and
>> >> in
>> >>> my
>> >>>>> opinion
>> >>>>>    it has not become number one bottleneck for the project. I
>> >> was
>> >>>> looking
>> >>>>> at
>> >>>>>    the infrastructure yesterday and I think the current
>> >>> infrastructure
>> >>>>> (even
>> >>>>>    its own set of problems) is still under-utilized. I am
>> >>> planning to
>> >>>>> increase
>> >>>>>    the number of threads to process the parallel test batches to
>> >>> start
>> >>>>> with.
>> >>>>>    It needs a restart on the server side. I can do it now, it
>> >>> folks are
>> >>>>> okay
>> >>>>>    with it. Else I can do it over weekend when the queue is
>> >> small.
>> >>>>>
>> >>>>>    I listed the improvements which I thought would be useful
>> >> under
>> >>>>>    https://issues.apache.org/jira/browse/HIVE-19425 but frankly
>> >>>> speaking
>> >>>>> I am
>> >>>>>    not able to devote as much time as I would like to on it. I
>> >>> would
>> >>>>>    appreciate if folks who have some more time if they can help
>> >>> out.
>> >>>>>
>> >>>>>    I think to start with https://issues.apache.org/
>> >>>> jira/browse/HIVE-19429
>> >>>>> will
>> >>>>>    help a lot. We need to pack more test runs in parallel and
>> >>> containers
>> >>>>>    provide good isolation.
>> >>>>>
>> >>>>>    For the short term green runs, I think we should @Ignore the
>> >>> tests
>> >>>>> which
>> >>>>>    are known to be failing since many runs. They are anyways not
>> >>> being
>> >>>>>    addressed as such. If people think they are important to be
>> >>> run we
>> >>>>> should
>> >>>>>    fix them and only then re-enable them.
>> >>>>>
>> >>>>>    Also, I feel we need light-weight test run which we can run
>> >>> locally
>> >>>>> before
>> >>>>>    submitting it for the full-suite. That way minor issues with
>> >>> the
>> >>>> patch
>> >>>>> can
>> >>>>>    be handled locally. May be create a profile which runs a
>> >>> subset of
>> >>>>>    important tests which are consistent. We can apply some label
>> >>> that
>> >>>>>    pre-checkin-local tests are runs successful and only then we
>> >>> submit
>> >>>>> for the
>> >>>>>    full-suite.
>> >>>>>
>> >>>>>    More thoughts are welcome. Thanks for starting this
>> >>> conversation.
>> >>>>>
>> >>>>>    On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
>> >>>>>    jcamacho@apache.org> wrote:
>> >>>>>
>> >>>>>    I believe we have reached a state (maybe we did reach it a
>> >>> while ago)
>> >>>>> that
>> >>>>>    is not sustainable anymore, as there are so many tests
>> >> failing
>> >>> /
>> >>>>> timing out
>> >>>>>    that it is not possible to verify whether a patch is breaking
>> >>> some
>> >>>>> critical
>> >>>>>    parts of the system or not. It also seems to me that due to
>> >> the
>> >>>>> timeouts
>> >>>>>    (maybe due to infra, maybe not), ptest runs are taking even
>> >>> longer
>> >>>> than
>> >>>>>    usual, which in turn creates even longer queue of patches.
>> >>>>>
>> >>>>>    There is an ongoing effort to improve ptests usability (
>> >>>>>    https://issues.apache.org/jira/browse/HIVE-19425), but apart
>> >>> from
>> >>>>> that,
>> >>>>>    we need to make an effort to stabilize existing tests and
>> >>> bring that
>> >>>>>    failure count to zero.
>> >>>>>
>> >>>>>    Hence, I am suggesting *we stop committing any patch before
>> >> we
>> >>> get a
>> >>>>> green
>> >>>>>    run*. If someone thinks this proposal is too radical, please
>> >>> come up
>> >>>>> with
>> >>>>>    an alternative, because I do not think it is OK to have the
>> >>> ptest
>> >>>> runs
>> >>>>> in
>> >>>>>    their current state. Other projects of certain size (e.g.,
>> >>> Hadoop,
>> >>>>> Spark)
>> >>>>>    are always green, we should be able to do the same.
>> >>>>>
>> >>>>>    Finally, once we get to zero failures, I suggest we are less
>> >>> tolerant
>> >>>>> with
>> >>>>>    committing without getting a clean ptests run. If there is a
>> >>> failure,
>> >>>>> we
>> >>>>>    need to fix it or revert the patch that caused it, then we
>> >>> continue
>> >>>>>    developing.
>> >>>>>
>> >>>>>    Please, let’s all work together as a community to fix this
>> >>> issue,
>> >>>> that
>> >>>>> is
>> >>>>>    the only way to get to zero quickly.
>> >>>>>
>> >>>>>    Thanks,
>> >>>>>    Jesús
>> >>>>>
>> >>>>>    PS. I assume the flaky tests will come into the discussion.
>> >>> Let´s see
>> >>>>>    first how many of those we have, then we can work to find a
>> >>> fix.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>>    --
>> >>>    Best regards!
>> >>>    Rui Li
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>
>>
>>


Re: [DISCUSS] Unsustainable situation with ptests

Posted by Alan Gates <al...@gmail.com>.
We should do it in a separate thread so that people can see it with the
[VOTE] subject.  Some people use that as a filter in their email to know
when to pay attention to things.

Alan.

On Mon, May 14, 2018 at 2:36 PM, Prasanth Jayachandran <
pjayachandran@hortonworks.com> wrote:

> Will there be a separate voting thread? Or the voting on this thread is
> sufficient for lock down?
>
> Thanks
> Prasanth
>
> > On May 14, 2018, at 2:34 PM, Alan Gates <al...@gmail.com> wrote:
> >
> > ​I see there's support for this, but people are still pouring in commits.
> > I proposed we have a quick vote on this to lock down the commits until we
> > get to green.  That way everyone knows we have drawn the line at a
> specific
> > point.  Any commits after that point would be reverted.  There isn't a
> > category in the bylaws that fits this kind of vote but I suggest lazy
> > majority as the most appropriate one (at least 3 votes, more +1s than
> > -1s).
> >
> > Alan.​
> >
> > On Mon, May 14, 2018 at 10:34 AM, Vihang Karajgaonkar <
> vihang@cloudera.com>
> > wrote:
> >
> >> I worked on a few quick-fix optimizations in Ptest infrastructure over
> the
> >> weekend which reduced the execution run from ~90 min to ~70 min per
> run. I
> >> had to restart Ptest multiple times. I was resubmitting the patches
> which
> >> were in the queue manually, but I may have missed a few. In case you
> have a
> >> patch which is pending pre-commit and you don't see it in the queue,
> please
> >> submit it manually or let me know if you don't have access to the
> jenkins
> >> job. I will continue to work on the sub-tasks in HIVE-19425 and will do
> >> some maintenance next weekend as well.
> >>
> >> On Mon, May 14, 2018 at 7:42 AM, Jesus Camacho Rodriguez <
> >> jcamacho@apache.org> wrote:
> >>
> >>> Vineet has already been working on disabling those tests that were
> timing
> >>> out. I am working on disabling those that are generating different q
> >> files
> >>> consistently for last ptests n runs. I am keeping track of all these
> >> tests
> >>> in https://issues.apache.org/jira/browse/HIVE-19509.
> >>>
> >>> -Jesús
> >>>
> >>> On 5/14/18, 2:25 AM, "Prasanth Jayachandran" <
> >>> pjayachandran@hortonworks.com> wrote:
> >>>
> >>>    +1 on freezing commits until we get repetitive green tests. We
> should
> >>> probably disable (and remember in a jira to reenable then at later
> point)
> >>> tests that are flaky to get repetitive green test runs.
> >>>
> >>>    Thanks
> >>>    Prasanth
> >>>
> >>>
> >>>
> >>>    On Mon, May 14, 2018 at 2:15 AM -0700, "Rui Li" <
> >> lirui.fudan@gmail.com
> >>> <ma...@gmail.com>> wrote:
> >>>
> >>>
> >>>    +1 to freezing commits until we stabilize
> >>>
> >>>    On Sat, May 12, 2018 at 6:10 AM, Vihang Karajgaonkar
> >>>    wrote:
> >>>
> >>>> In order to understand the end-to-end precommit flow I would like
> >> to
> >>> get
> >>>> access to the PreCommit-HIVE-Build jenkins script. Does anyone one
> >>> know how
> >>>> can I get that?
> >>>>
> >>>> On Fri, May 11, 2018 at 2:03 PM, Jesus Camacho Rodriguez <
> >>>> jcamacho@apache.org> wrote:
> >>>>
> >>>>> Bq. For the short term green runs, I think we should @Ignore the
> >>> tests
> >>>>> which
> >>>>> are known to be failing since many runs. They are anyways not
> >> being
> >>>>> addressed as such. If people think they are important to be run
> >> we
> >>> should
> >>>>> fix them and only then re-enable them.
> >>>>>
> >>>>> I think that is a good idea, as we would minimize the time that
> >> we
> >>> halt
> >>>>> development. We can create a JIRA where we list all tests that
> >> were
> >>>>> failing, and we have disabled to get the clean run. From that
> >>> moment, we
> >>>>> will have zero tolerance towards committing with failing tests.
> >>> And we
> >>>> need
> >>>>> to pick up those tests that should not be ignored and bring them
> >>> up again
> >>>>> but passing. If there is no disagreement, I can start working on
> >>> that.
> >>>>>
> >>>>> Once I am done, I can try to help with infra tickets too.
> >>>>>
> >>>>> -Jesús
> >>>>>
> >>>>>
> >>>>> On 5/11/18, 1:57 PM, "Vineet Garg"  wrote:
> >>>>>
> >>>>>    +1. I strongly vote for freezing commits and getting our
> >>> testing
> >>>>> coverage in acceptable state.  We have been struggling to
> >> stabilize
> >>>>> branch-3 due to test failures and releasing Hive 3.0 in current
> >>> state
> >>>> would
> >>>>> be unacceptable.
> >>>>>
> >>>>>    Currently there are quite a few test suites which are not
> >> even
> >>>> running
> >>>>> and are being timed out. We have been committing patches (to both
> >>>> branch-3
> >>>>> and master) without test coverage for these tests.
> >>>>>    We should immediately figure out what’s going on before we
> >>> proceed
> >>>>> with commits.
> >>>>>
> >>>>>    For reference following test suites are timing out on
> >> master: (
> >>>>> https://issues.apache.org/jira/browse/HIVE-19506)
> >>>>>
> >>>>>
> >>>>>    TestDbNotificationListener - did not produce a TEST-*.xml
> >> file
> >>>> (likely
> >>>>> timed out)
> >>>>>
> >>>>>    TestHCatHiveCompatibility - did not produce a TEST-*.xml file
> >>> (likely
> >>>>> timed out)
> >>>>>
> >>>>>    TestNegativeCliDriver - did not produce a TEST-*.xml file
> >>> (likely
> >>>>> timed out)
> >>>>>
> >>>>>    TestNonCatCallsWithCatalog - did not produce a TEST-*.xml
> >> file
> >>>> (likely
> >>>>> timed out)
> >>>>>
> >>>>>    TestSequenceFileReadWrite - did not produce a TEST-*.xml file
> >>> (likely
> >>>>> timed out)
> >>>>>
> >>>>>    TestTxnExIm - did not produce a TEST-*.xml file (likely timed
> >>> out)
> >>>>>
> >>>>>
> >>>>>    Vineet
> >>>>>
> >>>>>
> >>>>>    On May 11, 2018, at 1:46 PM, Vihang Karajgaonkar <
> >>>> vihang@cloudera.com
> >>>>>> wrote:
> >>>>>
> >>>>>    +1 There are many problems with the test infrastructure and
> >> in
> >>> my
> >>>>> opinion
> >>>>>    it has not become number one bottleneck for the project. I
> >> was
> >>>> looking
> >>>>> at
> >>>>>    the infrastructure yesterday and I think the current
> >>> infrastructure
> >>>>> (even
> >>>>>    its own set of problems) is still under-utilized. I am
> >>> planning to
> >>>>> increase
> >>>>>    the number of threads to process the parallel test batches to
> >>> start
> >>>>> with.
> >>>>>    It needs a restart on the server side. I can do it now, it
> >>> folks are
> >>>>> okay
> >>>>>    with it. Else I can do it over weekend when the queue is
> >> small.
> >>>>>
> >>>>>    I listed the improvements which I thought would be useful
> >> under
> >>>>>    https://issues.apache.org/jira/browse/HIVE-19425 but frankly
> >>>> speaking
> >>>>> I am
> >>>>>    not able to devote as much time as I would like to on it. I
> >>> would
> >>>>>    appreciate if folks who have some more time if they can help
> >>> out.
> >>>>>
> >>>>>    I think to start with https://issues.apache.org/
> >>>> jira/browse/HIVE-19429
> >>>>> will
> >>>>>    help a lot. We need to pack more test runs in parallel and
> >>> containers
> >>>>>    provide good isolation.
> >>>>>
> >>>>>    For the short term green runs, I think we should @Ignore the
> >>> tests
> >>>>> which
> >>>>>    are known to be failing since many runs. They are anyways not
> >>> being
> >>>>>    addressed as such. If people think they are important to be
> >>> run we
> >>>>> should
> >>>>>    fix them and only then re-enable them.
> >>>>>
> >>>>>    Also, I feel we need light-weight test run which we can run
> >>> locally
> >>>>> before
> >>>>>    submitting it for the full-suite. That way minor issues with
> >>> the
> >>>> patch
> >>>>> can
> >>>>>    be handled locally. May be create a profile which runs a
> >>> subset of
> >>>>>    important tests which are consistent. We can apply some label
> >>> that
> >>>>>    pre-checkin-local tests are runs successful and only then we
> >>> submit
> >>>>> for the
> >>>>>    full-suite.
> >>>>>
> >>>>>    More thoughts are welcome. Thanks for starting this
> >>> conversation.
> >>>>>
> >>>>>    On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
> >>>>>    jcamacho@apache.org> wrote:
> >>>>>
> >>>>>    I believe we have reached a state (maybe we did reach it a
> >>> while ago)
> >>>>> that
> >>>>>    is not sustainable anymore, as there are so many tests
> >> failing
> >>> /
> >>>>> timing out
> >>>>>    that it is not possible to verify whether a patch is breaking
> >>> some
> >>>>> critical
> >>>>>    parts of the system or not. It also seems to me that due to
> >> the
> >>>>> timeouts
> >>>>>    (maybe due to infra, maybe not), ptest runs are taking even
> >>> longer
> >>>> than
> >>>>>    usual, which in turn creates even longer queue of patches.
> >>>>>
> >>>>>    There is an ongoing effort to improve ptests usability (
> >>>>>    https://issues.apache.org/jira/browse/HIVE-19425), but apart
> >>> from
> >>>>> that,
> >>>>>    we need to make an effort to stabilize existing tests and
> >>> bring that
> >>>>>    failure count to zero.
> >>>>>
> >>>>>    Hence, I am suggesting *we stop committing any patch before
> >> we
> >>> get a
> >>>>> green
> >>>>>    run*. If someone thinks this proposal is too radical, please
> >>> come up
> >>>>> with
> >>>>>    an alternative, because I do not think it is OK to have the
> >>> ptest
> >>>> runs
> >>>>> in
> >>>>>    their current state. Other projects of certain size (e.g.,
> >>> Hadoop,
> >>>>> Spark)
> >>>>>    are always green, we should be able to do the same.
> >>>>>
> >>>>>    Finally, once we get to zero failures, I suggest we are less
> >>> tolerant
> >>>>> with
> >>>>>    committing without getting a clean ptests run. If there is a
> >>> failure,
> >>>>> we
> >>>>>    need to fix it or revert the patch that caused it, then we
> >>> continue
> >>>>>    developing.
> >>>>>
> >>>>>    Please, let’s all work together as a community to fix this
> >>> issue,
> >>>> that
> >>>>> is
> >>>>>    the only way to get to zero quickly.
> >>>>>
> >>>>>    Thanks,
> >>>>>    Jesús
> >>>>>
> >>>>>    PS. I assume the flaky tests will come into the discussion.
> >>> Let´s see
> >>>>>    first how many of those we have, then we can work to find a
> >>> fix.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>>
> >>>    --
> >>>    Best regards!
> >>>    Rui Li
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
>
>

Re: [DISCUSS] Unsustainable situation with ptests

Posted by Prasanth Jayachandran <pj...@hortonworks.com>.
Will there be a separate voting thread? Or the voting on this thread is sufficient for lock down?

Thanks
Prasanth

> On May 14, 2018, at 2:34 PM, Alan Gates <al...@gmail.com> wrote:
> 
> ​I see there's support for this, but people are still pouring in commits.
> I proposed we have a quick vote on this to lock down the commits until we
> get to green.  That way everyone knows we have drawn the line at a specific
> point.  Any commits after that point would be reverted.  There isn't a
> category in the bylaws that fits this kind of vote but I suggest lazy
> majority as the most appropriate one (at least 3 votes, more +1s than
> -1s).
> 
> Alan.​
> 
> On Mon, May 14, 2018 at 10:34 AM, Vihang Karajgaonkar <vi...@cloudera.com>
> wrote:
> 
>> I worked on a few quick-fix optimizations in Ptest infrastructure over the
>> weekend which reduced the execution run from ~90 min to ~70 min per run. I
>> had to restart Ptest multiple times. I was resubmitting the patches which
>> were in the queue manually, but I may have missed a few. In case you have a
>> patch which is pending pre-commit and you don't see it in the queue, please
>> submit it manually or let me know if you don't have access to the jenkins
>> job. I will continue to work on the sub-tasks in HIVE-19425 and will do
>> some maintenance next weekend as well.
>> 
>> On Mon, May 14, 2018 at 7:42 AM, Jesus Camacho Rodriguez <
>> jcamacho@apache.org> wrote:
>> 
>>> Vineet has already been working on disabling those tests that were timing
>>> out. I am working on disabling those that are generating different q
>> files
>>> consistently for last ptests n runs. I am keeping track of all these
>> tests
>>> in https://issues.apache.org/jira/browse/HIVE-19509.
>>> 
>>> -Jesús
>>> 
>>> On 5/14/18, 2:25 AM, "Prasanth Jayachandran" <
>>> pjayachandran@hortonworks.com> wrote:
>>> 
>>>    +1 on freezing commits until we get repetitive green tests. We should
>>> probably disable (and remember in a jira to reenable then at later point)
>>> tests that are flaky to get repetitive green test runs.
>>> 
>>>    Thanks
>>>    Prasanth
>>> 
>>> 
>>> 
>>>    On Mon, May 14, 2018 at 2:15 AM -0700, "Rui Li" <
>> lirui.fudan@gmail.com
>>> <ma...@gmail.com>> wrote:
>>> 
>>> 
>>>    +1 to freezing commits until we stabilize
>>> 
>>>    On Sat, May 12, 2018 at 6:10 AM, Vihang Karajgaonkar
>>>    wrote:
>>> 
>>>> In order to understand the end-to-end precommit flow I would like
>> to
>>> get
>>>> access to the PreCommit-HIVE-Build jenkins script. Does anyone one
>>> know how
>>>> can I get that?
>>>> 
>>>> On Fri, May 11, 2018 at 2:03 PM, Jesus Camacho Rodriguez <
>>>> jcamacho@apache.org> wrote:
>>>> 
>>>>> Bq. For the short term green runs, I think we should @Ignore the
>>> tests
>>>>> which
>>>>> are known to be failing since many runs. They are anyways not
>> being
>>>>> addressed as such. If people think they are important to be run
>> we
>>> should
>>>>> fix them and only then re-enable them.
>>>>> 
>>>>> I think that is a good idea, as we would minimize the time that
>> we
>>> halt
>>>>> development. We can create a JIRA where we list all tests that
>> were
>>>>> failing, and we have disabled to get the clean run. From that
>>> moment, we
>>>>> will have zero tolerance towards committing with failing tests.
>>> And we
>>>> need
>>>>> to pick up those tests that should not be ignored and bring them
>>> up again
>>>>> but passing. If there is no disagreement, I can start working on
>>> that.
>>>>> 
>>>>> Once I am done, I can try to help with infra tickets too.
>>>>> 
>>>>> -Jesús
>>>>> 
>>>>> 
>>>>> On 5/11/18, 1:57 PM, "Vineet Garg"  wrote:
>>>>> 
>>>>>    +1. I strongly vote for freezing commits and getting our
>>> testing
>>>>> coverage in acceptable state.  We have been struggling to
>> stabilize
>>>>> branch-3 due to test failures and releasing Hive 3.0 in current
>>> state
>>>> would
>>>>> be unacceptable.
>>>>> 
>>>>>    Currently there are quite a few test suites which are not
>> even
>>>> running
>>>>> and are being timed out. We have been committing patches (to both
>>>> branch-3
>>>>> and master) without test coverage for these tests.
>>>>>    We should immediately figure out what’s going on before we
>>> proceed
>>>>> with commits.
>>>>> 
>>>>>    For reference following test suites are timing out on
>> master: (
>>>>> https://issues.apache.org/jira/browse/HIVE-19506)
>>>>> 
>>>>> 
>>>>>    TestDbNotificationListener - did not produce a TEST-*.xml
>> file
>>>> (likely
>>>>> timed out)
>>>>> 
>>>>>    TestHCatHiveCompatibility - did not produce a TEST-*.xml file
>>> (likely
>>>>> timed out)
>>>>> 
>>>>>    TestNegativeCliDriver - did not produce a TEST-*.xml file
>>> (likely
>>>>> timed out)
>>>>> 
>>>>>    TestNonCatCallsWithCatalog - did not produce a TEST-*.xml
>> file
>>>> (likely
>>>>> timed out)
>>>>> 
>>>>>    TestSequenceFileReadWrite - did not produce a TEST-*.xml file
>>> (likely
>>>>> timed out)
>>>>> 
>>>>>    TestTxnExIm - did not produce a TEST-*.xml file (likely timed
>>> out)
>>>>> 
>>>>> 
>>>>>    Vineet
>>>>> 
>>>>> 
>>>>>    On May 11, 2018, at 1:46 PM, Vihang Karajgaonkar <
>>>> vihang@cloudera.com
>>>>>> wrote:
>>>>> 
>>>>>    +1 There are many problems with the test infrastructure and
>> in
>>> my
>>>>> opinion
>>>>>    it has not become number one bottleneck for the project. I
>> was
>>>> looking
>>>>> at
>>>>>    the infrastructure yesterday and I think the current
>>> infrastructure
>>>>> (even
>>>>>    its own set of problems) is still under-utilized. I am
>>> planning to
>>>>> increase
>>>>>    the number of threads to process the parallel test batches to
>>> start
>>>>> with.
>>>>>    It needs a restart on the server side. I can do it now, it
>>> folks are
>>>>> okay
>>>>>    with it. Else I can do it over weekend when the queue is
>> small.
>>>>> 
>>>>>    I listed the improvements which I thought would be useful
>> under
>>>>>    https://issues.apache.org/jira/browse/HIVE-19425 but frankly
>>>> speaking
>>>>> I am
>>>>>    not able to devote as much time as I would like to on it. I
>>> would
>>>>>    appreciate if folks who have some more time if they can help
>>> out.
>>>>> 
>>>>>    I think to start with https://issues.apache.org/
>>>> jira/browse/HIVE-19429
>>>>> will
>>>>>    help a lot. We need to pack more test runs in parallel and
>>> containers
>>>>>    provide good isolation.
>>>>> 
>>>>>    For the short term green runs, I think we should @Ignore the
>>> tests
>>>>> which
>>>>>    are known to be failing since many runs. They are anyways not
>>> being
>>>>>    addressed as such. If people think they are important to be
>>> run we
>>>>> should
>>>>>    fix them and only then re-enable them.
>>>>> 
>>>>>    Also, I feel we need light-weight test run which we can run
>>> locally
>>>>> before
>>>>>    submitting it for the full-suite. That way minor issues with
>>> the
>>>> patch
>>>>> can
>>>>>    be handled locally. May be create a profile which runs a
>>> subset of
>>>>>    important tests which are consistent. We can apply some label
>>> that
>>>>>    pre-checkin-local tests are runs successful and only then we
>>> submit
>>>>> for the
>>>>>    full-suite.
>>>>> 
>>>>>    More thoughts are welcome. Thanks for starting this
>>> conversation.
>>>>> 
>>>>>    On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
>>>>>    jcamacho@apache.org> wrote:
>>>>> 
>>>>>    I believe we have reached a state (maybe we did reach it a
>>> while ago)
>>>>> that
>>>>>    is not sustainable anymore, as there are so many tests
>> failing
>>> /
>>>>> timing out
>>>>>    that it is not possible to verify whether a patch is breaking
>>> some
>>>>> critical
>>>>>    parts of the system or not. It also seems to me that due to
>> the
>>>>> timeouts
>>>>>    (maybe due to infra, maybe not), ptest runs are taking even
>>> longer
>>>> than
>>>>>    usual, which in turn creates even longer queue of patches.
>>>>> 
>>>>>    There is an ongoing effort to improve ptests usability (
>>>>>    https://issues.apache.org/jira/browse/HIVE-19425), but apart
>>> from
>>>>> that,
>>>>>    we need to make an effort to stabilize existing tests and
>>> bring that
>>>>>    failure count to zero.
>>>>> 
>>>>>    Hence, I am suggesting *we stop committing any patch before
>> we
>>> get a
>>>>> green
>>>>>    run*. If someone thinks this proposal is too radical, please
>>> come up
>>>>> with
>>>>>    an alternative, because I do not think it is OK to have the
>>> ptest
>>>> runs
>>>>> in
>>>>>    their current state. Other projects of certain size (e.g.,
>>> Hadoop,
>>>>> Spark)
>>>>>    are always green, we should be able to do the same.
>>>>> 
>>>>>    Finally, once we get to zero failures, I suggest we are less
>>> tolerant
>>>>> with
>>>>>    committing without getting a clean ptests run. If there is a
>>> failure,
>>>>> we
>>>>>    need to fix it or revert the patch that caused it, then we
>>> continue
>>>>>    developing.
>>>>> 
>>>>>    Please, let’s all work together as a community to fix this
>>> issue,
>>>> that
>>>>> is
>>>>>    the only way to get to zero quickly.
>>>>> 
>>>>>    Thanks,
>>>>>    Jesús
>>>>> 
>>>>>    PS. I assume the flaky tests will come into the discussion.
>>> Let´s see
>>>>>    first how many of those we have, then we can work to find a
>>> fix.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>>    --
>>>    Best regards!
>>>    Rui Li
>>> 
>>> 
>>> 
>>> 
>>> 
>> 


Re: [DISCUSS] Unsustainable situation with ptests

Posted by Alan Gates <al...@gmail.com>.
​I see there's support for this, but people are still pouring in commits.
I proposed we have a quick vote on this to lock down the commits until we
get to green.  That way everyone knows we have drawn the line at a specific
point.  Any commits after that point would be reverted.  There isn't a
category in the bylaws that fits this kind of vote but I suggest lazy
majority as the most appropriate one (at least 3 votes, more +1s than
-1s).

Alan.​

On Mon, May 14, 2018 at 10:34 AM, Vihang Karajgaonkar <vi...@cloudera.com>
wrote:

> I worked on a few quick-fix optimizations in Ptest infrastructure over the
> weekend which reduced the execution run from ~90 min to ~70 min per run. I
> had to restart Ptest multiple times. I was resubmitting the patches which
> were in the queue manually, but I may have missed a few. In case you have a
> patch which is pending pre-commit and you don't see it in the queue, please
> submit it manually or let me know if you don't have access to the jenkins
> job. I will continue to work on the sub-tasks in HIVE-19425 and will do
> some maintenance next weekend as well.
>
> On Mon, May 14, 2018 at 7:42 AM, Jesus Camacho Rodriguez <
> jcamacho@apache.org> wrote:
>
> > Vineet has already been working on disabling those tests that were timing
> > out. I am working on disabling those that are generating different q
> files
> > consistently for last ptests n runs. I am keeping track of all these
> tests
> > in https://issues.apache.org/jira/browse/HIVE-19509.
> >
> > -Jesús
> >
> > On 5/14/18, 2:25 AM, "Prasanth Jayachandran" <
> > pjayachandran@hortonworks.com> wrote:
> >
> >     +1 on freezing commits until we get repetitive green tests. We should
> > probably disable (and remember in a jira to reenable then at later point)
> > tests that are flaky to get repetitive green test runs.
> >
> >     Thanks
> >     Prasanth
> >
> >
> >
> >     On Mon, May 14, 2018 at 2:15 AM -0700, "Rui Li" <
> lirui.fudan@gmail.com
> > <ma...@gmail.com>> wrote:
> >
> >
> >     +1 to freezing commits until we stabilize
> >
> >     On Sat, May 12, 2018 at 6:10 AM, Vihang Karajgaonkar
> >     wrote:
> >
> >     > In order to understand the end-to-end precommit flow I would like
> to
> > get
> >     > access to the PreCommit-HIVE-Build jenkins script. Does anyone one
> > know how
> >     > can I get that?
> >     >
> >     > On Fri, May 11, 2018 at 2:03 PM, Jesus Camacho Rodriguez <
> >     > jcamacho@apache.org> wrote:
> >     >
> >     > > Bq. For the short term green runs, I think we should @Ignore the
> > tests
> >     > > which
> >     > > are known to be failing since many runs. They are anyways not
> being
> >     > > addressed as such. If people think they are important to be run
> we
> > should
> >     > > fix them and only then re-enable them.
> >     > >
> >     > > I think that is a good idea, as we would minimize the time that
> we
> > halt
> >     > > development. We can create a JIRA where we list all tests that
> were
> >     > > failing, and we have disabled to get the clean run. From that
> > moment, we
> >     > > will have zero tolerance towards committing with failing tests.
> > And we
> >     > need
> >     > > to pick up those tests that should not be ignored and bring them
> > up again
> >     > > but passing. If there is no disagreement, I can start working on
> > that.
> >     > >
> >     > > Once I am done, I can try to help with infra tickets too.
> >     > >
> >     > > -Jesús
> >     > >
> >     > >
> >     > > On 5/11/18, 1:57 PM, "Vineet Garg"  wrote:
> >     > >
> >     > >     +1. I strongly vote for freezing commits and getting our
> > testing
> >     > > coverage in acceptable state.  We have been struggling to
> stabilize
> >     > > branch-3 due to test failures and releasing Hive 3.0 in current
> > state
> >     > would
> >     > > be unacceptable.
> >     > >
> >     > >     Currently there are quite a few test suites which are not
> even
> >     > running
> >     > > and are being timed out. We have been committing patches (to both
> >     > branch-3
> >     > > and master) without test coverage for these tests.
> >     > >     We should immediately figure out what’s going on before we
> > proceed
> >     > > with commits.
> >     > >
> >     > >     For reference following test suites are timing out on
> master: (
> >     > > https://issues.apache.org/jira/browse/HIVE-19506)
> >     > >
> >     > >
> >     > >     TestDbNotificationListener - did not produce a TEST-*.xml
> file
> >     > (likely
> >     > > timed out)
> >     > >
> >     > >     TestHCatHiveCompatibility - did not produce a TEST-*.xml file
> > (likely
> >     > > timed out)
> >     > >
> >     > >     TestNegativeCliDriver - did not produce a TEST-*.xml file
> > (likely
> >     > > timed out)
> >     > >
> >     > >     TestNonCatCallsWithCatalog - did not produce a TEST-*.xml
> file
> >     > (likely
> >     > > timed out)
> >     > >
> >     > >     TestSequenceFileReadWrite - did not produce a TEST-*.xml file
> > (likely
> >     > > timed out)
> >     > >
> >     > >     TestTxnExIm - did not produce a TEST-*.xml file (likely timed
> > out)
> >     > >
> >     > >
> >     > >     Vineet
> >     > >
> >     > >
> >     > >     On May 11, 2018, at 1:46 PM, Vihang Karajgaonkar <
> >     > vihang@cloudera.com
> >     > > > wrote:
> >     > >
> >     > >     +1 There are many problems with the test infrastructure and
> in
> > my
> >     > > opinion
> >     > >     it has not become number one bottleneck for the project. I
> was
> >     > looking
> >     > > at
> >     > >     the infrastructure yesterday and I think the current
> > infrastructure
> >     > > (even
> >     > >     its own set of problems) is still under-utilized. I am
> > planning to
> >     > > increase
> >     > >     the number of threads to process the parallel test batches to
> > start
> >     > > with.
> >     > >     It needs a restart on the server side. I can do it now, it
> > folks are
> >     > > okay
> >     > >     with it. Else I can do it over weekend when the queue is
> small.
> >     > >
> >     > >     I listed the improvements which I thought would be useful
> under
> >     > >     https://issues.apache.org/jira/browse/HIVE-19425 but frankly
> >     > speaking
> >     > > I am
> >     > >     not able to devote as much time as I would like to on it. I
> > would
> >     > >     appreciate if folks who have some more time if they can help
> > out.
> >     > >
> >     > >     I think to start with https://issues.apache.org/
> >     > jira/browse/HIVE-19429
> >     > > will
> >     > >     help a lot. We need to pack more test runs in parallel and
> > containers
> >     > >     provide good isolation.
> >     > >
> >     > >     For the short term green runs, I think we should @Ignore the
> > tests
> >     > > which
> >     > >     are known to be failing since many runs. They are anyways not
> > being
> >     > >     addressed as such. If people think they are important to be
> > run we
> >     > > should
> >     > >     fix them and only then re-enable them.
> >     > >
> >     > >     Also, I feel we need light-weight test run which we can run
> > locally
> >     > > before
> >     > >     submitting it for the full-suite. That way minor issues with
> > the
> >     > patch
> >     > > can
> >     > >     be handled locally. May be create a profile which runs a
> > subset of
> >     > >     important tests which are consistent. We can apply some label
> > that
> >     > >     pre-checkin-local tests are runs successful and only then we
> > submit
> >     > > for the
> >     > >     full-suite.
> >     > >
> >     > >     More thoughts are welcome. Thanks for starting this
> > conversation.
> >     > >
> >     > >     On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
> >     > >     jcamacho@apache.org> wrote:
> >     > >
> >     > >     I believe we have reached a state (maybe we did reach it a
> > while ago)
> >     > > that
> >     > >     is not sustainable anymore, as there are so many tests
> failing
> > /
> >     > > timing out
> >     > >     that it is not possible to verify whether a patch is breaking
> > some
> >     > > critical
> >     > >     parts of the system or not. It also seems to me that due to
> the
> >     > > timeouts
> >     > >     (maybe due to infra, maybe not), ptest runs are taking even
> > longer
> >     > than
> >     > >     usual, which in turn creates even longer queue of patches.
> >     > >
> >     > >     There is an ongoing effort to improve ptests usability (
> >     > >     https://issues.apache.org/jira/browse/HIVE-19425), but apart
> > from
> >     > > that,
> >     > >     we need to make an effort to stabilize existing tests and
> > bring that
> >     > >     failure count to zero.
> >     > >
> >     > >     Hence, I am suggesting *we stop committing any patch before
> we
> > get a
> >     > > green
> >     > >     run*. If someone thinks this proposal is too radical, please
> > come up
> >     > > with
> >     > >     an alternative, because I do not think it is OK to have the
> > ptest
> >     > runs
> >     > > in
> >     > >     their current state. Other projects of certain size (e.g.,
> > Hadoop,
> >     > > Spark)
> >     > >     are always green, we should be able to do the same.
> >     > >
> >     > >     Finally, once we get to zero failures, I suggest we are less
> > tolerant
> >     > > with
> >     > >     committing without getting a clean ptests run. If there is a
> > failure,
> >     > > we
> >     > >     need to fix it or revert the patch that caused it, then we
> > continue
> >     > >     developing.
> >     > >
> >     > >     Please, let’s all work together as a community to fix this
> > issue,
> >     > that
> >     > > is
> >     > >     the only way to get to zero quickly.
> >     > >
> >     > >     Thanks,
> >     > >     Jesús
> >     > >
> >     > >     PS. I assume the flaky tests will come into the discussion.
> > Let´s see
> >     > >     first how many of those we have, then we can work to find a
> > fix.
> >     > >
> >     > >
> >     > >
> >     > >
> >     > >
> >     > >
> >     > >
> >     > >
> >     >
> >
> >
> >
> >     --
> >     Best regards!
> >     Rui Li
> >
> >
> >
> >
> >
>

Re: [DISCUSS] Unsustainable situation with ptests

Posted by Vihang Karajgaonkar <vi...@cloudera.com>.
I worked on a few quick-fix optimizations in Ptest infrastructure over the
weekend which reduced the execution run from ~90 min to ~70 min per run. I
had to restart Ptest multiple times. I was resubmitting the patches which
were in the queue manually, but I may have missed a few. In case you have a
patch which is pending pre-commit and you don't see it in the queue, please
submit it manually or let me know if you don't have access to the jenkins
job. I will continue to work on the sub-tasks in HIVE-19425 and will do
some maintenance next weekend as well.

On Mon, May 14, 2018 at 7:42 AM, Jesus Camacho Rodriguez <
jcamacho@apache.org> wrote:

> Vineet has already been working on disabling those tests that were timing
> out. I am working on disabling those that are generating different q files
> consistently for last ptests n runs. I am keeping track of all these tests
> in https://issues.apache.org/jira/browse/HIVE-19509.
>
> -Jesús
>
> On 5/14/18, 2:25 AM, "Prasanth Jayachandran" <
> pjayachandran@hortonworks.com> wrote:
>
>     +1 on freezing commits until we get repetitive green tests. We should
> probably disable (and remember in a jira to reenable then at later point)
> tests that are flaky to get repetitive green test runs.
>
>     Thanks
>     Prasanth
>
>
>
>     On Mon, May 14, 2018 at 2:15 AM -0700, "Rui Li" <lirui.fudan@gmail.com
> <ma...@gmail.com>> wrote:
>
>
>     +1 to freezing commits until we stabilize
>
>     On Sat, May 12, 2018 at 6:10 AM, Vihang Karajgaonkar
>     wrote:
>
>     > In order to understand the end-to-end precommit flow I would like to
> get
>     > access to the PreCommit-HIVE-Build jenkins script. Does anyone one
> know how
>     > can I get that?
>     >
>     > On Fri, May 11, 2018 at 2:03 PM, Jesus Camacho Rodriguez <
>     > jcamacho@apache.org> wrote:
>     >
>     > > Bq. For the short term green runs, I think we should @Ignore the
> tests
>     > > which
>     > > are known to be failing since many runs. They are anyways not being
>     > > addressed as such. If people think they are important to be run we
> should
>     > > fix them and only then re-enable them.
>     > >
>     > > I think that is a good idea, as we would minimize the time that we
> halt
>     > > development. We can create a JIRA where we list all tests that were
>     > > failing, and we have disabled to get the clean run. From that
> moment, we
>     > > will have zero tolerance towards committing with failing tests.
> And we
>     > need
>     > > to pick up those tests that should not be ignored and bring them
> up again
>     > > but passing. If there is no disagreement, I can start working on
> that.
>     > >
>     > > Once I am done, I can try to help with infra tickets too.
>     > >
>     > > -Jesús
>     > >
>     > >
>     > > On 5/11/18, 1:57 PM, "Vineet Garg"  wrote:
>     > >
>     > >     +1. I strongly vote for freezing commits and getting our
> testing
>     > > coverage in acceptable state.  We have been struggling to stabilize
>     > > branch-3 due to test failures and releasing Hive 3.0 in current
> state
>     > would
>     > > be unacceptable.
>     > >
>     > >     Currently there are quite a few test suites which are not even
>     > running
>     > > and are being timed out. We have been committing patches (to both
>     > branch-3
>     > > and master) without test coverage for these tests.
>     > >     We should immediately figure out what’s going on before we
> proceed
>     > > with commits.
>     > >
>     > >     For reference following test suites are timing out on master: (
>     > > https://issues.apache.org/jira/browse/HIVE-19506)
>     > >
>     > >
>     > >     TestDbNotificationListener - did not produce a TEST-*.xml file
>     > (likely
>     > > timed out)
>     > >
>     > >     TestHCatHiveCompatibility - did not produce a TEST-*.xml file
> (likely
>     > > timed out)
>     > >
>     > >     TestNegativeCliDriver - did not produce a TEST-*.xml file
> (likely
>     > > timed out)
>     > >
>     > >     TestNonCatCallsWithCatalog - did not produce a TEST-*.xml file
>     > (likely
>     > > timed out)
>     > >
>     > >     TestSequenceFileReadWrite - did not produce a TEST-*.xml file
> (likely
>     > > timed out)
>     > >
>     > >     TestTxnExIm - did not produce a TEST-*.xml file (likely timed
> out)
>     > >
>     > >
>     > >     Vineet
>     > >
>     > >
>     > >     On May 11, 2018, at 1:46 PM, Vihang Karajgaonkar <
>     > vihang@cloudera.com
>     > > > wrote:
>     > >
>     > >     +1 There are many problems with the test infrastructure and in
> my
>     > > opinion
>     > >     it has not become number one bottleneck for the project. I was
>     > looking
>     > > at
>     > >     the infrastructure yesterday and I think the current
> infrastructure
>     > > (even
>     > >     its own set of problems) is still under-utilized. I am
> planning to
>     > > increase
>     > >     the number of threads to process the parallel test batches to
> start
>     > > with.
>     > >     It needs a restart on the server side. I can do it now, it
> folks are
>     > > okay
>     > >     with it. Else I can do it over weekend when the queue is small.
>     > >
>     > >     I listed the improvements which I thought would be useful under
>     > >     https://issues.apache.org/jira/browse/HIVE-19425 but frankly
>     > speaking
>     > > I am
>     > >     not able to devote as much time as I would like to on it. I
> would
>     > >     appreciate if folks who have some more time if they can help
> out.
>     > >
>     > >     I think to start with https://issues.apache.org/
>     > jira/browse/HIVE-19429
>     > > will
>     > >     help a lot. We need to pack more test runs in parallel and
> containers
>     > >     provide good isolation.
>     > >
>     > >     For the short term green runs, I think we should @Ignore the
> tests
>     > > which
>     > >     are known to be failing since many runs. They are anyways not
> being
>     > >     addressed as such. If people think they are important to be
> run we
>     > > should
>     > >     fix them and only then re-enable them.
>     > >
>     > >     Also, I feel we need light-weight test run which we can run
> locally
>     > > before
>     > >     submitting it for the full-suite. That way minor issues with
> the
>     > patch
>     > > can
>     > >     be handled locally. May be create a profile which runs a
> subset of
>     > >     important tests which are consistent. We can apply some label
> that
>     > >     pre-checkin-local tests are runs successful and only then we
> submit
>     > > for the
>     > >     full-suite.
>     > >
>     > >     More thoughts are welcome. Thanks for starting this
> conversation.
>     > >
>     > >     On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
>     > >     jcamacho@apache.org> wrote:
>     > >
>     > >     I believe we have reached a state (maybe we did reach it a
> while ago)
>     > > that
>     > >     is not sustainable anymore, as there are so many tests failing
> /
>     > > timing out
>     > >     that it is not possible to verify whether a patch is breaking
> some
>     > > critical
>     > >     parts of the system or not. It also seems to me that due to the
>     > > timeouts
>     > >     (maybe due to infra, maybe not), ptest runs are taking even
> longer
>     > than
>     > >     usual, which in turn creates even longer queue of patches.
>     > >
>     > >     There is an ongoing effort to improve ptests usability (
>     > >     https://issues.apache.org/jira/browse/HIVE-19425), but apart
> from
>     > > that,
>     > >     we need to make an effort to stabilize existing tests and
> bring that
>     > >     failure count to zero.
>     > >
>     > >     Hence, I am suggesting *we stop committing any patch before we
> get a
>     > > green
>     > >     run*. If someone thinks this proposal is too radical, please
> come up
>     > > with
>     > >     an alternative, because I do not think it is OK to have the
> ptest
>     > runs
>     > > in
>     > >     their current state. Other projects of certain size (e.g.,
> Hadoop,
>     > > Spark)
>     > >     are always green, we should be able to do the same.
>     > >
>     > >     Finally, once we get to zero failures, I suggest we are less
> tolerant
>     > > with
>     > >     committing without getting a clean ptests run. If there is a
> failure,
>     > > we
>     > >     need to fix it or revert the patch that caused it, then we
> continue
>     > >     developing.
>     > >
>     > >     Please, let’s all work together as a community to fix this
> issue,
>     > that
>     > > is
>     > >     the only way to get to zero quickly.
>     > >
>     > >     Thanks,
>     > >     Jesús
>     > >
>     > >     PS. I assume the flaky tests will come into the discussion.
> Let´s see
>     > >     first how many of those we have, then we can work to find a
> fix.
>     > >
>     > >
>     > >
>     > >
>     > >
>     > >
>     > >
>     > >
>     >
>
>
>
>     --
>     Best regards!
>     Rui Li
>
>
>
>
>

Re: [DISCUSS] Unsustainable situation with ptests

Posted by Jesus Camacho Rodriguez <jc...@apache.org>.
Vineet has already been working on disabling those tests that were timing out. I am working on disabling those that are generating different q files consistently for last ptests n runs. I am keeping track of all these tests in https://issues.apache.org/jira/browse/HIVE-19509.

-Jesús

On 5/14/18, 2:25 AM, "Prasanth Jayachandran" <pj...@hortonworks.com> wrote:

    +1 on freezing commits until we get repetitive green tests. We should probably disable (and remember in a jira to reenable then at later point) tests that are flaky to get repetitive green test runs.
    
    Thanks
    Prasanth
    
    
    
    On Mon, May 14, 2018 at 2:15 AM -0700, "Rui Li" <li...@gmail.com>> wrote:
    
    
    +1 to freezing commits until we stabilize
    
    On Sat, May 12, 2018 at 6:10 AM, Vihang Karajgaonkar
    wrote:
    
    > In order to understand the end-to-end precommit flow I would like to get
    > access to the PreCommit-HIVE-Build jenkins script. Does anyone one know how
    > can I get that?
    >
    > On Fri, May 11, 2018 at 2:03 PM, Jesus Camacho Rodriguez <
    > jcamacho@apache.org> wrote:
    >
    > > Bq. For the short term green runs, I think we should @Ignore the tests
    > > which
    > > are known to be failing since many runs. They are anyways not being
    > > addressed as such. If people think they are important to be run we should
    > > fix them and only then re-enable them.
    > >
    > > I think that is a good idea, as we would minimize the time that we halt
    > > development. We can create a JIRA where we list all tests that were
    > > failing, and we have disabled to get the clean run. From that moment, we
    > > will have zero tolerance towards committing with failing tests. And we
    > need
    > > to pick up those tests that should not be ignored and bring them up again
    > > but passing. If there is no disagreement, I can start working on that.
    > >
    > > Once I am done, I can try to help with infra tickets too.
    > >
    > > -Jesús
    > >
    > >
    > > On 5/11/18, 1:57 PM, "Vineet Garg"  wrote:
    > >
    > >     +1. I strongly vote for freezing commits and getting our testing
    > > coverage in acceptable state.  We have been struggling to stabilize
    > > branch-3 due to test failures and releasing Hive 3.0 in current state
    > would
    > > be unacceptable.
    > >
    > >     Currently there are quite a few test suites which are not even
    > running
    > > and are being timed out. We have been committing patches (to both
    > branch-3
    > > and master) without test coverage for these tests.
    > >     We should immediately figure out what’s going on before we proceed
    > > with commits.
    > >
    > >     For reference following test suites are timing out on master: (
    > > https://issues.apache.org/jira/browse/HIVE-19506)
    > >
    > >
    > >     TestDbNotificationListener - did not produce a TEST-*.xml file
    > (likely
    > > timed out)
    > >
    > >     TestHCatHiveCompatibility - did not produce a TEST-*.xml file (likely
    > > timed out)
    > >
    > >     TestNegativeCliDriver - did not produce a TEST-*.xml file (likely
    > > timed out)
    > >
    > >     TestNonCatCallsWithCatalog - did not produce a TEST-*.xml file
    > (likely
    > > timed out)
    > >
    > >     TestSequenceFileReadWrite - did not produce a TEST-*.xml file (likely
    > > timed out)
    > >
    > >     TestTxnExIm - did not produce a TEST-*.xml file (likely timed out)
    > >
    > >
    > >     Vineet
    > >
    > >
    > >     On May 11, 2018, at 1:46 PM, Vihang Karajgaonkar <
    > vihang@cloudera.com
    > > > wrote:
    > >
    > >     +1 There are many problems with the test infrastructure and in my
    > > opinion
    > >     it has not become number one bottleneck for the project. I was
    > looking
    > > at
    > >     the infrastructure yesterday and I think the current infrastructure
    > > (even
    > >     its own set of problems) is still under-utilized. I am planning to
    > > increase
    > >     the number of threads to process the parallel test batches to start
    > > with.
    > >     It needs a restart on the server side. I can do it now, it folks are
    > > okay
    > >     with it. Else I can do it over weekend when the queue is small.
    > >
    > >     I listed the improvements which I thought would be useful under
    > >     https://issues.apache.org/jira/browse/HIVE-19425 but frankly
    > speaking
    > > I am
    > >     not able to devote as much time as I would like to on it. I would
    > >     appreciate if folks who have some more time if they can help out.
    > >
    > >     I think to start with https://issues.apache.org/
    > jira/browse/HIVE-19429
    > > will
    > >     help a lot. We need to pack more test runs in parallel and containers
    > >     provide good isolation.
    > >
    > >     For the short term green runs, I think we should @Ignore the tests
    > > which
    > >     are known to be failing since many runs. They are anyways not being
    > >     addressed as such. If people think they are important to be run we
    > > should
    > >     fix them and only then re-enable them.
    > >
    > >     Also, I feel we need light-weight test run which we can run locally
    > > before
    > >     submitting it for the full-suite. That way minor issues with the
    > patch
    > > can
    > >     be handled locally. May be create a profile which runs a subset of
    > >     important tests which are consistent. We can apply some label that
    > >     pre-checkin-local tests are runs successful and only then we submit
    > > for the
    > >     full-suite.
    > >
    > >     More thoughts are welcome. Thanks for starting this conversation.
    > >
    > >     On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
    > >     jcamacho@apache.org> wrote:
    > >
    > >     I believe we have reached a state (maybe we did reach it a while ago)
    > > that
    > >     is not sustainable anymore, as there are so many tests failing /
    > > timing out
    > >     that it is not possible to verify whether a patch is breaking some
    > > critical
    > >     parts of the system or not. It also seems to me that due to the
    > > timeouts
    > >     (maybe due to infra, maybe not), ptest runs are taking even longer
    > than
    > >     usual, which in turn creates even longer queue of patches.
    > >
    > >     There is an ongoing effort to improve ptests usability (
    > >     https://issues.apache.org/jira/browse/HIVE-19425), but apart from
    > > that,
    > >     we need to make an effort to stabilize existing tests and bring that
    > >     failure count to zero.
    > >
    > >     Hence, I am suggesting *we stop committing any patch before we get a
    > > green
    > >     run*. If someone thinks this proposal is too radical, please come up
    > > with
    > >     an alternative, because I do not think it is OK to have the ptest
    > runs
    > > in
    > >     their current state. Other projects of certain size (e.g., Hadoop,
    > > Spark)
    > >     are always green, we should be able to do the same.
    > >
    > >     Finally, once we get to zero failures, I suggest we are less tolerant
    > > with
    > >     committing without getting a clean ptests run. If there is a failure,
    > > we
    > >     need to fix it or revert the patch that caused it, then we continue
    > >     developing.
    > >
    > >     Please, let’s all work together as a community to fix this issue,
    > that
    > > is
    > >     the only way to get to zero quickly.
    > >
    > >     Thanks,
    > >     Jesús
    > >
    > >     PS. I assume the flaky tests will come into the discussion. Let´s see
    > >     first how many of those we have, then we can work to find a fix.
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    >
    
    
    
    --
    Best regards!
    Rui Li
    
    



Re: [DISCUSS] Unsustainable situation with ptests

Posted by Prasanth Jayachandran <pj...@hortonworks.com>.
+1 on freezing commits until we get repetitive green tests. We should probably disable (and remember in a jira to reenable then at later point) tests that are flaky to get repetitive green test runs.

Thanks
Prasanth



On Mon, May 14, 2018 at 2:15 AM -0700, "Rui Li" <li...@gmail.com>> wrote:


+1 to freezing commits until we stabilize

On Sat, May 12, 2018 at 6:10 AM, Vihang Karajgaonkar
wrote:

> In order to understand the end-to-end precommit flow I would like to get
> access to the PreCommit-HIVE-Build jenkins script. Does anyone one know how
> can I get that?
>
> On Fri, May 11, 2018 at 2:03 PM, Jesus Camacho Rodriguez <
> jcamacho@apache.org> wrote:
>
> > Bq. For the short term green runs, I think we should @Ignore the tests
> > which
> > are known to be failing since many runs. They are anyways not being
> > addressed as such. If people think they are important to be run we should
> > fix them and only then re-enable them.
> >
> > I think that is a good idea, as we would minimize the time that we halt
> > development. We can create a JIRA where we list all tests that were
> > failing, and we have disabled to get the clean run. From that moment, we
> > will have zero tolerance towards committing with failing tests. And we
> need
> > to pick up those tests that should not be ignored and bring them up again
> > but passing. If there is no disagreement, I can start working on that.
> >
> > Once I am done, I can try to help with infra tickets too.
> >
> > -Jesús
> >
> >
> > On 5/11/18, 1:57 PM, "Vineet Garg"  wrote:
> >
> >     +1. I strongly vote for freezing commits and getting our testing
> > coverage in acceptable state.  We have been struggling to stabilize
> > branch-3 due to test failures and releasing Hive 3.0 in current state
> would
> > be unacceptable.
> >
> >     Currently there are quite a few test suites which are not even
> running
> > and are being timed out. We have been committing patches (to both
> branch-3
> > and master) without test coverage for these tests.
> >     We should immediately figure out what’s going on before we proceed
> > with commits.
> >
> >     For reference following test suites are timing out on master: (
> > https://issues.apache.org/jira/browse/HIVE-19506)
> >
> >
> >     TestDbNotificationListener - did not produce a TEST-*.xml file
> (likely
> > timed out)
> >
> >     TestHCatHiveCompatibility - did not produce a TEST-*.xml file (likely
> > timed out)
> >
> >     TestNegativeCliDriver - did not produce a TEST-*.xml file (likely
> > timed out)
> >
> >     TestNonCatCallsWithCatalog - did not produce a TEST-*.xml file
> (likely
> > timed out)
> >
> >     TestSequenceFileReadWrite - did not produce a TEST-*.xml file (likely
> > timed out)
> >
> >     TestTxnExIm - did not produce a TEST-*.xml file (likely timed out)
> >
> >
> >     Vineet
> >
> >
> >     On May 11, 2018, at 1:46 PM, Vihang Karajgaonkar <
> vihang@cloudera.com
> > > wrote:
> >
> >     +1 There are many problems with the test infrastructure and in my
> > opinion
> >     it has not become number one bottleneck for the project. I was
> looking
> > at
> >     the infrastructure yesterday and I think the current infrastructure
> > (even
> >     its own set of problems) is still under-utilized. I am planning to
> > increase
> >     the number of threads to process the parallel test batches to start
> > with.
> >     It needs a restart on the server side. I can do it now, it folks are
> > okay
> >     with it. Else I can do it over weekend when the queue is small.
> >
> >     I listed the improvements which I thought would be useful under
> >     https://issues.apache.org/jira/browse/HIVE-19425 but frankly
> speaking
> > I am
> >     not able to devote as much time as I would like to on it. I would
> >     appreciate if folks who have some more time if they can help out.
> >
> >     I think to start with https://issues.apache.org/
> jira/browse/HIVE-19429
> > will
> >     help a lot. We need to pack more test runs in parallel and containers
> >     provide good isolation.
> >
> >     For the short term green runs, I think we should @Ignore the tests
> > which
> >     are known to be failing since many runs. They are anyways not being
> >     addressed as such. If people think they are important to be run we
> > should
> >     fix them and only then re-enable them.
> >
> >     Also, I feel we need light-weight test run which we can run locally
> > before
> >     submitting it for the full-suite. That way minor issues with the
> patch
> > can
> >     be handled locally. May be create a profile which runs a subset of
> >     important tests which are consistent. We can apply some label that
> >     pre-checkin-local tests are runs successful and only then we submit
> > for the
> >     full-suite.
> >
> >     More thoughts are welcome. Thanks for starting this conversation.
> >
> >     On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
> >     jcamacho@apache.org> wrote:
> >
> >     I believe we have reached a state (maybe we did reach it a while ago)
> > that
> >     is not sustainable anymore, as there are so many tests failing /
> > timing out
> >     that it is not possible to verify whether a patch is breaking some
> > critical
> >     parts of the system or not. It also seems to me that due to the
> > timeouts
> >     (maybe due to infra, maybe not), ptest runs are taking even longer
> than
> >     usual, which in turn creates even longer queue of patches.
> >
> >     There is an ongoing effort to improve ptests usability (
> >     https://issues.apache.org/jira/browse/HIVE-19425), but apart from
> > that,
> >     we need to make an effort to stabilize existing tests and bring that
> >     failure count to zero.
> >
> >     Hence, I am suggesting *we stop committing any patch before we get a
> > green
> >     run*. If someone thinks this proposal is too radical, please come up
> > with
> >     an alternative, because I do not think it is OK to have the ptest
> runs
> > in
> >     their current state. Other projects of certain size (e.g., Hadoop,
> > Spark)
> >     are always green, we should be able to do the same.
> >
> >     Finally, once we get to zero failures, I suggest we are less tolerant
> > with
> >     committing without getting a clean ptests run. If there is a failure,
> > we
> >     need to fix it or revert the patch that caused it, then we continue
> >     developing.
> >
> >     Please, let’s all work together as a community to fix this issue,
> that
> > is
> >     the only way to get to zero quickly.
> >
> >     Thanks,
> >     Jesús
> >
> >     PS. I assume the flaky tests will come into the discussion. Let´s see
> >     first how many of those we have, then we can work to find a fix.
> >
> >
> >
> >
> >
> >
> >
> >
>



--
Best regards!
Rui Li


Re: [DISCUSS] Unsustainable situation with ptests

Posted by Rui Li <li...@gmail.com>.
+1 to freezing commits until we stabilize

On Sat, May 12, 2018 at 6:10 AM, Vihang Karajgaonkar <vi...@cloudera.com>
wrote:

> In order to understand the end-to-end precommit flow I would like to get
> access to the PreCommit-HIVE-Build jenkins script. Does anyone one know how
> can I get that?
>
> On Fri, May 11, 2018 at 2:03 PM, Jesus Camacho Rodriguez <
> jcamacho@apache.org> wrote:
>
> > Bq. For the short term green runs, I think we should @Ignore the tests
> > which
> > are known to be failing since many runs. They are anyways not being
> > addressed as such. If people think they are important to be run we should
> > fix them and only then re-enable them.
> >
> > I think that is a good idea, as we would minimize the time that we halt
> > development. We can create a JIRA where we list all tests that were
> > failing, and we have disabled to get the clean run. From that moment, we
> > will have zero tolerance towards committing with failing tests. And we
> need
> > to pick up those tests that should not be ignored and bring them up again
> > but passing. If there is no disagreement, I can start working on that.
> >
> > Once I am done, I can try to help with infra tickets too.
> >
> > -Jesús
> >
> >
> > On 5/11/18, 1:57 PM, "Vineet Garg" <vg...@hortonworks.com> wrote:
> >
> >     +1. I strongly vote for freezing commits and getting our testing
> > coverage in acceptable state.  We have been struggling to stabilize
> > branch-3 due to test failures and releasing Hive 3.0 in current state
> would
> > be unacceptable.
> >
> >     Currently there are quite a few test suites which are not even
> running
> > and are being timed out. We have been committing patches (to both
> branch-3
> > and master) without test coverage for these tests.
> >     We should immediately figure out what’s going on before we proceed
> > with commits.
> >
> >     For reference following test suites are timing out on master: (
> > https://issues.apache.org/jira/browse/HIVE-19506)
> >
> >
> >     TestDbNotificationListener - did not produce a TEST-*.xml file
> (likely
> > timed out)
> >
> >     TestHCatHiveCompatibility - did not produce a TEST-*.xml file (likely
> > timed out)
> >
> >     TestNegativeCliDriver - did not produce a TEST-*.xml file (likely
> > timed out)
> >
> >     TestNonCatCallsWithCatalog - did not produce a TEST-*.xml file
> (likely
> > timed out)
> >
> >     TestSequenceFileReadWrite - did not produce a TEST-*.xml file (likely
> > timed out)
> >
> >     TestTxnExIm - did not produce a TEST-*.xml file (likely timed out)
> >
> >
> >     Vineet
> >
> >
> >     On May 11, 2018, at 1:46 PM, Vihang Karajgaonkar <
> vihang@cloudera.com
> > <ma...@cloudera.com>> wrote:
> >
> >     +1 There are many problems with the test infrastructure and in my
> > opinion
> >     it has not become number one bottleneck for the project. I was
> looking
> > at
> >     the infrastructure yesterday and I think the current infrastructure
> > (even
> >     its own set of problems) is still under-utilized. I am planning to
> > increase
> >     the number of threads to process the parallel test batches to start
> > with.
> >     It needs a restart on the server side. I can do it now, it folks are
> > okay
> >     with it. Else I can do it over weekend when the queue is small.
> >
> >     I listed the improvements which I thought would be useful under
> >     https://issues.apache.org/jira/browse/HIVE-19425 but frankly
> speaking
> > I am
> >     not able to devote as much time as I would like to on it. I would
> >     appreciate if folks who have some more time if they can help out.
> >
> >     I think to start with https://issues.apache.org/
> jira/browse/HIVE-19429
> > will
> >     help a lot. We need to pack more test runs in parallel and containers
> >     provide good isolation.
> >
> >     For the short term green runs, I think we should @Ignore the tests
> > which
> >     are known to be failing since many runs. They are anyways not being
> >     addressed as such. If people think they are important to be run we
> > should
> >     fix them and only then re-enable them.
> >
> >     Also, I feel we need light-weight test run which we can run locally
> > before
> >     submitting it for the full-suite. That way minor issues with the
> patch
> > can
> >     be handled locally. May be create a profile which runs a subset of
> >     important tests which are consistent. We can apply some label that
> >     pre-checkin-local tests are runs successful and only then we submit
> > for the
> >     full-suite.
> >
> >     More thoughts are welcome. Thanks for starting this conversation.
> >
> >     On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
> >     jcamacho@apache.org<ma...@apache.org>> wrote:
> >
> >     I believe we have reached a state (maybe we did reach it a while ago)
> > that
> >     is not sustainable anymore, as there are so many tests failing /
> > timing out
> >     that it is not possible to verify whether a patch is breaking some
> > critical
> >     parts of the system or not. It also seems to me that due to the
> > timeouts
> >     (maybe due to infra, maybe not), ptest runs are taking even longer
> than
> >     usual, which in turn creates even longer queue of patches.
> >
> >     There is an ongoing effort to improve ptests usability (
> >     https://issues.apache.org/jira/browse/HIVE-19425), but apart from
> > that,
> >     we need to make an effort to stabilize existing tests and bring that
> >     failure count to zero.
> >
> >     Hence, I am suggesting *we stop committing any patch before we get a
> > green
> >     run*. If someone thinks this proposal is too radical, please come up
> > with
> >     an alternative, because I do not think it is OK to have the ptest
> runs
> > in
> >     their current state. Other projects of certain size (e.g., Hadoop,
> > Spark)
> >     are always green, we should be able to do the same.
> >
> >     Finally, once we get to zero failures, I suggest we are less tolerant
> > with
> >     committing without getting a clean ptests run. If there is a failure,
> > we
> >     need to fix it or revert the patch that caused it, then we continue
> >     developing.
> >
> >     Please, let’s all work together as a community to fix this issue,
> that
> > is
> >     the only way to get to zero quickly.
> >
> >     Thanks,
> >     Jesús
> >
> >     PS. I assume the flaky tests will come into the discussion. Let´s see
> >     first how many of those we have, then we can work to find a fix.
> >
> >
> >
> >
> >
> >
> >
> >
>



-- 
Best regards!
Rui Li

Re: [DISCUSS] Unsustainable situation with ptests

Posted by Vihang Karajgaonkar <vi...@cloudera.com>.
In order to understand the end-to-end precommit flow I would like to get
access to the PreCommit-HIVE-Build jenkins script. Does anyone one know how
can I get that?

On Fri, May 11, 2018 at 2:03 PM, Jesus Camacho Rodriguez <
jcamacho@apache.org> wrote:

> Bq. For the short term green runs, I think we should @Ignore the tests
> which
> are known to be failing since many runs. They are anyways not being
> addressed as such. If people think they are important to be run we should
> fix them and only then re-enable them.
>
> I think that is a good idea, as we would minimize the time that we halt
> development. We can create a JIRA where we list all tests that were
> failing, and we have disabled to get the clean run. From that moment, we
> will have zero tolerance towards committing with failing tests. And we need
> to pick up those tests that should not be ignored and bring them up again
> but passing. If there is no disagreement, I can start working on that.
>
> Once I am done, I can try to help with infra tickets too.
>
> -Jesús
>
>
> On 5/11/18, 1:57 PM, "Vineet Garg" <vg...@hortonworks.com> wrote:
>
>     +1. I strongly vote for freezing commits and getting our testing
> coverage in acceptable state.  We have been struggling to stabilize
> branch-3 due to test failures and releasing Hive 3.0 in current state would
> be unacceptable.
>
>     Currently there are quite a few test suites which are not even running
> and are being timed out. We have been committing patches (to both branch-3
> and master) without test coverage for these tests.
>     We should immediately figure out what’s going on before we proceed
> with commits.
>
>     For reference following test suites are timing out on master: (
> https://issues.apache.org/jira/browse/HIVE-19506)
>
>
>     TestDbNotificationListener - did not produce a TEST-*.xml file (likely
> timed out)
>
>     TestHCatHiveCompatibility - did not produce a TEST-*.xml file (likely
> timed out)
>
>     TestNegativeCliDriver - did not produce a TEST-*.xml file (likely
> timed out)
>
>     TestNonCatCallsWithCatalog - did not produce a TEST-*.xml file (likely
> timed out)
>
>     TestSequenceFileReadWrite - did not produce a TEST-*.xml file (likely
> timed out)
>
>     TestTxnExIm - did not produce a TEST-*.xml file (likely timed out)
>
>
>     Vineet
>
>
>     On May 11, 2018, at 1:46 PM, Vihang Karajgaonkar <vihang@cloudera.com
> <ma...@cloudera.com>> wrote:
>
>     +1 There are many problems with the test infrastructure and in my
> opinion
>     it has not become number one bottleneck for the project. I was looking
> at
>     the infrastructure yesterday and I think the current infrastructure
> (even
>     its own set of problems) is still under-utilized. I am planning to
> increase
>     the number of threads to process the parallel test batches to start
> with.
>     It needs a restart on the server side. I can do it now, it folks are
> okay
>     with it. Else I can do it over weekend when the queue is small.
>
>     I listed the improvements which I thought would be useful under
>     https://issues.apache.org/jira/browse/HIVE-19425 but frankly speaking
> I am
>     not able to devote as much time as I would like to on it. I would
>     appreciate if folks who have some more time if they can help out.
>
>     I think to start with https://issues.apache.org/jira/browse/HIVE-19429
> will
>     help a lot. We need to pack more test runs in parallel and containers
>     provide good isolation.
>
>     For the short term green runs, I think we should @Ignore the tests
> which
>     are known to be failing since many runs. They are anyways not being
>     addressed as such. If people think they are important to be run we
> should
>     fix them and only then re-enable them.
>
>     Also, I feel we need light-weight test run which we can run locally
> before
>     submitting it for the full-suite. That way minor issues with the patch
> can
>     be handled locally. May be create a profile which runs a subset of
>     important tests which are consistent. We can apply some label that
>     pre-checkin-local tests are runs successful and only then we submit
> for the
>     full-suite.
>
>     More thoughts are welcome. Thanks for starting this conversation.
>
>     On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
>     jcamacho@apache.org<ma...@apache.org>> wrote:
>
>     I believe we have reached a state (maybe we did reach it a while ago)
> that
>     is not sustainable anymore, as there are so many tests failing /
> timing out
>     that it is not possible to verify whether a patch is breaking some
> critical
>     parts of the system or not. It also seems to me that due to the
> timeouts
>     (maybe due to infra, maybe not), ptest runs are taking even longer than
>     usual, which in turn creates even longer queue of patches.
>
>     There is an ongoing effort to improve ptests usability (
>     https://issues.apache.org/jira/browse/HIVE-19425), but apart from
> that,
>     we need to make an effort to stabilize existing tests and bring that
>     failure count to zero.
>
>     Hence, I am suggesting *we stop committing any patch before we get a
> green
>     run*. If someone thinks this proposal is too radical, please come up
> with
>     an alternative, because I do not think it is OK to have the ptest runs
> in
>     their current state. Other projects of certain size (e.g., Hadoop,
> Spark)
>     are always green, we should be able to do the same.
>
>     Finally, once we get to zero failures, I suggest we are less tolerant
> with
>     committing without getting a clean ptests run. If there is a failure,
> we
>     need to fix it or revert the patch that caused it, then we continue
>     developing.
>
>     Please, let’s all work together as a community to fix this issue, that
> is
>     the only way to get to zero quickly.
>
>     Thanks,
>     Jesús
>
>     PS. I assume the flaky tests will come into the discussion. Let´s see
>     first how many of those we have, then we can work to find a fix.
>
>
>
>
>
>
>
>

Re: [DISCUSS] Unsustainable situation with ptests

Posted by Jesus Camacho Rodriguez <jc...@apache.org>.
Bq. For the short term green runs, I think we should @Ignore the tests which
are known to be failing since many runs. They are anyways not being
addressed as such. If people think they are important to be run we should
fix them and only then re-enable them.

I think that is a good idea, as we would minimize the time that we halt development. We can create a JIRA where we list all tests that were failing, and we have disabled to get the clean run. From that moment, we will have zero tolerance towards committing with failing tests. And we need to pick up those tests that should not be ignored and bring them up again but passing. If there is no disagreement, I can start working on that.

Once I am done, I can try to help with infra tickets too.

-Jesús


On 5/11/18, 1:57 PM, "Vineet Garg" <vg...@hortonworks.com> wrote:

    +1. I strongly vote for freezing commits and getting our testing coverage in acceptable state.  We have been struggling to stabilize branch-3 due to test failures and releasing Hive 3.0 in current state would be unacceptable.
    
    Currently there are quite a few test suites which are not even running and are being timed out. We have been committing patches (to both branch-3 and master) without test coverage for these tests.
    We should immediately figure out what’s going on before we proceed with commits.
    
    For reference following test suites are timing out on master: (https://issues.apache.org/jira/browse/HIVE-19506)
    
    
    TestDbNotificationListener - did not produce a TEST-*.xml file (likely timed out)
    
    TestHCatHiveCompatibility - did not produce a TEST-*.xml file (likely timed out)
    
    TestNegativeCliDriver - did not produce a TEST-*.xml file (likely timed out)
    
    TestNonCatCallsWithCatalog - did not produce a TEST-*.xml file (likely timed out)
    
    TestSequenceFileReadWrite - did not produce a TEST-*.xml file (likely timed out)
    
    TestTxnExIm - did not produce a TEST-*.xml file (likely timed out)
    
    
    Vineet
    
    
    On May 11, 2018, at 1:46 PM, Vihang Karajgaonkar <vi...@cloudera.com>> wrote:
    
    +1 There are many problems with the test infrastructure and in my opinion
    it has not become number one bottleneck for the project. I was looking at
    the infrastructure yesterday and I think the current infrastructure (even
    its own set of problems) is still under-utilized. I am planning to increase
    the number of threads to process the parallel test batches to start with.
    It needs a restart on the server side. I can do it now, it folks are okay
    with it. Else I can do it over weekend when the queue is small.
    
    I listed the improvements which I thought would be useful under
    https://issues.apache.org/jira/browse/HIVE-19425 but frankly speaking I am
    not able to devote as much time as I would like to on it. I would
    appreciate if folks who have some more time if they can help out.
    
    I think to start with https://issues.apache.org/jira/browse/HIVE-19429 will
    help a lot. We need to pack more test runs in parallel and containers
    provide good isolation.
    
    For the short term green runs, I think we should @Ignore the tests which
    are known to be failing since many runs. They are anyways not being
    addressed as such. If people think they are important to be run we should
    fix them and only then re-enable them.
    
    Also, I feel we need light-weight test run which we can run locally before
    submitting it for the full-suite. That way minor issues with the patch can
    be handled locally. May be create a profile which runs a subset of
    important tests which are consistent. We can apply some label that
    pre-checkin-local tests are runs successful and only then we submit for the
    full-suite.
    
    More thoughts are welcome. Thanks for starting this conversation.
    
    On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
    jcamacho@apache.org<ma...@apache.org>> wrote:
    
    I believe we have reached a state (maybe we did reach it a while ago) that
    is not sustainable anymore, as there are so many tests failing / timing out
    that it is not possible to verify whether a patch is breaking some critical
    parts of the system or not. It also seems to me that due to the timeouts
    (maybe due to infra, maybe not), ptest runs are taking even longer than
    usual, which in turn creates even longer queue of patches.
    
    There is an ongoing effort to improve ptests usability (
    https://issues.apache.org/jira/browse/HIVE-19425), but apart from that,
    we need to make an effort to stabilize existing tests and bring that
    failure count to zero.
    
    Hence, I am suggesting *we stop committing any patch before we get a green
    run*. If someone thinks this proposal is too radical, please come up with
    an alternative, because I do not think it is OK to have the ptest runs in
    their current state. Other projects of certain size (e.g., Hadoop, Spark)
    are always green, we should be able to do the same.
    
    Finally, once we get to zero failures, I suggest we are less tolerant with
    committing without getting a clean ptests run. If there is a failure, we
    need to fix it or revert the patch that caused it, then we continue
    developing.
    
    Please, let’s all work together as a community to fix this issue, that is
    the only way to get to zero quickly.
    
    Thanks,
    Jesús
    
    PS. I assume the flaky tests will come into the discussion. Let´s see
    first how many of those we have, then we can work to find a fix.
    
    
    
    
    



Re: [DISCUSS] Unsustainable situation with ptests

Posted by Vineet Garg <vg...@hortonworks.com>.
+1. I strongly vote for freezing commits and getting our testing coverage in acceptable state.  We have been struggling to stabilize branch-3 due to test failures and releasing Hive 3.0 in current state would be unacceptable.

Currently there are quite a few test suites which are not even running and are being timed out. We have been committing patches (to both branch-3 and master) without test coverage for these tests.
We should immediately figure out what’s going on before we proceed with commits.

For reference following test suites are timing out on master: (https://issues.apache.org/jira/browse/HIVE-19506)


TestDbNotificationListener - did not produce a TEST-*.xml file (likely timed out)

TestHCatHiveCompatibility - did not produce a TEST-*.xml file (likely timed out)

TestNegativeCliDriver - did not produce a TEST-*.xml file (likely timed out)

TestNonCatCallsWithCatalog - did not produce a TEST-*.xml file (likely timed out)

TestSequenceFileReadWrite - did not produce a TEST-*.xml file (likely timed out)

TestTxnExIm - did not produce a TEST-*.xml file (likely timed out)


Vineet


On May 11, 2018, at 1:46 PM, Vihang Karajgaonkar <vi...@cloudera.com>> wrote:

+1 There are many problems with the test infrastructure and in my opinion
it has not become number one bottleneck for the project. I was looking at
the infrastructure yesterday and I think the current infrastructure (even
its own set of problems) is still under-utilized. I am planning to increase
the number of threads to process the parallel test batches to start with.
It needs a restart on the server side. I can do it now, it folks are okay
with it. Else I can do it over weekend when the queue is small.

I listed the improvements which I thought would be useful under
https://issues.apache.org/jira/browse/HIVE-19425 but frankly speaking I am
not able to devote as much time as I would like to on it. I would
appreciate if folks who have some more time if they can help out.

I think to start with https://issues.apache.org/jira/browse/HIVE-19429 will
help a lot. We need to pack more test runs in parallel and containers
provide good isolation.

For the short term green runs, I think we should @Ignore the tests which
are known to be failing since many runs. They are anyways not being
addressed as such. If people think they are important to be run we should
fix them and only then re-enable them.

Also, I feel we need light-weight test run which we can run locally before
submitting it for the full-suite. That way minor issues with the patch can
be handled locally. May be create a profile which runs a subset of
important tests which are consistent. We can apply some label that
pre-checkin-local tests are runs successful and only then we submit for the
full-suite.

More thoughts are welcome. Thanks for starting this conversation.

On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
jcamacho@apache.org<ma...@apache.org>> wrote:

I believe we have reached a state (maybe we did reach it a while ago) that
is not sustainable anymore, as there are so many tests failing / timing out
that it is not possible to verify whether a patch is breaking some critical
parts of the system or not. It also seems to me that due to the timeouts
(maybe due to infra, maybe not), ptest runs are taking even longer than
usual, which in turn creates even longer queue of patches.

There is an ongoing effort to improve ptests usability (
https://issues.apache.org/jira/browse/HIVE-19425), but apart from that,
we need to make an effort to stabilize existing tests and bring that
failure count to zero.

Hence, I am suggesting *we stop committing any patch before we get a green
run*. If someone thinks this proposal is too radical, please come up with
an alternative, because I do not think it is OK to have the ptest runs in
their current state. Other projects of certain size (e.g., Hadoop, Spark)
are always green, we should be able to do the same.

Finally, once we get to zero failures, I suggest we are less tolerant with
committing without getting a clean ptests run. If there is a failure, we
need to fix it or revert the patch that caused it, then we continue
developing.

Please, let’s all work together as a community to fix this issue, that is
the only way to get to zero quickly.

Thanks,
Jesús

PS. I assume the flaky tests will come into the discussion. Let´s see
first how many of those we have, then we can work to find a fix.





Re: [DISCUSS] Unsustainable situation with ptests

Posted by Vihang Karajgaonkar <vi...@cloudera.com>.
correction in my email below. I meant "in my opinion it *has now* become
number one bottleneck for the project" (worst place for a typo I guess)

On Fri, May 11, 2018 at 1:46 PM, Vihang Karajgaonkar <vi...@cloudera.com>
wrote:

> +1 There are many problems with the test infrastructure and in my opinion
> it has not become number one bottleneck for the project. I was looking at
> the infrastructure yesterday and I think the current infrastructure (even
> its own set of problems) is still under-utilized. I am planning to increase
> the number of threads to process the parallel test batches to start with.
> It needs a restart on the server side. I can do it now, it folks are okay
> with it. Else I can do it over weekend when the queue is small.
>
> I listed the improvements which I thought would be useful under
> https://issues.apache.org/jira/browse/HIVE-19425 but frankly speaking I
> am not able to devote as much time as I would like to on it. I would
> appreciate if folks who have some more time if they can help out.
>
> I think to start with https://issues.apache.org/jira/browse/HIVE-19429
> will help a lot. We need to pack more test runs in parallel and containers
> provide good isolation.
>
> For the short term green runs, I think we should @Ignore the tests which
> are known to be failing since many runs. They are anyways not being
> addressed as such. If people think they are important to be run we should
> fix them and only then re-enable them.
>
> Also, I feel we need light-weight test run which we can run locally before
> submitting it for the full-suite. That way minor issues with the patch can
> be handled locally. May be create a profile which runs a subset of
> important tests which are consistent. We can apply some label that
> pre-checkin-local tests are runs successful and only then we submit for the
> full-suite.
>
> More thoughts are welcome. Thanks for starting this conversation.
>
> On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
> jcamacho@apache.org> wrote:
>
>> I believe we have reached a state (maybe we did reach it a while ago)
>> that is not sustainable anymore, as there are so many tests failing /
>> timing out that it is not possible to verify whether a patch is breaking
>> some critical parts of the system or not. It also seems to me that due to
>> the timeouts (maybe due to infra, maybe not), ptest runs are taking even
>> longer than usual, which in turn creates even longer queue of patches.
>>
>> There is an ongoing effort to improve ptests usability (
>> https://issues.apache.org/jira/browse/HIVE-19425), but apart from that,
>> we need to make an effort to stabilize existing tests and bring that
>> failure count to zero.
>>
>> Hence, I am suggesting *we stop committing any patch before we get a
>> green run*. If someone thinks this proposal is too radical, please come up
>> with an alternative, because I do not think it is OK to have the ptest runs
>> in their current state. Other projects of certain size (e.g., Hadoop,
>> Spark) are always green, we should be able to do the same.
>>
>> Finally, once we get to zero failures, I suggest we are less tolerant
>> with committing without getting a clean ptests run. If there is a failure,
>> we need to fix it or revert the patch that caused it, then we continue
>> developing.
>>
>> Please, let’s all work together as a community to fix this issue, that is
>> the only way to get to zero quickly.
>>
>> Thanks,
>> Jesús
>>
>> PS. I assume the flaky tests will come into the discussion. Let´s see
>> first how many of those we have, then we can work to find a fix.
>>
>>
>>
>

Re: [DISCUSS] Unsustainable situation with ptests

Posted by Vihang Karajgaonkar <vi...@cloudera.com>.
+1 There are many problems with the test infrastructure and in my opinion
it has not become number one bottleneck for the project. I was looking at
the infrastructure yesterday and I think the current infrastructure (even
its own set of problems) is still under-utilized. I am planning to increase
the number of threads to process the parallel test batches to start with.
It needs a restart on the server side. I can do it now, it folks are okay
with it. Else I can do it over weekend when the queue is small.

I listed the improvements which I thought would be useful under
https://issues.apache.org/jira/browse/HIVE-19425 but frankly speaking I am
not able to devote as much time as I would like to on it. I would
appreciate if folks who have some more time if they can help out.

I think to start with https://issues.apache.org/jira/browse/HIVE-19429 will
help a lot. We need to pack more test runs in parallel and containers
provide good isolation.

For the short term green runs, I think we should @Ignore the tests which
are known to be failing since many runs. They are anyways not being
addressed as such. If people think they are important to be run we should
fix them and only then re-enable them.

Also, I feel we need light-weight test run which we can run locally before
submitting it for the full-suite. That way minor issues with the patch can
be handled locally. May be create a profile which runs a subset of
important tests which are consistent. We can apply some label that
pre-checkin-local tests are runs successful and only then we submit for the
full-suite.

More thoughts are welcome. Thanks for starting this conversation.

On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
jcamacho@apache.org> wrote:

> I believe we have reached a state (maybe we did reach it a while ago) that
> is not sustainable anymore, as there are so many tests failing / timing out
> that it is not possible to verify whether a patch is breaking some critical
> parts of the system or not. It also seems to me that due to the timeouts
> (maybe due to infra, maybe not), ptest runs are taking even longer than
> usual, which in turn creates even longer queue of patches.
>
> There is an ongoing effort to improve ptests usability (
> https://issues.apache.org/jira/browse/HIVE-19425), but apart from that,
> we need to make an effort to stabilize existing tests and bring that
> failure count to zero.
>
> Hence, I am suggesting *we stop committing any patch before we get a green
> run*. If someone thinks this proposal is too radical, please come up with
> an alternative, because I do not think it is OK to have the ptest runs in
> their current state. Other projects of certain size (e.g., Hadoop, Spark)
> are always green, we should be able to do the same.
>
> Finally, once we get to zero failures, I suggest we are less tolerant with
> committing without getting a clean ptests run. If there is a failure, we
> need to fix it or revert the patch that caused it, then we continue
> developing.
>
> Please, let’s all work together as a community to fix this issue, that is
> the only way to get to zero quickly.
>
> Thanks,
> Jesús
>
> PS. I assume the flaky tests will come into the discussion. Let´s see
> first how many of those we have, then we can work to find a fix.
>
>
>