You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@reef.apache.org by Byung-Gon Chun <bg...@gmail.com> on 2017/03/10 00:04:54 UTC

Re: 0.16?

Taegeon, thanks for following up the issues.
Do you have any update on the Java side?
I've seen message exchanges about the .Net CI issues.

Thanks.
-Gon

On Tue, Feb 21, 2017 at 12:09 PM, Tae-Geon Um <ta...@gmail.com> wrote:

> Thanks Mariia for the explanation.
>
> For transient failures, we should handle the two issues on Java side
> - REEF-1668
> - REEF-1729
>
> and four issues on .NET side
> - REEF-1462
> - REEF-1473
> - REEF-1622
> - REEF-1723
>
> Sergiy is assigned to the two issues on Java side (and I think Saikat is
> willing to help), but no one is assigned to the .NET side issues yet.
>
> Is there anyone who can handle the .NET side issues?
>
> Thanks,
> Taegeon
>
> > On Feb 18, 2017, at 2:18 AM, Mariia Mykhailova <ma...@microsoft.com.INVALID>
> wrote:
> >
> > For high availability feature, the fixes which allow to run
> DriverRestart example on our Yarn test clusters are in the master. However,
> to the best of my knowledge nobody has tried to use HA in production
> code/real-life scenarios yet.
> >
> >
> > For transient test failures in CI, there are two issues on Java side
> (REEF-1668 and REEF-1729) and a whole bunch of issues on .NET side
> (umbrella REEF-1462). The ones on .NET side can't be reproduced locally, so
> you have to set up an instance of AppVeyor for your for of REEF repository,
> as described in https://github.com/apache/reef/blob/master/lang/cs/
> BUILD.md <https://github.com/apache/reef/blob/master/lang/cs/BUILD.md>
> >
> >
> > -Mariia
> >
> > ________________________________
> > From: Saikat Kanjilal <sxk1969@gmail.com <ma...@gmail.com>>
> > Sent: Thursday, February 16, 2017 8:07:40 PM
> > To: dev@reef.apache.org <ma...@reef.apache.org>
> > Subject: Re: 0.16?
> >
> > Sergei,
> > I definitely have more experience with Java than .Net, maybe this is a
> JIRA that I also add to my collection and help you, might be a good case
> for pair coding as well, let me know how you want to move forward.
> > Thanks
> >
> > Sent from my iPad
> >
> >> On Feb 16, 2017, at 6:23 PM, Sergiy Matusevych <
> sergiy.matusevych@gmail.com> wrote:
> >>
> >> Hi Saikat,
> >>
> >> The cleanup work is purely Java, so if you are working on the .NET side
> of
> >> things, I don't see much sense to switch the environment just for these
> >> issues. Still, it would be nice to get some help - maybe there are
> >> volunteers willing to debug some race conditions in Java and on YARN?
> >>
> >> Thank you,
> >> Sergiy.
> >>
> >>> On Thu, Feb 16, 2017 at 6:11 PM, Saikat Kanjilal <sx...@gmail.com>
> wrote:
> >>>
> >>> Me and my big mouth :))))), just kidding, I am already working on .Net
> >>> core 2.0 conversion JIRA's , what sort of dev/test help can I provide?
> >>>
> >>> Sent from my iPhone
> >>>
> >>>> On Feb 16, 2017, at 5:41 PM, Sergiy Matusevych <
> >>> sergiy.matusevych@gmail.com> wrote:
> >>>>
> >>>> Hi Saikat,
> >>>>
> >>>> The failures are sporadic and most likely are due to some race
> conditions
> >>>> during the cleanup process. You don't need CI to replicate them, but
> we
> >>>> need to debug the issues not only in local mode, but also on YARN
> (and,
> >>>> ideally, for all other runtimes that we provide). A good indicator of
> >>>> successful cleanup would be JIRA issue
> >>>> https://na01.safelinks.protection.outlook.com/?url=
> https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FREEF-
> 1715&data=02%7C01%7Cmamykhai%40microsoft.com%
> 7Cf59d099955eb4db4334908d456ea88f1%7C72f988bf86f141af91ab2d7cd011
> db47%7C1%7C0%7C636229012705797305&sdata=d7onJx7fUX%
> 2BjvYQYsvf8U2y2DuMfls%2Fw%2FAlVkDeYq4I%3D&reserved=0 <
> https://na01.safelinks.protection.outlook.com/?url=
> https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FREEF-
> 1715&data=02%7C01%7Cmamykhai%40microsoft.com%
> 7Cf59d099955eb4db4334908d456ea88f1%7C72f988bf86f141af91ab2d7cd011
> db47%7C1%7C0%7C636229012705797305&sdata=d7onJx7fUX%
> 2BjvYQYsvf8U2y2DuMfls%2Fw%2FAlVkDeYq4I%3D&reserved=0> - when all threads
> are
> >>>> closed properly, we would no longer need System.exit() call at the
> end of
> >>>> the Driver or Evaluator processes (regardless of the runtime). Would
> you
> >>> be
> >>>> interested in helping me with that part?
> >>>>
> >>>> Thank you,
> >>>> Sergiy.
> >>>>
> >>>>
> >>>>> On Thu, Feb 16, 2017 at 5:29 PM, Saikat Kanjilal <sx...@gmail.com>
> >>> wrote:
> >>>>>
> >>>>> Out of curiosity have we been able to replicate these failures
> locally ,
> >>>>> am wondering whether there's a need to have a local version of
> Travis ci
> >>>>> setup?
> >>>>>
> >>>>> Sent from my iPhone
> >>>>>
> >>>>>> On Feb 16, 2017, at 5:22 PM, Boris Shulman <sh...@gmail.com>
> wrote:
> >>>>>>
> >>>>>> Is AM HA part of 0.16?
> >>>>>>
> >>>>>> Sent from my iPhone
> >>>>>>
> >>>>>>> On Feb 16, 2017, at 12:22 PM, Sergiy Matusevych <
> >>>>> sergiy.matusevych@gmail.com> wrote:
> >>>>>>>
> >>>>>>> Hi Markus,
> >>>>>>>
> >>>>>>> I think we can safely announce that Unmanaged AM and REEF-on-REEF
> will
> >>>>> be
> >>>>>>> part of 0.16, but the bugs that Mariia mentions prevent us from
> >>> calling
> >>>>>>> this release REEF-as-a-Library. Even for the unmanaged AM, I need
> some
> >>>>> time
> >>>>>>> (likely till the end of this sprint) to make sure that Unmanaged AM
> >>>>> works
> >>>>>>> properly on Hadoop 2.7.3 and above.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Sergiy.
> >>>>>>>
> >>>>>>> On Thu, Feb 16, 2017 at 9:43 AM, Mariia Mykhailova <
> >>>>>>> mamykhai@microsoft.com.invalid> wrote:
> >>>>>>>
> >>>>>>>> There are several transient test failures in both Java and .NET
> tests
> >>>>> and
> >>>>>>>> Travis CI job timeout (which indicates hidden problems in
> terminating
> >>>>> Java
> >>>>>>>> REEF jobs) which we've introduced since 0.15. I don't think we
> should
> >>>>> do a
> >>>>>>>> release with these issues uninvestigated, especially Travis
> timeout.
> >>>>> For
> >>>>>>>> now I've marked them as blocking REEF-1444.
> >>>>>>>>
> >>>>>>>> -Mariia
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ________________________________
> >>>>>>>> From: Markus Weimer <ma...@weimo.de>
> >>>>>>>> Sent: Thursday, February 16, 2017 9:31:35 AM
> >>>>>>>> To: REEF Developers Mailinglist
> >>>>>>>> Subject: 0.16?
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> where are we in the process of releasing 0.16? In other words: If
> we
> >>>>> called
> >>>>>>>> the release today, what amazing feature that is on the cusp of
> >>> getting
> >>>>> in
> >>>>>>>> would we loose?
> >>>>>>>>
> >>>>>>>> I'm not suggesting to literally do it today, but a release around
> the
> >>>>> VS
> >>>>>>>> 2017 availability would be convenient for us to switch to the new
> >>> build
> >>>>>>>> system and all early in the works towards 0.17.
> >>>>>>>>
> >>>>>>>> Markus
>
>


-- 
Byung-Gon Chun

Re: 0.16?

Posted by Tae-Geon Um <ta...@gmail.com>.
Yes, Sergiy gave details about the Java side issues (https://issues.apache.org/jira/browse/REEF-1729), and I’m going to take a look at them.

Thanks.
Taegeon  

> On Mar 10, 2017, at 9:04 AM, Byung-Gon Chun <bg...@gmail.com> wrote:
> 
> Taegeon, thanks for following up the issues.
> Do you have any update on the Java side?
> I've seen message exchanges about the .Net CI issues.
> 
> Thanks.
> -Gon
> 
> On Tue, Feb 21, 2017 at 12:09 PM, Tae-Geon Um <taegeonum@gmail.com <ma...@gmail.com>> wrote:
> 
>> Thanks Mariia for the explanation.
>> 
>> For transient failures, we should handle the two issues on Java side
>> - REEF-1668
>> - REEF-1729
>> 
>> and four issues on .NET side
>> - REEF-1462
>> - REEF-1473
>> - REEF-1622
>> - REEF-1723
>> 
>> Sergiy is assigned to the two issues on Java side (and I think Saikat is
>> willing to help), but no one is assigned to the .NET side issues yet.
>> 
>> Is there anyone who can handle the .NET side issues?
>> 
>> Thanks,
>> Taegeon
>> 
>>> On Feb 18, 2017, at 2:18 AM, Mariia Mykhailova <ma...@microsoft.com.INVALID>
>> wrote:
>>> 
>>> For high availability feature, the fixes which allow to run
>> DriverRestart example on our Yarn test clusters are in the master. However,
>> to the best of my knowledge nobody has tried to use HA in production
>> code/real-life scenarios yet.
>>> 
>>> 
>>> For transient test failures in CI, there are two issues on Java side
>> (REEF-1668 and REEF-1729) and a whole bunch of issues on .NET side
>> (umbrella REEF-1462). The ones on .NET side can't be reproduced locally, so
>> you have to set up an instance of AppVeyor for your for of REEF repository,
>> as described in https://github.com/apache/reef/blob/master/lang/cs/
>> BUILD.md <https://github.com/apache/reef/blob/master/lang/cs/BUILD.md <https://github.com/apache/reef/blob/master/lang/cs/BUILD.md>>
>>> 
>>> 
>>> -Mariia
>>> 
>>> ________________________________
>>> From: Saikat Kanjilal <sxk1969@gmail.com <ma...@gmail.com> <mailto:sxk1969@gmail.com <ma...@gmail.com>>>
>>> Sent: Thursday, February 16, 2017 8:07:40 PM
>>> To: dev@reef.apache.org <ma...@reef.apache.org> <mailto:dev@reef.apache.org <ma...@reef.apache.org>>
>>> Subject: Re: 0.16?
>>> 
>>> Sergei,
>>> I definitely have more experience with Java than .Net, maybe this is a
>> JIRA that I also add to my collection and help you, might be a good case
>> for pair coding as well, let me know how you want to move forward.
>>> Thanks
>>> 
>>> Sent from my iPad
>>> 
>>>> On Feb 16, 2017, at 6:23 PM, Sergiy Matusevych <
>> sergiy.matusevych@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> Hi Saikat,
>>>> 
>>>> The cleanup work is purely Java, so if you are working on the .NET side
>> of
>>>> things, I don't see much sense to switch the environment just for these
>>>> issues. Still, it would be nice to get some help - maybe there are
>>>> volunteers willing to debug some race conditions in Java and on YARN?
>>>> 
>>>> Thank you,
>>>> Sergiy.
>>>> 
>>>>> On Thu, Feb 16, 2017 at 6:11 PM, Saikat Kanjilal <sxk1969@gmail.com <ma...@gmail.com>>
>> wrote:
>>>>> 
>>>>> Me and my big mouth :))))), just kidding, I am already working on .Net
>>>>> core 2.0 conversion JIRA's , what sort of dev/test help can I provide?
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>>> On Feb 16, 2017, at 5:41 PM, Sergiy Matusevych <
>>>>> sergiy.matusevych@gmail.com <ma...@gmail.com>> wrote:
>>>>>> 
>>>>>> Hi Saikat,
>>>>>> 
>>>>>> The failures are sporadic and most likely are due to some race
>> conditions
>>>>>> during the cleanup process. You don't need CI to replicate them, but
>> we
>>>>>> need to debug the issues not only in local mode, but also on YARN
>> (and,
>>>>>> ideally, for all other runtimes that we provide). A good indicator of
>>>>>> successful cleanup would be JIRA issue
>>>>>> https://na01.safelinks.protection.outlook.com/?url= <https://na01.safelinks.protection.outlook.com/?url=>
>> https%3A%2F%2Fissues.apache.org <http://2fissues.apache.org/>%2Fjira%2Fbrowse%2FREEF-
>> 1715&data=02%7C01%7Cmamykhai%40microsoft.com <http://40microsoft.com/>%
>> 7Cf59d099955eb4db4334908d456ea88f1%7C72f988bf86f141af91ab2d7cd011
>> db47%7C1%7C0%7C636229012705797305&sdata=d7onJx7fUX%
>> 2BjvYQYsvf8U2y2DuMfls%2Fw%2FAlVkDeYq4I%3D&reserved=0 <
>> https://na01.safelinks.protection.outlook.com/?url= <https://na01.safelinks.protection.outlook.com/?url=>
>> https%3A%2F%2Fissues.apache.org <http://2fissues.apache.org/>%2Fjira%2Fbrowse%2FREEF-
>> 1715&data=02%7C01%7Cmamykhai%40microsoft.com <http://40microsoft.com/>%
>> 7Cf59d099955eb4db4334908d456ea88f1%7C72f988bf86f141af91ab2d7cd011
>> db47%7C1%7C0%7C636229012705797305&sdata=d7onJx7fUX%
>> 2BjvYQYsvf8U2y2DuMfls%2Fw%2FAlVkDeYq4I%3D&reserved=0> - when all threads
>> are
>>>>>> closed properly, we would no longer need System.exit() call at the
>> end of
>>>>>> the Driver or Evaluator processes (regardless of the runtime). Would
>> you
>>>>> be
>>>>>> interested in helping me with that part?
>>>>>> 
>>>>>> Thank you,
>>>>>> Sergiy.
>>>>>> 
>>>>>> 
>>>>>>> On Thu, Feb 16, 2017 at 5:29 PM, Saikat Kanjilal <sxk1969@gmail.com <ma...@gmail.com>>
>>>>> wrote:
>>>>>>> 
>>>>>>> Out of curiosity have we been able to replicate these failures
>> locally ,
>>>>>>> am wondering whether there's a need to have a local version of
>> Travis ci
>>>>>>> setup?
>>>>>>> 
>>>>>>> Sent from my iPhone
>>>>>>> 
>>>>>>>> On Feb 16, 2017, at 5:22 PM, Boris Shulman <shulmanb@gmail.com <ma...@gmail.com>>
>> wrote:
>>>>>>>> 
>>>>>>>> Is AM HA part of 0.16?
>>>>>>>> 
>>>>>>>> Sent from my iPhone
>>>>>>>> 
>>>>>>>>> On Feb 16, 2017, at 12:22 PM, Sergiy Matusevych <
>>>>>>> sergiy.matusevych@gmail.com <ma...@gmail.com>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Markus,
>>>>>>>>> 
>>>>>>>>> I think we can safely announce that Unmanaged AM and REEF-on-REEF
>> will
>>>>>>> be
>>>>>>>>> part of 0.16, but the bugs that Mariia mentions prevent us from
>>>>> calling
>>>>>>>>> this release REEF-as-a-Library. Even for the unmanaged AM, I need
>> some
>>>>>>> time
>>>>>>>>> (likely till the end of this sprint) to make sure that Unmanaged AM
>>>>>>> works
>>>>>>>>> properly on Hadoop 2.7.3 and above.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Sergiy.
>>>>>>>>> 
>>>>>>>>> On Thu, Feb 16, 2017 at 9:43 AM, Mariia Mykhailova <
>>>>>>>>> mamykhai@microsoft.com.invalid <ma...@microsoft.com.invalid>> wrote:
>>>>>>>>> 
>>>>>>>>>> There are several transient test failures in both Java and .NET
>> tests
>>>>>>> and
>>>>>>>>>> Travis CI job timeout (which indicates hidden problems in
>> terminating
>>>>>>> Java
>>>>>>>>>> REEF jobs) which we've introduced since 0.15. I don't think we
>> should
>>>>>>> do a
>>>>>>>>>> release with these issues uninvestigated, especially Travis
>> timeout.
>>>>>>> For
>>>>>>>>>> now I've marked them as blocking REEF-1444.
>>>>>>>>>> 
>>>>>>>>>> -Mariia
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> ________________________________
>>>>>>>>>> From: Markus Weimer <markus@weimo.de <ma...@weimo.de>>
>>>>>>>>>> Sent: Thursday, February 16, 2017 9:31:35 AM
>>>>>>>>>> To: REEF Developers Mailinglist
>>>>>>>>>> Subject: 0.16?
>>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> where are we in the process of releasing 0.16? In other words: If
>> we
>>>>>>> called
>>>>>>>>>> the release today, what amazing feature that is on the cusp of
>>>>> getting
>>>>>>> in
>>>>>>>>>> would we loose?
>>>>>>>>>> 
>>>>>>>>>> I'm not suggesting to literally do it today, but a release around
>> the
>>>>>>> VS
>>>>>>>>>> 2017 availability would be convenient for us to switch to the new
>>>>> build
>>>>>>>>>> system and all early in the works towards 0.17.
>>>>>>>>>> 
>>>>>>>>>> Markus
>> 
>> 
> 
> 
> -- 
> Byung-Gon Chun