You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gossip.apache.org by Edward Capriolo <ed...@gmail.com> on 2017/01/07 22:39:35 UTC

Re: https://issues.apache.org/jira/browse/GOSSIP-22 and failure detector

I have rebased GOSSIP-22.

@Chia-Hung comparing implementations the only difference I see is the use
of ExponentialDistributionImpl vs NormallDistributionImpl. I experimented
with both and did not find one better than the other. I think it is a
simple follow up to make that plug-able via configuration.

I made sure all the integration tests use different ports so that one
failure does not cascade to other tests.

I have also upped a the wait time for the exit condition of for the
problematic ShutdownDeadtimeTest. It is longer then I would like.

However this implementation is really nice in that it uses much less
threads! It would be great if anyone can poor over this implementation and
pick it apart. We can always fine tune from here.



On Fri, Dec 2, 2016 at 3:15 AM, Chia-Hung Lin <cl...@googlemail.com> wrote:

> Shameless plug the code written long time ago. Didn't find a chance to
> modulize that. But feel free to use it as it's licensed in Apache 2.
>
> [1].
> https://github.com/apache/hama/tree/master/core/src/
> main/java/org/apache/hama/monitor/fd
>
> On Friday, 2 December 2016, P. Taylor Goetz <pt...@gmail.com> wrote:
>
> > There's not a lot of code there. Could it be reimplemented in gossip
> > without infringing on any copyrights?
> >
> > -Taylor
> >
> > > On Dec 1, 2016, at 6:21 PM, Edward Capriolo <edlinuxguru@gmail.com
> > <javascript:;>> wrote:
> > >
> > > I reached out to the initial author of the failure library to see if
> they
> > > would consider contributing it and I. I have not heard back.
> > >
> > > The library itself is comprised of two functions, with no unit testing,
> > and
> > > those functions lean heavily on commons-math. I think the signatures
> and
> > > the return types are not setup in a way that is natural for us to
> > leverage.
> > > I think it is best we simply write the code to execute the failure
> > detector
> > > logic ourselves.  We can make with a method signature we want and
> provide
> > > our own direct testing.
> > >
> > > If anyone sees an alternative library let me know. Remember the
> algorithm
> > > itself is essentially a one-liner on top of common-math parts.
> > >
> > > Thanks,
> > > Edward
> > >
> > > On Thu, Nov 17, 2016 at 1:49 PM, chandresh pancholi <
> > > chandreshpancholi007@gmail.com <javascript:;>> wrote:
> > >
> > >> https://github.com/apache/incubator-gossip/compare/
> > >> master...edwardcapriolo:GOSSIP-22?expand=1
> > >> Try the whole URL.
> > >>
> > >> Thanks
> > >>
> > >> On Thu, Nov 17, 2016 at 11:15 PM, Sandeep More <moresandeep@gmail.com
> > <javascript:;>>
> > >> wrote:
> > >>
> > >>> Hello Edward,
> > >>>
> > >>> Sorry for jumping in late, I tried to look at the URL you gave, it
> says
> > >>> "There isn’t anything to compare."
> > >>>
> > >>> BTW https://github.com/arosien/failure looks great !
> > >>>
> > >>> Best,
> > >>> Sandeep
> > >>>
> > >>>
> > >>> On Thu, Nov 17, 2016 at 11:52 AM, Edward Capriolo <
> > edlinuxguru@gmail.com <javascript:;>
> > >>>
> > >>> wrote:
> > >>>
> > >>>> If someone gets a chance please review. It turned out to be a little
> > >>> easier
> > >>>> then i thought:
> > >>>>
> > >>>> https://github.com/apache/incubator-gossip/compare/
> > >>> master...edwardcapriolo
> > >>>> :
> > >>>> GOSSIP-22?expand=1
> > >>>>
> > >>>> Leveraging the code here:
> > >>>>
> > >>>> https://github.com/arosien/failure
> > >>>>
> > >>>> I attempted to contact the author of failure (ASF V2) to see if he
> > >> wants
> > >>> to
> > >>>> contribute the code. (not in maven) We have other options like fork
> > and
> > >>>> package etc.
> > >>>>
> > >>>> Lets hold off the merge of this until after the release.
> > >>>>
> > >>>> Thanks,
> > >>>> Edward
> > >>>>
> > >>>> On Tue, Nov 15, 2016 at 10:42 PM, chandresh pancholi <
> > >>>> chandreshpancholi007@gmail.com <javascript:;>> wrote:
> > >>>>
> > >>>>> I will also look into it.
> > >>>>>
> > >>>>> On Wed, Nov 16, 2016 at 5:53 AM, Edward Capriolo <
> > >>> edlinuxguru@gmail.com <javascript:;>>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> This seems interesting and low bar to entry:
> > >>>>>>
> > >>>>>> https://github.com/arosien/failure
> > >>>>>>
> > >>>>>> On Tue, Nov 15, 2016 at 4:01 PM, Edward Capriolo <
> > >>>> edlinuxguru@gmail.com <javascript:;>>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> I was doing some load testing and I found the the current gating
> > >>>> factor
> > >>>>>>> for max instances running in the same JVM is limited by the JMX
> > >>> based
> > >>>>>>> notification system the failure detector uses.
> > >>>>>>>
> > >>>>>>> Currently a cluster of N requires N * (N-1) JMX notification
> > >>>> threads. I
> > >>>>>>> started attempting to remove this limit without going into
> > >> building
> > >>>> the
> > >>>>>>> accrual failure detector (22) but there were some nuanced bugs
> > >> and
> > >>> I
> > >>>>>> backed
> > >>>>>>> off because it did not seem worth the change.
> > >>>>>>>
> > >>>>>>> If anyone has an literature to contribute about building a
> > >>> consensus
> > >>>>>> based
> > >>>>>>> failure detector please discuss. Once we cut this release that is
> > >>>>> likely
> > >>>>>>> were I will spent my attention.
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>> Edward
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> --
> > >>>>> Chandresh Pancholi
> > >>>>> Senior Software Engineer
> > >>>>> Flipkart.com
> > >>>>> Email-id:chandresh.pancholi@flipkart.com <javascript:;>
> > >>>>> Contact:08951803660
> > >>>>>
> > >>>>
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> Chandresh Pancholi
> > >> Senior Software Engineer
> > >> Flipkart.com
> > >> Email-id:chandresh.pancholi@flipkart.com <javascript:;>
> > >> Contact:08951803660
> > >>
> >
>