You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gossip.apache.org by Edward Capriolo <ed...@gmail.com> on 2016/12/01 23:21:59 UTC

Re: https://issues.apache.org/jira/browse/GOSSIP-22 and failure detector

I reached out to the initial author of the failure library to see if they
would consider contributing it and I. I have not heard back.

The library itself is comprised of two functions, with no unit testing, and
those functions lean heavily on commons-math. I think the signatures and
the return types are not setup in a way that is natural for us to leverage.
I think it is best we simply write the code to execute the failure detector
logic ourselves.  We can make with a method signature we want and provide
our own direct testing.

If anyone sees an alternative library let me know. Remember the algorithm
itself is essentially a one-liner on top of common-math parts.

Thanks,
Edward

On Thu, Nov 17, 2016 at 1:49 PM, chandresh pancholi <
chandreshpancholi007@gmail.com> wrote:

> https://github.com/apache/incubator-gossip/compare/
> master...edwardcapriolo:GOSSIP-22?expand=1
> Try the whole URL.
>
> Thanks
>
> On Thu, Nov 17, 2016 at 11:15 PM, Sandeep More <mo...@gmail.com>
> wrote:
>
> > Hello Edward,
> >
> > Sorry for jumping in late, I tried to look at the URL you gave, it says
> > "There isn’t anything to compare."
> >
> > BTW https://github.com/arosien/failure looks great !
> >
> > Best,
> > Sandeep
> >
> >
> > On Thu, Nov 17, 2016 at 11:52 AM, Edward Capriolo <edlinuxguru@gmail.com
> >
> > wrote:
> >
> > > If someone gets a chance please review. It turned out to be a little
> > easier
> > > then i thought:
> > >
> > > https://github.com/apache/incubator-gossip/compare/
> > master...edwardcapriolo
> > > :
> > > GOSSIP-22?expand=1
> > >
> > > Leveraging the code here:
> > >
> > > https://github.com/arosien/failure
> > >
> > > I attempted to contact the author of failure (ASF V2) to see if he
> wants
> > to
> > > contribute the code. (not in maven) We have other options like fork and
> > > package etc.
> > >
> > > Lets hold off the merge of this until after the release.
> > >
> > > Thanks,
> > > Edward
> > >
> > > On Tue, Nov 15, 2016 at 10:42 PM, chandresh pancholi <
> > > chandreshpancholi007@gmail.com> wrote:
> > >
> > > > I will also look into it.
> > > >
> > > > On Wed, Nov 16, 2016 at 5:53 AM, Edward Capriolo <
> > edlinuxguru@gmail.com>
> > > > wrote:
> > > >
> > > > > This seems interesting and low bar to entry:
> > > > >
> > > > > https://github.com/arosien/failure
> > > > >
> > > > > On Tue, Nov 15, 2016 at 4:01 PM, Edward Capriolo <
> > > edlinuxguru@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > I was doing some load testing and I found the the current gating
> > > factor
> > > > > > for max instances running in the same JVM is limited by the JMX
> > based
> > > > > > notification system the failure detector uses.
> > > > > >
> > > > > > Currently a cluster of N requires N * (N-1) JMX notification
> > > threads. I
> > > > > > started attempting to remove this limit without going into
> building
> > > the
> > > > > > accrual failure detector (22) but there were some nuanced bugs
> and
> > I
> > > > > backed
> > > > > > off because it did not seem worth the change.
> > > > > >
> > > > > > If anyone has an literature to contribute about building a
> > consensus
> > > > > based
> > > > > > failure detector please discuss. Once we cut this release that is
> > > > likely
> > > > > > were I will spent my attention.
> > > > > >
> > > > > > Thanks,
> > > > > > Edward
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Chandresh Pancholi
> > > > Senior Software Engineer
> > > > Flipkart.com
> > > > Email-id:chandresh.pancholi@flipkart.com
> > > > Contact:08951803660
> > > >
> > >
> >
>
>
>
> --
> Chandresh Pancholi
> Senior Software Engineer
> Flipkart.com
> Email-id:chandresh.pancholi@flipkart.com
> Contact:08951803660
>

Re: https://issues.apache.org/jira/browse/GOSSIP-22 and failure detector

Posted by Edward Capriolo <ed...@gmail.com>.
I have rebased GOSSIP-22.

@Chia-Hung comparing implementations the only difference I see is the use
of ExponentialDistributionImpl vs NormallDistributionImpl. I experimented
with both and did not find one better than the other. I think it is a
simple follow up to make that plug-able via configuration.

I made sure all the integration tests use different ports so that one
failure does not cascade to other tests.

I have also upped a the wait time for the exit condition of for the
problematic ShutdownDeadtimeTest. It is longer then I would like.

However this implementation is really nice in that it uses much less
threads! It would be great if anyone can poor over this implementation and
pick it apart. We can always fine tune from here.



On Fri, Dec 2, 2016 at 3:15 AM, Chia-Hung Lin <cl...@googlemail.com> wrote:

> Shameless plug the code written long time ago. Didn't find a chance to
> modulize that. But feel free to use it as it's licensed in Apache 2.
>
> [1].
> https://github.com/apache/hama/tree/master/core/src/
> main/java/org/apache/hama/monitor/fd
>
> On Friday, 2 December 2016, P. Taylor Goetz <pt...@gmail.com> wrote:
>
> > There's not a lot of code there. Could it be reimplemented in gossip
> > without infringing on any copyrights?
> >
> > -Taylor
> >
> > > On Dec 1, 2016, at 6:21 PM, Edward Capriolo <edlinuxguru@gmail.com
> > <javascript:;>> wrote:
> > >
> > > I reached out to the initial author of the failure library to see if
> they
> > > would consider contributing it and I. I have not heard back.
> > >
> > > The library itself is comprised of two functions, with no unit testing,
> > and
> > > those functions lean heavily on commons-math. I think the signatures
> and
> > > the return types are not setup in a way that is natural for us to
> > leverage.
> > > I think it is best we simply write the code to execute the failure
> > detector
> > > logic ourselves.  We can make with a method signature we want and
> provide
> > > our own direct testing.
> > >
> > > If anyone sees an alternative library let me know. Remember the
> algorithm
> > > itself is essentially a one-liner on top of common-math parts.
> > >
> > > Thanks,
> > > Edward
> > >
> > > On Thu, Nov 17, 2016 at 1:49 PM, chandresh pancholi <
> > > chandreshpancholi007@gmail.com <javascript:;>> wrote:
> > >
> > >> https://github.com/apache/incubator-gossip/compare/
> > >> master...edwardcapriolo:GOSSIP-22?expand=1
> > >> Try the whole URL.
> > >>
> > >> Thanks
> > >>
> > >> On Thu, Nov 17, 2016 at 11:15 PM, Sandeep More <moresandeep@gmail.com
> > <javascript:;>>
> > >> wrote:
> > >>
> > >>> Hello Edward,
> > >>>
> > >>> Sorry for jumping in late, I tried to look at the URL you gave, it
> says
> > >>> "There isn’t anything to compare."
> > >>>
> > >>> BTW https://github.com/arosien/failure looks great !
> > >>>
> > >>> Best,
> > >>> Sandeep
> > >>>
> > >>>
> > >>> On Thu, Nov 17, 2016 at 11:52 AM, Edward Capriolo <
> > edlinuxguru@gmail.com <javascript:;>
> > >>>
> > >>> wrote:
> > >>>
> > >>>> If someone gets a chance please review. It turned out to be a little
> > >>> easier
> > >>>> then i thought:
> > >>>>
> > >>>> https://github.com/apache/incubator-gossip/compare/
> > >>> master...edwardcapriolo
> > >>>> :
> > >>>> GOSSIP-22?expand=1
> > >>>>
> > >>>> Leveraging the code here:
> > >>>>
> > >>>> https://github.com/arosien/failure
> > >>>>
> > >>>> I attempted to contact the author of failure (ASF V2) to see if he
> > >> wants
> > >>> to
> > >>>> contribute the code. (not in maven) We have other options like fork
> > and
> > >>>> package etc.
> > >>>>
> > >>>> Lets hold off the merge of this until after the release.
> > >>>>
> > >>>> Thanks,
> > >>>> Edward
> > >>>>
> > >>>> On Tue, Nov 15, 2016 at 10:42 PM, chandresh pancholi <
> > >>>> chandreshpancholi007@gmail.com <javascript:;>> wrote:
> > >>>>
> > >>>>> I will also look into it.
> > >>>>>
> > >>>>> On Wed, Nov 16, 2016 at 5:53 AM, Edward Capriolo <
> > >>> edlinuxguru@gmail.com <javascript:;>>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> This seems interesting and low bar to entry:
> > >>>>>>
> > >>>>>> https://github.com/arosien/failure
> > >>>>>>
> > >>>>>> On Tue, Nov 15, 2016 at 4:01 PM, Edward Capriolo <
> > >>>> edlinuxguru@gmail.com <javascript:;>>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> I was doing some load testing and I found the the current gating
> > >>>> factor
> > >>>>>>> for max instances running in the same JVM is limited by the JMX
> > >>> based
> > >>>>>>> notification system the failure detector uses.
> > >>>>>>>
> > >>>>>>> Currently a cluster of N requires N * (N-1) JMX notification
> > >>>> threads. I
> > >>>>>>> started attempting to remove this limit without going into
> > >> building
> > >>>> the
> > >>>>>>> accrual failure detector (22) but there were some nuanced bugs
> > >> and
> > >>> I
> > >>>>>> backed
> > >>>>>>> off because it did not seem worth the change.
> > >>>>>>>
> > >>>>>>> If anyone has an literature to contribute about building a
> > >>> consensus
> > >>>>>> based
> > >>>>>>> failure detector please discuss. Once we cut this release that is
> > >>>>> likely
> > >>>>>>> were I will spent my attention.
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>> Edward
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> --
> > >>>>> Chandresh Pancholi
> > >>>>> Senior Software Engineer
> > >>>>> Flipkart.com
> > >>>>> Email-id:chandresh.pancholi@flipkart.com <javascript:;>
> > >>>>> Contact:08951803660
> > >>>>>
> > >>>>
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> Chandresh Pancholi
> > >> Senior Software Engineer
> > >> Flipkart.com
> > >> Email-id:chandresh.pancholi@flipkart.com <javascript:;>
> > >> Contact:08951803660
> > >>
> >
>

Re: https://issues.apache.org/jira/browse/GOSSIP-22 and failure detector

Posted by Chia-Hung Lin <cl...@googlemail.com>.
I'm happy to contribute, but may not be in recent time for righting up with
other tasks; therefore feel free to rework on it, or I'll refactor when I
have free time. And I am happy to how it evolves.

On Friday, 2 December 2016, Edward Capriolo <ed...@gmail.com> wrote:

> On Fri, Dec 2, 2016 at 3:15 AM, Chia-Hung Lin <clin4j@googlemail.com
> <javascript:;>> wrote:
>
> > Shameless plug the code written long time ago. Didn't find a chance to
> > modulize that. But feel free to use it as it's licensed in Apache 2.
> >
> > [1].
> > https://github.com/apache/hama/tree/master/core/src/
> > main/java/org/apache/hama/monitor/fd
> >
> > On Friday, 2 December 2016, P. Taylor Goetz <ptgoetz@gmail.com
> <javascript:;>> wrote:
> >
> > > There's not a lot of code there. Could it be reimplemented in gossip
> > > without infringing on any copyrights?
> > >
> > > -Taylor
> > >
> > > > On Dec 1, 2016, at 6:21 PM, Edward Capriolo <edlinuxguru@gmail.com
> <javascript:;>
> > > <javascript:;>> wrote:
> > > >
> > > > I reached out to the initial author of the failure library to see if
> > they
> > > > would consider contributing it and I. I have not heard back.
> > > >
> > > > The library itself is comprised of two functions, with no unit
> testing,
> > > and
> > > > those functions lean heavily on commons-math. I think the signatures
> > and
> > > > the return types are not setup in a way that is natural for us to
> > > leverage.
> > > > I think it is best we simply write the code to execute the failure
> > > detector
> > > > logic ourselves.  We can make with a method signature we want and
> > provide
> > > > our own direct testing.
> > > >
> > > > If anyone sees an alternative library let me know. Remember the
> > algorithm
> > > > itself is essentially a one-liner on top of common-math parts.
> > > >
> > > > Thanks,
> > > > Edward
> > > >
> > > > On Thu, Nov 17, 2016 at 1:49 PM, chandresh pancholi <
> > > > chandreshpancholi007@gmail.com <javascript:;> <javascript:;>> wrote:
> > > >
> > > >> https://github.com/apache/incubator-gossip/compare/
> > > >> master...edwardcapriolo:GOSSIP-22?expand=1
> > > >> Try the whole URL.
> > > >>
> > > >> Thanks
> > > >>
> > > >> On Thu, Nov 17, 2016 at 11:15 PM, Sandeep More <
> moresandeep@gmail.com <javascript:;>
> > > <javascript:;>>
> > > >> wrote:
> > > >>
> > > >>> Hello Edward,
> > > >>>
> > > >>> Sorry for jumping in late, I tried to look at the URL you gave, it
> > says
> > > >>> "There isn’t anything to compare."
> > > >>>
> > > >>> BTW https://github.com/arosien/failure looks great !
> > > >>>
> > > >>> Best,
> > > >>> Sandeep
> > > >>>
> > > >>>
> > > >>> On Thu, Nov 17, 2016 at 11:52 AM, Edward Capriolo <
> > > edlinuxguru@gmail.com <javascript:;> <javascript:;>
> > > >>>
> > > >>> wrote:
> > > >>>
> > > >>>> If someone gets a chance please review. It turned out to be a
> little
> > > >>> easier
> > > >>>> then i thought:
> > > >>>>
> > > >>>> https://github.com/apache/incubator-gossip/compare/
> > > >>> master...edwardcapriolo
> > > >>>> :
> > > >>>> GOSSIP-22?expand=1
> > > >>>>
> > > >>>> Leveraging the code here:
> > > >>>>
> > > >>>> https://github.com/arosien/failure
> > > >>>>
> > > >>>> I attempted to contact the author of failure (ASF V2) to see if he
> > > >> wants
> > > >>> to
> > > >>>> contribute the code. (not in maven) We have other options like
> fork
> > > and
> > > >>>> package etc.
> > > >>>>
> > > >>>> Lets hold off the merge of this until after the release.
> > > >>>>
> > > >>>> Thanks,
> > > >>>> Edward
> > > >>>>
> > > >>>> On Tue, Nov 15, 2016 at 10:42 PM, chandresh pancholi <
> > > >>>> chandreshpancholi007@gmail.com <javascript:;> <javascript:;>>
> wrote:
> > > >>>>
> > > >>>>> I will also look into it.
> > > >>>>>
> > > >>>>> On Wed, Nov 16, 2016 at 5:53 AM, Edward Capriolo <
> > > >>> edlinuxguru@gmail.com <javascript:;> <javascript:;>>
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>> This seems interesting and low bar to entry:
> > > >>>>>>
> > > >>>>>> https://github.com/arosien/failure
> > > >>>>>>
> > > >>>>>> On Tue, Nov 15, 2016 at 4:01 PM, Edward Capriolo <
> > > >>>> edlinuxguru@gmail.com <javascript:;> <javascript:;>>
> > > >>>>>> wrote:
> > > >>>>>>
> > > >>>>>>> I was doing some load testing and I found the the current
> gating
> > > >>>> factor
> > > >>>>>>> for max instances running in the same JVM is limited by the JMX
> > > >>> based
> > > >>>>>>> notification system the failure detector uses.
> > > >>>>>>>
> > > >>>>>>> Currently a cluster of N requires N * (N-1) JMX notification
> > > >>>> threads. I
> > > >>>>>>> started attempting to remove this limit without going into
> > > >> building
> > > >>>> the
> > > >>>>>>> accrual failure detector (22) but there were some nuanced bugs
> > > >> and
> > > >>> I
> > > >>>>>> backed
> > > >>>>>>> off because it did not seem worth the change.
> > > >>>>>>>
> > > >>>>>>> If anyone has an literature to contribute about building a
> > > >>> consensus
> > > >>>>>> based
> > > >>>>>>> failure detector please discuss. Once we cut this release that
> is
> > > >>>>> likely
> > > >>>>>>> were I will spent my attention.
> > > >>>>>>>
> > > >>>>>>> Thanks,
> > > >>>>>>> Edward
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> --
> > > >>>>> Chandresh Pancholi
> > > >>>>> Senior Software Engineer
> > > >>>>> Flipkart.com
> > > >>>>> Email-id:chandresh.pancholi@flipkart.com <javascript:;>
> <javascript:;>
> > > >>>>> Contact:08951803660
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Chandresh Pancholi
> > > >> Senior Software Engineer
> > > >> Flipkart.com
> > > >> Email-id:chandresh.pancholi@flipkart.com <javascript:;>
> <javascript:;>
> > > >> Contact:08951803660
> > > >>
> > >
> >
>
> > There's not a lot of code there. Could it be reimplemented in gossip
> > without infringing on any copyrights?
>
> Yes. Basically the paper detail the algorithm (it is basically a one liner)
>
> >>
> https://github.com/apache/hama/tree/master/core/src/
> main/java/org/apache/hama/monitor/fd
>
> This is interesting. The "math" parts are similar in both projects.
>
> Hama seems like a solid implementation. Some things I see as a challenge:
> FD code is coupled into the network code and for our purposes we only want
> the the logic.
> In the future we probably want to track some kind of removed state. UP,
> DOWN, REMOVED
>
> It is really nice that it is done using concurrent type collections instead
> of sync blocks.
>
> @Chia-Hung looking this over I see some interesting bits:
>
> I like how you can chose to be notified only on specific hosts, and how the
> notify is being done with a callback.
> https://github.com/apache/hama/blob/master/core/src/
> main/java/org/apache/hama/monitor/fd/NodeEventListener.java
>
> This is more feature rich then our current notifications which you can only
> register a single listener and you can not pick hosts to listen about.
>
> Obviously Hama's implementation is stable but maybe once we have a solid
> release or two under us maybe we can see if Hama users are comfortable with
> leveraging what we are building.
>
> Good stuff!
>

Re: https://issues.apache.org/jira/browse/GOSSIP-22 and failure detector

Posted by Edward Capriolo <ed...@gmail.com>.
On Fri, Dec 2, 2016 at 3:15 AM, Chia-Hung Lin <cl...@googlemail.com> wrote:

> Shameless plug the code written long time ago. Didn't find a chance to
> modulize that. But feel free to use it as it's licensed in Apache 2.
>
> [1].
> https://github.com/apache/hama/tree/master/core/src/
> main/java/org/apache/hama/monitor/fd
>
> On Friday, 2 December 2016, P. Taylor Goetz <pt...@gmail.com> wrote:
>
> > There's not a lot of code there. Could it be reimplemented in gossip
> > without infringing on any copyrights?
> >
> > -Taylor
> >
> > > On Dec 1, 2016, at 6:21 PM, Edward Capriolo <edlinuxguru@gmail.com
> > <javascript:;>> wrote:
> > >
> > > I reached out to the initial author of the failure library to see if
> they
> > > would consider contributing it and I. I have not heard back.
> > >
> > > The library itself is comprised of two functions, with no unit testing,
> > and
> > > those functions lean heavily on commons-math. I think the signatures
> and
> > > the return types are not setup in a way that is natural for us to
> > leverage.
> > > I think it is best we simply write the code to execute the failure
> > detector
> > > logic ourselves.  We can make with a method signature we want and
> provide
> > > our own direct testing.
> > >
> > > If anyone sees an alternative library let me know. Remember the
> algorithm
> > > itself is essentially a one-liner on top of common-math parts.
> > >
> > > Thanks,
> > > Edward
> > >
> > > On Thu, Nov 17, 2016 at 1:49 PM, chandresh pancholi <
> > > chandreshpancholi007@gmail.com <javascript:;>> wrote:
> > >
> > >> https://github.com/apache/incubator-gossip/compare/
> > >> master...edwardcapriolo:GOSSIP-22?expand=1
> > >> Try the whole URL.
> > >>
> > >> Thanks
> > >>
> > >> On Thu, Nov 17, 2016 at 11:15 PM, Sandeep More <moresandeep@gmail.com
> > <javascript:;>>
> > >> wrote:
> > >>
> > >>> Hello Edward,
> > >>>
> > >>> Sorry for jumping in late, I tried to look at the URL you gave, it
> says
> > >>> "There isn’t anything to compare."
> > >>>
> > >>> BTW https://github.com/arosien/failure looks great !
> > >>>
> > >>> Best,
> > >>> Sandeep
> > >>>
> > >>>
> > >>> On Thu, Nov 17, 2016 at 11:52 AM, Edward Capriolo <
> > edlinuxguru@gmail.com <javascript:;>
> > >>>
> > >>> wrote:
> > >>>
> > >>>> If someone gets a chance please review. It turned out to be a little
> > >>> easier
> > >>>> then i thought:
> > >>>>
> > >>>> https://github.com/apache/incubator-gossip/compare/
> > >>> master...edwardcapriolo
> > >>>> :
> > >>>> GOSSIP-22?expand=1
> > >>>>
> > >>>> Leveraging the code here:
> > >>>>
> > >>>> https://github.com/arosien/failure
> > >>>>
> > >>>> I attempted to contact the author of failure (ASF V2) to see if he
> > >> wants
> > >>> to
> > >>>> contribute the code. (not in maven) We have other options like fork
> > and
> > >>>> package etc.
> > >>>>
> > >>>> Lets hold off the merge of this until after the release.
> > >>>>
> > >>>> Thanks,
> > >>>> Edward
> > >>>>
> > >>>> On Tue, Nov 15, 2016 at 10:42 PM, chandresh pancholi <
> > >>>> chandreshpancholi007@gmail.com <javascript:;>> wrote:
> > >>>>
> > >>>>> I will also look into it.
> > >>>>>
> > >>>>> On Wed, Nov 16, 2016 at 5:53 AM, Edward Capriolo <
> > >>> edlinuxguru@gmail.com <javascript:;>>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> This seems interesting and low bar to entry:
> > >>>>>>
> > >>>>>> https://github.com/arosien/failure
> > >>>>>>
> > >>>>>> On Tue, Nov 15, 2016 at 4:01 PM, Edward Capriolo <
> > >>>> edlinuxguru@gmail.com <javascript:;>>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> I was doing some load testing and I found the the current gating
> > >>>> factor
> > >>>>>>> for max instances running in the same JVM is limited by the JMX
> > >>> based
> > >>>>>>> notification system the failure detector uses.
> > >>>>>>>
> > >>>>>>> Currently a cluster of N requires N * (N-1) JMX notification
> > >>>> threads. I
> > >>>>>>> started attempting to remove this limit without going into
> > >> building
> > >>>> the
> > >>>>>>> accrual failure detector (22) but there were some nuanced bugs
> > >> and
> > >>> I
> > >>>>>> backed
> > >>>>>>> off because it did not seem worth the change.
> > >>>>>>>
> > >>>>>>> If anyone has an literature to contribute about building a
> > >>> consensus
> > >>>>>> based
> > >>>>>>> failure detector please discuss. Once we cut this release that is
> > >>>>> likely
> > >>>>>>> were I will spent my attention.
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>> Edward
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> --
> > >>>>> Chandresh Pancholi
> > >>>>> Senior Software Engineer
> > >>>>> Flipkart.com
> > >>>>> Email-id:chandresh.pancholi@flipkart.com <javascript:;>
> > >>>>> Contact:08951803660
> > >>>>>
> > >>>>
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> Chandresh Pancholi
> > >> Senior Software Engineer
> > >> Flipkart.com
> > >> Email-id:chandresh.pancholi@flipkart.com <javascript:;>
> > >> Contact:08951803660
> > >>
> >
>

> There's not a lot of code there. Could it be reimplemented in gossip
> without infringing on any copyrights?

Yes. Basically the paper detail the algorithm (it is basically a one liner)

>>
https://github.com/apache/hama/tree/master/core/src/main/java/org/apache/hama/monitor/fd

This is interesting. The "math" parts are similar in both projects.

Hama seems like a solid implementation. Some things I see as a challenge:
FD code is coupled into the network code and for our purposes we only want
the the logic.
In the future we probably want to track some kind of removed state. UP,
DOWN, REMOVED

It is really nice that it is done using concurrent type collections instead
of sync blocks.

@Chia-Hung looking this over I see some interesting bits:

I like how you can chose to be notified only on specific hosts, and how the
notify is being done with a callback.
https://github.com/apache/hama/blob/master/core/src/main/java/org/apache/hama/monitor/fd/NodeEventListener.java

This is more feature rich then our current notifications which you can only
register a single listener and you can not pick hosts to listen about.

Obviously Hama's implementation is stable but maybe once we have a solid
release or two under us maybe we can see if Hama users are comfortable with
leveraging what we are building.

Good stuff!

Re: https://issues.apache.org/jira/browse/GOSSIP-22 and failure detector

Posted by Chia-Hung Lin <cl...@googlemail.com>.
Shameless plug the code written long time ago. Didn't find a chance to
modulize that. But feel free to use it as it's licensed in Apache 2.

[1].
https://github.com/apache/hama/tree/master/core/src/main/java/org/apache/hama/monitor/fd

On Friday, 2 December 2016, P. Taylor Goetz <pt...@gmail.com> wrote:

> There's not a lot of code there. Could it be reimplemented in gossip
> without infringing on any copyrights?
>
> -Taylor
>
> > On Dec 1, 2016, at 6:21 PM, Edward Capriolo <edlinuxguru@gmail.com
> <javascript:;>> wrote:
> >
> > I reached out to the initial author of the failure library to see if they
> > would consider contributing it and I. I have not heard back.
> >
> > The library itself is comprised of two functions, with no unit testing,
> and
> > those functions lean heavily on commons-math. I think the signatures and
> > the return types are not setup in a way that is natural for us to
> leverage.
> > I think it is best we simply write the code to execute the failure
> detector
> > logic ourselves.  We can make with a method signature we want and provide
> > our own direct testing.
> >
> > If anyone sees an alternative library let me know. Remember the algorithm
> > itself is essentially a one-liner on top of common-math parts.
> >
> > Thanks,
> > Edward
> >
> > On Thu, Nov 17, 2016 at 1:49 PM, chandresh pancholi <
> > chandreshpancholi007@gmail.com <javascript:;>> wrote:
> >
> >> https://github.com/apache/incubator-gossip/compare/
> >> master...edwardcapriolo:GOSSIP-22?expand=1
> >> Try the whole URL.
> >>
> >> Thanks
> >>
> >> On Thu, Nov 17, 2016 at 11:15 PM, Sandeep More <moresandeep@gmail.com
> <javascript:;>>
> >> wrote:
> >>
> >>> Hello Edward,
> >>>
> >>> Sorry for jumping in late, I tried to look at the URL you gave, it says
> >>> "There isn’t anything to compare."
> >>>
> >>> BTW https://github.com/arosien/failure looks great !
> >>>
> >>> Best,
> >>> Sandeep
> >>>
> >>>
> >>> On Thu, Nov 17, 2016 at 11:52 AM, Edward Capriolo <
> edlinuxguru@gmail.com <javascript:;>
> >>>
> >>> wrote:
> >>>
> >>>> If someone gets a chance please review. It turned out to be a little
> >>> easier
> >>>> then i thought:
> >>>>
> >>>> https://github.com/apache/incubator-gossip/compare/
> >>> master...edwardcapriolo
> >>>> :
> >>>> GOSSIP-22?expand=1
> >>>>
> >>>> Leveraging the code here:
> >>>>
> >>>> https://github.com/arosien/failure
> >>>>
> >>>> I attempted to contact the author of failure (ASF V2) to see if he
> >> wants
> >>> to
> >>>> contribute the code. (not in maven) We have other options like fork
> and
> >>>> package etc.
> >>>>
> >>>> Lets hold off the merge of this until after the release.
> >>>>
> >>>> Thanks,
> >>>> Edward
> >>>>
> >>>> On Tue, Nov 15, 2016 at 10:42 PM, chandresh pancholi <
> >>>> chandreshpancholi007@gmail.com <javascript:;>> wrote:
> >>>>
> >>>>> I will also look into it.
> >>>>>
> >>>>> On Wed, Nov 16, 2016 at 5:53 AM, Edward Capriolo <
> >>> edlinuxguru@gmail.com <javascript:;>>
> >>>>> wrote:
> >>>>>
> >>>>>> This seems interesting and low bar to entry:
> >>>>>>
> >>>>>> https://github.com/arosien/failure
> >>>>>>
> >>>>>> On Tue, Nov 15, 2016 at 4:01 PM, Edward Capriolo <
> >>>> edlinuxguru@gmail.com <javascript:;>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> I was doing some load testing and I found the the current gating
> >>>> factor
> >>>>>>> for max instances running in the same JVM is limited by the JMX
> >>> based
> >>>>>>> notification system the failure detector uses.
> >>>>>>>
> >>>>>>> Currently a cluster of N requires N * (N-1) JMX notification
> >>>> threads. I
> >>>>>>> started attempting to remove this limit without going into
> >> building
> >>>> the
> >>>>>>> accrual failure detector (22) but there were some nuanced bugs
> >> and
> >>> I
> >>>>>> backed
> >>>>>>> off because it did not seem worth the change.
> >>>>>>>
> >>>>>>> If anyone has an literature to contribute about building a
> >>> consensus
> >>>>>> based
> >>>>>>> failure detector please discuss. Once we cut this release that is
> >>>>> likely
> >>>>>>> were I will spent my attention.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Edward
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Chandresh Pancholi
> >>>>> Senior Software Engineer
> >>>>> Flipkart.com
> >>>>> Email-id:chandresh.pancholi@flipkart.com <javascript:;>
> >>>>> Contact:08951803660
> >>>>>
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> Chandresh Pancholi
> >> Senior Software Engineer
> >> Flipkart.com
> >> Email-id:chandresh.pancholi@flipkart.com <javascript:;>
> >> Contact:08951803660
> >>
>

Re: https://issues.apache.org/jira/browse/GOSSIP-22 and failure detector

Posted by "P. Taylor Goetz" <pt...@gmail.com>.
There's not a lot of code there. Could it be reimplemented in gossip without infringing on any copyrights?

-Taylor

> On Dec 1, 2016, at 6:21 PM, Edward Capriolo <ed...@gmail.com> wrote:
> 
> I reached out to the initial author of the failure library to see if they
> would consider contributing it and I. I have not heard back.
> 
> The library itself is comprised of two functions, with no unit testing, and
> those functions lean heavily on commons-math. I think the signatures and
> the return types are not setup in a way that is natural for us to leverage.
> I think it is best we simply write the code to execute the failure detector
> logic ourselves.  We can make with a method signature we want and provide
> our own direct testing.
> 
> If anyone sees an alternative library let me know. Remember the algorithm
> itself is essentially a one-liner on top of common-math parts.
> 
> Thanks,
> Edward
> 
> On Thu, Nov 17, 2016 at 1:49 PM, chandresh pancholi <
> chandreshpancholi007@gmail.com> wrote:
> 
>> https://github.com/apache/incubator-gossip/compare/
>> master...edwardcapriolo:GOSSIP-22?expand=1
>> Try the whole URL.
>> 
>> Thanks
>> 
>> On Thu, Nov 17, 2016 at 11:15 PM, Sandeep More <mo...@gmail.com>
>> wrote:
>> 
>>> Hello Edward,
>>> 
>>> Sorry for jumping in late, I tried to look at the URL you gave, it says
>>> "There isn’t anything to compare."
>>> 
>>> BTW https://github.com/arosien/failure looks great !
>>> 
>>> Best,
>>> Sandeep
>>> 
>>> 
>>> On Thu, Nov 17, 2016 at 11:52 AM, Edward Capriolo <edlinuxguru@gmail.com
>>> 
>>> wrote:
>>> 
>>>> If someone gets a chance please review. It turned out to be a little
>>> easier
>>>> then i thought:
>>>> 
>>>> https://github.com/apache/incubator-gossip/compare/
>>> master...edwardcapriolo
>>>> :
>>>> GOSSIP-22?expand=1
>>>> 
>>>> Leveraging the code here:
>>>> 
>>>> https://github.com/arosien/failure
>>>> 
>>>> I attempted to contact the author of failure (ASF V2) to see if he
>> wants
>>> to
>>>> contribute the code. (not in maven) We have other options like fork and
>>>> package etc.
>>>> 
>>>> Lets hold off the merge of this until after the release.
>>>> 
>>>> Thanks,
>>>> Edward
>>>> 
>>>> On Tue, Nov 15, 2016 at 10:42 PM, chandresh pancholi <
>>>> chandreshpancholi007@gmail.com> wrote:
>>>> 
>>>>> I will also look into it.
>>>>> 
>>>>> On Wed, Nov 16, 2016 at 5:53 AM, Edward Capriolo <
>>> edlinuxguru@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> This seems interesting and low bar to entry:
>>>>>> 
>>>>>> https://github.com/arosien/failure
>>>>>> 
>>>>>> On Tue, Nov 15, 2016 at 4:01 PM, Edward Capriolo <
>>>> edlinuxguru@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> I was doing some load testing and I found the the current gating
>>>> factor
>>>>>>> for max instances running in the same JVM is limited by the JMX
>>> based
>>>>>>> notification system the failure detector uses.
>>>>>>> 
>>>>>>> Currently a cluster of N requires N * (N-1) JMX notification
>>>> threads. I
>>>>>>> started attempting to remove this limit without going into
>> building
>>>> the
>>>>>>> accrual failure detector (22) but there were some nuanced bugs
>> and
>>> I
>>>>>> backed
>>>>>>> off because it did not seem worth the change.
>>>>>>> 
>>>>>>> If anyone has an literature to contribute about building a
>>> consensus
>>>>>> based
>>>>>>> failure detector please discuss. Once we cut this release that is
>>>>> likely
>>>>>>> were I will spent my attention.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Edward
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Chandresh Pancholi
>>>>> Senior Software Engineer
>>>>> Flipkart.com
>>>>> Email-id:chandresh.pancholi@flipkart.com
>>>>> Contact:08951803660
>>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> Chandresh Pancholi
>> Senior Software Engineer
>> Flipkart.com
>> Email-id:chandresh.pancholi@flipkart.com
>> Contact:08951803660
>>