You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@zookeeper.apache.org by powell molleti <po...@yahoo.com.INVALID> on 2016/01/01 11:29:20 UTC

Re: Best-practice guides on coordination of operations in distributed systems (and some C client specific questions)

Hi Janmejay,
Regarding question 1, if a node takes a lock and the lock has timed-out from system perspective then it can mean that other nodes are free to take the lock and work on the resource. Hence the history could be well into the future when the previous node discovers the time-out. The question of rollback in the specific context depends on the implementation details, is the lock holder updating some common area?, then there could be corruption since other nodes are free to write in parallel to the first one?. In the usual sense a time-out of lock held means the node which held the lock is dead. It is upto the implementation to ensure this case and, using this primitive, if there is a timeout which means other nodes are sure that no one else is working on the resource and hence can move forward.
Question 2 seems to imply the assumption that leader has significant work todo and leader change is quite common, which seems contrary to common implementation pattern. If the work can be broken down into smaller chunks which need serialization separately then each chunk/work type can have a different leader.
For question 3, ZK does support auth and encryption for client connections but not for inter ZK node channels. Do you have requirement to secure inter ZK nodes, can you let us know what your requirements are so we can implement a solution to fit all needs?.
For question 4 the official implementation is C, people tend to wrap that with C++ and there should projects that use ZK doing that you can look them up and see if you can separate it out and use them.
Hope this helps.Powell.

    On Thursday, December 31, 2015 8:07 AM, Edward Capriolo <ed...@huffingtonpost.com> wrote:

 Q:What is the best way of handling distributed-lock expiry? The owner
of the lock managed to acquire it and may be in middle of some
computation when the session expires or lock expire

If you are using Java a way I can see doing this is by using the
ExecutorCompletionService
https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorCompletionService.html.
It allows you to keep your workers in a group, You can poll the group and
provide cancel semantics as needed.
An example of that service is here:
https://github.com/edwardcapriolo/nibiru/blob/master/src/main/java/io/teknek/nibiru/coordinator/EventualCoordinator.java
where I am issuing multiple reads and I want to abandon the process if they
do not timeout in a while. Many async/promices frameworks do this by
launching two task ComputationTask and a TimeoutTask that returns in 10
seconds. Then they ask the completions service to poll. If the service is
given the TimoutTask after the timeout that means the Computation did not
finish in time.

Do people generally take action in middle of the computation (abort it and
do itin a clever way such that effect appears atomic, so abort is notreally
visible, if so what are some of those clever ways)?

The base issue is java's synchronized/ AtomicReference do not have a
rollback.

There are a few ways I know to work around this. Clojure has STM (software
Transactional Memory) such that if an exception is through inside a doSync
all of the stems inside the critical block never happened. This assumes
your using all clojure structures which you are probably not.
A way co workers have done this is as follows. Move your entire
transnational state into a SINGLE big object that you can
copy/mutate/compare and swap. You never need to rollback each piece because
your changing the clone up until the point you commit it.
Writing reversal code is possible depending on the problem. There are
questions to ask like "what if the reversal somehow fails?"

On Thu, Dec 31, 2015 at 3:10 AM, singh.janmejay <si...@gmail.com>
wrote:

> Hi,
>
> Was wondering if there are any reference designs, patterns on handling
> common operations involving distributed coordination.
>
> I have a few questions and I guess they must have been asked before, I
> am unsure what to search for to surface the right answers. It'll be
> really valuable if someone can provide links to relevant
> "best-practices guide" or "suggestions" per question or share some
> wisdom or ideas on patterns to do this in the best way.
>
> 1. What is the best way of handling distributed-lock expiry? The owner
> of the lock managed to acquire it and may be in middle of some
> computation when the session expires or lock expires. When it finishes
> that computation, it can tell that the lock expired, but do people
> generally take action in middle of the computation (abort it and do it
> in a clever way such that effect appears atomic, so abort is not
> really visible, if so what are some of those clever ways)? Or is the
> right thing to do, is to write reversal-code, such that operations can
> be cleanly undone in case the verification at the end of computation
> shows that lock expired? The later obviously is a lot harder to
> achieve.
>
> 2. Same as above for leader-election scenarios. Leader generally
> administers operations on data-systems that take significant time to
> complete and have significant resource overhead and RPC to administer
> such operations synchronously from leader to data-node can't be atomic
> and can't be made latency-resilient to such a degree that issuing
> operation across a large set of nodes on a cluster can be guaranteed
> to finish without leader-change. What do people generally do in such
> situations? How are timeouts for operations issued when operations are
> issued using sequential-znode as a per-datanode dedicated queue? How
> well does it scale, and what are some things to watch-out for
> (operation-size, encoding, clustering into one znode for atomicity
> etc)? Or how are atomic operations that need to be issued across
> multiple data-nodes managed (do they have to be clobbered into one
> znode)?
>
> 3. How do people secure zookeeper based services? Is
> client-certificate-verification the recommended way? How well does
> this work with C client? Is inter-zk-node communication done with
> X509-auth too?
>
> 4. What other projects, reference-implementations or libraries should
> I look at for working with C client?
>
> Most of what I have asked revolves around leader or lock-owner having
> a false-failure (where it doesn't know that coordinator thinks it has
> failed).
>
> --
> Regards,
> Janmejay
> http://codehunk.wordpress.com
>

Re: Best-practice guides on coordination of operations in distributed systems (and some C client specific questions)

Posted by "singh.janmejay" <si...@gmail.com>.

*Bump*

On Tue, Jan 5, 2016 at 2:30 PM, singh.janmejay <si...@gmail.com> wrote:
> Thanks for the replies everyone, most of it was very useful.
>
> @Alexander: The section of chubby paper you pointed me to tries to
> address exactly what I was looking for. That clearly is one good way
> of doing it. Im also thinking of an alternative way and can use a
> review or some feedback on that.
>
> @Powel: About x509 auth on intra-cluster communication, I don't have a
> blocking need for it, as it can be achieved by setting up firewall
> rules to accept only from desired hosts. It may be a good-to-have
> thing though, because in cloud-based scenarios where IP addresses are
> re-used, a recycled IP can still talk to a secure zk-cluster unless
> config is changed to remove the old peer IP and replace it with the
> new one. Its clearly a corner-case though.
>
> Here is the approach Im thinking of:
> - Implement all operations(atleast master-triggered operations) on
> operand machines idempotently
> - Have master journal these operations to ZK before issuing RPC
> - In case master fails with some of these operations in flight, the
> newly elected master will need to read all issued (but not retired
> yet) operations and issue them again.
> - Existing master(before failure or after failure) can retry and
> retire operations according to whatever the retry policy and
> success-criterion is.
>
> Why am I thinking of this as opposed to going with chubby sequencer passing:
> - I need to implement idempotency regardless, because recovery-path
> involving master-death after successful execution of operation but
> before writing ack to coordination service requires it. So idempotent
> implementation complexity is here to stay.
> - I need to increase surface-area of the architecture which is exposed
> to coordination-service for sequencer validation. Which may bring
> verification RPC in data-plane in some cases.
> - The sequencer may expire after verification but before ack, in which
> case new master may not recognize the operation as consistent with its
> decisions (or previous decision path).
>
> Thoughts? Suggestions?
>
>
>
> On Sun, Jan 3, 2016 at 2:18 PM, Alexander Shraer <sh...@gmail.com> wrote:
>> regarding atomic multi-znode updates -- check out "multi" updates
>> <http://tdunning.blogspot.com/2011/06/tour-of-multi-update-for-zookeeper.html>
>> .
>>
>> On Sat, Jan 2, 2016 at 10:45 PM, Alexander Shraer <sh...@gmail.com> wrote:
>>
>>> for 1, see the chubby paper
>>> <http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf>,
>>> section 2.4.
>>> for 2, I'm not sure I fully understand the question, but essentially, ZK
>>> guarantees that even during failures
>>> consistency of updates is preserved. The user doesn't need to do anything
>>> in particular to guarantee this, even
>>> during leader failures. In such case, some suffix of operations executed
>>> by the leader may be lost if they weren't
>>> previously acked by a majority.However, none of these operations could
>>> have been visible
>>> to reads.
>>>
>>> On Fri, Jan 1, 2016 at 12:29 AM, powell molleti <
>>> powellm79@yahoo.com.invalid> wrote:
>>>
>>>> Hi Janmejay,
>>>> Regarding question 1, if a node takes a lock and the lock has timed-out
>>>> from system perspective then it can mean that other nodes are free to take
>>>> the lock and work on the resource. Hence the history could be well into the
>>>> future when the previous node discovers the time-out. The question of
>>>> rollback in the specific context depends on the implementation details, is
>>>> the lock holder updating some common area?, then there could be corruption
>>>> since other nodes are free to write in parallel to the first one?. In the
>>>> usual sense a time-out of lock held means the node which held the lock is
>>>> dead. It is upto the implementation to ensure this case and, using this
>>>> primitive, if there is a timeout which means other nodes are sure that no
>>>> one else is working on the resource and hence can move forward.
>>>> Question 2 seems to imply the assumption that leader has significant work
>>>> todo and leader change is quite common, which seems contrary to common
>>>> implementation pattern. If the work can be broken down into smaller chunks
>>>> which need serialization separately then each chunk/work type can have a
>>>> different leader.
>>>> For question 3, ZK does support auth and encryption for client
>>>> connections but not for inter ZK node channels. Do you have requirement to
>>>> secure inter ZK nodes, can you let us know what your requirements are so we
>>>> can implement a solution to fit all needs?.
>>>> For question 4 the official implementation is C, people tend to wrap that
>>>> with C++ and there should projects that use ZK doing that you can look them
>>>> up and see if you can separate it out and use them.
>>>> Hope this helps.Powell.
>>>>
>>>>
>>>>
>>>>     On Thursday, December 31, 2015 8:07 AM, Edward Capriolo <
>>>> edward.capriolo@huffingtonpost.com> wrote:
>>>>
>>>>
>>>>  Q:What is the best way of handling distributed-lock expiry? The owner
>>>> of the lock managed to acquire it and may be in middle of some
>>>> computation when the session expires or lock expire
>>>>
>>>> If you are using Java a way I can see doing this is by using the
>>>> ExecutorCompletionService
>>>>
>>>> https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorCompletionService.html
>>>> .
>>>> It allows you to keep your workers in a group, You can poll the group and
>>>> provide cancel semantics as needed.
>>>> An example of that service is here:
>>>>
>>>> https://github.com/edwardcapriolo/nibiru/blob/master/src/main/java/io/teknek/nibiru/coordinator/EventualCoordinator.java
>>>> where I am issuing multiple reads and I want to abandon the process if
>>>> they
>>>> do not timeout in a while. Many async/promices frameworks do this by
>>>> launching two task ComputationTask and a TimeoutTask that returns in 10
>>>> seconds. Then they ask the completions service to poll. If the service is
>>>> given the TimoutTask after the timeout that means the Computation did not
>>>> finish in time.
>>>>
>>>> Do people generally take action in middle of the computation (abort it and
>>>> do itin a clever way such that effect appears atomic, so abort is
>>>> notreally
>>>> visible, if so what are some of those clever ways)?
>>>>
>>>> The base issue is java's synchronized/ AtomicReference do not have a
>>>> rollback.
>>>>
>>>> There are a few ways I know to work around this. Clojure has STM (software
>>>> Transactional Memory) such that if an exception is through inside a doSync
>>>> all of the stems inside the critical block never happened. This assumes
>>>> your using all clojure structures which you are probably not.
>>>> A way co workers have done this is as follows. Move your entire
>>>> transnational state into a SINGLE big object that you can
>>>> copy/mutate/compare and swap. You never need to rollback each piece
>>>> because
>>>> your changing the clone up until the point you commit it.
>>>> Writing reversal code is possible depending on the problem. There are
>>>> questions to ask like "what if the reversal somehow fails?"
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Dec 31, 2015 at 3:10 AM, singh.janmejay <singh.janmejay@gmail.com
>>>> >
>>>> wrote:
>>>>
>>>> > Hi,
>>>> >
>>>> > Was wondering if there are any reference designs, patterns on handling
>>>> > common operations involving distributed coordination.
>>>> >
>>>> > I have a few questions and I guess they must have been asked before, I
>>>> > am unsure what to search for to surface the right answers. It'll be
>>>> > really valuable if someone can provide links to relevant
>>>> > "best-practices guide" or "suggestions" per question or share some
>>>> > wisdom or ideas on patterns to do this in the best way.
>>>> >
>>>> > 1. What is the best way of handling distributed-lock expiry? The owner
>>>> > of the lock managed to acquire it and may be in middle of some
>>>> > computation when the session expires or lock expires. When it finishes
>>>> > that computation, it can tell that the lock expired, but do people
>>>> > generally take action in middle of the computation (abort it and do it
>>>> > in a clever way such that effect appears atomic, so abort is not
>>>> > really visible, if so what are some of those clever ways)? Or is the
>>>> > right thing to do, is to write reversal-code, such that operations can
>>>> > be cleanly undone in case the verification at the end of computation
>>>> > shows that lock expired? The later obviously is a lot harder to
>>>> > achieve.
>>>> >
>>>> > 2. Same as above for leader-election scenarios. Leader generally
>>>> > administers operations on data-systems that take significant time to
>>>> > complete and have significant resource overhead and RPC to administer
>>>> > such operations synchronously from leader to data-node can't be atomic
>>>> > and can't be made latency-resilient to such a degree that issuing
>>>> > operation across a large set of nodes on a cluster can be guaranteed
>>>> > to finish without leader-change. What do people generally do in such
>>>> > situations? How are timeouts for operations issued when operations are
>>>> > issued using sequential-znode as a per-datanode dedicated queue? How
>>>> > well does it scale, and what are some things to watch-out for
>>>> > (operation-size, encoding, clustering into one znode for atomicity
>>>> > etc)? Or how are atomic operations that need to be issued across
>>>> > multiple data-nodes managed (do they have to be clobbered into one
>>>> > znode)?
>>>> >
>>>> > 3. How do people secure zookeeper based services? Is
>>>> > client-certificate-verification the recommended way? How well does
>>>> > this work with C client? Is inter-zk-node communication done with
>>>> > X509-auth too?
>>>> >
>>>> > 4. What other projects, reference-implementations or libraries should
>>>> > I look at for working with C client?
>>>> >
>>>> > Most of what I have asked revolves around leader or lock-owner having
>>>> > a false-failure (where it doesn't know that coordinator thinks it has
>>>> > failed).
>>>> >
>>>> > --
>>>> > Regards,
>>>> > Janmejay
>>>> > http://codehunk.wordpress.com
>>>> >
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>
>
>
> --
> Regards,
> Janmejay
> http://codehunk.wordpress.com



-- 
Regards,
Janmejay
http://codehunk.wordpress.com

Re: Best-practice guides on coordination of operations in distributed systems (and some C client specific questions)

Posted by "singh.janmejay" <si...@gmail.com>.

@Martin: Master reacting to achieve operation-timeout makes sense, I
never thought of it that way and it seems really effective. I can't
see it failing in any scenario. The only scenario where it has a
problem(solvable) is when the lock-owner dies to never comes back, in
which case it will need manual assistance (someone confirming that the
lock-owner is dead for good) or a very high-level control like an API
to kill the machine which is supposed to be running the lock owner.
About work distribution, I have a stateful set of nodes, so
traditional queue model assuming all workers are equal doesn't work. A
dedicated queue per worker would, but that will still have larger
surface-area of architecture exposed to coordination-service and will
increase scaling-burden on zookeeper. It doesn't really simplify
idempotency requirement anyway, so its equal to RPC triggered work in
complexity of implementation.


On Wed, Jan 13, 2016 at 4:51 PM, Martin Kersten
<ma...@gmail.com> wrote:
> Hello,
>
>    I am not quite aware of the zookeeper specialities but from my point of
> view you should think about distributing the task as well like having
> multiple nodes being responsible for a task to be done and if one node
> fails the other nodes take over and perform/complete the task instead. This
> would involve having becon messages and a place you can put your code in.
>
> Locks that time out should actually never happen, since it makes everything
> go boom and become overly complex. Just bind a lock to the lifeliness of
> the node. So if the node is considered dead free the lock. If the node is
> kind of zombie (not reacting (stuck) but responsive in terms of sending
> i-am-alife beacons (heart beat)) it is the task of the leader to kill the
> node remotely or remove the node from the list of members and inform anyone
> else about it. Once this happens this would also revoke the lock.
>
> The goal is to simply let the leader kill any node that seams to be
> malfunction in any possible way (like missing a deadline). A node that
> wants to complete its operation needs to interact with the leader and at
> this point in time the node should realize if it was considered dead and
> should restart by crashing and rebooting instantly.
>
> Another thing you might consider is to renew the lock in certain periodes.
> If you have a workflow, your lock times out in 10 minutes just every time
> you make real progress in your workflow renew the lock giving you another
> 10minutes to do the next steps.
>
> This way (as long as you do not have a loop in the workflow) you are save
> in assuming that a workflow is being completed in the future. If you need a
> hard deadline the node processing the operation might as well check the
> estimate of the workflow and drop the lock and abort the operation if it
> estimates the operation is likely to timeout and might even perform a
> compensation operation.
>
>
> Cheers,
>
> Martin (Kersten)
>
>
>
> 2016-01-13 11:08 GMT+01:00 singh.janmejay <si...@gmail.com>:
>
>> @Alexander: In that scenario, write of X will be attempted by A, but
>> external system will not act upon write-X because that operation has
>> already been acted upon in the past. This is guaranteed by idempotent
>> operations invariant. But it does point out another problem, which I
>> hadn't handled in my original algorithm. Problem: If X and Y have both
>> not been issued yet, and if Y is issued before X towards external
>> system, because neither operations have executed yet, it'll overwrite
>> Y with X. I need another constraint, master should only issue 1
>> operation on a certain external-system at a time and must issue
>> operations in the order of operation-id (sequential-znode sequence
>> number). So we need the following invariants:
>> - order of issuing operation being fixed (matching order of creation
>> of operations)
>> - concurrency of operation fixed to 1
>> - idempotent execution on external-system side
>>
>> @Powell: Im kind of doing the same thing. Except the loop doesn't run
>> on consumer, instead it runs on master, which is assigning work to
>> consumers. So triggerWork function is basically changed to issueWork,
>> which is RPC + triggerWork. The replay if history is basically just
>> replay of 1 operation per operand-node (in this thread we are calling
>> it external-system), so its as if triggerWork failed, in which case we
>> need to re-execute triggerWork. Idempotency also follows from that
>> requirement. If triggerWork fails in the last step, and all the
>> desired effect that was necessary has happened, we will still need to
>> run triggerWork again, but we need awareness that actual work has been
>> done, which is why idempotency is necessary.
>>
>> Btw, thanks for continuing to spare time for this, I really appreciate
>> this feedback/validation.
>>
>> Thoughts?
>>
>> On Wed, Jan 13, 2016 at 3:47 AM, powell molleti
>> <po...@yahoo.com.invalid> wrote:
>> > Wouldn't a distributed queue recipe for consumer work?. Where one needs
>> to add extra logic something like this:
>> >
>> > with lock() {
>> >     p = queue.peek()
>> >     if triggerWork(p) is Done:
>> >         queue.pop()
>> > }
>> >
>> > With this a consumer that worked on it but crashed before popping the
>> queue would result in another consumer processing the same work.
>> >
>> > I am not sure with the details of where you are getting the work from
>> and the scale of it is but producers(leader) could write to this queue.
>> Since there is guarantee of read after write , producer could delete from
>> its local queue the work that was successfully queued. Hence again new
>> producer could re-send the last entry of work so one has to handle that.
>> Without more details on the origin of work etc its hard to design end to
>> end.
>> >
>> > I do not see a need to do a total replay of past history etc when using
>> ZK like system because ZK is built on idea of serialized and replicated
>> log, hence if you are using ZK then your design should be much simpler i.e
>> fail and re-start from last know transaction.
>> >
>> > Powell.
>> >
>> >
>> >
>> > On Tuesday, January 12, 2016 11:51 AM, Alexander Shraer <
>> shralex@gmail.com> wrote:
>> > Hi,
>> >
>> > With your suggestion, the following scenario seems possible: master A is
>> > about to write value X to an external system so it logs it to ZK, then
>> > freezes for some time, ZK suspects it as failed, another master B is
>> > elected, writes X (completing what A wanted to do)
>> > then starts doing something else and writes Y. Then leader A "wakes up"
>> and
>> > re-executes the old operation writing X which is now stale.
>> >
>> > perhaps if your external system supports conditional updates this can be
>> > avoided - a write of X only works if the current state is as expected.
>> >
>> > Alex
>> >
>> >
>> > On Tue, Jan 5, 2016 at 1:00 AM, singh.janmejay <singh.janmejay@gmail.com
>> >
>> > wrote:
>> >
>> >> Thanks for the replies everyone, most of it was very useful.
>> >>
>> >> @Alexander: The section of chubby paper you pointed me to tries to
>> >> address exactly what I was looking for. That clearly is one good way
>> >> of doing it. Im also thinking of an alternative way and can use a
>> >> review or some feedback on that.
>> >>
>> >> @Powel: About x509 auth on intra-cluster communication, I don't have a
>> >> blocking need for it, as it can be achieved by setting up firewall
>> >> rules to accept only from desired hosts. It may be a good-to-have
>> >> thing though, because in cloud-based scenarios where IP addresses are
>> >> re-used, a recycled IP can still talk to a secure zk-cluster unless
>> >> config is changed to remove the old peer IP and replace it with the
>> >> new one. Its clearly a corner-case though.
>> >>
>> >> Here is the approach Im thinking of:
>> >> - Implement all operations(atleast master-triggered operations) on
>> >> operand machines idempotently
>> >> - Have master journal these operations to ZK before issuing RPC
>> >> - In case master fails with some of these operations in flight, the
>> >> newly elected master will need to read all issued (but not retired
>> >> yet) operations and issue them again.
>> >> - Existing master(before failure or after failure) can retry and
>> >> retire operations according to whatever the retry policy and
>> >> success-criterion is.
>> >>
>> >> Why am I thinking of this as opposed to going with chubby sequencer
>> >> passing:
>> >> - I need to implement idempotency regardless, because recovery-path
>> >> involving master-death after successful execution of operation but
>> >> before writing ack to coordination service requires it. So idempotent
>> >> implementation complexity is here to stay.
>> >> - I need to increase surface-area of the architecture which is exposed
>> >> to coordination-service for sequencer validation. Which may bring
>> >> verification RPC in data-plane in some cases.
>> >> - The sequencer may expire after verification but before ack, in which
>> >> case new master may not recognize the operation as consistent with its
>> >> decisions (or previous decision path).
>> >>
>> >> Thoughts? Suggestions?
>> >>
>> >>
>> >>
>> >> On Sun, Jan 3, 2016 at 2:18 PM, Alexander Shraer <sh...@gmail.com>
>> >> wrote:
>> >> > regarding atomic multi-znode updates -- check out "multi" updates
>> >> > <
>> >>
>> http://tdunning.blogspot.com/2011/06/tour-of-multi-update-for-zookeeper.html
>> >> >
>> >> > .
>> >> >
>> >> > On Sat, Jan 2, 2016 at 10:45 PM, Alexander Shraer <sh...@gmail.com>
>> >> wrote:
>> >> >
>> >> >> for 1, see the chubby paper
>> >> >> <
>> >>
>> http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf
>> >> >,
>> >> >> section 2.4.
>> >> >> for 2, I'm not sure I fully understand the question, but
>> essentially, ZK
>> >> >> guarantees that even during failures
>> >> >> consistency of updates is preserved. The user doesn't need to do
>> >> anything
>> >> >> in particular to guarantee this, even
>> >> >> during leader failures. In such case, some suffix of operations
>> executed
>> >> >> by the leader may be lost if they weren't
>> >> >> previously acked by a majority.However, none of these operations
>> could
>> >> >> have been visible
>> >> >> to reads.
>> >> >>
>> >> >> On Fri, Jan 1, 2016 at 12:29 AM, powell molleti <
>> >> >> powellm79@yahoo.com.invalid> wrote:
>> >> >>
>> >> >>> Hi Janmejay,
>> >> >>> Regarding question 1, if a node takes a lock and the lock has
>> timed-out
>> >> >>> from system perspective then it can mean that other nodes are free
>> to
>> >> take
>> >> >>> the lock and work on the resource. Hence the history could be well
>> >> into the
>> >> >>> future when the previous node discovers the time-out. The question
>> of
>> >> >>> rollback in the specific context depends on the implementation
>> >> details, is
>> >> >>> the lock holder updating some common area?, then there could be
>> >> corruption
>> >> >>> since other nodes are free to write in parallel to the first one?.
>> In
>> >> the
>> >> >>> usual sense a time-out of lock held means the node which held the
>> lock
>> >> is
>> >> >>> dead. It is upto the implementation to ensure this case and, using
>> this
>> >> >>> primitive, if there is a timeout which means other nodes are sure
>> that
>> >> no
>> >> >>> one else is working on the resource and hence can move forward.
>> >> >>> Question 2 seems to imply the assumption that leader has significant
>> >> work
>> >> >>> todo and leader change is quite common, which seems contrary to
>> common
>> >> >>> implementation pattern. If the work can be broken down into smaller
>> >> chunks
>> >> >>> which need serialization separately then each chunk/work type can
>> have
>> >> a
>> >> >>> different leader.
>> >> >>> For question 3, ZK does support auth and encryption for client
>> >> >>> connections but not for inter ZK node channels. Do you have
>> >> requirement to
>> >> >>> secure inter ZK nodes, can you let us know what your requirements
>> are
>> >> so we
>> >> >>> can implement a solution to fit all needs?.
>> >> >>> For question 4 the official implementation is C, people tend to wrap
>> >> that
>> >> >>> with C++ and there should projects that use ZK doing that you can
>> look
>> >> them
>> >> >>> up and see if you can separate it out and use them.
>> >> >>> Hope this helps.Powell.
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>     On Thursday, December 31, 2015 8:07 AM, Edward Capriolo <
>> >> >>> edward.capriolo@huffingtonpost.com> wrote:
>> >> >>>
>> >> >>>
>> >> >>>  Q:What is the best way of handling distributed-lock expiry? The
>> owner
>> >> >>> of the lock managed to acquire it and may be in middle of some
>> >> >>> computation when the session expires or lock expire
>> >> >>>
>> >> >>> If you are using Java a way I can see doing this is by using the
>> >> >>> ExecutorCompletionService
>> >> >>>
>> >> >>>
>> >>
>> https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorCompletionService.html
>> >> >>> .
>> >> >>> It allows you to keep your workers in a group, You can poll the
>> group
>> >> and
>> >> >>> provide cancel semantics as needed.
>> >> >>> An example of that service is here:
>> >> >>>
>> >> >>>
>> >>
>> https://github.com/edwardcapriolo/nibiru/blob/master/src/main/java/io/teknek/nibiru/coordinator/EventualCoordinator.java
>> >> >>> where I am issuing multiple reads and I want to abandon the process
>> if
>> >> >>> they
>> >> >>> do not timeout in a while. Many async/promices frameworks do this by
>> >> >>> launching two task ComputationTask and a TimeoutTask that returns
>> in 10
>> >> >>> seconds. Then they ask the completions service to poll. If the
>> service
>> >> is
>> >> >>> given the TimoutTask after the timeout that means the Computation
>> did
>> >> not
>> >> >>> finish in time.
>> >> >>>
>> >> >>> Do people generally take action in middle of the computation (abort
>> it
>> >> and
>> >> >>> do itin a clever way such that effect appears atomic, so abort is
>> >> >>> notreally
>> >> >>> visible, if so what are some of those clever ways)?
>> >> >>>
>> >> >>> The base issue is java's synchronized/ AtomicReference do not have a
>> >> >>> rollback.
>> >> >>>
>> >> >>> There are a few ways I know to work around this. Clojure has STM
>> >> (software
>> >> >>> Transactional Memory) such that if an exception is through inside a
>> >> doSync
>> >> >>> all of the stems inside the critical block never happened. This
>> assumes
>> >> >>> your using all clojure structures which you are probably not.
>> >> >>> A way co workers have done this is as follows. Move your entire
>> >> >>> transnational state into a SINGLE big object that you can
>> >> >>> copy/mutate/compare and swap. You never need to rollback each piece
>> >> >>> because
>> >> >>> your changing the clone up until the point you commit it.
>> >> >>> Writing reversal code is possible depending on the problem. There
>> are
>> >> >>> questions to ask like "what if the reversal somehow fails?"
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> On Thu, Dec 31, 2015 at 3:10 AM, singh.janmejay <
>> >> singh.janmejay@gmail.com
>> >> >>> >
>> >> >>> wrote:
>> >> >>>
>> >> >>> > Hi,
>> >> >>> >
>> >> >>> > Was wondering if there are any reference designs, patterns on
>> >> handling
>> >> >>> > common operations involving distributed coordination.
>> >> >>> >
>> >> >>> > I have a few questions and I guess they must have been asked
>> before,
>> >> I
>> >> >>> > am unsure what to search for to surface the right answers. It'll
>> be
>> >> >>> > really valuable if someone can provide links to relevant
>> >> >>> > "best-practices guide" or "suggestions" per question or share some
>> >> >>> > wisdom or ideas on patterns to do this in the best way.
>> >> >>> >
>> >> >>> > 1. What is the best way of handling distributed-lock expiry? The
>> >> owner
>> >> >>> > of the lock managed to acquire it and may be in middle of some
>> >> >>> > computation when the session expires or lock expires. When it
>> >> finishes
>> >> >>> > that computation, it can tell that the lock expired, but do people
>> >> >>> > generally take action in middle of the computation (abort it and
>> do
>> >> it
>> >> >>> > in a clever way such that effect appears atomic, so abort is not
>> >> >>> > really visible, if so what are some of those clever ways)? Or is
>> the
>> >> >>> > right thing to do, is to write reversal-code, such that operations
>> >> can
>> >> >>> > be cleanly undone in case the verification at the end of
>> computation
>> >> >>> > shows that lock expired? The later obviously is a lot harder to
>> >> >>> > achieve.
>> >> >>> >
>> >> >>> > 2. Same as above for leader-election scenarios. Leader generally
>> >> >>> > administers operations on data-systems that take significant time
>> to
>> >> >>> > complete and have significant resource overhead and RPC to
>> administer
>> >> >>> > such operations synchronously from leader to data-node can't be
>> >> atomic
>> >> >>> > and can't be made latency-resilient to such a degree that issuing
>> >> >>> > operation across a large set of nodes on a cluster can be
>> guaranteed
>> >> >>> > to finish without leader-change. What do people generally do in
>> such
>> >> >>> > situations? How are timeouts for operations issued when operations
>> >> are
>> >> >>> > issued using sequential-znode as a per-datanode dedicated queue?
>> How
>> >> >>> > well does it scale, and what are some things to watch-out for
>> >> >>> > (operation-size, encoding, clustering into one znode for atomicity
>> >> >>> > etc)? Or how are atomic operations that need to be issued across
>> >> >>> > multiple data-nodes managed (do they have to be clobbered into one
>> >> >>> > znode)?
>> >> >>> >
>> >> >>> > 3. How do people secure zookeeper based services? Is
>> >> >>> > client-certificate-verification the recommended way? How well does
>> >> >>> > this work with C client? Is inter-zk-node communication done with
>> >> >>> > X509-auth too?
>> >> >>> >
>> >> >>> > 4. What other projects, reference-implementations or libraries
>> should
>> >> >>> > I look at for working with C client?
>> >> >>> >
>> >> >>> > Most of what I have asked revolves around leader or lock-owner
>> having
>> >> >>> > a false-failure (where it doesn't know that coordinator thinks it
>> has
>> >> >>> > failed).
>> >> >>> >
>> >> >>> > --
>> >> >>> > Regards,
>> >> >>> > Janmejay
>> >> >>> > http://codehunk.wordpress.com
>> >> >>> >
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>
>> >> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Regards,
>> >> Janmejay
>> >> http://codehunk.wordpress.com
>> >>
>>
>>
>>
>> --
>> Regards,
>> Janmejay
>> http://codehunk.wordpress.com
>>



-- 
Regards,
Janmejay
http://codehunk.wordpress.com

Re: Best-practice guides on coordination of operations in distributed systems (and some C client specific questions)

Posted by Martin Kersten <ma...@gmail.com>.

Hello,

   I am not quite aware of the zookeeper specialities but from my point of
view you should think about distributing the task as well like having
multiple nodes being responsible for a task to be done and if one node
fails the other nodes take over and perform/complete the task instead. This
would involve having becon messages and a place you can put your code in.

Locks that time out should actually never happen, since it makes everything
go boom and become overly complex. Just bind a lock to the lifeliness of
the node. So if the node is considered dead free the lock. If the node is
kind of zombie (not reacting (stuck) but responsive in terms of sending
i-am-alife beacons (heart beat)) it is the task of the leader to kill the
node remotely or remove the node from the list of members and inform anyone
else about it. Once this happens this would also revoke the lock.

The goal is to simply let the leader kill any node that seams to be
malfunction in any possible way (like missing a deadline). A node that
wants to complete its operation needs to interact with the leader and at
this point in time the node should realize if it was considered dead and
should restart by crashing and rebooting instantly.

Another thing you might consider is to renew the lock in certain periodes.
If you have a workflow, your lock times out in 10 minutes just every time
you make real progress in your workflow renew the lock giving you another
10minutes to do the next steps.

This way (as long as you do not have a loop in the workflow) you are save
in assuming that a workflow is being completed in the future. If you need a
hard deadline the node processing the operation might as well check the
estimate of the workflow and drop the lock and abort the operation if it
estimates the operation is likely to timeout and might even perform a
compensation operation.


Cheers,

Martin (Kersten)



2016-01-13 11:08 GMT+01:00 singh.janmejay <si...@gmail.com>:

> @Alexander: In that scenario, write of X will be attempted by A, but
> external system will not act upon write-X because that operation has
> already been acted upon in the past. This is guaranteed by idempotent
> operations invariant. But it does point out another problem, which I
> hadn't handled in my original algorithm. Problem: If X and Y have both
> not been issued yet, and if Y is issued before X towards external
> system, because neither operations have executed yet, it'll overwrite
> Y with X. I need another constraint, master should only issue 1
> operation on a certain external-system at a time and must issue
> operations in the order of operation-id (sequential-znode sequence
> number). So we need the following invariants:
> - order of issuing operation being fixed (matching order of creation
> of operations)
> - concurrency of operation fixed to 1
> - idempotent execution on external-system side
>
> @Powell: Im kind of doing the same thing. Except the loop doesn't run
> on consumer, instead it runs on master, which is assigning work to
> consumers. So triggerWork function is basically changed to issueWork,
> which is RPC + triggerWork. The replay if history is basically just
> replay of 1 operation per operand-node (in this thread we are calling
> it external-system), so its as if triggerWork failed, in which case we
> need to re-execute triggerWork. Idempotency also follows from that
> requirement. If triggerWork fails in the last step, and all the
> desired effect that was necessary has happened, we will still need to
> run triggerWork again, but we need awareness that actual work has been
> done, which is why idempotency is necessary.
>
> Btw, thanks for continuing to spare time for this, I really appreciate
> this feedback/validation.
>
> Thoughts?
>
> On Wed, Jan 13, 2016 at 3:47 AM, powell molleti
> <po...@yahoo.com.invalid> wrote:
> > Wouldn't a distributed queue recipe for consumer work?. Where one needs
> to add extra logic something like this:
> >
> > with lock() {
> >     p = queue.peek()
> >     if triggerWork(p) is Done:
> >         queue.pop()
> > }
> >
> > With this a consumer that worked on it but crashed before popping the
> queue would result in another consumer processing the same work.
> >
> > I am not sure with the details of where you are getting the work from
> and the scale of it is but producers(leader) could write to this queue.
> Since there is guarantee of read after write , producer could delete from
> its local queue the work that was successfully queued. Hence again new
> producer could re-send the last entry of work so one has to handle that.
> Without more details on the origin of work etc its hard to design end to
> end.
> >
> > I do not see a need to do a total replay of past history etc when using
> ZK like system because ZK is built on idea of serialized and replicated
> log, hence if you are using ZK then your design should be much simpler i.e
> fail and re-start from last know transaction.
> >
> > Powell.
> >
> >
> >
> > On Tuesday, January 12, 2016 11:51 AM, Alexander Shraer <
> shralex@gmail.com> wrote:
> > Hi,
> >
> > With your suggestion, the following scenario seems possible: master A is
> > about to write value X to an external system so it logs it to ZK, then
> > freezes for some time, ZK suspects it as failed, another master B is
> > elected, writes X (completing what A wanted to do)
> > then starts doing something else and writes Y. Then leader A "wakes up"
> and
> > re-executes the old operation writing X which is now stale.
> >
> > perhaps if your external system supports conditional updates this can be
> > avoided - a write of X only works if the current state is as expected.
> >
> > Alex
> >
> >
> > On Tue, Jan 5, 2016 at 1:00 AM, singh.janmejay <singh.janmejay@gmail.com
> >
> > wrote:
> >
> >> Thanks for the replies everyone, most of it was very useful.
> >>
> >> @Alexander: The section of chubby paper you pointed me to tries to
> >> address exactly what I was looking for. That clearly is one good way
> >> of doing it. Im also thinking of an alternative way and can use a
> >> review or some feedback on that.
> >>
> >> @Powel: About x509 auth on intra-cluster communication, I don't have a
> >> blocking need for it, as it can be achieved by setting up firewall
> >> rules to accept only from desired hosts. It may be a good-to-have
> >> thing though, because in cloud-based scenarios where IP addresses are
> >> re-used, a recycled IP can still talk to a secure zk-cluster unless
> >> config is changed to remove the old peer IP and replace it with the
> >> new one. Its clearly a corner-case though.
> >>
> >> Here is the approach Im thinking of:
> >> - Implement all operations(atleast master-triggered operations) on
> >> operand machines idempotently
> >> - Have master journal these operations to ZK before issuing RPC
> >> - In case master fails with some of these operations in flight, the
> >> newly elected master will need to read all issued (but not retired
> >> yet) operations and issue them again.
> >> - Existing master(before failure or after failure) can retry and
> >> retire operations according to whatever the retry policy and
> >> success-criterion is.
> >>
> >> Why am I thinking of this as opposed to going with chubby sequencer
> >> passing:
> >> - I need to implement idempotency regardless, because recovery-path
> >> involving master-death after successful execution of operation but
> >> before writing ack to coordination service requires it. So idempotent
> >> implementation complexity is here to stay.
> >> - I need to increase surface-area of the architecture which is exposed
> >> to coordination-service for sequencer validation. Which may bring
> >> verification RPC in data-plane in some cases.
> >> - The sequencer may expire after verification but before ack, in which
> >> case new master may not recognize the operation as consistent with its
> >> decisions (or previous decision path).
> >>
> >> Thoughts? Suggestions?
> >>
> >>
> >>
> >> On Sun, Jan 3, 2016 at 2:18 PM, Alexander Shraer <sh...@gmail.com>
> >> wrote:
> >> > regarding atomic multi-znode updates -- check out "multi" updates
> >> > <
> >>
> http://tdunning.blogspot.com/2011/06/tour-of-multi-update-for-zookeeper.html
> >> >
> >> > .
> >> >
> >> > On Sat, Jan 2, 2016 at 10:45 PM, Alexander Shraer <sh...@gmail.com>
> >> wrote:
> >> >
> >> >> for 1, see the chubby paper
> >> >> <
> >>
> http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf
> >> >,
> >> >> section 2.4.
> >> >> for 2, I'm not sure I fully understand the question, but
> essentially, ZK
> >> >> guarantees that even during failures
> >> >> consistency of updates is preserved. The user doesn't need to do
> >> anything
> >> >> in particular to guarantee this, even
> >> >> during leader failures. In such case, some suffix of operations
> executed
> >> >> by the leader may be lost if they weren't
> >> >> previously acked by a majority.However, none of these operations
> could
> >> >> have been visible
> >> >> to reads.
> >> >>
> >> >> On Fri, Jan 1, 2016 at 12:29 AM, powell molleti <
> >> >> powellm79@yahoo.com.invalid> wrote:
> >> >>
> >> >>> Hi Janmejay,
> >> >>> Regarding question 1, if a node takes a lock and the lock has
> timed-out
> >> >>> from system perspective then it can mean that other nodes are free
> to
> >> take
> >> >>> the lock and work on the resource. Hence the history could be well
> >> into the
> >> >>> future when the previous node discovers the time-out. The question
> of
> >> >>> rollback in the specific context depends on the implementation
> >> details, is
> >> >>> the lock holder updating some common area?, then there could be
> >> corruption
> >> >>> since other nodes are free to write in parallel to the first one?.
> In
> >> the
> >> >>> usual sense a time-out of lock held means the node which held the
> lock
> >> is
> >> >>> dead. It is upto the implementation to ensure this case and, using
> this
> >> >>> primitive, if there is a timeout which means other nodes are sure
> that
> >> no
> >> >>> one else is working on the resource and hence can move forward.
> >> >>> Question 2 seems to imply the assumption that leader has significant
> >> work
> >> >>> todo and leader change is quite common, which seems contrary to
> common
> >> >>> implementation pattern. If the work can be broken down into smaller
> >> chunks
> >> >>> which need serialization separately then each chunk/work type can
> have
> >> a
> >> >>> different leader.
> >> >>> For question 3, ZK does support auth and encryption for client
> >> >>> connections but not for inter ZK node channels. Do you have
> >> requirement to
> >> >>> secure inter ZK nodes, can you let us know what your requirements
> are
> >> so we
> >> >>> can implement a solution to fit all needs?.
> >> >>> For question 4 the official implementation is C, people tend to wrap
> >> that
> >> >>> with C++ and there should projects that use ZK doing that you can
> look
> >> them
> >> >>> up and see if you can separate it out and use them.
> >> >>> Hope this helps.Powell.
> >> >>>
> >> >>>
> >> >>>
> >> >>>     On Thursday, December 31, 2015 8:07 AM, Edward Capriolo <
> >> >>> edward.capriolo@huffingtonpost.com> wrote:
> >> >>>
> >> >>>
> >> >>>  Q:What is the best way of handling distributed-lock expiry? The
> owner
> >> >>> of the lock managed to acquire it and may be in middle of some
> >> >>> computation when the session expires or lock expire
> >> >>>
> >> >>> If you are using Java a way I can see doing this is by using the
> >> >>> ExecutorCompletionService
> >> >>>
> >> >>>
> >>
> https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorCompletionService.html
> >> >>> .
> >> >>> It allows you to keep your workers in a group, You can poll the
> group
> >> and
> >> >>> provide cancel semantics as needed.
> >> >>> An example of that service is here:
> >> >>>
> >> >>>
> >>
> https://github.com/edwardcapriolo/nibiru/blob/master/src/main/java/io/teknek/nibiru/coordinator/EventualCoordinator.java
> >> >>> where I am issuing multiple reads and I want to abandon the process
> if
> >> >>> they
> >> >>> do not timeout in a while. Many async/promices frameworks do this by
> >> >>> launching two task ComputationTask and a TimeoutTask that returns
> in 10
> >> >>> seconds. Then they ask the completions service to poll. If the
> service
> >> is
> >> >>> given the TimoutTask after the timeout that means the Computation
> did
> >> not
> >> >>> finish in time.
> >> >>>
> >> >>> Do people generally take action in middle of the computation (abort
> it
> >> and
> >> >>> do itin a clever way such that effect appears atomic, so abort is
> >> >>> notreally
> >> >>> visible, if so what are some of those clever ways)?
> >> >>>
> >> >>> The base issue is java's synchronized/ AtomicReference do not have a
> >> >>> rollback.
> >> >>>
> >> >>> There are a few ways I know to work around this. Clojure has STM
> >> (software
> >> >>> Transactional Memory) such that if an exception is through inside a
> >> doSync
> >> >>> all of the stems inside the critical block never happened. This
> assumes
> >> >>> your using all clojure structures which you are probably not.
> >> >>> A way co workers have done this is as follows. Move your entire
> >> >>> transnational state into a SINGLE big object that you can
> >> >>> copy/mutate/compare and swap. You never need to rollback each piece
> >> >>> because
> >> >>> your changing the clone up until the point you commit it.
> >> >>> Writing reversal code is possible depending on the problem. There
> are
> >> >>> questions to ask like "what if the reversal somehow fails?"
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Thu, Dec 31, 2015 at 3:10 AM, singh.janmejay <
> >> singh.janmejay@gmail.com
> >> >>> >
> >> >>> wrote:
> >> >>>
> >> >>> > Hi,
> >> >>> >
> >> >>> > Was wondering if there are any reference designs, patterns on
> >> handling
> >> >>> > common operations involving distributed coordination.
> >> >>> >
> >> >>> > I have a few questions and I guess they must have been asked
> before,
> >> I
> >> >>> > am unsure what to search for to surface the right answers. It'll
> be
> >> >>> > really valuable if someone can provide links to relevant
> >> >>> > "best-practices guide" or "suggestions" per question or share some
> >> >>> > wisdom or ideas on patterns to do this in the best way.
> >> >>> >
> >> >>> > 1. What is the best way of handling distributed-lock expiry? The
> >> owner
> >> >>> > of the lock managed to acquire it and may be in middle of some
> >> >>> > computation when the session expires or lock expires. When it
> >> finishes
> >> >>> > that computation, it can tell that the lock expired, but do people
> >> >>> > generally take action in middle of the computation (abort it and
> do
> >> it
> >> >>> > in a clever way such that effect appears atomic, so abort is not
> >> >>> > really visible, if so what are some of those clever ways)? Or is
> the
> >> >>> > right thing to do, is to write reversal-code, such that operations
> >> can
> >> >>> > be cleanly undone in case the verification at the end of
> computation
> >> >>> > shows that lock expired? The later obviously is a lot harder to
> >> >>> > achieve.
> >> >>> >
> >> >>> > 2. Same as above for leader-election scenarios. Leader generally
> >> >>> > administers operations on data-systems that take significant time
> to
> >> >>> > complete and have significant resource overhead and RPC to
> administer
> >> >>> > such operations synchronously from leader to data-node can't be
> >> atomic
> >> >>> > and can't be made latency-resilient to such a degree that issuing
> >> >>> > operation across a large set of nodes on a cluster can be
> guaranteed
> >> >>> > to finish without leader-change. What do people generally do in
> such
> >> >>> > situations? How are timeouts for operations issued when operations
> >> are
> >> >>> > issued using sequential-znode as a per-datanode dedicated queue?
> How
> >> >>> > well does it scale, and what are some things to watch-out for
> >> >>> > (operation-size, encoding, clustering into one znode for atomicity
> >> >>> > etc)? Or how are atomic operations that need to be issued across
> >> >>> > multiple data-nodes managed (do they have to be clobbered into one
> >> >>> > znode)?
> >> >>> >
> >> >>> > 3. How do people secure zookeeper based services? Is
> >> >>> > client-certificate-verification the recommended way? How well does
> >> >>> > this work with C client? Is inter-zk-node communication done with
> >> >>> > X509-auth too?
> >> >>> >
> >> >>> > 4. What other projects, reference-implementations or libraries
> should
> >> >>> > I look at for working with C client?
> >> >>> >
> >> >>> > Most of what I have asked revolves around leader or lock-owner
> having
> >> >>> > a false-failure (where it doesn't know that coordinator thinks it
> has
> >> >>> > failed).
> >> >>> >
> >> >>> > --
> >> >>> > Regards,
> >> >>> > Janmejay
> >> >>> > http://codehunk.wordpress.com
> >> >>> >
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>
> >> >>
> >>
> >>
> >>
> >> --
> >> Regards,
> >> Janmejay
> >> http://codehunk.wordpress.com
> >>
>
>
>
> --
> Regards,
> Janmejay
> http://codehunk.wordpress.com
>

Re: Best-practice guides on coordination of operations in distributed systems (and some C client specific questions)

Posted by "singh.janmejay" <si...@gmail.com>.

Yes, that will depend on the way idempotency is implemented. The way I plan
to implement it is by using a monotonically increasing operation-id. Any
operation with id lower than last op-id will be identified as stale and
will not be executed. Because only one op is executed at a time, and ops
are executed in absolute order, op-id level identification for staleness is
sufficient.

--
Regards,
Janmejay

PS: Please blame the typos in this mail on my phone's uncivilized soft
keyboard sporting it's not-so-smart-assist technology.

On Jan 13, 2016 11:49 PM, "Alexander Shraer" <sh...@gmail.com> wrote:

> I may be wrong but I don't think that being idempotent gives you what you
> said. Just because f(f(x))=f(x) doesn't mean that f(g(f(x))) = g(f(x)) --
> this was my example. But if your system can detect that X was already
> executed (or if the operations are conditional on state) my scenario indeed
> can't happen.
>
>
> On Wed, Jan 13, 2016 at 2:08 AM, singh.janmejay <si...@gmail.com>
> wrote:
>
> > @Alexander: In that scenario, write of X will be attempted by A, but
> > external system will not act upon write-X because that operation has
> > already been acted upon in the past. This is guaranteed by idempotent
> > operations invariant. But it does point out another problem, which I
> > hadn't handled in my original algorithm. Problem: If X and Y have both
> > not been issued yet, and if Y is issued before X towards external
> > system, because neither operations have executed yet, it'll overwrite
> > Y with X. I need another constraint, master should only issue 1
> > operation on a certain external-system at a time and must issue
> > operations in the order of operation-id (sequential-znode sequence
> > number). So we need the following invariants:
> > - order of issuing operation being fixed (matching order of creation
> > of operations)
> > - concurrency of operation fixed to 1
> > - idempotent execution on external-system side
> >
> > @Powell: Im kind of doing the same thing. Except the loop doesn't run
> > on consumer, instead it runs on master, which is assigning work to
> > consumers. So triggerWork function is basically changed to issueWork,
> > which is RPC + triggerWork. The replay if history is basically just
> > replay of 1 operation per operand-node (in this thread we are calling
> > it external-system), so its as if triggerWork failed, in which case we
> > need to re-execute triggerWork. Idempotency also follows from that
> > requirement. If triggerWork fails in the last step, and all the
> > desired effect that was necessary has happened, we will still need to
> > run triggerWork again, but we need awareness that actual work has been
> > done, which is why idempotency is necessary.
> >
> > Btw, thanks for continuing to spare time for this, I really appreciate
> > this feedback/validation.
> >
> > Thoughts?
> >
> > On Wed, Jan 13, 2016 at 3:47 AM, powell molleti
> > <po...@yahoo.com.invalid> wrote:
> > > Wouldn't a distributed queue recipe for consumer work?. Where one needs
> > to add extra logic something like this:
> > >
> > > with lock() {
> > >     p = queue.peek()
> > >     if triggerWork(p) is Done:
> > >         queue.pop()
> > > }
> > >
> > > With this a consumer that worked on it but crashed before popping the
> > queue would result in another consumer processing the same work.
> > >
> > > I am not sure with the details of where you are getting the work from
> > and the scale of it is but producers(leader) could write to this queue.
> > Since there is guarantee of read after write , producer could delete from
> > its local queue the work that was successfully queued. Hence again new
> > producer could re-send the last entry of work so one has to handle that.
> > Without more details on the origin of work etc its hard to design end to
> > end.
> > >
> > > I do not see a need to do a total replay of past history etc when using
> > ZK like system because ZK is built on idea of serialized and replicated
> > log, hence if you are using ZK then your design should be much simpler
> i.e
> > fail and re-start from last know transaction.
> > >
> > > Powell.
> > >
> > >
> > >
> > > On Tuesday, January 12, 2016 11:51 AM, Alexander Shraer <
> > shralex@gmail.com> wrote:
> > > Hi,
> > >
> > > With your suggestion, the following scenario seems possible: master A
> is
> > > about to write value X to an external system so it logs it to ZK, then
> > > freezes for some time, ZK suspects it as failed, another master B is
> > > elected, writes X (completing what A wanted to do)
> > > then starts doing something else and writes Y. Then leader A "wakes up"
> > and
> > > re-executes the old operation writing X which is now stale.
> > >
> > > perhaps if your external system supports conditional updates this can
> be
> > > avoided - a write of X only works if the current state is as expected.
> > >
> > > Alex
> > >
> > >
> > > On Tue, Jan 5, 2016 at 1:00 AM, singh.janmejay <
> singh.janmejay@gmail.com
> > >
> > > wrote:
> > >
> > >> Thanks for the replies everyone, most of it was very useful.
> > >>
> > >> @Alexander: The section of chubby paper you pointed me to tries to
> > >> address exactly what I was looking for. That clearly is one good way
> > >> of doing it. Im also thinking of an alternative way and can use a
> > >> review or some feedback on that.
> > >>
> > >> @Powel: About x509 auth on intra-cluster communication, I don't have a
> > >> blocking need for it, as it can be achieved by setting up firewall
> > >> rules to accept only from desired hosts. It may be a good-to-have
> > >> thing though, because in cloud-based scenarios where IP addresses are
> > >> re-used, a recycled IP can still talk to a secure zk-cluster unless
> > >> config is changed to remove the old peer IP and replace it with the
> > >> new one. Its clearly a corner-case though.
> > >>
> > >> Here is the approach Im thinking of:
> > >> - Implement all operations(atleast master-triggered operations) on
> > >> operand machines idempotently
> > >> - Have master journal these operations to ZK before issuing RPC
> > >> - In case master fails with some of these operations in flight, the
> > >> newly elected master will need to read all issued (but not retired
> > >> yet) operations and issue them again.
> > >> - Existing master(before failure or after failure) can retry and
> > >> retire operations according to whatever the retry policy and
> > >> success-criterion is.
> > >>
> > >> Why am I thinking of this as opposed to going with chubby sequencer
> > >> passing:
> > >> - I need to implement idempotency regardless, because recovery-path
> > >> involving master-death after successful execution of operation but
> > >> before writing ack to coordination service requires it. So idempotent
> > >> implementation complexity is here to stay.
> > >> - I need to increase surface-area of the architecture which is exposed
> > >> to coordination-service for sequencer validation. Which may bring
> > >> verification RPC in data-plane in some cases.
> > >> - The sequencer may expire after verification but before ack, in which
> > >> case new master may not recognize the operation as consistent with its
> > >> decisions (or previous decision path).
> > >>
> > >> Thoughts? Suggestions?
> > >>
> > >>
> > >>
> > >> On Sun, Jan 3, 2016 at 2:18 PM, Alexander Shraer <sh...@gmail.com>
> > >> wrote:
> > >> > regarding atomic multi-znode updates -- check out "multi" updates
> > >> > <
> > >>
> >
> http://tdunning.blogspot.com/2011/06/tour-of-multi-update-for-zookeeper.html
> > >> >
> > >> > .
> > >> >
> > >> > On Sat, Jan 2, 2016 at 10:45 PM, Alexander Shraer <
> shralex@gmail.com>
> > >> wrote:
> > >> >
> > >> >> for 1, see the chubby paper
> > >> >> <
> > >>
> >
> http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf
> > >> >,
> > >> >> section 2.4.
> > >> >> for 2, I'm not sure I fully understand the question, but
> > essentially, ZK
> > >> >> guarantees that even during failures
> > >> >> consistency of updates is preserved. The user doesn't need to do
> > >> anything
> > >> >> in particular to guarantee this, even
> > >> >> during leader failures. In such case, some suffix of operations
> > executed
> > >> >> by the leader may be lost if they weren't
> > >> >> previously acked by a majority.However, none of these operations
> > could
> > >> >> have been visible
> > >> >> to reads.
> > >> >>
> > >> >> On Fri, Jan 1, 2016 at 12:29 AM, powell molleti <
> > >> >> powellm79@yahoo.com.invalid> wrote:
> > >> >>
> > >> >>> Hi Janmejay,
> > >> >>> Regarding question 1, if a node takes a lock and the lock has
> > timed-out
> > >> >>> from system perspective then it can mean that other nodes are free
> > to
> > >> take
> > >> >>> the lock and work on the resource. Hence the history could be well
> > >> into the
> > >> >>> future when the previous node discovers the time-out. The question
> > of
> > >> >>> rollback in the specific context depends on the implementation
> > >> details, is
> > >> >>> the lock holder updating some common area?, then there could be
> > >> corruption
> > >> >>> since other nodes are free to write in parallel to the first one?.
> > In
> > >> the
> > >> >>> usual sense a time-out of lock held means the node which held the
> > lock
> > >> is
> > >> >>> dead. It is upto the implementation to ensure this case and, using
> > this
> > >> >>> primitive, if there is a timeout which means other nodes are sure
> > that
> > >> no
> > >> >>> one else is working on the resource and hence can move forward.
> > >> >>> Question 2 seems to imply the assumption that leader has
> significant
> > >> work
> > >> >>> todo and leader change is quite common, which seems contrary to
> > common
> > >> >>> implementation pattern. If the work can be broken down into
> smaller
> > >> chunks
> > >> >>> which need serialization separately then each chunk/work type can
> > have
> > >> a
> > >> >>> different leader.
> > >> >>> For question 3, ZK does support auth and encryption for client
> > >> >>> connections but not for inter ZK node channels. Do you have
> > >> requirement to
> > >> >>> secure inter ZK nodes, can you let us know what your requirements
> > are
> > >> so we
> > >> >>> can implement a solution to fit all needs?.
> > >> >>> For question 4 the official implementation is C, people tend to
> wrap
> > >> that
> > >> >>> with C++ and there should projects that use ZK doing that you can
> > look
> > >> them
> > >> >>> up and see if you can separate it out and use them.
> > >> >>> Hope this helps.Powell.
> > >> >>>
> > >> >>>
> > >> >>>
> > >> >>>     On Thursday, December 31, 2015 8:07 AM, Edward Capriolo <
> > >> >>> edward.capriolo@huffingtonpost.com> wrote:
> > >> >>>
> > >> >>>
> > >> >>>  Q:What is the best way of handling distributed-lock expiry? The
> > owner
> > >> >>> of the lock managed to acquire it and may be in middle of some
> > >> >>> computation when the session expires or lock expire
> > >> >>>
> > >> >>> If you are using Java a way I can see doing this is by using the
> > >> >>> ExecutorCompletionService
> > >> >>>
> > >> >>>
> > >>
> >
> https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorCompletionService.html
> > >> >>> .
> > >> >>> It allows you to keep your workers in a group, You can poll the
> > group
> > >> and
> > >> >>> provide cancel semantics as needed.
> > >> >>> An example of that service is here:
> > >> >>>
> > >> >>>
> > >>
> >
> https://github.com/edwardcapriolo/nibiru/blob/master/src/main/java/io/teknek/nibiru/coordinator/EventualCoordinator.java
> > >> >>> where I am issuing multiple reads and I want to abandon the
> process
> > if
> > >> >>> they
> > >> >>> do not timeout in a while. Many async/promices frameworks do this
> by
> > >> >>> launching two task ComputationTask and a TimeoutTask that returns
> > in 10
> > >> >>> seconds. Then they ask the completions service to poll. If the
> > service
> > >> is
> > >> >>> given the TimoutTask after the timeout that means the Computation
> > did
> > >> not
> > >> >>> finish in time.
> > >> >>>
> > >> >>> Do people generally take action in middle of the computation
> (abort
> > it
> > >> and
> > >> >>> do itin a clever way such that effect appears atomic, so abort is
> > >> >>> notreally
> > >> >>> visible, if so what are some of those clever ways)?
> > >> >>>
> > >> >>> The base issue is java's synchronized/ AtomicReference do not
> have a
> > >> >>> rollback.
> > >> >>>
> > >> >>> There are a few ways I know to work around this. Clojure has STM
> > >> (software
> > >> >>> Transactional Memory) such that if an exception is through inside
> a
> > >> doSync
> > >> >>> all of the stems inside the critical block never happened. This
> > assumes
> > >> >>> your using all clojure structures which you are probably not.
> > >> >>> A way co workers have done this is as follows. Move your entire
> > >> >>> transnational state into a SINGLE big object that you can
> > >> >>> copy/mutate/compare and swap. You never need to rollback each
> piece
> > >> >>> because
> > >> >>> your changing the clone up until the point you commit it.
> > >> >>> Writing reversal code is possible depending on the problem. There
> > are
> > >> >>> questions to ask like "what if the reversal somehow fails?"
> > >> >>>
> > >> >>>
> > >> >>>
> > >> >>>
> > >> >>> On Thu, Dec 31, 2015 at 3:10 AM, singh.janmejay <
> > >> singh.janmejay@gmail.com
> > >> >>> >
> > >> >>> wrote:
> > >> >>>
> > >> >>> > Hi,
> > >> >>> >
> > >> >>> > Was wondering if there are any reference designs, patterns on
> > >> handling
> > >> >>> > common operations involving distributed coordination.
> > >> >>> >
> > >> >>> > I have a few questions and I guess they must have been asked
> > before,
> > >> I
> > >> >>> > am unsure what to search for to surface the right answers. It'll
> > be
> > >> >>> > really valuable if someone can provide links to relevant
> > >> >>> > "best-practices guide" or "suggestions" per question or share
> some
> > >> >>> > wisdom or ideas on patterns to do this in the best way.
> > >> >>> >
> > >> >>> > 1. What is the best way of handling distributed-lock expiry? The
> > >> owner
> > >> >>> > of the lock managed to acquire it and may be in middle of some
> > >> >>> > computation when the session expires or lock expires. When it
> > >> finishes
> > >> >>> > that computation, it can tell that the lock expired, but do
> people
> > >> >>> > generally take action in middle of the computation (abort it and
> > do
> > >> it
> > >> >>> > in a clever way such that effect appears atomic, so abort is not
> > >> >>> > really visible, if so what are some of those clever ways)? Or is
> > the
> > >> >>> > right thing to do, is to write reversal-code, such that
> operations
> > >> can
> > >> >>> > be cleanly undone in case the verification at the end of
> > computation
> > >> >>> > shows that lock expired? The later obviously is a lot harder to
> > >> >>> > achieve.
> > >> >>> >
> > >> >>> > 2. Same as above for leader-election scenarios. Leader generally
> > >> >>> > administers operations on data-systems that take significant
> time
> > to
> > >> >>> > complete and have significant resource overhead and RPC to
> > administer
> > >> >>> > such operations synchronously from leader to data-node can't be
> > >> atomic
> > >> >>> > and can't be made latency-resilient to such a degree that
> issuing
> > >> >>> > operation across a large set of nodes on a cluster can be
> > guaranteed
> > >> >>> > to finish without leader-change. What do people generally do in
> > such
> > >> >>> > situations? How are timeouts for operations issued when
> operations
> > >> are
> > >> >>> > issued using sequential-znode as a per-datanode dedicated queue?
> > How
> > >> >>> > well does it scale, and what are some things to watch-out for
> > >> >>> > (operation-size, encoding, clustering into one znode for
> atomicity
> > >> >>> > etc)? Or how are atomic operations that need to be issued across
> > >> >>> > multiple data-nodes managed (do they have to be clobbered into
> one
> > >> >>> > znode)?
> > >> >>> >
> > >> >>> > 3. How do people secure zookeeper based services? Is
> > >> >>> > client-certificate-verification the recommended way? How well
> does
> > >> >>> > this work with C client? Is inter-zk-node communication done
> with
> > >> >>> > X509-auth too?
> > >> >>> >
> > >> >>> > 4. What other projects, reference-implementations or libraries
> > should
> > >> >>> > I look at for working with C client?
> > >> >>> >
> > >> >>> > Most of what I have asked revolves around leader or lock-owner
> > having
> > >> >>> > a false-failure (where it doesn't know that coordinator thinks
> it
> > has
> > >> >>> > failed).
> > >> >>> >
> > >> >>> > --
> > >> >>> > Regards,
> > >> >>> > Janmejay
> > >> >>> > http://codehunk.wordpress.com
> > >> >>> >
> > >> >>>
> > >> >>>
> > >> >>>
> > >> >>>
> > >> >>
> > >> >>
> > >>
> > >>
> > >>
> > >> --
> > >> Regards,
> > >> Janmejay
> > >> http://codehunk.wordpress.com
> > >>
> >
> >
> >
> > --
> > Regards,
> > Janmejay
> > http://codehunk.wordpress.com
> >
>

Re: Best-practice guides on coordination of operations in distributed systems (and some C client specific questions)

Posted by Alexander Shraer <sh...@gmail.com>.

I may be wrong but I don't think that being idempotent gives you what you
said. Just because f(f(x))=f(x) doesn't mean that f(g(f(x))) = g(f(x)) --
this was my example. But if your system can detect that X was already
executed (or if the operations are conditional on state) my scenario indeed
can't happen.


On Wed, Jan 13, 2016 at 2:08 AM, singh.janmejay <si...@gmail.com>
wrote:

> @Alexander: In that scenario, write of X will be attempted by A, but
> external system will not act upon write-X because that operation has
> already been acted upon in the past. This is guaranteed by idempotent
> operations invariant. But it does point out another problem, which I
> hadn't handled in my original algorithm. Problem: If X and Y have both
> not been issued yet, and if Y is issued before X towards external
> system, because neither operations have executed yet, it'll overwrite
> Y with X. I need another constraint, master should only issue 1
> operation on a certain external-system at a time and must issue
> operations in the order of operation-id (sequential-znode sequence
> number). So we need the following invariants:
> - order of issuing operation being fixed (matching order of creation
> of operations)
> - concurrency of operation fixed to 1
> - idempotent execution on external-system side
>
> @Powell: Im kind of doing the same thing. Except the loop doesn't run
> on consumer, instead it runs on master, which is assigning work to
> consumers. So triggerWork function is basically changed to issueWork,
> which is RPC + triggerWork. The replay if history is basically just
> replay of 1 operation per operand-node (in this thread we are calling
> it external-system), so its as if triggerWork failed, in which case we
> need to re-execute triggerWork. Idempotency also follows from that
> requirement. If triggerWork fails in the last step, and all the
> desired effect that was necessary has happened, we will still need to
> run triggerWork again, but we need awareness that actual work has been
> done, which is why idempotency is necessary.
>
> Btw, thanks for continuing to spare time for this, I really appreciate
> this feedback/validation.
>
> Thoughts?
>
> On Wed, Jan 13, 2016 at 3:47 AM, powell molleti
> <po...@yahoo.com.invalid> wrote:
> > Wouldn't a distributed queue recipe for consumer work?. Where one needs
> to add extra logic something like this:
> >
> > with lock() {
> >     p = queue.peek()
> >     if triggerWork(p) is Done:
> >         queue.pop()
> > }
> >
> > With this a consumer that worked on it but crashed before popping the
> queue would result in another consumer processing the same work.
> >
> > I am not sure with the details of where you are getting the work from
> and the scale of it is but producers(leader) could write to this queue.
> Since there is guarantee of read after write , producer could delete from
> its local queue the work that was successfully queued. Hence again new
> producer could re-send the last entry of work so one has to handle that.
> Without more details on the origin of work etc its hard to design end to
> end.
> >
> > I do not see a need to do a total replay of past history etc when using
> ZK like system because ZK is built on idea of serialized and replicated
> log, hence if you are using ZK then your design should be much simpler i.e
> fail and re-start from last know transaction.
> >
> > Powell.
> >
> >
> >
> > On Tuesday, January 12, 2016 11:51 AM, Alexander Shraer <
> shralex@gmail.com> wrote:
> > Hi,
> >
> > With your suggestion, the following scenario seems possible: master A is
> > about to write value X to an external system so it logs it to ZK, then
> > freezes for some time, ZK suspects it as failed, another master B is
> > elected, writes X (completing what A wanted to do)
> > then starts doing something else and writes Y. Then leader A "wakes up"
> and
> > re-executes the old operation writing X which is now stale.
> >
> > perhaps if your external system supports conditional updates this can be
> > avoided - a write of X only works if the current state is as expected.
> >
> > Alex
> >
> >
> > On Tue, Jan 5, 2016 at 1:00 AM, singh.janmejay <singh.janmejay@gmail.com
> >
> > wrote:
> >
> >> Thanks for the replies everyone, most of it was very useful.
> >>
> >> @Alexander: The section of chubby paper you pointed me to tries to
> >> address exactly what I was looking for. That clearly is one good way
> >> of doing it. Im also thinking of an alternative way and can use a
> >> review or some feedback on that.
> >>
> >> @Powel: About x509 auth on intra-cluster communication, I don't have a
> >> blocking need for it, as it can be achieved by setting up firewall
> >> rules to accept only from desired hosts. It may be a good-to-have
> >> thing though, because in cloud-based scenarios where IP addresses are
> >> re-used, a recycled IP can still talk to a secure zk-cluster unless
> >> config is changed to remove the old peer IP and replace it with the
> >> new one. Its clearly a corner-case though.
> >>
> >> Here is the approach Im thinking of:
> >> - Implement all operations(atleast master-triggered operations) on
> >> operand machines idempotently
> >> - Have master journal these operations to ZK before issuing RPC
> >> - In case master fails with some of these operations in flight, the
> >> newly elected master will need to read all issued (but not retired
> >> yet) operations and issue them again.
> >> - Existing master(before failure or after failure) can retry and
> >> retire operations according to whatever the retry policy and
> >> success-criterion is.
> >>
> >> Why am I thinking of this as opposed to going with chubby sequencer
> >> passing:
> >> - I need to implement idempotency regardless, because recovery-path
> >> involving master-death after successful execution of operation but
> >> before writing ack to coordination service requires it. So idempotent
> >> implementation complexity is here to stay.
> >> - I need to increase surface-area of the architecture which is exposed
> >> to coordination-service for sequencer validation. Which may bring
> >> verification RPC in data-plane in some cases.
> >> - The sequencer may expire after verification but before ack, in which
> >> case new master may not recognize the operation as consistent with its
> >> decisions (or previous decision path).
> >>
> >> Thoughts? Suggestions?
> >>
> >>
> >>
> >> On Sun, Jan 3, 2016 at 2:18 PM, Alexander Shraer <sh...@gmail.com>
> >> wrote:
> >> > regarding atomic multi-znode updates -- check out "multi" updates
> >> > <
> >>
> http://tdunning.blogspot.com/2011/06/tour-of-multi-update-for-zookeeper.html
> >> >
> >> > .
> >> >
> >> > On Sat, Jan 2, 2016 at 10:45 PM, Alexander Shraer <sh...@gmail.com>
> >> wrote:
> >> >
> >> >> for 1, see the chubby paper
> >> >> <
> >>
> http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf
> >> >,
> >> >> section 2.4.
> >> >> for 2, I'm not sure I fully understand the question, but
> essentially, ZK
> >> >> guarantees that even during failures
> >> >> consistency of updates is preserved. The user doesn't need to do
> >> anything
> >> >> in particular to guarantee this, even
> >> >> during leader failures. In such case, some suffix of operations
> executed
> >> >> by the leader may be lost if they weren't
> >> >> previously acked by a majority.However, none of these operations
> could
> >> >> have been visible
> >> >> to reads.
> >> >>
> >> >> On Fri, Jan 1, 2016 at 12:29 AM, powell molleti <
> >> >> powellm79@yahoo.com.invalid> wrote:
> >> >>
> >> >>> Hi Janmejay,
> >> >>> Regarding question 1, if a node takes a lock and the lock has
> timed-out
> >> >>> from system perspective then it can mean that other nodes are free
> to
> >> take
> >> >>> the lock and work on the resource. Hence the history could be well
> >> into the
> >> >>> future when the previous node discovers the time-out. The question
> of
> >> >>> rollback in the specific context depends on the implementation
> >> details, is
> >> >>> the lock holder updating some common area?, then there could be
> >> corruption
> >> >>> since other nodes are free to write in parallel to the first one?.
> In
> >> the
> >> >>> usual sense a time-out of lock held means the node which held the
> lock
> >> is
> >> >>> dead. It is upto the implementation to ensure this case and, using
> this
> >> >>> primitive, if there is a timeout which means other nodes are sure
> that
> >> no
> >> >>> one else is working on the resource and hence can move forward.
> >> >>> Question 2 seems to imply the assumption that leader has significant
> >> work
> >> >>> todo and leader change is quite common, which seems contrary to
> common
> >> >>> implementation pattern. If the work can be broken down into smaller
> >> chunks
> >> >>> which need serialization separately then each chunk/work type can
> have
> >> a
> >> >>> different leader.
> >> >>> For question 3, ZK does support auth and encryption for client
> >> >>> connections but not for inter ZK node channels. Do you have
> >> requirement to
> >> >>> secure inter ZK nodes, can you let us know what your requirements
> are
> >> so we
> >> >>> can implement a solution to fit all needs?.
> >> >>> For question 4 the official implementation is C, people tend to wrap
> >> that
> >> >>> with C++ and there should projects that use ZK doing that you can
> look
> >> them
> >> >>> up and see if you can separate it out and use them.
> >> >>> Hope this helps.Powell.
> >> >>>
> >> >>>
> >> >>>
> >> >>>     On Thursday, December 31, 2015 8:07 AM, Edward Capriolo <
> >> >>> edward.capriolo@huffingtonpost.com> wrote:
> >> >>>
> >> >>>
> >> >>>  Q:What is the best way of handling distributed-lock expiry? The
> owner
> >> >>> of the lock managed to acquire it and may be in middle of some
> >> >>> computation when the session expires or lock expire
> >> >>>
> >> >>> If you are using Java a way I can see doing this is by using the
> >> >>> ExecutorCompletionService
> >> >>>
> >> >>>
> >>
> https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorCompletionService.html
> >> >>> .
> >> >>> It allows you to keep your workers in a group, You can poll the
> group
> >> and
> >> >>> provide cancel semantics as needed.
> >> >>> An example of that service is here:
> >> >>>
> >> >>>
> >>
> https://github.com/edwardcapriolo/nibiru/blob/master/src/main/java/io/teknek/nibiru/coordinator/EventualCoordinator.java
> >> >>> where I am issuing multiple reads and I want to abandon the process
> if
> >> >>> they
> >> >>> do not timeout in a while. Many async/promices frameworks do this by
> >> >>> launching two task ComputationTask and a TimeoutTask that returns
> in 10
> >> >>> seconds. Then they ask the completions service to poll. If the
> service
> >> is
> >> >>> given the TimoutTask after the timeout that means the Computation
> did
> >> not
> >> >>> finish in time.
> >> >>>
> >> >>> Do people generally take action in middle of the computation (abort
> it
> >> and
> >> >>> do itin a clever way such that effect appears atomic, so abort is
> >> >>> notreally
> >> >>> visible, if so what are some of those clever ways)?
> >> >>>
> >> >>> The base issue is java's synchronized/ AtomicReference do not have a
> >> >>> rollback.
> >> >>>
> >> >>> There are a few ways I know to work around this. Clojure has STM
> >> (software
> >> >>> Transactional Memory) such that if an exception is through inside a
> >> doSync
> >> >>> all of the stems inside the critical block never happened. This
> assumes
> >> >>> your using all clojure structures which you are probably not.
> >> >>> A way co workers have done this is as follows. Move your entire
> >> >>> transnational state into a SINGLE big object that you can
> >> >>> copy/mutate/compare and swap. You never need to rollback each piece
> >> >>> because
> >> >>> your changing the clone up until the point you commit it.
> >> >>> Writing reversal code is possible depending on the problem. There
> are
> >> >>> questions to ask like "what if the reversal somehow fails?"
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Thu, Dec 31, 2015 at 3:10 AM, singh.janmejay <
> >> singh.janmejay@gmail.com
> >> >>> >
> >> >>> wrote:
> >> >>>
> >> >>> > Hi,
> >> >>> >
> >> >>> > Was wondering if there are any reference designs, patterns on
> >> handling
> >> >>> > common operations involving distributed coordination.
> >> >>> >
> >> >>> > I have a few questions and I guess they must have been asked
> before,
> >> I
> >> >>> > am unsure what to search for to surface the right answers. It'll
> be
> >> >>> > really valuable if someone can provide links to relevant
> >> >>> > "best-practices guide" or "suggestions" per question or share some
> >> >>> > wisdom or ideas on patterns to do this in the best way.
> >> >>> >
> >> >>> > 1. What is the best way of handling distributed-lock expiry? The
> >> owner
> >> >>> > of the lock managed to acquire it and may be in middle of some
> >> >>> > computation when the session expires or lock expires. When it
> >> finishes
> >> >>> > that computation, it can tell that the lock expired, but do people
> >> >>> > generally take action in middle of the computation (abort it and
> do
> >> it
> >> >>> > in a clever way such that effect appears atomic, so abort is not
> >> >>> > really visible, if so what are some of those clever ways)? Or is
> the
> >> >>> > right thing to do, is to write reversal-code, such that operations
> >> can
> >> >>> > be cleanly undone in case the verification at the end of
> computation
> >> >>> > shows that lock expired? The later obviously is a lot harder to
> >> >>> > achieve.
> >> >>> >
> >> >>> > 2. Same as above for leader-election scenarios. Leader generally
> >> >>> > administers operations on data-systems that take significant time
> to
> >> >>> > complete and have significant resource overhead and RPC to
> administer
> >> >>> > such operations synchronously from leader to data-node can't be
> >> atomic
> >> >>> > and can't be made latency-resilient to such a degree that issuing
> >> >>> > operation across a large set of nodes on a cluster can be
> guaranteed
> >> >>> > to finish without leader-change. What do people generally do in
> such
> >> >>> > situations? How are timeouts for operations issued when operations
> >> are
> >> >>> > issued using sequential-znode as a per-datanode dedicated queue?
> How
> >> >>> > well does it scale, and what are some things to watch-out for
> >> >>> > (operation-size, encoding, clustering into one znode for atomicity
> >> >>> > etc)? Or how are atomic operations that need to be issued across
> >> >>> > multiple data-nodes managed (do they have to be clobbered into one
> >> >>> > znode)?
> >> >>> >
> >> >>> > 3. How do people secure zookeeper based services? Is
> >> >>> > client-certificate-verification the recommended way? How well does
> >> >>> > this work with C client? Is inter-zk-node communication done with
> >> >>> > X509-auth too?
> >> >>> >
> >> >>> > 4. What other projects, reference-implementations or libraries
> should
> >> >>> > I look at for working with C client?
> >> >>> >
> >> >>> > Most of what I have asked revolves around leader or lock-owner
> having
> >> >>> > a false-failure (where it doesn't know that coordinator thinks it
> has
> >> >>> > failed).
> >> >>> >
> >> >>> > --
> >> >>> > Regards,
> >> >>> > Janmejay
> >> >>> > http://codehunk.wordpress.com
> >> >>> >
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>
> >> >>
> >>
> >>
> >>
> >> --
> >> Regards,
> >> Janmejay
> >> http://codehunk.wordpress.com
> >>
>
>
>
> --
> Regards,
> Janmejay
> http://codehunk.wordpress.com
>

Re: Best-practice guides on coordination of operations in distributed systems (and some C client specific questions)

Posted by "singh.janmejay" <si...@gmail.com>.

@Alexander: In that scenario, write of X will be attempted by A, but
external system will not act upon write-X because that operation has
already been acted upon in the past. This is guaranteed by idempotent
operations invariant. But it does point out another problem, which I
hadn't handled in my original algorithm. Problem: If X and Y have both
not been issued yet, and if Y is issued before X towards external
system, because neither operations have executed yet, it'll overwrite
Y with X. I need another constraint, master should only issue 1
operation on a certain external-system at a time and must issue
operations in the order of operation-id (sequential-znode sequence
number). So we need the following invariants:
- order of issuing operation being fixed (matching order of creation
of operations)
- concurrency of operation fixed to 1
- idempotent execution on external-system side

@Powell: Im kind of doing the same thing. Except the loop doesn't run
on consumer, instead it runs on master, which is assigning work to
consumers. So triggerWork function is basically changed to issueWork,
which is RPC + triggerWork. The replay if history is basically just
replay of 1 operation per operand-node (in this thread we are calling
it external-system), so its as if triggerWork failed, in which case we
need to re-execute triggerWork. Idempotency also follows from that
requirement. If triggerWork fails in the last step, and all the
desired effect that was necessary has happened, we will still need to
run triggerWork again, but we need awareness that actual work has been
done, which is why idempotency is necessary.

Btw, thanks for continuing to spare time for this, I really appreciate
this feedback/validation.

Thoughts?

On Wed, Jan 13, 2016 at 3:47 AM, powell molleti
<po...@yahoo.com.invalid> wrote:
> Wouldn't a distributed queue recipe for consumer work?. Where one needs to add extra logic something like this:
>
> with lock() {
>     p = queue.peek()
>     if triggerWork(p) is Done:
>         queue.pop()
> }
>
> With this a consumer that worked on it but crashed before popping the queue would result in another consumer processing the same work.
>
> I am not sure with the details of where you are getting the work from and the scale of it is but producers(leader) could write to this queue. Since there is guarantee of read after write , producer could delete from its local queue the work that was successfully queued. Hence again new producer could re-send the last entry of work so one has to handle that. Without more details on the origin of work etc its hard to design end to end.
>
> I do not see a need to do a total replay of past history etc when using ZK like system because ZK is built on idea of serialized and replicated log, hence if you are using ZK then your design should be much simpler i.e fail and re-start from last know transaction.
>
> Powell.
>
>
>
> On Tuesday, January 12, 2016 11:51 AM, Alexander Shraer <sh...@gmail.com> wrote:
> Hi,
>
> With your suggestion, the following scenario seems possible: master A is
> about to write value X to an external system so it logs it to ZK, then
> freezes for some time, ZK suspects it as failed, another master B is
> elected, writes X (completing what A wanted to do)
> then starts doing something else and writes Y. Then leader A "wakes up" and
> re-executes the old operation writing X which is now stale.
>
> perhaps if your external system supports conditional updates this can be
> avoided - a write of X only works if the current state is as expected.
>
> Alex
>
>
> On Tue, Jan 5, 2016 at 1:00 AM, singh.janmejay <si...@gmail.com>
> wrote:
>
>> Thanks for the replies everyone, most of it was very useful.
>>
>> @Alexander: The section of chubby paper you pointed me to tries to
>> address exactly what I was looking for. That clearly is one good way
>> of doing it. Im also thinking of an alternative way and can use a
>> review or some feedback on that.
>>
>> @Powel: About x509 auth on intra-cluster communication, I don't have a
>> blocking need for it, as it can be achieved by setting up firewall
>> rules to accept only from desired hosts. It may be a good-to-have
>> thing though, because in cloud-based scenarios where IP addresses are
>> re-used, a recycled IP can still talk to a secure zk-cluster unless
>> config is changed to remove the old peer IP and replace it with the
>> new one. Its clearly a corner-case though.
>>
>> Here is the approach Im thinking of:
>> - Implement all operations(atleast master-triggered operations) on
>> operand machines idempotently
>> - Have master journal these operations to ZK before issuing RPC
>> - In case master fails with some of these operations in flight, the
>> newly elected master will need to read all issued (but not retired
>> yet) operations and issue them again.
>> - Existing master(before failure or after failure) can retry and
>> retire operations according to whatever the retry policy and
>> success-criterion is.
>>
>> Why am I thinking of this as opposed to going with chubby sequencer
>> passing:
>> - I need to implement idempotency regardless, because recovery-path
>> involving master-death after successful execution of operation but
>> before writing ack to coordination service requires it. So idempotent
>> implementation complexity is here to stay.
>> - I need to increase surface-area of the architecture which is exposed
>> to coordination-service for sequencer validation. Which may bring
>> verification RPC in data-plane in some cases.
>> - The sequencer may expire after verification but before ack, in which
>> case new master may not recognize the operation as consistent with its
>> decisions (or previous decision path).
>>
>> Thoughts? Suggestions?
>>
>>
>>
>> On Sun, Jan 3, 2016 at 2:18 PM, Alexander Shraer <sh...@gmail.com>
>> wrote:
>> > regarding atomic multi-znode updates -- check out "multi" updates
>> > <
>> http://tdunning.blogspot.com/2011/06/tour-of-multi-update-for-zookeeper.html
>> >
>> > .
>> >
>> > On Sat, Jan 2, 2016 at 10:45 PM, Alexander Shraer <sh...@gmail.com>
>> wrote:
>> >
>> >> for 1, see the chubby paper
>> >> <
>> http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf
>> >,
>> >> section 2.4.
>> >> for 2, I'm not sure I fully understand the question, but essentially, ZK
>> >> guarantees that even during failures
>> >> consistency of updates is preserved. The user doesn't need to do
>> anything
>> >> in particular to guarantee this, even
>> >> during leader failures. In such case, some suffix of operations executed
>> >> by the leader may be lost if they weren't
>> >> previously acked by a majority.However, none of these operations could
>> >> have been visible
>> >> to reads.
>> >>
>> >> On Fri, Jan 1, 2016 at 12:29 AM, powell molleti <
>> >> powellm79@yahoo.com.invalid> wrote:
>> >>
>> >>> Hi Janmejay,
>> >>> Regarding question 1, if a node takes a lock and the lock has timed-out
>> >>> from system perspective then it can mean that other nodes are free to
>> take
>> >>> the lock and work on the resource. Hence the history could be well
>> into the
>> >>> future when the previous node discovers the time-out. The question of
>> >>> rollback in the specific context depends on the implementation
>> details, is
>> >>> the lock holder updating some common area?, then there could be
>> corruption
>> >>> since other nodes are free to write in parallel to the first one?. In
>> the
>> >>> usual sense a time-out of lock held means the node which held the lock
>> is
>> >>> dead. It is upto the implementation to ensure this case and, using this
>> >>> primitive, if there is a timeout which means other nodes are sure that
>> no
>> >>> one else is working on the resource and hence can move forward.
>> >>> Question 2 seems to imply the assumption that leader has significant
>> work
>> >>> todo and leader change is quite common, which seems contrary to common
>> >>> implementation pattern. If the work can be broken down into smaller
>> chunks
>> >>> which need serialization separately then each chunk/work type can have
>> a
>> >>> different leader.
>> >>> For question 3, ZK does support auth and encryption for client
>> >>> connections but not for inter ZK node channels. Do you have
>> requirement to
>> >>> secure inter ZK nodes, can you let us know what your requirements are
>> so we
>> >>> can implement a solution to fit all needs?.
>> >>> For question 4 the official implementation is C, people tend to wrap
>> that
>> >>> with C++ and there should projects that use ZK doing that you can look
>> them
>> >>> up and see if you can separate it out and use them.
>> >>> Hope this helps.Powell.
>> >>>
>> >>>
>> >>>
>> >>>     On Thursday, December 31, 2015 8:07 AM, Edward Capriolo <
>> >>> edward.capriolo@huffingtonpost.com> wrote:
>> >>>
>> >>>
>> >>>  Q:What is the best way of handling distributed-lock expiry? The owner
>> >>> of the lock managed to acquire it and may be in middle of some
>> >>> computation when the session expires or lock expire
>> >>>
>> >>> If you are using Java a way I can see doing this is by using the
>> >>> ExecutorCompletionService
>> >>>
>> >>>
>> https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorCompletionService.html
>> >>> .
>> >>> It allows you to keep your workers in a group, You can poll the group
>> and
>> >>> provide cancel semantics as needed.
>> >>> An example of that service is here:
>> >>>
>> >>>
>> https://github.com/edwardcapriolo/nibiru/blob/master/src/main/java/io/teknek/nibiru/coordinator/EventualCoordinator.java
>> >>> where I am issuing multiple reads and I want to abandon the process if
>> >>> they
>> >>> do not timeout in a while. Many async/promices frameworks do this by
>> >>> launching two task ComputationTask and a TimeoutTask that returns in 10
>> >>> seconds. Then they ask the completions service to poll. If the service
>> is
>> >>> given the TimoutTask after the timeout that means the Computation did
>> not
>> >>> finish in time.
>> >>>
>> >>> Do people generally take action in middle of the computation (abort it
>> and
>> >>> do itin a clever way such that effect appears atomic, so abort is
>> >>> notreally
>> >>> visible, if so what are some of those clever ways)?
>> >>>
>> >>> The base issue is java's synchronized/ AtomicReference do not have a
>> >>> rollback.
>> >>>
>> >>> There are a few ways I know to work around this. Clojure has STM
>> (software
>> >>> Transactional Memory) such that if an exception is through inside a
>> doSync
>> >>> all of the stems inside the critical block never happened. This assumes
>> >>> your using all clojure structures which you are probably not.
>> >>> A way co workers have done this is as follows. Move your entire
>> >>> transnational state into a SINGLE big object that you can
>> >>> copy/mutate/compare and swap. You never need to rollback each piece
>> >>> because
>> >>> your changing the clone up until the point you commit it.
>> >>> Writing reversal code is possible depending on the problem. There are
>> >>> questions to ask like "what if the reversal somehow fails?"
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Dec 31, 2015 at 3:10 AM, singh.janmejay <
>> singh.janmejay@gmail.com
>> >>> >
>> >>> wrote:
>> >>>
>> >>> > Hi,
>> >>> >
>> >>> > Was wondering if there are any reference designs, patterns on
>> handling
>> >>> > common operations involving distributed coordination.
>> >>> >
>> >>> > I have a few questions and I guess they must have been asked before,
>> I
>> >>> > am unsure what to search for to surface the right answers. It'll be
>> >>> > really valuable if someone can provide links to relevant
>> >>> > "best-practices guide" or "suggestions" per question or share some
>> >>> > wisdom or ideas on patterns to do this in the best way.
>> >>> >
>> >>> > 1. What is the best way of handling distributed-lock expiry? The
>> owner
>> >>> > of the lock managed to acquire it and may be in middle of some
>> >>> > computation when the session expires or lock expires. When it
>> finishes
>> >>> > that computation, it can tell that the lock expired, but do people
>> >>> > generally take action in middle of the computation (abort it and do
>> it
>> >>> > in a clever way such that effect appears atomic, so abort is not
>> >>> > really visible, if so what are some of those clever ways)? Or is the
>> >>> > right thing to do, is to write reversal-code, such that operations
>> can
>> >>> > be cleanly undone in case the verification at the end of computation
>> >>> > shows that lock expired? The later obviously is a lot harder to
>> >>> > achieve.
>> >>> >
>> >>> > 2. Same as above for leader-election scenarios. Leader generally
>> >>> > administers operations on data-systems that take significant time to
>> >>> > complete and have significant resource overhead and RPC to administer
>> >>> > such operations synchronously from leader to data-node can't be
>> atomic
>> >>> > and can't be made latency-resilient to such a degree that issuing
>> >>> > operation across a large set of nodes on a cluster can be guaranteed
>> >>> > to finish without leader-change. What do people generally do in such
>> >>> > situations? How are timeouts for operations issued when operations
>> are
>> >>> > issued using sequential-znode as a per-datanode dedicated queue? How
>> >>> > well does it scale, and what are some things to watch-out for
>> >>> > (operation-size, encoding, clustering into one znode for atomicity
>> >>> > etc)? Or how are atomic operations that need to be issued across
>> >>> > multiple data-nodes managed (do they have to be clobbered into one
>> >>> > znode)?
>> >>> >
>> >>> > 3. How do people secure zookeeper based services? Is
>> >>> > client-certificate-verification the recommended way? How well does
>> >>> > this work with C client? Is inter-zk-node communication done with
>> >>> > X509-auth too?
>> >>> >
>> >>> > 4. What other projects, reference-implementations or libraries should
>> >>> > I look at for working with C client?
>> >>> >
>> >>> > Most of what I have asked revolves around leader or lock-owner having
>> >>> > a false-failure (where it doesn't know that coordinator thinks it has
>> >>> > failed).
>> >>> >
>> >>> > --
>> >>> > Regards,
>> >>> > Janmejay
>> >>> > http://codehunk.wordpress.com
>> >>> >
>> >>>
>> >>>
>> >>>
>> >>>
>> >>
>> >>
>>
>>
>>
>> --
>> Regards,
>> Janmejay
>> http://codehunk.wordpress.com
>>



-- 
Regards,
Janmejay
http://codehunk.wordpress.com

Re: Best-practice guides on coordination of operations in distributed systems (and some C client specific questions)

Posted by powell molleti <po...@yahoo.com.INVALID>.

Wouldn't a distributed queue recipe for consumer work?. Where one needs to add extra logic something like this:

with lock() {
    p = queue.peek()
    if triggerWork(p) is Done:
        queue.pop()
}

With this a consumer that worked on it but crashed before popping the queue would result in another consumer processing the same work. 

I am not sure with the details of where you are getting the work from and the scale of it is but producers(leader) could write to this queue. Since there is guarantee of read after write , producer could delete from its local queue the work that was successfully queued. Hence again new producer could re-send the last entry of work so one has to handle that. Without more details on the origin of work etc its hard to design end to end. 

I do not see a need to do a total replay of past history etc when using ZK like system because ZK is built on idea of serialized and replicated log, hence if you are using ZK then your design should be much simpler i.e fail and re-start from last know transaction.

Powell.
    


On Tuesday, January 12, 2016 11:51 AM, Alexander Shraer <sh...@gmail.com> wrote:
Hi,

With your suggestion, the following scenario seems possible: master A is
about to write value X to an external system so it logs it to ZK, then
freezes for some time, ZK suspects it as failed, another master B is
elected, writes X (completing what A wanted to do)
then starts doing something else and writes Y. Then leader A "wakes up" and
re-executes the old operation writing X which is now stale.

perhaps if your external system supports conditional updates this can be
avoided - a write of X only works if the current state is as expected.

Alex


On Tue, Jan 5, 2016 at 1:00 AM, singh.janmejay <si...@gmail.com>
wrote:

> Thanks for the replies everyone, most of it was very useful.
>
> @Alexander: The section of chubby paper you pointed me to tries to
> address exactly what I was looking for. That clearly is one good way
> of doing it. Im also thinking of an alternative way and can use a
> review or some feedback on that.
>
> @Powel: About x509 auth on intra-cluster communication, I don't have a
> blocking need for it, as it can be achieved by setting up firewall
> rules to accept only from desired hosts. It may be a good-to-have
> thing though, because in cloud-based scenarios where IP addresses are
> re-used, a recycled IP can still talk to a secure zk-cluster unless
> config is changed to remove the old peer IP and replace it with the
> new one. Its clearly a corner-case though.
>
> Here is the approach Im thinking of:
> - Implement all operations(atleast master-triggered operations) on
> operand machines idempotently
> - Have master journal these operations to ZK before issuing RPC
> - In case master fails with some of these operations in flight, the
> newly elected master will need to read all issued (but not retired
> yet) operations and issue them again.
> - Existing master(before failure or after failure) can retry and
> retire operations according to whatever the retry policy and
> success-criterion is.
>
> Why am I thinking of this as opposed to going with chubby sequencer
> passing:
> - I need to implement idempotency regardless, because recovery-path
> involving master-death after successful execution of operation but
> before writing ack to coordination service requires it. So idempotent
> implementation complexity is here to stay.
> - I need to increase surface-area of the architecture which is exposed
> to coordination-service for sequencer validation. Which may bring
> verification RPC in data-plane in some cases.
> - The sequencer may expire after verification but before ack, in which
> case new master may not recognize the operation as consistent with its
> decisions (or previous decision path).
>
> Thoughts? Suggestions?
>
>
>
> On Sun, Jan 3, 2016 at 2:18 PM, Alexander Shraer <sh...@gmail.com>
> wrote:
> > regarding atomic multi-znode updates -- check out "multi" updates
> > <
> http://tdunning.blogspot.com/2011/06/tour-of-multi-update-for-zookeeper.html
> >
> > .
> >
> > On Sat, Jan 2, 2016 at 10:45 PM, Alexander Shraer <sh...@gmail.com>
> wrote:
> >
> >> for 1, see the chubby paper
> >> <
> http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf
> >,
> >> section 2.4.
> >> for 2, I'm not sure I fully understand the question, but essentially, ZK
> >> guarantees that even during failures
> >> consistency of updates is preserved. The user doesn't need to do
> anything
> >> in particular to guarantee this, even
> >> during leader failures. In such case, some suffix of operations executed
> >> by the leader may be lost if they weren't
> >> previously acked by a majority.However, none of these operations could
> >> have been visible
> >> to reads.
> >>
> >> On Fri, Jan 1, 2016 at 12:29 AM, powell molleti <
> >> powellm79@yahoo.com.invalid> wrote:
> >>
> >>> Hi Janmejay,
> >>> Regarding question 1, if a node takes a lock and the lock has timed-out
> >>> from system perspective then it can mean that other nodes are free to
> take
> >>> the lock and work on the resource. Hence the history could be well
> into the
> >>> future when the previous node discovers the time-out. The question of
> >>> rollback in the specific context depends on the implementation
> details, is
> >>> the lock holder updating some common area?, then there could be
> corruption
> >>> since other nodes are free to write in parallel to the first one?. In
> the
> >>> usual sense a time-out of lock held means the node which held the lock
> is
> >>> dead. It is upto the implementation to ensure this case and, using this
> >>> primitive, if there is a timeout which means other nodes are sure that
> no
> >>> one else is working on the resource and hence can move forward.
> >>> Question 2 seems to imply the assumption that leader has significant
> work
> >>> todo and leader change is quite common, which seems contrary to common
> >>> implementation pattern. If the work can be broken down into smaller
> chunks
> >>> which need serialization separately then each chunk/work type can have
> a
> >>> different leader.
> >>> For question 3, ZK does support auth and encryption for client
> >>> connections but not for inter ZK node channels. Do you have
> requirement to
> >>> secure inter ZK nodes, can you let us know what your requirements are
> so we
> >>> can implement a solution to fit all needs?.
> >>> For question 4 the official implementation is C, people tend to wrap
> that
> >>> with C++ and there should projects that use ZK doing that you can look
> them
> >>> up and see if you can separate it out and use them.
> >>> Hope this helps.Powell.
> >>>
> >>>
> >>>
> >>>     On Thursday, December 31, 2015 8:07 AM, Edward Capriolo <
> >>> edward.capriolo@huffingtonpost.com> wrote:
> >>>
> >>>
> >>>  Q:What is the best way of handling distributed-lock expiry? The owner
> >>> of the lock managed to acquire it and may be in middle of some
> >>> computation when the session expires or lock expire
> >>>
> >>> If you are using Java a way I can see doing this is by using the
> >>> ExecutorCompletionService
> >>>
> >>>
> https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorCompletionService.html
> >>> .
> >>> It allows you to keep your workers in a group, You can poll the group
> and
> >>> provide cancel semantics as needed.
> >>> An example of that service is here:
> >>>
> >>>
> https://github.com/edwardcapriolo/nibiru/blob/master/src/main/java/io/teknek/nibiru/coordinator/EventualCoordinator.java
> >>> where I am issuing multiple reads and I want to abandon the process if
> >>> they
> >>> do not timeout in a while. Many async/promices frameworks do this by
> >>> launching two task ComputationTask and a TimeoutTask that returns in 10
> >>> seconds. Then they ask the completions service to poll. If the service
> is
> >>> given the TimoutTask after the timeout that means the Computation did
> not
> >>> finish in time.
> >>>
> >>> Do people generally take action in middle of the computation (abort it
> and
> >>> do itin a clever way such that effect appears atomic, so abort is
> >>> notreally
> >>> visible, if so what are some of those clever ways)?
> >>>
> >>> The base issue is java's synchronized/ AtomicReference do not have a
> >>> rollback.
> >>>
> >>> There are a few ways I know to work around this. Clojure has STM
> (software
> >>> Transactional Memory) such that if an exception is through inside a
> doSync
> >>> all of the stems inside the critical block never happened. This assumes
> >>> your using all clojure structures which you are probably not.
> >>> A way co workers have done this is as follows. Move your entire
> >>> transnational state into a SINGLE big object that you can
> >>> copy/mutate/compare and swap. You never need to rollback each piece
> >>> because
> >>> your changing the clone up until the point you commit it.
> >>> Writing reversal code is possible depending on the problem. There are
> >>> questions to ask like "what if the reversal somehow fails?"
> >>>
> >>>
> >>>
> >>>
> >>> On Thu, Dec 31, 2015 at 3:10 AM, singh.janmejay <
> singh.janmejay@gmail.com
> >>> >
> >>> wrote:
> >>>
> >>> > Hi,
> >>> >
> >>> > Was wondering if there are any reference designs, patterns on
> handling
> >>> > common operations involving distributed coordination.
> >>> >
> >>> > I have a few questions and I guess they must have been asked before,
> I
> >>> > am unsure what to search for to surface the right answers. It'll be
> >>> > really valuable if someone can provide links to relevant
> >>> > "best-practices guide" or "suggestions" per question or share some
> >>> > wisdom or ideas on patterns to do this in the best way.
> >>> >
> >>> > 1. What is the best way of handling distributed-lock expiry? The
> owner
> >>> > of the lock managed to acquire it and may be in middle of some
> >>> > computation when the session expires or lock expires. When it
> finishes
> >>> > that computation, it can tell that the lock expired, but do people
> >>> > generally take action in middle of the computation (abort it and do
> it
> >>> > in a clever way such that effect appears atomic, so abort is not
> >>> > really visible, if so what are some of those clever ways)? Or is the
> >>> > right thing to do, is to write reversal-code, such that operations
> can
> >>> > be cleanly undone in case the verification at the end of computation
> >>> > shows that lock expired? The later obviously is a lot harder to
> >>> > achieve.
> >>> >
> >>> > 2. Same as above for leader-election scenarios. Leader generally
> >>> > administers operations on data-systems that take significant time to
> >>> > complete and have significant resource overhead and RPC to administer
> >>> > such operations synchronously from leader to data-node can't be
> atomic
> >>> > and can't be made latency-resilient to such a degree that issuing
> >>> > operation across a large set of nodes on a cluster can be guaranteed
> >>> > to finish without leader-change. What do people generally do in such
> >>> > situations? How are timeouts for operations issued when operations
> are
> >>> > issued using sequential-znode as a per-datanode dedicated queue? How
> >>> > well does it scale, and what are some things to watch-out for
> >>> > (operation-size, encoding, clustering into one znode for atomicity
> >>> > etc)? Or how are atomic operations that need to be issued across
> >>> > multiple data-nodes managed (do they have to be clobbered into one
> >>> > znode)?
> >>> >
> >>> > 3. How do people secure zookeeper based services? Is
> >>> > client-certificate-verification the recommended way? How well does
> >>> > this work with C client? Is inter-zk-node communication done with
> >>> > X509-auth too?
> >>> >
> >>> > 4. What other projects, reference-implementations or libraries should
> >>> > I look at for working with C client?
> >>> >
> >>> > Most of what I have asked revolves around leader or lock-owner having
> >>> > a false-failure (where it doesn't know that coordinator thinks it has
> >>> > failed).
> >>> >
> >>> > --
> >>> > Regards,
> >>> > Janmejay
> >>> > http://codehunk.wordpress.com
> >>> >
> >>>
> >>>
> >>>
> >>>
> >>
> >>
>
>
>
> --
> Regards,
> Janmejay
> http://codehunk.wordpress.com
>

Re: Best-practice guides on coordination of operations in distributed systems (and some C client specific questions)

Posted by Alexander Shraer <sh...@gmail.com>.

Hi,

With your suggestion, the following scenario seems possible: master A is
about to write value X to an external system so it logs it to ZK, then
freezes for some time, ZK suspects it as failed, another master B is
elected, writes X (completing what A wanted to do)
then starts doing something else and writes Y. Then leader A "wakes up" and
re-executes the old operation writing X which is now stale.

perhaps if your external system supports conditional updates this can be
avoided - a write of X only works if the current state is as expected.

Alex

On Tue, Jan 5, 2016 at 1:00 AM, singh.janmejay <si...@gmail.com>
wrote:

> Thanks for the replies everyone, most of it was very useful.
>
> @Alexander: The section of chubby paper you pointed me to tries to
> address exactly what I was looking for. That clearly is one good way
> of doing it. Im also thinking of an alternative way and can use a
> review or some feedback on that.
>
> @Powel: About x509 auth on intra-cluster communication, I don't have a
> blocking need for it, as it can be achieved by setting up firewall
> rules to accept only from desired hosts. It may be a good-to-have
> thing though, because in cloud-based scenarios where IP addresses are
> re-used, a recycled IP can still talk to a secure zk-cluster unless
> config is changed to remove the old peer IP and replace it with the
> new one. Its clearly a corner-case though.
>
> Here is the approach Im thinking of:
> - Implement all operations(atleast master-triggered operations) on
> operand machines idempotently
> - Have master journal these operations to ZK before issuing RPC
> - In case master fails with some of these operations in flight, the
> newly elected master will need to read all issued (but not retired
> yet) operations and issue them again.
> - Existing master(before failure or after failure) can retry and
> retire operations according to whatever the retry policy and
> success-criterion is.
>
> Why am I thinking of this as opposed to going with chubby sequencer
> passing:
> - I need to implement idempotency regardless, because recovery-path
> involving master-death after successful execution of operation but
> before writing ack to coordination service requires it. So idempotent
> implementation complexity is here to stay.
> - I need to increase surface-area of the architecture which is exposed
> to coordination-service for sequencer validation. Which may bring
> verification RPC in data-plane in some cases.
> - The sequencer may expire after verification but before ack, in which
> case new master may not recognize the operation as consistent with its
> decisions (or previous decision path).
>
> Thoughts? Suggestions?
>
>
>
> On Sun, Jan 3, 2016 at 2:18 PM, Alexander Shraer <sh...@gmail.com>
> wrote:
> > regarding atomic multi-znode updates -- check out "multi" updates
> > <
> http://tdunning.blogspot.com/2011/06/tour-of-multi-update-for-zookeeper.html
> >
> > .
> >
> > On Sat, Jan 2, 2016 at 10:45 PM, Alexander Shraer <sh...@gmail.com>
> wrote:
> >
> >> for 1, see the chubby paper
> >> <
> http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf
> >,
> >> section 2.4.
> >> for 2, I'm not sure I fully understand the question, but essentially, ZK
> >> guarantees that even during failures
> >> consistency of updates is preserved. The user doesn't need to do
> anything
> >> in particular to guarantee this, even
> >> during leader failures. In such case, some suffix of operations executed
> >> by the leader may be lost if they weren't
> >> previously acked by a majority.However, none of these operations could
> >> have been visible
> >> to reads.
> >>
> >> On Fri, Jan 1, 2016 at 12:29 AM, powell molleti <
> >> powellm79@yahoo.com.invalid> wrote:
> >>
> >>> Hi Janmejay,
> >>> Regarding question 1, if a node takes a lock and the lock has timed-out
> >>> from system perspective then it can mean that other nodes are free to
> take
> >>> the lock and work on the resource. Hence the history could be well
> into the
> >>> future when the previous node discovers the time-out. The question of
> >>> rollback in the specific context depends on the implementation
> details, is
> >>> the lock holder updating some common area?, then there could be
> corruption
> >>> since other nodes are free to write in parallel to the first one?. In
> the
> >>> usual sense a time-out of lock held means the node which held the lock
> is
> >>> dead. It is upto the implementation to ensure this case and, using this
> >>> primitive, if there is a timeout which means other nodes are sure that
> no
> >>> one else is working on the resource and hence can move forward.
> >>> Question 2 seems to imply the assumption that leader has significant
> work
> >>> todo and leader change is quite common, which seems contrary to common
> >>> implementation pattern. If the work can be broken down into smaller
> chunks
> >>> which need serialization separately then each chunk/work type can have
> a
> >>> different leader.
> >>> For question 3, ZK does support auth and encryption for client
> >>> connections but not for inter ZK node channels. Do you have
> requirement to
> >>> secure inter ZK nodes, can you let us know what your requirements are
> so we
> >>> can implement a solution to fit all needs?.
> >>> For question 4 the official implementation is C, people tend to wrap
> that
> >>> with C++ and there should projects that use ZK doing that you can look
> them
> >>> up and see if you can separate it out and use them.
> >>> Hope this helps.Powell.
> >>>
> >>>
> >>>
> >>>     On Thursday, December 31, 2015 8:07 AM, Edward Capriolo <
> >>> edward.capriolo@huffingtonpost.com> wrote:
> >>>
> >>>
> >>>  Q:What is the best way of handling distributed-lock expiry? The owner
> >>> of the lock managed to acquire it and may be in middle of some
> >>> computation when the session expires or lock expire
> >>>
> >>> If you are using Java a way I can see doing this is by using the
> >>> ExecutorCompletionService
> >>>
> >>>
> https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorCompletionService.html
> >>> .
> >>> It allows you to keep your workers in a group, You can poll the group
> and
> >>> provide cancel semantics as needed.
> >>> An example of that service is here:
> >>>
> >>>
> https://github.com/edwardcapriolo/nibiru/blob/master/src/main/java/io/teknek/nibiru/coordinator/EventualCoordinator.java
> >>> where I am issuing multiple reads and I want to abandon the process if
> >>> they
> >>> do not timeout in a while. Many async/promices frameworks do this by
> >>> launching two task ComputationTask and a TimeoutTask that returns in 10
> >>> seconds. Then they ask the completions service to poll. If the service
> is
> >>> given the TimoutTask after the timeout that means the Computation did
> not
> >>> finish in time.
> >>>
> >>> Do people generally take action in middle of the computation (abort it
> and
> >>> do itin a clever way such that effect appears atomic, so abort is
> >>> notreally
> >>> visible, if so what are some of those clever ways)?
> >>>
> >>> The base issue is java's synchronized/ AtomicReference do not have a
> >>> rollback.
> >>>
> >>> There are a few ways I know to work around this. Clojure has STM
> (software
> >>> Transactional Memory) such that if an exception is through inside a
> doSync
> >>> all of the stems inside the critical block never happened. This assumes
> >>> your using all clojure structures which you are probably not.
> >>> A way co workers have done this is as follows. Move your entire
> >>> transnational state into a SINGLE big object that you can
> >>> copy/mutate/compare and swap. You never need to rollback each piece
> >>> because
> >>> your changing the clone up until the point you commit it.
> >>> Writing reversal code is possible depending on the problem. There are
> >>> questions to ask like "what if the reversal somehow fails?"
> >>>
> >>>
> >>>
> >>>
> >>> On Thu, Dec 31, 2015 at 3:10 AM, singh.janmejay <
> singh.janmejay@gmail.com
> >>> >
> >>> wrote:
> >>>
> >>> > Hi,
> >>> >
> >>> > Was wondering if there are any reference designs, patterns on
> handling
> >>> > common operations involving distributed coordination.
> >>> >
> >>> > I have a few questions and I guess they must have been asked before,
> I
> >>> > am unsure what to search for to surface the right answers. It'll be
> >>> > really valuable if someone can provide links to relevant
> >>> > "best-practices guide" or "suggestions" per question or share some
> >>> > wisdom or ideas on patterns to do this in the best way.
> >>> >
> >>> > 1. What is the best way of handling distributed-lock expiry? The
> owner
> >>> > of the lock managed to acquire it and may be in middle of some
> >>> > computation when the session expires or lock expires. When it
> finishes
> >>> > that computation, it can tell that the lock expired, but do people
> >>> > generally take action in middle of the computation (abort it and do
> it
> >>> > in a clever way such that effect appears atomic, so abort is not
> >>> > really visible, if so what are some of those clever ways)? Or is the
> >>> > right thing to do, is to write reversal-code, such that operations
> can
> >>> > be cleanly undone in case the verification at the end of computation
> >>> > shows that lock expired? The later obviously is a lot harder to
> >>> > achieve.
> >>> >
> >>> > 2. Same as above for leader-election scenarios. Leader generally
> >>> > administers operations on data-systems that take significant time to
> >>> > complete and have significant resource overhead and RPC to administer
> >>> > such operations synchronously from leader to data-node can't be
> atomic
> >>> > and can't be made latency-resilient to such a degree that issuing
> >>> > operation across a large set of nodes on a cluster can be guaranteed
> >>> > to finish without leader-change. What do people generally do in such
> >>> > situations? How are timeouts for operations issued when operations
> are
> >>> > issued using sequential-znode as a per-datanode dedicated queue? How
> >>> > well does it scale, and what are some things to watch-out for
> >>> > (operation-size, encoding, clustering into one znode for atomicity
> >>> > etc)? Or how are atomic operations that need to be issued across
> >>> > multiple data-nodes managed (do they have to be clobbered into one
> >>> > znode)?
> >>> >
> >>> > 3. How do people secure zookeeper based services? Is
> >>> > client-certificate-verification the recommended way? How well does
> >>> > this work with C client? Is inter-zk-node communication done with
> >>> > X509-auth too?
> >>> >
> >>> > 4. What other projects, reference-implementations or libraries should
> >>> > I look at for working with C client?
> >>> >
> >>> > Most of what I have asked revolves around leader or lock-owner having
> >>> > a false-failure (where it doesn't know that coordinator thinks it has
> >>> > failed).
> >>> >
> >>> > --
> >>> > Regards,
> >>> > Janmejay
> >>> > http://codehunk.wordpress.com
> >>> >
> >>>
> >>>
> >>>
> >>>
> >>
> >>
>
>
>
> --
> Regards,
> Janmejay
> http://codehunk.wordpress.com
>

Re: Best-practice guides on coordination of operations in distributed systems (and some C client specific questions)

Posted by "singh.janmejay" <si...@gmail.com>.

Thanks for the replies everyone, most of it was very useful.

@Alexander: The section of chubby paper you pointed me to tries to
address exactly what I was looking for. That clearly is one good way
of doing it. Im also thinking of an alternative way and can use a
review or some feedback on that.

@Powel: About x509 auth on intra-cluster communication, I don't have a
blocking need for it, as it can be achieved by setting up firewall
rules to accept only from desired hosts. It may be a good-to-have
thing though, because in cloud-based scenarios where IP addresses are
re-used, a recycled IP can still talk to a secure zk-cluster unless
config is changed to remove the old peer IP and replace it with the
new one. Its clearly a corner-case though.

Here is the approach Im thinking of:
- Implement all operations(atleast master-triggered operations) on
operand machines idempotently
- Have master journal these operations to ZK before issuing RPC
- In case master fails with some of these operations in flight, the
newly elected master will need to read all issued (but not retired
yet) operations and issue them again.
- Existing master(before failure or after failure) can retry and
retire operations according to whatever the retry policy and
success-criterion is.

Why am I thinking of this as opposed to going with chubby sequencer passing:
- I need to implement idempotency regardless, because recovery-path
involving master-death after successful execution of operation but
before writing ack to coordination service requires it. So idempotent
implementation complexity is here to stay.
- I need to increase surface-area of the architecture which is exposed
to coordination-service for sequencer validation. Which may bring
verification RPC in data-plane in some cases.
- The sequencer may expire after verification but before ack, in which
case new master may not recognize the operation as consistent with its
decisions (or previous decision path).

Thoughts? Suggestions?



On Sun, Jan 3, 2016 at 2:18 PM, Alexander Shraer <sh...@gmail.com> wrote:
> regarding atomic multi-znode updates -- check out "multi" updates
> <http://tdunning.blogspot.com/2011/06/tour-of-multi-update-for-zookeeper.html>
> .
>
> On Sat, Jan 2, 2016 at 10:45 PM, Alexander Shraer <sh...@gmail.com> wrote:
>
>> for 1, see the chubby paper
>> <http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf>,
>> section 2.4.
>> for 2, I'm not sure I fully understand the question, but essentially, ZK
>> guarantees that even during failures
>> consistency of updates is preserved. The user doesn't need to do anything
>> in particular to guarantee this, even
>> during leader failures. In such case, some suffix of operations executed
>> by the leader may be lost if they weren't
>> previously acked by a majority.However, none of these operations could
>> have been visible
>> to reads.
>>
>> On Fri, Jan 1, 2016 at 12:29 AM, powell molleti <
>> powellm79@yahoo.com.invalid> wrote:
>>
>>> Hi Janmejay,
>>> Regarding question 1, if a node takes a lock and the lock has timed-out
>>> from system perspective then it can mean that other nodes are free to take
>>> the lock and work on the resource. Hence the history could be well into the
>>> future when the previous node discovers the time-out. The question of
>>> rollback in the specific context depends on the implementation details, is
>>> the lock holder updating some common area?, then there could be corruption
>>> since other nodes are free to write in parallel to the first one?. In the
>>> usual sense a time-out of lock held means the node which held the lock is
>>> dead. It is upto the implementation to ensure this case and, using this
>>> primitive, if there is a timeout which means other nodes are sure that no
>>> one else is working on the resource and hence can move forward.
>>> Question 2 seems to imply the assumption that leader has significant work
>>> todo and leader change is quite common, which seems contrary to common
>>> implementation pattern. If the work can be broken down into smaller chunks
>>> which need serialization separately then each chunk/work type can have a
>>> different leader.
>>> For question 3, ZK does support auth and encryption for client
>>> connections but not for inter ZK node channels. Do you have requirement to
>>> secure inter ZK nodes, can you let us know what your requirements are so we
>>> can implement a solution to fit all needs?.
>>> For question 4 the official implementation is C, people tend to wrap that
>>> with C++ and there should projects that use ZK doing that you can look them
>>> up and see if you can separate it out and use them.
>>> Hope this helps.Powell.
>>>
>>>
>>>
>>>     On Thursday, December 31, 2015 8:07 AM, Edward Capriolo <
>>> edward.capriolo@huffingtonpost.com> wrote:
>>>
>>>
>>>  Q:What is the best way of handling distributed-lock expiry? The owner
>>> of the lock managed to acquire it and may be in middle of some
>>> computation when the session expires or lock expire
>>>
>>> If you are using Java a way I can see doing this is by using the
>>> ExecutorCompletionService
>>>
>>> https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorCompletionService.html
>>> .
>>> It allows you to keep your workers in a group, You can poll the group and
>>> provide cancel semantics as needed.
>>> An example of that service is here:
>>>
>>> https://github.com/edwardcapriolo/nibiru/blob/master/src/main/java/io/teknek/nibiru/coordinator/EventualCoordinator.java
>>> where I am issuing multiple reads and I want to abandon the process if
>>> they
>>> do not timeout in a while. Many async/promices frameworks do this by
>>> launching two task ComputationTask and a TimeoutTask that returns in 10
>>> seconds. Then they ask the completions service to poll. If the service is
>>> given the TimoutTask after the timeout that means the Computation did not
>>> finish in time.
>>>
>>> Do people generally take action in middle of the computation (abort it and
>>> do itin a clever way such that effect appears atomic, so abort is
>>> notreally
>>> visible, if so what are some of those clever ways)?
>>>
>>> The base issue is java's synchronized/ AtomicReference do not have a
>>> rollback.
>>>
>>> There are a few ways I know to work around this. Clojure has STM (software
>>> Transactional Memory) such that if an exception is through inside a doSync
>>> all of the stems inside the critical block never happened. This assumes
>>> your using all clojure structures which you are probably not.
>>> A way co workers have done this is as follows. Move your entire
>>> transnational state into a SINGLE big object that you can
>>> copy/mutate/compare and swap. You never need to rollback each piece
>>> because
>>> your changing the clone up until the point you commit it.
>>> Writing reversal code is possible depending on the problem. There are
>>> questions to ask like "what if the reversal somehow fails?"
>>>
>>>
>>>
>>>
>>> On Thu, Dec 31, 2015 at 3:10 AM, singh.janmejay <singh.janmejay@gmail.com
>>> >
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > Was wondering if there are any reference designs, patterns on handling
>>> > common operations involving distributed coordination.
>>> >
>>> > I have a few questions and I guess they must have been asked before, I
>>> > am unsure what to search for to surface the right answers. It'll be
>>> > really valuable if someone can provide links to relevant
>>> > "best-practices guide" or "suggestions" per question or share some
>>> > wisdom or ideas on patterns to do this in the best way.
>>> >
>>> > 1. What is the best way of handling distributed-lock expiry? The owner
>>> > of the lock managed to acquire it and may be in middle of some
>>> > computation when the session expires or lock expires. When it finishes
>>> > that computation, it can tell that the lock expired, but do people
>>> > generally take action in middle of the computation (abort it and do it
>>> > in a clever way such that effect appears atomic, so abort is not
>>> > really visible, if so what are some of those clever ways)? Or is the
>>> > right thing to do, is to write reversal-code, such that operations can
>>> > be cleanly undone in case the verification at the end of computation
>>> > shows that lock expired? The later obviously is a lot harder to
>>> > achieve.
>>> >
>>> > 2. Same as above for leader-election scenarios. Leader generally
>>> > administers operations on data-systems that take significant time to
>>> > complete and have significant resource overhead and RPC to administer
>>> > such operations synchronously from leader to data-node can't be atomic
>>> > and can't be made latency-resilient to such a degree that issuing
>>> > operation across a large set of nodes on a cluster can be guaranteed
>>> > to finish without leader-change. What do people generally do in such
>>> > situations? How are timeouts for operations issued when operations are
>>> > issued using sequential-znode as a per-datanode dedicated queue? How
>>> > well does it scale, and what are some things to watch-out for
>>> > (operation-size, encoding, clustering into one znode for atomicity
>>> > etc)? Or how are atomic operations that need to be issued across
>>> > multiple data-nodes managed (do they have to be clobbered into one
>>> > znode)?
>>> >
>>> > 3. How do people secure zookeeper based services? Is
>>> > client-certificate-verification the recommended way? How well does
>>> > this work with C client? Is inter-zk-node communication done with
>>> > X509-auth too?
>>> >
>>> > 4. What other projects, reference-implementations or libraries should
>>> > I look at for working with C client?
>>> >
>>> > Most of what I have asked revolves around leader or lock-owner having
>>> > a false-failure (where it doesn't know that coordinator thinks it has
>>> > failed).
>>> >
>>> > --
>>> > Regards,
>>> > Janmejay
>>> > http://codehunk.wordpress.com
>>> >
>>>
>>>
>>>
>>>
>>
>>



-- 
Regards,
Janmejay
http://codehunk.wordpress.com

Re: Best-practice guides on coordination of operations in distributed systems (and some C client specific questions)

Posted by Alexander Shraer <sh...@gmail.com>.

regarding atomic multi-znode updates -- check out "multi" updates
<http://tdunning.blogspot.com/2011/06/tour-of-multi-update-for-zookeeper.html>
.

On Sat, Jan 2, 2016 at 10:45 PM, Alexander Shraer <sh...@gmail.com> wrote:

> for 1, see the chubby paper
> <http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf>,
> section 2.4.
> for 2, I'm not sure I fully understand the question, but essentially, ZK
> guarantees that even during failures
> consistency of updates is preserved. The user doesn't need to do anything
> in particular to guarantee this, even
> during leader failures. In such case, some suffix of operations executed
> by the leader may be lost if they weren't
> previously acked by a majority.However, none of these operations could
> have been visible
> to reads.
>
> On Fri, Jan 1, 2016 at 12:29 AM, powell molleti <
> powellm79@yahoo.com.invalid> wrote:
>
>> Hi Janmejay,
>> Regarding question 1, if a node takes a lock and the lock has timed-out
>> from system perspective then it can mean that other nodes are free to take
>> the lock and work on the resource. Hence the history could be well into the
>> future when the previous node discovers the time-out. The question of
>> rollback in the specific context depends on the implementation details, is
>> the lock holder updating some common area?, then there could be corruption
>> since other nodes are free to write in parallel to the first one?. In the
>> usual sense a time-out of lock held means the node which held the lock is
>> dead. It is upto the implementation to ensure this case and, using this
>> primitive, if there is a timeout which means other nodes are sure that no
>> one else is working on the resource and hence can move forward.
>> Question 2 seems to imply the assumption that leader has significant work
>> todo and leader change is quite common, which seems contrary to common
>> implementation pattern. If the work can be broken down into smaller chunks
>> which need serialization separately then each chunk/work type can have a
>> different leader.
>> For question 3, ZK does support auth and encryption for client
>> connections but not for inter ZK node channels. Do you have requirement to
>> secure inter ZK nodes, can you let us know what your requirements are so we
>> can implement a solution to fit all needs?.
>> For question 4 the official implementation is C, people tend to wrap that
>> with C++ and there should projects that use ZK doing that you can look them
>> up and see if you can separate it out and use them.
>> Hope this helps.Powell.
>>
>>
>>
>>     On Thursday, December 31, 2015 8:07 AM, Edward Capriolo <
>> edward.capriolo@huffingtonpost.com> wrote:
>>
>>
>>  Q:What is the best way of handling distributed-lock expiry? The owner
>> of the lock managed to acquire it and may be in middle of some
>> computation when the session expires or lock expire
>>
>> If you are using Java a way I can see doing this is by using the
>> ExecutorCompletionService
>>
>> https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorCompletionService.html
>> .
>> It allows you to keep your workers in a group, You can poll the group and
>> provide cancel semantics as needed.
>> An example of that service is here:
>>
>> https://github.com/edwardcapriolo/nibiru/blob/master/src/main/java/io/teknek/nibiru/coordinator/EventualCoordinator.java
>> where I am issuing multiple reads and I want to abandon the process if
>> they
>> do not timeout in a while. Many async/promices frameworks do this by
>> launching two task ComputationTask and a TimeoutTask that returns in 10
>> seconds. Then they ask the completions service to poll. If the service is
>> given the TimoutTask after the timeout that means the Computation did not
>> finish in time.
>>
>> Do people generally take action in middle of the computation (abort it and
>> do itin a clever way such that effect appears atomic, so abort is
>> notreally
>> visible, if so what are some of those clever ways)?
>>
>> The base issue is java's synchronized/ AtomicReference do not have a
>> rollback.
>>
>> There are a few ways I know to work around this. Clojure has STM (software
>> Transactional Memory) such that if an exception is through inside a doSync
>> all of the stems inside the critical block never happened. This assumes
>> your using all clojure structures which you are probably not.
>> A way co workers have done this is as follows. Move your entire
>> transnational state into a SINGLE big object that you can
>> copy/mutate/compare and swap. You never need to rollback each piece
>> because
>> your changing the clone up until the point you commit it.
>> Writing reversal code is possible depending on the problem. There are
>> questions to ask like "what if the reversal somehow fails?"
>>
>>
>>
>>
>> On Thu, Dec 31, 2015 at 3:10 AM, singh.janmejay <singh.janmejay@gmail.com
>> >
>> wrote:
>>
>> > Hi,
>> >
>> > Was wondering if there are any reference designs, patterns on handling
>> > common operations involving distributed coordination.
>> >
>> > I have a few questions and I guess they must have been asked before, I
>> > am unsure what to search for to surface the right answers. It'll be
>> > really valuable if someone can provide links to relevant
>> > "best-practices guide" or "suggestions" per question or share some
>> > wisdom or ideas on patterns to do this in the best way.
>> >
>> > 1. What is the best way of handling distributed-lock expiry? The owner
>> > of the lock managed to acquire it and may be in middle of some
>> > computation when the session expires or lock expires. When it finishes
>> > that computation, it can tell that the lock expired, but do people
>> > generally take action in middle of the computation (abort it and do it
>> > in a clever way such that effect appears atomic, so abort is not
>> > really visible, if so what are some of those clever ways)? Or is the
>> > right thing to do, is to write reversal-code, such that operations can
>> > be cleanly undone in case the verification at the end of computation
>> > shows that lock expired? The later obviously is a lot harder to
>> > achieve.
>> >
>> > 2. Same as above for leader-election scenarios. Leader generally
>> > administers operations on data-systems that take significant time to
>> > complete and have significant resource overhead and RPC to administer
>> > such operations synchronously from leader to data-node can't be atomic
>> > and can't be made latency-resilient to such a degree that issuing
>> > operation across a large set of nodes on a cluster can be guaranteed
>> > to finish without leader-change. What do people generally do in such
>> > situations? How are timeouts for operations issued when operations are
>> > issued using sequential-znode as a per-datanode dedicated queue? How
>> > well does it scale, and what are some things to watch-out for
>> > (operation-size, encoding, clustering into one znode for atomicity
>> > etc)? Or how are atomic operations that need to be issued across
>> > multiple data-nodes managed (do they have to be clobbered into one
>> > znode)?
>> >
>> > 3. How do people secure zookeeper based services? Is
>> > client-certificate-verification the recommended way? How well does
>> > this work with C client? Is inter-zk-node communication done with
>> > X509-auth too?
>> >
>> > 4. What other projects, reference-implementations or libraries should
>> > I look at for working with C client?
>> >
>> > Most of what I have asked revolves around leader or lock-owner having
>> > a false-failure (where it doesn't know that coordinator thinks it has
>> > failed).
>> >
>> > --
>> > Regards,
>> > Janmejay
>> > http://codehunk.wordpress.com
>> >
>>
>>
>>
>>
>
>

Re: Best-practice guides on coordination of operations in distributed systems (and some C client specific questions)

Posted by Alexander Shraer <sh...@gmail.com>.

for 1, see the chubby paper
<http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf>,
section 2.4.
for 2, I'm not sure I fully understand the question, but essentially, ZK
guarantees that even during failures
consistency of updates is preserved. The user doesn't need to do anything
in particular to guarantee this, even
during leader failures. In such case, some suffix of operations executed by
the leader may be lost if they weren't
previously acked by a majority.However, none of these operations could have
been visible
to reads.

On Fri, Jan 1, 2016 at 12:29 AM, powell molleti <powellm79@yahoo.com.invalid
> wrote:

> Hi Janmejay,
> Regarding question 1, if a node takes a lock and the lock has timed-out
> from system perspective then it can mean that other nodes are free to take
> the lock and work on the resource. Hence the history could be well into the
> future when the previous node discovers the time-out. The question of
> rollback in the specific context depends on the implementation details, is
> the lock holder updating some common area?, then there could be corruption
> since other nodes are free to write in parallel to the first one?. In the
> usual sense a time-out of lock held means the node which held the lock is
> dead. It is upto the implementation to ensure this case and, using this
> primitive, if there is a timeout which means other nodes are sure that no
> one else is working on the resource and hence can move forward.
> Question 2 seems to imply the assumption that leader has significant work
> todo and leader change is quite common, which seems contrary to common
> implementation pattern. If the work can be broken down into smaller chunks
> which need serialization separately then each chunk/work type can have a
> different leader.
> For question 3, ZK does support auth and encryption for client connections
> but not for inter ZK node channels. Do you have requirement to secure inter
> ZK nodes, can you let us know what your requirements are so we can
> implement a solution to fit all needs?.
> For question 4 the official implementation is C, people tend to wrap that
> with C++ and there should projects that use ZK doing that you can look them
> up and see if you can separate it out and use them.
> Hope this helps.Powell.
>
>
>
>     On Thursday, December 31, 2015 8:07 AM, Edward Capriolo <
> edward.capriolo@huffingtonpost.com> wrote:
>
>
>  Q:What is the best way of handling distributed-lock expiry? The owner
> of the lock managed to acquire it and may be in middle of some
> computation when the session expires or lock expire
>
> If you are using Java a way I can see doing this is by using the
> ExecutorCompletionService
>
> https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorCompletionService.html
> .
> It allows you to keep your workers in a group, You can poll the group and
> provide cancel semantics as needed.
> An example of that service is here:
>
> https://github.com/edwardcapriolo/nibiru/blob/master/src/main/java/io/teknek/nibiru/coordinator/EventualCoordinator.java
> where I am issuing multiple reads and I want to abandon the process if they
> do not timeout in a while. Many async/promices frameworks do this by
> launching two task ComputationTask and a TimeoutTask that returns in 10
> seconds. Then they ask the completions service to poll. If the service is
> given the TimoutTask after the timeout that means the Computation did not
> finish in time.
>
> Do people generally take action in middle of the computation (abort it and
> do itin a clever way such that effect appears atomic, so abort is notreally
> visible, if so what are some of those clever ways)?
>
> The base issue is java's synchronized/ AtomicReference do not have a
> rollback.
>
> There are a few ways I know to work around this. Clojure has STM (software
> Transactional Memory) such that if an exception is through inside a doSync
> all of the stems inside the critical block never happened. This assumes
> your using all clojure structures which you are probably not.
> A way co workers have done this is as follows. Move your entire
> transnational state into a SINGLE big object that you can
> copy/mutate/compare and swap. You never need to rollback each piece because
> your changing the clone up until the point you commit it.
> Writing reversal code is possible depending on the problem. There are
> questions to ask like "what if the reversal somehow fails?"
>
>
>
>
> On Thu, Dec 31, 2015 at 3:10 AM, singh.janmejay <si...@gmail.com>
> wrote:
>
> > Hi,
> >
> > Was wondering if there are any reference designs, patterns on handling
> > common operations involving distributed coordination.
> >
> > I have a few questions and I guess they must have been asked before, I
> > am unsure what to search for to surface the right answers. It'll be
> > really valuable if someone can provide links to relevant
> > "best-practices guide" or "suggestions" per question or share some
> > wisdom or ideas on patterns to do this in the best way.
> >
> > 1. What is the best way of handling distributed-lock expiry? The owner
> > of the lock managed to acquire it and may be in middle of some
> > computation when the session expires or lock expires. When it finishes
> > that computation, it can tell that the lock expired, but do people
> > generally take action in middle of the computation (abort it and do it
> > in a clever way such that effect appears atomic, so abort is not
> > really visible, if so what are some of those clever ways)? Or is the
> > right thing to do, is to write reversal-code, such that operations can
> > be cleanly undone in case the verification at the end of computation
> > shows that lock expired? The later obviously is a lot harder to
> > achieve.
> >
> > 2. Same as above for leader-election scenarios. Leader generally
> > administers operations on data-systems that take significant time to
> > complete and have significant resource overhead and RPC to administer
> > such operations synchronously from leader to data-node can't be atomic
> > and can't be made latency-resilient to such a degree that issuing
> > operation across a large set of nodes on a cluster can be guaranteed
> > to finish without leader-change. What do people generally do in such
> > situations? How are timeouts for operations issued when operations are
> > issued using sequential-znode as a per-datanode dedicated queue? How
> > well does it scale, and what are some things to watch-out for
> > (operation-size, encoding, clustering into one znode for atomicity
> > etc)? Or how are atomic operations that need to be issued across
> > multiple data-nodes managed (do they have to be clobbered into one
> > znode)?
> >
> > 3. How do people secure zookeeper based services? Is
> > client-certificate-verification the recommended way? How well does
> > this work with C client? Is inter-zk-node communication done with
> > X509-auth too?
> >
> > 4. What other projects, reference-implementations or libraries should
> > I look at for working with C client?
> >
> > Most of what I have asked revolves around leader or lock-owner having
> > a false-failure (where it doesn't know that coordinator thinks it has
> > failed).
> >
> > --
> > Regards,
> > Janmejay
> > http://codehunk.wordpress.com
> >
>
>
>
>