You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@fluo.apache.org by Bill Slacum <ws...@gmail.com> on 2022/11/02 20:32:26 UTC

Partial Transactions

This is related to https://github.com/apache/fluo/issues/660

I've noticed this error crop up a handful of times on smaller development
clusters, but is happening increasingly on larger, bare metal clusters
(think hundreds of CPUs, terabytes of memory running dozens of workers).
It's difficult to reproduce without some manual agitation, but I noticed if
a transaction gets aborted in a few spots, it won't rollback any locks it
held. There are some steps in `commitAsync` that won't roll back anything
on failure, but it's also possible an error could abort that flow
prematurely. It's also possible that JVM failure in there would stop a
transaction in its place without any recovery/rollback.

I think, more importantly for my use case, it is possible that state will
raise an IllegalStateException which will kill the worker process and
restart it, meaning that all further writes/scans will fail if they
encounter a transaction in an UNKNOWN state.

I added a quick little step at the end of `DeleteLockStep` that has a 1%
chance of failing a transaction (
https://gist.github.com/wjsl/01000d7c3efe5cf271d47547e0320bd4). Eventually
I'll run into an error similar to the one described in #660. This blocks
pretty much all reads and writes into my cluster until I go in and remove
the underlying Accumulo keys that represent the lock graph.

What should we do in this scenario? The two things that jump out to me are:

1. Always rolling back locks on failure. This doesn't appear to happen in
some default implementations of BatchWriterStep (DeleteLocksStep,
WriteNotificationsStep).
 LockOtherStep also doesn't seem to handle unknowns given the comments. I
think this leads into #2.

2. Other transactions notice a dangling or dead transaction. If a JVM goes
away, how do I go about resolving/rolling back all the locks that the dead
transaction held? We clearly halt when we can't find the primary, but we
need to go through and resolve all the locks that are pointing to that
primary. Would this require a full table scan of the underlying Accumulo
table?

Part of our design may be an issue in that certain pieces of transactions
seem to update the same portions of a table (we keep a per-partition count
around), which could exacerbate this issue.

Any advice is appreciated!

Thanks,
Bill

Re: Partial Transactions

Posted by Keith Turner <ke...@deenlo.com>.
Bill,

I did a few local experiments trying to reproduce the problem and was
not able to do so.  So then I went back and looked at #660 and I
noticed that it seemed like the column family was empty in the
messages posted there.  So I tried setting an empty column family in
my experiments and boom the problem you are seeing happened.  So now I
can reliably reproduce the problem.  I am trying to figure out why the
empty column family is making lock recovery fail now.

Keith

On Wed, Nov 2, 2022 at 8:32 PM Bill Slacum <ws...@gmail.com> wrote:
>
> This is related to https://github.com/apache/fluo/issues/660
>
> I've noticed this error crop up a handful of times on smaller development
> clusters, but is happening increasingly on larger, bare metal clusters
> (think hundreds of CPUs, terabytes of memory running dozens of workers).
> It's difficult to reproduce without some manual agitation, but I noticed if
> a transaction gets aborted in a few spots, it won't rollback any locks it
> held. There are some steps in `commitAsync` that won't roll back anything
> on failure, but it's also possible an error could abort that flow
> prematurely. It's also possible that JVM failure in there would stop a
> transaction in its place without any recovery/rollback.
>
> I think, more importantly for my use case, it is possible that state will
> raise an IllegalStateException which will kill the worker process and
> restart it, meaning that all further writes/scans will fail if they
> encounter a transaction in an UNKNOWN state.
>
> I added a quick little step at the end of `DeleteLockStep` that has a 1%
> chance of failing a transaction (
> https://gist.github.com/wjsl/01000d7c3efe5cf271d47547e0320bd4). Eventually
> I'll run into an error similar to the one described in #660. This blocks
> pretty much all reads and writes into my cluster until I go in and remove
> the underlying Accumulo keys that represent the lock graph.
>
> What should we do in this scenario? The two things that jump out to me are:
>
> 1. Always rolling back locks on failure. This doesn't appear to happen in
> some default implementations of BatchWriterStep (DeleteLocksStep,
> WriteNotificationsStep).
>  LockOtherStep also doesn't seem to handle unknowns given the comments. I
> think this leads into #2.
>
> 2. Other transactions notice a dangling or dead transaction. If a JVM goes
> away, how do I go about resolving/rolling back all the locks that the dead
> transaction held? We clearly halt when we can't find the primary, but we
> need to go through and resolve all the locks that are pointing to that
> primary. Would this require a full table scan of the underlying Accumulo
> table?
>
> Part of our design may be an issue in that certain pieces of transactions
> seem to update the same portions of a table (we keep a per-partition count
> around), which could exacerbate this issue.
>
> Any advice is appreciated!
>
> Thanks,
> Bill

Re: Partial Transactions

Posted by Keith Turner <ke...@deenlo.com>.
On Wed, Nov 2, 2022 at 8:32 PM Bill Slacum <ws...@gmail.com> wrote:
>
> This is related to https://github.com/apache/fluo/issues/660
>
> I've noticed this error crop up a handful of times on smaller development
> clusters, but is happening increasingly on larger, bare metal clusters
> (think hundreds of CPUs, terabytes of memory running dozens of workers).
> It's difficult to reproduce without some manual agitation, but I noticed if
> a transaction gets aborted in a few spots, it won't rollback any locks it
> held. There are some steps in `commitAsync` that won't roll back anything
> on failure, but it's also possible an error could abort that flow
> prematurely. It's also possible that JVM failure in there would stop a
> transaction in its place without any recovery/rollback.
>
> I think, more importantly for my use case, it is possible that state will
> raise an IllegalStateException which will kill the worker process and
> restart it, meaning that all further writes/scans will fail if they
> encounter a transaction in an UNKNOWN state.
>
> I added a quick little step at the end of `DeleteLockStep` that has a 1%
> chance of failing a transaction (
> https://gist.github.com/wjsl/01000d7c3efe5cf271d47547e0320bd4). Eventually
> I'll run into an error similar to the one described in #660. This blocks
> pretty much all reads and writes into my cluster until I go in and remove
> the underlying Accumulo keys that represent the lock graph.
>
> What should we do in this scenario? The two things that jump out to me are:
>
> 1. Always rolling back locks on failure. This doesn't appear to happen in
> some default implementations of BatchWriterStep (DeleteLocksStep,
> WriteNotificationsStep).
>  LockOtherStep also doesn't seem to handle unknowns given the comments. I
> think this leads into #2.

Whenever a transaction fails for any reason later transactions should
resolve its status when they try to read the data.  Need to figure out
why that is not happening for you.  After figuring that out we could
look into optimizing certain failure cases so that the resolution
happens sooner.

>
> 2. Other transactions notice a dangling or dead transaction. If a JVM goes
> away, how do I go about resolving/rolling back all the locks that the dead
> transaction held? We clearly halt when we can't find the primary, but we
> need to go through and resolve all the locks that are pointing to that
> primary. Would this require a full table scan of the underlying Accumulo
> table?

Yeah, I think it would require a full table scan to find all of the
locks pointing to the primary to resolve them.

>
> Part of our design may be an issue in that certain pieces of transactions
> seem to update the same portions of a table (we keep a per-partition count
> around), which could exacerbate this issue.
>
> Any advice is appreciated!

I am going to look at the resolve lock code and see if anything jumps
out at me re the UNKNOWN state.  If I don't see anything I am going to
try to reproduce the problem by running the stress test with your
modification to DeleteLocksStep.

>
> Thanks,
> Bill

Re: Partial Transactions

Posted by Keith Turner <ke...@deenlo.com>.
Opened a PR w/ a fix.  Good to hear you are using empty col fams,
because otherwise I would be back to square one.  Lock recovery
reliably fails w/ empty col fams.

https://github.com/apache/fluo/pull/1123

On Tue, Nov 15, 2022 at 11:46 PM Bill Slacum <ws...@gmail.com> wrote:
>
> Wowza thank you for that effort, Keith! If it's related to empty column
> families, well, we are definitely using empty column families. I'm guessing
> that's why my patch with my app got me to the issue relatively quickly (and
> others, after I accidentally published a fork to our maven repo).
>
> On Wed, Nov 2, 2022 at 4:32 PM Bill Slacum <ws...@gmail.com> wrote:
>
> > This is related to https://github.com/apache/fluo/issues/660
> >
> > I've noticed this error crop up a handful of times on smaller development
> > clusters, but is happening increasingly on larger, bare metal clusters
> > (think hundreds of CPUs, terabytes of memory running dozens of workers).
> > It's difficult to reproduce without some manual agitation, but I noticed if
> > a transaction gets aborted in a few spots, it won't rollback any locks it
> > held. There are some steps in `commitAsync` that won't roll back anything
> > on failure, but it's also possible an error could abort that flow
> > prematurely. It's also possible that JVM failure in there would stop a
> > transaction in its place without any recovery/rollback.
> >
> > I think, more importantly for my use case, it is possible that state will
> > raise an IllegalStateException which will kill the worker process and
> > restart it, meaning that all further writes/scans will fail if they
> > encounter a transaction in an UNKNOWN state.
> >
> > I added a quick little step at the end of `DeleteLockStep` that has a 1%
> > chance of failing a transaction (
> > https://gist.github.com/wjsl/01000d7c3efe5cf271d47547e0320bd4).
> > Eventually I'll run into an error similar to the one described in #660.
> > This blocks pretty much all reads and writes into my cluster until I go in
> > and remove the underlying Accumulo keys that represent the lock graph.
> >
> > What should we do in this scenario? The two things that jump out to me are:
> >
> > 1. Always rolling back locks on failure. This doesn't appear to happen in
> > some default implementations of BatchWriterStep (DeleteLocksStep,
> > WriteNotificationsStep).
> >  LockOtherStep also doesn't seem to handle unknowns given the comments. I
> > think this leads into #2.
> >
> > 2. Other transactions notice a dangling or dead transaction. If a JVM goes
> > away, how do I go about resolving/rolling back all the locks that the dead
> > transaction held? We clearly halt when we can't find the primary, but we
> > need to go through and resolve all the locks that are pointing to that
> > primary. Would this require a full table scan of the underlying Accumulo
> > table?
> >
> > Part of our design may be an issue in that certain pieces of transactions
> > seem to update the same portions of a table (we keep a per-partition count
> > around), which could exacerbate this issue.
> >
> > Any advice is appreciated!
> >
> > Thanks,
> > Bill
> >

Re: Partial Transactions

Posted by Bill Slacum <ws...@gmail.com>.
Wowza thank you for that effort, Keith! If it's related to empty column
families, well, we are definitely using empty column families. I'm guessing
that's why my patch with my app got me to the issue relatively quickly (and
others, after I accidentally published a fork to our maven repo).

On Wed, Nov 2, 2022 at 4:32 PM Bill Slacum <ws...@gmail.com> wrote:

> This is related to https://github.com/apache/fluo/issues/660
>
> I've noticed this error crop up a handful of times on smaller development
> clusters, but is happening increasingly on larger, bare metal clusters
> (think hundreds of CPUs, terabytes of memory running dozens of workers).
> It's difficult to reproduce without some manual agitation, but I noticed if
> a transaction gets aborted in a few spots, it won't rollback any locks it
> held. There are some steps in `commitAsync` that won't roll back anything
> on failure, but it's also possible an error could abort that flow
> prematurely. It's also possible that JVM failure in there would stop a
> transaction in its place without any recovery/rollback.
>
> I think, more importantly for my use case, it is possible that state will
> raise an IllegalStateException which will kill the worker process and
> restart it, meaning that all further writes/scans will fail if they
> encounter a transaction in an UNKNOWN state.
>
> I added a quick little step at the end of `DeleteLockStep` that has a 1%
> chance of failing a transaction (
> https://gist.github.com/wjsl/01000d7c3efe5cf271d47547e0320bd4).
> Eventually I'll run into an error similar to the one described in #660.
> This blocks pretty much all reads and writes into my cluster until I go in
> and remove the underlying Accumulo keys that represent the lock graph.
>
> What should we do in this scenario? The two things that jump out to me are:
>
> 1. Always rolling back locks on failure. This doesn't appear to happen in
> some default implementations of BatchWriterStep (DeleteLocksStep,
> WriteNotificationsStep).
>  LockOtherStep also doesn't seem to handle unknowns given the comments. I
> think this leads into #2.
>
> 2. Other transactions notice a dangling or dead transaction. If a JVM goes
> away, how do I go about resolving/rolling back all the locks that the dead
> transaction held? We clearly halt when we can't find the primary, but we
> need to go through and resolve all the locks that are pointing to that
> primary. Would this require a full table scan of the underlying Accumulo
> table?
>
> Part of our design may be an issue in that certain pieces of transactions
> seem to update the same portions of a table (we keep a per-partition count
> around), which could exacerbate this issue.
>
> Any advice is appreciated!
>
> Thanks,
> Bill
>