You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Scott Kirklin <sc...@gmail.com> on 2022/06/08 18:39:53 UTC

iterator state persistence in 2.0.1

Hello,

I am trying to do graph traversal with a custom Iterator. Simplifying a
bit, a “node” is a unique row id and edges are represented as an entry
where the Key.row is the source node and the Key.colQualifier is the target
node. The custom iterator maintains a stack and uses a subordinate iterator
to traverse following these edges. For small graphs this works exactly as
hoped, but once the graph becomes large enough to fill a scan batch the
iterator is torn down and when re-init’ed the stack is gone, so I can’t
resume from where it left off. From the docs it says that "Being torn-down
is equivalent to a new instance of the Iterator being creating and deepCopy
being called on the new instance with the old instance provided as the
argument to deepCopy". I thought that meant that I could carry state
through the life of the traversal, at least as long as the iterator stays
on a single TServer and deepCopy copies the right data, but I cannot find
evidence that this actually happens in the code or by tracing.
IterConfigUtil looks like it is responsible for re-creating the iterator
when resuming a scan, and it only calls ‘init’.

Now, my actual question: Is there a supported way to maintain internal
state throughout the lifetime of an Iterator? Is my approach at all
sensible?

I am able to accomplish what I want 100% from the client as well of course,
but that will have much worse performance for many users. A lot of usage
happens by users who connect (over high latency connections) through the
thrift proxy, which will make a client side solution very non-performant,
so I am motivated to figure out a server-side solution, but am not married
to any particular pattern. Totally changing the key design is on the table
as well, as this effort is still somewhat greenfield.

Thanks in advance,
Scott

Re: iterator state persistence in 2.0.1

Posted by Scott Kirklin <sc...@gmail.com>.
Thanks for the quick confirmation! I'll need to send back a bit more data
than I originally planned, but your suggestion of encoding progress in the
keys should work quite nicely.

Best wishes as you push for 2.1.0!

On Wed, Jun 8, 2022 at 12:42 PM Christopher <ct...@apache.org> wrote:

> On Wed, Jun 8, 2022 at 2:40 PM Scott Kirklin <sc...@gmail.com>
> wrote:
> >
> > Hello,
> >
> > I am trying to do graph traversal with a custom Iterator. Simplifying a
> bit, a “node” is a unique row id and edges are represented as an entry
> where the Key.row is the source node and the Key.colQualifier is the target
> node. The custom iterator maintains a stack and uses a subordinate iterator
> to traverse following these edges. For small graphs this works exactly as
> hoped, but once the graph becomes large enough to fill a scan batch the
> iterator is torn down and when re-init’ed the stack is gone, so I can’t
> resume from where it left off. From the docs it says that "Being torn-down
> is equivalent to a new instance of the Iterator being creating and deepCopy
> being called on the new instance with the old instance provided as the
> argument to deepCopy". I thought that meant that I could carry state
> through the life of the traversal, at least as long as the iterator stays
> on a single TServer and deepCopy copies the right data, but I cannot find
> evidence that this actually happens in the code or by tracing.
> IterConfigUtil looks like it is responsible for re-creating the iterator
> when resuming a scan, and it only calls ‘init’.
>
> I'm not sure why the docs describe it that way. It certainly doesn't
> appear to match the code. deepCopy doesn't accept the old instance as
> an argument... it gets the iterator environment, which does not
> contain the previous iterator. There is some strange wording in that
> doc. It says "being creating" also, implying there's some serious
> grammar being mangled in this portion of the docs. I'm not sure what
> it was trying to say, but I don't think we have any guarantees
> regarding whether an iterator is torn down or not between scan session
> batches.
>
>
> >
> > Now, my actual question: Is there a supported way to maintain internal
> state throughout the lifetime of an Iterator? Is my approach at all
> sensible?
>
> I don't think you can rely on the statefulness of an iterator across
> scan session batches, but you may be able to encode information in the
> keys that are emitted so it knows how to skip paths in the graph when
> the newly constructed iterator seeks to the resume point.
>
> >
> > I am able to accomplish what I want 100% from the client as well of
> course, but that will have much worse performance for many users. A lot of
> usage happens by users who connect (over high latency connections) through
> the thrift proxy, which will make a client side solution very
> non-performant, so I am motivated to figure out a server-side solution, but
> am not married to any particular pattern. Totally changing the key design
> is on the table as well, as this effort is still somewhat greenfield.
> >
> > Thanks in advance,
> > Scott
>

Re: iterator state persistence in 2.0.1

Posted by Christopher <ct...@apache.org>.
On Wed, Jun 8, 2022 at 2:40 PM Scott Kirklin <sc...@gmail.com> wrote:
>
> Hello,
>
> I am trying to do graph traversal with a custom Iterator. Simplifying a bit, a “node” is a unique row id and edges are represented as an entry where the Key.row is the source node and the Key.colQualifier is the target node. The custom iterator maintains a stack and uses a subordinate iterator to traverse following these edges. For small graphs this works exactly as hoped, but once the graph becomes large enough to fill a scan batch the iterator is torn down and when re-init’ed the stack is gone, so I can’t resume from where it left off. From the docs it says that "Being torn-down is equivalent to a new instance of the Iterator being creating and deepCopy being called on the new instance with the old instance provided as the argument to deepCopy". I thought that meant that I could carry state through the life of the traversal, at least as long as the iterator stays on a single TServer and deepCopy copies the right data, but I cannot find evidence that this actually happens in the code or by tracing. IterConfigUtil looks like it is responsible for re-creating the iterator when resuming a scan, and it only calls ‘init’.

I'm not sure why the docs describe it that way. It certainly doesn't
appear to match the code. deepCopy doesn't accept the old instance as
an argument... it gets the iterator environment, which does not
contain the previous iterator. There is some strange wording in that
doc. It says "being creating" also, implying there's some serious
grammar being mangled in this portion of the docs. I'm not sure what
it was trying to say, but I don't think we have any guarantees
regarding whether an iterator is torn down or not between scan session
batches.


>
> Now, my actual question: Is there a supported way to maintain internal state throughout the lifetime of an Iterator? Is my approach at all sensible?

I don't think you can rely on the statefulness of an iterator across
scan session batches, but you may be able to encode information in the
keys that are emitted so it knows how to skip paths in the graph when
the newly constructed iterator seeks to the resume point.

>
> I am able to accomplish what I want 100% from the client as well of course, but that will have much worse performance for many users. A lot of usage happens by users who connect (over high latency connections) through the thrift proxy, which will make a client side solution very non-performant, so I am motivated to figure out a server-side solution, but am not married to any particular pattern. Totally changing the key design is on the table as well, as this effort is still somewhat greenfield.
>
> Thanks in advance,
> Scott