You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by Adam Fuchs <af...@apache.org> on 2013/04/15 21:00:55 UTC

multi-table isolated batch scanner

Is anyone else pining for a multi-table isolated batch scanner, or is it
just me? I like the automatic parallelism and balancing of the batch
scanner, but I'm looking to maintain server-side state in my iterators over
long-running scans. I would also like to scan over multiple tables
concurrently. Has anyone tried hacking something together with a pool of
non-batch scanners?

Adam

Re: multi-table isolated batch scanner

Posted by Adam Fuchs <af...@apache.org>.
Hey Dave,

You are absolutely right that the server-side space usage is a big concern.
One of the ways that we conserve space on the server is to do multiple
passes in which we scan a bunch of data and keep only part of it. This
gives us a trade-off between memory usage and cpu time. The space that we
use on the server gets amplified by a secondary lookup before data gets
back to the client. We've found that the optimal amount of memory to use on
the server is much larger that what can be processed before the scan buffer
fills, although it's still only on the order of a couple of megabytes.

Thanks for the suggestions, everyone. Keep em coming! I'm going to try out
a couple of prototypes and see where they get me.

Cheers,
Adam



On Mon, Apr 15, 2013 at 6:37 PM, Dave Marion <dl...@comcast.net> wrote:

> ---> I have found that increasing the buffer size also increases the
> latency
> for getting the first results.
>
>   We have found that to be true also, we do the opposite to get to the
> first
> result faster. Of course we are not performing a local sort first.
>
> ---> increasing the batch size too much puts significant memory
> requirements
> on the process running the batch scanner
>
>   Pushing the problem from the client to the server increases the
> complexity. I would be concerned with multiple concurrent scans that are
> saving state. The server side state will compete for tserver application
> memory. I would assume that you would have to build some feature to
> restrict
> the amount of memory that the state can consume.
>
> -----Original Message-----
> From: Adam Fuchs [mailto:afuchs@apache.org]
> Sent: Monday, April 15, 2013 6:19 PM
> To: dev@accumulo.apache.org
> Subject: Re: multi-table isolated batch scanner
>
> Keith,
>
> In this case we're filling the buffer before we can amortize the search
> cost. We're using a document-partitioned table design and we have to do a
> local sort before we can get the first result.
>
> I have found that increasing the buffer size also increases the latency for
> getting the first results. This application is both latency and throughput
> sensitive. In addition, increasing the batch size too much puts significant
> memory requirements on the process running the batch scanner.
>
> Adam
>
>
>
> On Mon, Apr 15, 2013 at 5:33 PM, Keith Turner <ke...@deenlo.com> wrote:
>
> > On Mon, Apr 15, 2013 at 5:06 PM, Adam Fuchs <af...@apache.org> wrote:
> > > Chris,
> > >
> > > The desire for isolation stems from the desire to amortize some
> > computation
> > > over a number of results. Say it takes 5 seconds to compute an
> > intersection
> >
> > Would increasing the size of the key/value buffer help in your case?
> > The iterator stack is not torn down until that buffer fills up or the
> > end of tablet is reached.  Are you concerned about the cost of
> > reconstructing the iterator stack across tablets?
> >
> > > of a couple of sets within the iterators, and then streaming back
> > > the results takes a minute or so. If I have to redo the 5 second
> > > computation many times, as in to support the reconstruction of the
> > > iterator tree,
> > then
> > > that computation may start to dominate my query performance.
> > > Primarily, this means I need to be able to continue a scan without
> > > having to rebuild the iterators. Isolation in the scanner has that
> > > side effect. Proper isolation would be a "nice-to-have", but I can deal
> with not having it.
> > >
> > > Adam
> > >
> > >
> > >
> > > On Mon, Apr 15, 2013 at 4:13 PM, Christopher <ct...@apache.org>
> > wrote:
> > >
> > >> Adam-
> > >>
> > >> It seems like you're talking about two features at once:
> > >> 1) Multi-table batch scanner.
> > >> 2) Scan Isolation on batch scanners like we have on regular scanners.
> > >> Is that correct?
> > >>
> > >> I can see the utility of a multi-table batch scanner, but I haven't
> > >> seen a compelling need for implementing isolation on the
> > >> batch-scanners. Do you have a use case in mind for that?
> > >>
> > >> Also, it seems that your use case for isolation is not so much the
> > >> isolated reads, but the statefulness of the iterator stack on the
> > >> server side. Is this correct? If so, I'm even more curious about
> > >> your use case for this, since that statefulness is only guaranteed
> per-row.
> > >>
> > >>
> > >> --
> > >> Christopher L Tubbs II
> > >> http://gravatar.com/ctubbsii
> > >>
> > >>
> > >> On Mon, Apr 15, 2013 at 3:10 PM, Adam Fuchs <af...@apache.org>
> wrote:
> > >> > Thanks Bill,
> > >> >
> > >> > I care about latency and throughput. First available result
> > >> > ordering
> > is
> > >> > fine, though.
> > >> >
> > >> > Does Guava just chain through a collection of iterators,
> > >> > completing
> > one
> > >> > then moving to the next?
> > >> >
> > >> > Adam
> > >> >
> > >> >
> > >> >
> > >> > On Mon, Apr 15, 2013 at 3:06 PM, William Slacum <
> > >> > wilhelm.von.cloud@accumulo.net> wrote:
> > >> >
> > >> >> How are you expecting to get results back? Guava's Iterables
> > >> >> could
> > >> concat a
> > >> >> bunch of a Scanners together, if you didn't care about the
> > >> >> throughput aspect of it and simply wanted results from multiple
> tables.
> > >> >>
> > >> >> On Mon, Apr 15, 2013 at 3:00 PM, Adam Fuchs <af...@apache.org>
> > wrote:
> > >> >>
> > >> >> > Is anyone else pining for a multi-table isolated batch
> > >> >> > scanner, or
> > is
> > >> it
> > >> >> > just me? I like the automatic parallelism and balancing of the
> > batch
> > >> >> > scanner, but I'm looking to maintain server-side state in my
> > iterators
> > >> >> over
> > >> >> > long-running scans. I would also like to scan over multiple
> > >> >> > tables concurrently. Has anyone tried hacking something
> > >> >> > together with a
> > pool
> > >> of
> > >> >> > non-batch scanners?
> > >> >> >
> > >> >> > Adam
> > >> >> >
> > >> >>
> > >>
> >
>
>

RE: multi-table isolated batch scanner

Posted by Dave Marion <dl...@comcast.net>.
---> I have found that increasing the buffer size also increases the latency
for getting the first results.

  We have found that to be true also, we do the opposite to get to the first
result faster. Of course we are not performing a local sort first.

---> increasing the batch size too much puts significant memory requirements
on the process running the batch scanner

  Pushing the problem from the client to the server increases the
complexity. I would be concerned with multiple concurrent scans that are
saving state. The server side state will compete for tserver application
memory. I would assume that you would have to build some feature to restrict
the amount of memory that the state can consume. 

-----Original Message-----
From: Adam Fuchs [mailto:afuchs@apache.org] 
Sent: Monday, April 15, 2013 6:19 PM
To: dev@accumulo.apache.org
Subject: Re: multi-table isolated batch scanner

Keith,

In this case we're filling the buffer before we can amortize the search
cost. We're using a document-partitioned table design and we have to do a
local sort before we can get the first result.

I have found that increasing the buffer size also increases the latency for
getting the first results. This application is both latency and throughput
sensitive. In addition, increasing the batch size too much puts significant
memory requirements on the process running the batch scanner.

Adam



On Mon, Apr 15, 2013 at 5:33 PM, Keith Turner <ke...@deenlo.com> wrote:

> On Mon, Apr 15, 2013 at 5:06 PM, Adam Fuchs <af...@apache.org> wrote:
> > Chris,
> >
> > The desire for isolation stems from the desire to amortize some
> computation
> > over a number of results. Say it takes 5 seconds to compute an
> intersection
>
> Would increasing the size of the key/value buffer help in your case?
> The iterator stack is not torn down until that buffer fills up or the 
> end of tablet is reached.  Are you concerned about the cost of 
> reconstructing the iterator stack across tablets?
>
> > of a couple of sets within the iterators, and then streaming back 
> > the results takes a minute or so. If I have to redo the 5 second 
> > computation many times, as in to support the reconstruction of the 
> > iterator tree,
> then
> > that computation may start to dominate my query performance. 
> > Primarily, this means I need to be able to continue a scan without 
> > having to rebuild the iterators. Isolation in the scanner has that 
> > side effect. Proper isolation would be a "nice-to-have", but I can deal
with not having it.
> >
> > Adam
> >
> >
> >
> > On Mon, Apr 15, 2013 at 4:13 PM, Christopher <ct...@apache.org>
> wrote:
> >
> >> Adam-
> >>
> >> It seems like you're talking about two features at once:
> >> 1) Multi-table batch scanner.
> >> 2) Scan Isolation on batch scanners like we have on regular scanners.
> >> Is that correct?
> >>
> >> I can see the utility of a multi-table batch scanner, but I haven't 
> >> seen a compelling need for implementing isolation on the 
> >> batch-scanners. Do you have a use case in mind for that?
> >>
> >> Also, it seems that your use case for isolation is not so much the 
> >> isolated reads, but the statefulness of the iterator stack on the 
> >> server side. Is this correct? If so, I'm even more curious about 
> >> your use case for this, since that statefulness is only guaranteed
per-row.
> >>
> >>
> >> --
> >> Christopher L Tubbs II
> >> http://gravatar.com/ctubbsii
> >>
> >>
> >> On Mon, Apr 15, 2013 at 3:10 PM, Adam Fuchs <af...@apache.org> wrote:
> >> > Thanks Bill,
> >> >
> >> > I care about latency and throughput. First available result 
> >> > ordering
> is
> >> > fine, though.
> >> >
> >> > Does Guava just chain through a collection of iterators, 
> >> > completing
> one
> >> > then moving to the next?
> >> >
> >> > Adam
> >> >
> >> >
> >> >
> >> > On Mon, Apr 15, 2013 at 3:06 PM, William Slacum < 
> >> > wilhelm.von.cloud@accumulo.net> wrote:
> >> >
> >> >> How are you expecting to get results back? Guava's Iterables 
> >> >> could
> >> concat a
> >> >> bunch of a Scanners together, if you didn't care about the 
> >> >> throughput aspect of it and simply wanted results from multiple
tables.
> >> >>
> >> >> On Mon, Apr 15, 2013 at 3:00 PM, Adam Fuchs <af...@apache.org>
> wrote:
> >> >>
> >> >> > Is anyone else pining for a multi-table isolated batch 
> >> >> > scanner, or
> is
> >> it
> >> >> > just me? I like the automatic parallelism and balancing of the
> batch
> >> >> > scanner, but I'm looking to maintain server-side state in my
> iterators
> >> >> over
> >> >> > long-running scans. I would also like to scan over multiple 
> >> >> > tables concurrently. Has anyone tried hacking something 
> >> >> > together with a
> pool
> >> of
> >> >> > non-batch scanners?
> >> >> >
> >> >> > Adam
> >> >> >
> >> >>
> >>
>


Re: multi-table isolated batch scanner

Posted by Adam Fuchs <af...@apache.org>.
Keith,

I'm essentially performing a local sort-merge inner join (local meaning
within one tablet at a time). This is similar to what the intersecting
iterator does, but I'm doing it on data that is not already sorted.
Applications of this include any kind of indexed boolean logic predicate
search in which one or more of the predicates cannot be tied to an exact
term, but can be tied to a range of terms. Wildcard searches, multi-hop
graph searches, and range queries all fit this pattern.

After the join there is additional processing done on the results. I would
like to keep some of the intermediate data buffered, like the sorted set of
IDs that match a predicate, even though these are not directly part of the
output data. The multiple passes are needed for when the intermediate data
is too big to hold in memory (while maintaining concurrency, etc.), but
they don't really complicate the underlying problem.

Adam



On Tue, Apr 16, 2013 at 9:29 AM, Keith Turner <ke...@deenlo.com> wrote:

> On Mon, Apr 15, 2013 at 6:19 PM, Adam Fuchs <af...@apache.org> wrote:
> > Keith,
> >
> > In this case we're filling the buffer before we can amortize the search
> > cost. We're using a document-partitioned table design and we have to do a
> > local sort before we can get the first result.
> >
>
> I am not sure exactly what you are doing.  To me it sounds like you
> may be doing the following, but not sure.
>
>  * Seeking N iterators
>  * Doing what the intersecting iterator does, joining docids or
> seeking iterators
>  * Collecting a set of key values (docids or documents?) and sorting
> them?  How much is collected before sort?  Why sort?  Is filtering
> done after the sort?
>
> Or did you meant something else by 'local sort' ? like a sort on the
> client side?  But this does not seem the entire story as you mentioned
> something about multiple passes.  Are you hash partitioning the data?
> Are you building something in memoery besides the buffered output
> data?
>
>
> > I have found that increasing the buffer size also increases the latency
> for
> > getting the first results. This application is both latency and
> throughput
> > sensitive. In addition, increasing the batch size too much puts
> significant
> > memory requirements on the process running the batch scanner.
> >
> > Adam
> >
> >
> >
> > On Mon, Apr 15, 2013 at 5:33 PM, Keith Turner <ke...@deenlo.com> wrote:
> >
> >> On Mon, Apr 15, 2013 at 5:06 PM, Adam Fuchs <af...@apache.org> wrote:
> >> > Chris,
> >> >
> >> > The desire for isolation stems from the desire to amortize some
> >> computation
> >> > over a number of results. Say it takes 5 seconds to compute an
> >> intersection
> >>
> >> Would increasing the size of the key/value buffer help in your case?
> >> The iterator stack is not torn down until that buffer fills up or the
> >> end of tablet is reached.  Are you concerned about the cost of
> >> reconstructing the iterator stack across tablets?
> >>
> >> > of a couple of sets within the iterators, and then streaming back the
> >> > results takes a minute or so. If I have to redo the 5 second
> computation
> >> > many times, as in to support the reconstruction of the iterator tree,
> >> then
> >> > that computation may start to dominate my query performance.
> Primarily,
> >> > this means I need to be able to continue a scan without having to
> rebuild
> >> > the iterators. Isolation in the scanner has that side effect. Proper
> >> > isolation would be a "nice-to-have", but I can deal with not having
> it.
> >> >
> >> > Adam
> >> >
> >> >
> >> >
> >> > On Mon, Apr 15, 2013 at 4:13 PM, Christopher <ct...@apache.org>
> >> wrote:
> >> >
> >> >> Adam-
> >> >>
> >> >> It seems like you're talking about two features at once:
> >> >> 1) Multi-table batch scanner.
> >> >> 2) Scan Isolation on batch scanners like we have on regular scanners.
> >> >> Is that correct?
> >> >>
> >> >> I can see the utility of a multi-table batch scanner, but I haven't
> >> >> seen a compelling need for implementing isolation on the
> >> >> batch-scanners. Do you have a use case in mind for that?
> >> >>
> >> >> Also, it seems that your use case for isolation is not so much the
> >> >> isolated reads, but the statefulness of the iterator stack on the
> >> >> server side. Is this correct? If so, I'm even more curious about your
> >> >> use case for this, since that statefulness is only guaranteed
> per-row.
> >> >>
> >> >>
> >> >> --
> >> >> Christopher L Tubbs II
> >> >> http://gravatar.com/ctubbsii
> >> >>
> >> >>
> >> >> On Mon, Apr 15, 2013 at 3:10 PM, Adam Fuchs <af...@apache.org>
> wrote:
> >> >> > Thanks Bill,
> >> >> >
> >> >> > I care about latency and throughput. First available result
> ordering
> >> is
> >> >> > fine, though.
> >> >> >
> >> >> > Does Guava just chain through a collection of iterators, completing
> >> one
> >> >> > then moving to the next?
> >> >> >
> >> >> > Adam
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Mon, Apr 15, 2013 at 3:06 PM, William Slacum <
> >> >> > wilhelm.von.cloud@accumulo.net> wrote:
> >> >> >
> >> >> >> How are you expecting to get results back? Guava's Iterables could
> >> >> concat a
> >> >> >> bunch of a Scanners together, if you didn't care about the
> throughput
> >> >> >> aspect of it and simply wanted results from multiple tables.
> >> >> >>
> >> >> >> On Mon, Apr 15, 2013 at 3:00 PM, Adam Fuchs <af...@apache.org>
> >> wrote:
> >> >> >>
> >> >> >> > Is anyone else pining for a multi-table isolated batch scanner,
> or
> >> is
> >> >> it
> >> >> >> > just me? I like the automatic parallelism and balancing of the
> >> batch
> >> >> >> > scanner, but I'm looking to maintain server-side state in my
> >> iterators
> >> >> >> over
> >> >> >> > long-running scans. I would also like to scan over multiple
> tables
> >> >> >> > concurrently. Has anyone tried hacking something together with a
> >> pool
> >> >> of
> >> >> >> > non-batch scanners?
> >> >> >> >
> >> >> >> > Adam
> >> >> >> >
> >> >> >>
> >> >>
> >>
>

Re: multi-table isolated batch scanner

Posted by Keith Turner <ke...@deenlo.com>.
On Mon, Apr 15, 2013 at 6:19 PM, Adam Fuchs <af...@apache.org> wrote:
> Keith,
>
> In this case we're filling the buffer before we can amortize the search
> cost. We're using a document-partitioned table design and we have to do a
> local sort before we can get the first result.
>

I am not sure exactly what you are doing.  To me it sounds like you
may be doing the following, but not sure.

 * Seeking N iterators
 * Doing what the intersecting iterator does, joining docids or
seeking iterators
 * Collecting a set of key values (docids or documents?) and sorting
them?  How much is collected before sort?  Why sort?  Is filtering
done after the sort?

Or did you meant something else by 'local sort' ? like a sort on the
client side?  But this does not seem the entire story as you mentioned
something about multiple passes.  Are you hash partitioning the data?
Are you building something in memoery besides the buffered output
data?


> I have found that increasing the buffer size also increases the latency for
> getting the first results. This application is both latency and throughput
> sensitive. In addition, increasing the batch size too much puts significant
> memory requirements on the process running the batch scanner.
>
> Adam
>
>
>
> On Mon, Apr 15, 2013 at 5:33 PM, Keith Turner <ke...@deenlo.com> wrote:
>
>> On Mon, Apr 15, 2013 at 5:06 PM, Adam Fuchs <af...@apache.org> wrote:
>> > Chris,
>> >
>> > The desire for isolation stems from the desire to amortize some
>> computation
>> > over a number of results. Say it takes 5 seconds to compute an
>> intersection
>>
>> Would increasing the size of the key/value buffer help in your case?
>> The iterator stack is not torn down until that buffer fills up or the
>> end of tablet is reached.  Are you concerned about the cost of
>> reconstructing the iterator stack across tablets?
>>
>> > of a couple of sets within the iterators, and then streaming back the
>> > results takes a minute or so. If I have to redo the 5 second computation
>> > many times, as in to support the reconstruction of the iterator tree,
>> then
>> > that computation may start to dominate my query performance. Primarily,
>> > this means I need to be able to continue a scan without having to rebuild
>> > the iterators. Isolation in the scanner has that side effect. Proper
>> > isolation would be a "nice-to-have", but I can deal with not having it.
>> >
>> > Adam
>> >
>> >
>> >
>> > On Mon, Apr 15, 2013 at 4:13 PM, Christopher <ct...@apache.org>
>> wrote:
>> >
>> >> Adam-
>> >>
>> >> It seems like you're talking about two features at once:
>> >> 1) Multi-table batch scanner.
>> >> 2) Scan Isolation on batch scanners like we have on regular scanners.
>> >> Is that correct?
>> >>
>> >> I can see the utility of a multi-table batch scanner, but I haven't
>> >> seen a compelling need for implementing isolation on the
>> >> batch-scanners. Do you have a use case in mind for that?
>> >>
>> >> Also, it seems that your use case for isolation is not so much the
>> >> isolated reads, but the statefulness of the iterator stack on the
>> >> server side. Is this correct? If so, I'm even more curious about your
>> >> use case for this, since that statefulness is only guaranteed per-row.
>> >>
>> >>
>> >> --
>> >> Christopher L Tubbs II
>> >> http://gravatar.com/ctubbsii
>> >>
>> >>
>> >> On Mon, Apr 15, 2013 at 3:10 PM, Adam Fuchs <af...@apache.org> wrote:
>> >> > Thanks Bill,
>> >> >
>> >> > I care about latency and throughput. First available result ordering
>> is
>> >> > fine, though.
>> >> >
>> >> > Does Guava just chain through a collection of iterators, completing
>> one
>> >> > then moving to the next?
>> >> >
>> >> > Adam
>> >> >
>> >> >
>> >> >
>> >> > On Mon, Apr 15, 2013 at 3:06 PM, William Slacum <
>> >> > wilhelm.von.cloud@accumulo.net> wrote:
>> >> >
>> >> >> How are you expecting to get results back? Guava's Iterables could
>> >> concat a
>> >> >> bunch of a Scanners together, if you didn't care about the throughput
>> >> >> aspect of it and simply wanted results from multiple tables.
>> >> >>
>> >> >> On Mon, Apr 15, 2013 at 3:00 PM, Adam Fuchs <af...@apache.org>
>> wrote:
>> >> >>
>> >> >> > Is anyone else pining for a multi-table isolated batch scanner, or
>> is
>> >> it
>> >> >> > just me? I like the automatic parallelism and balancing of the
>> batch
>> >> >> > scanner, but I'm looking to maintain server-side state in my
>> iterators
>> >> >> over
>> >> >> > long-running scans. I would also like to scan over multiple tables
>> >> >> > concurrently. Has anyone tried hacking something together with a
>> pool
>> >> of
>> >> >> > non-batch scanners?
>> >> >> >
>> >> >> > Adam
>> >> >> >
>> >> >>
>> >>
>>

Re: multi-table isolated batch scanner

Posted by Adam Fuchs <af...@apache.org>.
Keith,

In this case we're filling the buffer before we can amortize the search
cost. We're using a document-partitioned table design and we have to do a
local sort before we can get the first result.

I have found that increasing the buffer size also increases the latency for
getting the first results. This application is both latency and throughput
sensitive. In addition, increasing the batch size too much puts significant
memory requirements on the process running the batch scanner.

Adam



On Mon, Apr 15, 2013 at 5:33 PM, Keith Turner <ke...@deenlo.com> wrote:

> On Mon, Apr 15, 2013 at 5:06 PM, Adam Fuchs <af...@apache.org> wrote:
> > Chris,
> >
> > The desire for isolation stems from the desire to amortize some
> computation
> > over a number of results. Say it takes 5 seconds to compute an
> intersection
>
> Would increasing the size of the key/value buffer help in your case?
> The iterator stack is not torn down until that buffer fills up or the
> end of tablet is reached.  Are you concerned about the cost of
> reconstructing the iterator stack across tablets?
>
> > of a couple of sets within the iterators, and then streaming back the
> > results takes a minute or so. If I have to redo the 5 second computation
> > many times, as in to support the reconstruction of the iterator tree,
> then
> > that computation may start to dominate my query performance. Primarily,
> > this means I need to be able to continue a scan without having to rebuild
> > the iterators. Isolation in the scanner has that side effect. Proper
> > isolation would be a "nice-to-have", but I can deal with not having it.
> >
> > Adam
> >
> >
> >
> > On Mon, Apr 15, 2013 at 4:13 PM, Christopher <ct...@apache.org>
> wrote:
> >
> >> Adam-
> >>
> >> It seems like you're talking about two features at once:
> >> 1) Multi-table batch scanner.
> >> 2) Scan Isolation on batch scanners like we have on regular scanners.
> >> Is that correct?
> >>
> >> I can see the utility of a multi-table batch scanner, but I haven't
> >> seen a compelling need for implementing isolation on the
> >> batch-scanners. Do you have a use case in mind for that?
> >>
> >> Also, it seems that your use case for isolation is not so much the
> >> isolated reads, but the statefulness of the iterator stack on the
> >> server side. Is this correct? If so, I'm even more curious about your
> >> use case for this, since that statefulness is only guaranteed per-row.
> >>
> >>
> >> --
> >> Christopher L Tubbs II
> >> http://gravatar.com/ctubbsii
> >>
> >>
> >> On Mon, Apr 15, 2013 at 3:10 PM, Adam Fuchs <af...@apache.org> wrote:
> >> > Thanks Bill,
> >> >
> >> > I care about latency and throughput. First available result ordering
> is
> >> > fine, though.
> >> >
> >> > Does Guava just chain through a collection of iterators, completing
> one
> >> > then moving to the next?
> >> >
> >> > Adam
> >> >
> >> >
> >> >
> >> > On Mon, Apr 15, 2013 at 3:06 PM, William Slacum <
> >> > wilhelm.von.cloud@accumulo.net> wrote:
> >> >
> >> >> How are you expecting to get results back? Guava's Iterables could
> >> concat a
> >> >> bunch of a Scanners together, if you didn't care about the throughput
> >> >> aspect of it and simply wanted results from multiple tables.
> >> >>
> >> >> On Mon, Apr 15, 2013 at 3:00 PM, Adam Fuchs <af...@apache.org>
> wrote:
> >> >>
> >> >> > Is anyone else pining for a multi-table isolated batch scanner, or
> is
> >> it
> >> >> > just me? I like the automatic parallelism and balancing of the
> batch
> >> >> > scanner, but I'm looking to maintain server-side state in my
> iterators
> >> >> over
> >> >> > long-running scans. I would also like to scan over multiple tables
> >> >> > concurrently. Has anyone tried hacking something together with a
> pool
> >> of
> >> >> > non-batch scanners?
> >> >> >
> >> >> > Adam
> >> >> >
> >> >>
> >>
>

Re: multi-table isolated batch scanner

Posted by Keith Turner <ke...@deenlo.com>.
On Mon, Apr 15, 2013 at 5:06 PM, Adam Fuchs <af...@apache.org> wrote:
> Chris,
>
> The desire for isolation stems from the desire to amortize some computation
> over a number of results. Say it takes 5 seconds to compute an intersection

Would increasing the size of the key/value buffer help in your case?
The iterator stack is not torn down until that buffer fills up or the
end of tablet is reached.  Are you concerned about the cost of
reconstructing the iterator stack across tablets?

> of a couple of sets within the iterators, and then streaming back the
> results takes a minute or so. If I have to redo the 5 second computation
> many times, as in to support the reconstruction of the iterator tree, then
> that computation may start to dominate my query performance. Primarily,
> this means I need to be able to continue a scan without having to rebuild
> the iterators. Isolation in the scanner has that side effect. Proper
> isolation would be a "nice-to-have", but I can deal with not having it.
>
> Adam
>
>
>
> On Mon, Apr 15, 2013 at 4:13 PM, Christopher <ct...@apache.org> wrote:
>
>> Adam-
>>
>> It seems like you're talking about two features at once:
>> 1) Multi-table batch scanner.
>> 2) Scan Isolation on batch scanners like we have on regular scanners.
>> Is that correct?
>>
>> I can see the utility of a multi-table batch scanner, but I haven't
>> seen a compelling need for implementing isolation on the
>> batch-scanners. Do you have a use case in mind for that?
>>
>> Also, it seems that your use case for isolation is not so much the
>> isolated reads, but the statefulness of the iterator stack on the
>> server side. Is this correct? If so, I'm even more curious about your
>> use case for this, since that statefulness is only guaranteed per-row.
>>
>>
>> --
>> Christopher L Tubbs II
>> http://gravatar.com/ctubbsii
>>
>>
>> On Mon, Apr 15, 2013 at 3:10 PM, Adam Fuchs <af...@apache.org> wrote:
>> > Thanks Bill,
>> >
>> > I care about latency and throughput. First available result ordering is
>> > fine, though.
>> >
>> > Does Guava just chain through a collection of iterators, completing one
>> > then moving to the next?
>> >
>> > Adam
>> >
>> >
>> >
>> > On Mon, Apr 15, 2013 at 3:06 PM, William Slacum <
>> > wilhelm.von.cloud@accumulo.net> wrote:
>> >
>> >> How are you expecting to get results back? Guava's Iterables could
>> concat a
>> >> bunch of a Scanners together, if you didn't care about the throughput
>> >> aspect of it and simply wanted results from multiple tables.
>> >>
>> >> On Mon, Apr 15, 2013 at 3:00 PM, Adam Fuchs <af...@apache.org> wrote:
>> >>
>> >> > Is anyone else pining for a multi-table isolated batch scanner, or is
>> it
>> >> > just me? I like the automatic parallelism and balancing of the batch
>> >> > scanner, but I'm looking to maintain server-side state in my iterators
>> >> over
>> >> > long-running scans. I would also like to scan over multiple tables
>> >> > concurrently. Has anyone tried hacking something together with a pool
>> of
>> >> > non-batch scanners?
>> >> >
>> >> > Adam
>> >> >
>> >>
>>

Re: multi-table isolated batch scanner

Posted by Adam Fuchs <af...@apache.org>.
Chris,

The desire for isolation stems from the desire to amortize some computation
over a number of results. Say it takes 5 seconds to compute an intersection
of a couple of sets within the iterators, and then streaming back the
results takes a minute or so. If I have to redo the 5 second computation
many times, as in to support the reconstruction of the iterator tree, then
that computation may start to dominate my query performance. Primarily,
this means I need to be able to continue a scan without having to rebuild
the iterators. Isolation in the scanner has that side effect. Proper
isolation would be a "nice-to-have", but I can deal with not having it.

Adam



On Mon, Apr 15, 2013 at 4:13 PM, Christopher <ct...@apache.org> wrote:

> Adam-
>
> It seems like you're talking about two features at once:
> 1) Multi-table batch scanner.
> 2) Scan Isolation on batch scanners like we have on regular scanners.
> Is that correct?
>
> I can see the utility of a multi-table batch scanner, but I haven't
> seen a compelling need for implementing isolation on the
> batch-scanners. Do you have a use case in mind for that?
>
> Also, it seems that your use case for isolation is not so much the
> isolated reads, but the statefulness of the iterator stack on the
> server side. Is this correct? If so, I'm even more curious about your
> use case for this, since that statefulness is only guaranteed per-row.
>
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> On Mon, Apr 15, 2013 at 3:10 PM, Adam Fuchs <af...@apache.org> wrote:
> > Thanks Bill,
> >
> > I care about latency and throughput. First available result ordering is
> > fine, though.
> >
> > Does Guava just chain through a collection of iterators, completing one
> > then moving to the next?
> >
> > Adam
> >
> >
> >
> > On Mon, Apr 15, 2013 at 3:06 PM, William Slacum <
> > wilhelm.von.cloud@accumulo.net> wrote:
> >
> >> How are you expecting to get results back? Guava's Iterables could
> concat a
> >> bunch of a Scanners together, if you didn't care about the throughput
> >> aspect of it and simply wanted results from multiple tables.
> >>
> >> On Mon, Apr 15, 2013 at 3:00 PM, Adam Fuchs <af...@apache.org> wrote:
> >>
> >> > Is anyone else pining for a multi-table isolated batch scanner, or is
> it
> >> > just me? I like the automatic parallelism and balancing of the batch
> >> > scanner, but I'm looking to maintain server-side state in my iterators
> >> over
> >> > long-running scans. I would also like to scan over multiple tables
> >> > concurrently. Has anyone tried hacking something together with a pool
> of
> >> > non-batch scanners?
> >> >
> >> > Adam
> >> >
> >>
>

Re: multi-table isolated batch scanner

Posted by Christopher <ct...@apache.org>.
Adam-

It seems like you're talking about two features at once:
1) Multi-table batch scanner.
2) Scan Isolation on batch scanners like we have on regular scanners.
Is that correct?

I can see the utility of a multi-table batch scanner, but I haven't
seen a compelling need for implementing isolation on the
batch-scanners. Do you have a use case in mind for that?

Also, it seems that your use case for isolation is not so much the
isolated reads, but the statefulness of the iterator stack on the
server side. Is this correct? If so, I'm even more curious about your
use case for this, since that statefulness is only guaranteed per-row.


--
Christopher L Tubbs II
http://gravatar.com/ctubbsii


On Mon, Apr 15, 2013 at 3:10 PM, Adam Fuchs <af...@apache.org> wrote:
> Thanks Bill,
>
> I care about latency and throughput. First available result ordering is
> fine, though.
>
> Does Guava just chain through a collection of iterators, completing one
> then moving to the next?
>
> Adam
>
>
>
> On Mon, Apr 15, 2013 at 3:06 PM, William Slacum <
> wilhelm.von.cloud@accumulo.net> wrote:
>
>> How are you expecting to get results back? Guava's Iterables could concat a
>> bunch of a Scanners together, if you didn't care about the throughput
>> aspect of it and simply wanted results from multiple tables.
>>
>> On Mon, Apr 15, 2013 at 3:00 PM, Adam Fuchs <af...@apache.org> wrote:
>>
>> > Is anyone else pining for a multi-table isolated batch scanner, or is it
>> > just me? I like the automatic parallelism and balancing of the batch
>> > scanner, but I'm looking to maintain server-side state in my iterators
>> over
>> > long-running scans. I would also like to scan over multiple tables
>> > concurrently. Has anyone tried hacking something together with a pool of
>> > non-batch scanners?
>> >
>> > Adam
>> >
>>

Re: multi-table isolated batch scanner

Posted by Adam Fuchs <af...@apache.org>.
Thanks Bill,

I care about latency and throughput. First available result ordering is
fine, though.

Does Guava just chain through a collection of iterators, completing one
then moving to the next?

Adam



On Mon, Apr 15, 2013 at 3:06 PM, William Slacum <
wilhelm.von.cloud@accumulo.net> wrote:

> How are you expecting to get results back? Guava's Iterables could concat a
> bunch of a Scanners together, if you didn't care about the throughput
> aspect of it and simply wanted results from multiple tables.
>
> On Mon, Apr 15, 2013 at 3:00 PM, Adam Fuchs <af...@apache.org> wrote:
>
> > Is anyone else pining for a multi-table isolated batch scanner, or is it
> > just me? I like the automatic parallelism and balancing of the batch
> > scanner, but I'm looking to maintain server-side state in my iterators
> over
> > long-running scans. I would also like to scan over multiple tables
> > concurrently. Has anyone tried hacking something together with a pool of
> > non-batch scanners?
> >
> > Adam
> >
>

Re: multi-table isolated batch scanner

Posted by William Slacum <wi...@accumulo.net>.
How are you expecting to get results back? Guava's Iterables could concat a
bunch of a Scanners together, if you didn't care about the throughput
aspect of it and simply wanted results from multiple tables.

On Mon, Apr 15, 2013 at 3:00 PM, Adam Fuchs <af...@apache.org> wrote:

> Is anyone else pining for a multi-table isolated batch scanner, or is it
> just me? I like the automatic parallelism and balancing of the batch
> scanner, but I'm looking to maintain server-side state in my iterators over
> long-running scans. I would also like to scan over multiple tables
> concurrently. Has anyone tried hacking something together with a pool of
> non-batch scanners?
>
> Adam
>

Re: multi-table isolated batch scanner

Posted by Keith Turner <ke...@deenlo.com>.
 The current thrift RPCs may support multiple tables.   The RPCs take
key extents.   The only caveat is that the same options (columns,
iterators, auths) would have to be used for all tables.

On Mon, Apr 15, 2013 at 3:00 PM, Adam Fuchs <af...@apache.org> wrote:
> Is anyone else pining for a multi-table isolated batch scanner, or is it
> just me? I like the automatic parallelism and balancing of the batch
> scanner, but I'm looking to maintain server-side state in my iterators over
> long-running scans. I would also like to scan over multiple tables
> concurrently. Has anyone tried hacking something together with a pool of
> non-batch scanners?
>
> Adam