You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by "Slater, David M." <Da...@jhuapl.edu> on 2013/01/23 19:24:36 UTC

BatchScanning with a very large collection of ranges

First, thanks to everyone for their responses to my previous questions. (Mike, I'll definitely take a look at Brian's materials for iterator behavior.)

Now I'm doing some sharded document querying (where the documents are small but numerous)-where I'm trying to get not just the list of documents but also return all of them (they are also stored in Accumulo). However, I'm running into a bottleneck in the retrieval process. It seems that the BatchScanner is quite slow at retrieving information when there is a very large number of (small) ranges (entries, i.e. docs), and increasing the thread count doesn't seem to help.

Basically, I'm taking all of the docIDs that are returned from the index process, making a new Range(docID), adding that to Collection<Range> ranges, and then adding those ranges to the new BatchScanner to return the information:

...
Collection<Range> docRanges = new LinkedList<Range>();
for (Map.Entry<Key, Value> entry : indexScanner) { // Go through index table here
            Text docID = entry.getKey().getColumnQualifier();
            docRanges.add(new Range(docID));
}

int threadCount = 20;
String docTableName = "docTable";
BatchScanner docScanner = connector.createBatchScanner(docTableName, new Authorizations(), threadCount);
docScanner.setRanges(docRanges); // large collection of ranges

for (Map.Entry<Key, Value> doc : docScanner) { // retrieve doc data
            ...
}
...

Is this a naïve way of doing this? Would trying to group documents into larger ranges (when adjacent) be a more viable approach?

Thanks,
David

Re: BatchScanning with a very large collection of ranges

Posted by John Stoneham <ly...@lyrically.net>.

You also have some other options. One would be using an IteratorChain to
string together the results of several BatchScanners in a row which you
could kick off in parallel to batch up your reads.

Or, writing this in a sequence model: use the
Iterator<Map.Entry<Key,Value>> from the indexScanner to feed an
Iterator<Map.Entry<Key,Value>> of your creation that produces document
key/values. As you request the document key/values using next(), it
prefetches a number of index key/values, runs a batch scan, queues the
results for you. When it runs out of document results, it repeats. This
model has been successful for us when hitting a term index to pull millions
of source records without loading them all into client memory at the same
time.


On Wed, Jan 23, 2013 at 1:51 PM, Keith Turner <ke...@deenlo.com> wrote:

> How much data is coming back, and whats the data rate?  You can sum up
> the size of the keys and values in your loop.
>
> On Wed, Jan 23, 2013 at 1:24 PM, Slater, David M.
> <Da...@jhuapl.edu> wrote:
> > First, thanks to everyone for their responses to my previous questions.
> > (Mike, I’ll definitely take a look at Brian’s materials for iterator
> > behavior.)
> >
> >
> >
> > Now I’m doing some sharded document querying (where the documents are
> small
> > but numerous)—where I’m trying to get not just the list of documents but
> > also return all of them (they are also stored in Accumulo). However, I’m
> > running into a bottleneck in the retrieval process. It seems that the
> > BatchScanner is quite slow at retrieving information when there is a very
> > large number of (small) ranges (entries, i.e. docs), and increasing the
> > thread count doesn’t seem to help.
> >
> >
> >
> > Basically, I’m taking all of the docIDs that are returned from the index
> > process, making a new Range(docID), adding that to Collection<Range>
> ranges,
> > and then adding those ranges to the new BatchScanner to return the
> > information:
> >
> >
> >
> > …
> >
> > Collection<Range> docRanges = new LinkedList<Range>();
> >
> > for (Map.Entry<Key, Value> entry : indexScanner) { // Go through index
> table
> > here
> >
> >             Text docID = entry.getKey().getColumnQualifier();
> >
> >             docRanges.add(new Range(docID));
> >
> > }
> >
> >
> >
> > int threadCount = 20;
> >
> > String docTableName = “docTable”;
> >
> > BatchScanner docScanner = connector.createBatchScanner(docTableName, new
> > Authorizations(), threadCount);
> >
> > docScanner.setRanges(docRanges); // large collection of ranges
> >
> >
> >
> > for (Map.Entry<Key, Value> doc : docScanner) { // retrieve doc data
> >
> >             ...
> >
> > }
> >
> > …
> >
> >
> >
> > Is this a naïve way of doing this? Would trying to group documents into
> > larger ranges (when adjacent) be a more viable approach?
> >
> >
> >
> > Thanks,
> >
> > David
>



-- 
John Stoneham
lyric@lyrically.net

Re: BatchScanning with a very large collection of ranges

Posted by Keith Turner <ke...@deenlo.com>.

How much data is coming back, and whats the data rate?  You can sum up
the size of the keys and values in your loop.

On Wed, Jan 23, 2013 at 1:24 PM, Slater, David M.
<Da...@jhuapl.edu> wrote:
> First, thanks to everyone for their responses to my previous questions.
> (Mike, I’ll definitely take a look at Brian’s materials for iterator
> behavior.)
>
>
>
> Now I’m doing some sharded document querying (where the documents are small
> but numerous)—where I’m trying to get not just the list of documents but
> also return all of them (they are also stored in Accumulo). However, I’m
> running into a bottleneck in the retrieval process. It seems that the
> BatchScanner is quite slow at retrieving information when there is a very
> large number of (small) ranges (entries, i.e. docs), and increasing the
> thread count doesn’t seem to help.
>
>
>
> Basically, I’m taking all of the docIDs that are returned from the index
> process, making a new Range(docID), adding that to Collection<Range> ranges,
> and then adding those ranges to the new BatchScanner to return the
> information:
>
>
>
> …
>
> Collection<Range> docRanges = new LinkedList<Range>();
>
> for (Map.Entry<Key, Value> entry : indexScanner) { // Go through index table
> here
>
>             Text docID = entry.getKey().getColumnQualifier();
>
>             docRanges.add(new Range(docID));
>
> }
>
>
>
> int threadCount = 20;
>
> String docTableName = “docTable”;
>
> BatchScanner docScanner = connector.createBatchScanner(docTableName, new
> Authorizations(), threadCount);
>
> docScanner.setRanges(docRanges); // large collection of ranges
>
>
>
> for (Map.Entry<Key, Value> doc : docScanner) { // retrieve doc data
>
>             ...
>
> }
>
> …
>
>
>
> Is this a naïve way of doing this? Would trying to group documents into
> larger ranges (when adjacent) be a more viable approach?
>
>
>
> Thanks,
>
> David