You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Eugene Cheipesh <ec...@gmail.com> on 2015/01/09 23:37:08 UTC

Seeking Iterator

Hello,

I am attempting to write an Iterator based on a Z-curve index to search through multi-dimensional data. Essentially, given a record that I have encountered that is in the index range not in the multi-demensional query range I have a way to generate the next candidate record, potentially far ahead of the current point.

Ideally I would be able to refine my search range with subsequent calls to seek(). It appears that Accumulo will create an iterator for every RFile (or some split other split point). The beginning of the range argument to seek will be the record at beginning of this split (which is good), however all instances of the iterator have the same, global range end (which is bad).

I need to avoid the case where I seek past the range boundary of each individual iterator instance and throw a NullPointerException. Is there any way to get enough information to achieve this?

Thank you,

-- 
Eugene Cheipesh

Re: Seeking Iterator

Posted by Adam Fuchs <sc...@gmail.com>.
On Mon, Jan 12, 2015 at 4:10 PM, Josh Elser <jo...@gmail.com> wrote:
> seek()'ing doesn't always imply an increase in performance -- remember that
> RFiles (the files that back Accumulo tables), are composed of multiple
> blocks/sections with an index of them. A seek is comprised of using that
> index to find the block/section of the RFile and then a linear scan forward
> to find the first key for the range you seek()'ed to.
>
> Thus, if you're repeatedly re-seek()'ing within the same block, you'll waste
> a lot of time re-read the same data. In your situation, it sounds like the
> cost of re-reading the data after a seek is about the same as naively
> consuming the records.
>
> You can try altering table.file.compress.blocksize (and then compacting your
> table) to see how this changes.
>

There is actually some fairly well-optimized code in the RFile seek
that minimizes the re-reading of RFile data and index blocks. Seeking
forward by one key adds a couple of key comparisons and function
calls, but that's about it. Incidentally, key comparisons are pretty
high up on my list of things that could use some performance
optimization.

Adam

Re: Seeking Iterator

Posted by Josh Elser <jo...@gmail.com>.
Eugene Cheipesh wrote:
> That is the idea I am playing with, optimizing subsequent calls to .next
> vs .reseek.
>
> Using the VersioningIterator as an I was able to get reseek working, the
> trick turned out to be to check getSource.hasTop a little more
> carefully. Thank you for that pointer. Disappointingly enough there does
> not appear to be a huge difference in performance from a Filter that
> performs the same checking without seeking forward.

seek()'ing doesn't always imply an increase in performance -- remember 
that RFiles (the files that back Accumulo tables), are composed of 
multiple blocks/sections with an index of them. A seek is comprised of 
using that index to find the block/section of the RFile and then a 
linear scan forward to find the first key for the range you seek()'ed to.

Thus, if you're repeatedly re-seek()'ing within the same block, you'll 
waste a lot of time re-read the same data. In your situation, it sounds 
like the cost of re-reading the data after a seek is about the same as 
naively consuming the records.

You can try altering table.file.compress.blocksize (and then compacting 
your table) to see how this changes.

> I am attempting to use accumulo tracing to come up with an explanation
> but results are spotty. Query runtime is ~2.3s for 200k entries and I am
> only capturing tiny fraction of that through tracing.

You mean that your iterator isn't accounting for a majority of the 
execution time? Are you bounded by the time it takes to read the data 
from disk?

> I am using
> AccumuloInputFormat to pull results into a spark job. Would I get better
> results if I was creating the BatchScanner directly? Otherwise what
> would be the best method to debug the iterator and query performance?

Are there 200k entries in the table or are you extracting 200k out of N 
entries? Hooking into tracing is definitely the right approach to take 
to get perf information.

> --
> Eugene Cheipesh
>
> From: Russ Weeks <rw...@newbrightidea.com>
> <ma...@newbrightidea.com>
> Reply: user@accumulo.apache.org <us...@accumulo.apache.org>>
> <ma...@accumulo.apache.org>
> Date: January 9, 2015 at 11:32:13 PM
> To: user@accumulo.apache.org <us...@accumulo.apache.org>>
> <ma...@accumulo.apache.org>
> Subject: Re: Seeking Iterator
>
>> On Fri, Jan 9, 2015 at 7:56 PM, Christopher <ctubbsii@apache.org
>> <ma...@apache.org>> wrote:
>>
>>     Another optimization you can try: instead of always seeking to the
>>     computed next, you can advance internally inside your iterator by
>>     calling its source's next method a few times. If you don't reach
>>     the next element that you would have seek'd to in some reasonable
>>     number of iterations, you can then seek. This also is a strategy
>>     that is hard to optimize: Do I need to advance, on average 3 or 20
>>     or 10000000 keys? How many before it would have been more
>>     efficient to just seek? There's no easy answer. Experimentation helps.
>>
>>
>> The VersioningIterator has a good example of this approach:
>> https://github.com/apache/accumulo/blob/901d60ef1cf72c2d55c90746fce94e108a992d3b/core/src/main/java/org/apache/accumulo/core/iterators/user/VersioningIterator.java#L95
>>
>> -Russ
>>
>>
>>
>>     --
>>     Christopher L Tubbs II
>>     http://gravatar.com/ctubbsii
>>
>>     On Fri, Jan 9, 2015 at 6:54 PM, Eugene Cheipesh
>>     <echeipesh@gmail.com <ma...@gmail.com>> wrote:
>>
>>         That’s would work well enough and is my next choice.
>>
>>         The thought was, rows are stored in increasing order, so as
>>         long as I know when I walked off the edge, and flag the
>>         iterator as empty it’d be good. I’m just chasing the optimal
>>         in this case, but if it doesn’t exist, oh well.
>>
>>         Thank you for the reference link, it’s very helpful.
>>
>>         --
>>         Eugene Cheipesh
>>
>>         From: Russ Weeks <rw...@newbrightidea.com>
>>         <ma...@newbrightidea.com>
>>         Reply: user@accumulo.apache.org
>>         <ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
>>         <ma...@accumulo.apache.org>
>>         Date: January 9, 2015 at 6:48:47 PM
>>         To: user@accumulo.apache.org <ma...@accumulo.apache.org>
>>         <us...@accumulo.apache.org>> <ma...@accumulo.apache.org>
>>         Subject: Re: Seeking Iterator
>>
>>>         Hi, Eugene,
>>>
>>>         I think the conventional approach is to decompose your search
>>>         area (bounding box?) into a set of scan ranges that introduce
>>>         minimal extraneous curve segments, and then pass all those
>>>         scan ranges into a BatchScanner. The excellent Accumulo
>>>         Recipes site has an example[1]. Does this approach not work
>>>         for you?
>>>
>>>         In general, your custom iterator should never try to seek to
>>>         a row id different from the current row id, because that row
>>>         could be hosted by a different tablet server.
>>>
>>>         -Russ
>>>
>>>         1:
>>>         https://github.com/calrissian/accumulo-recipes/blob/master/store/geospatial-store/src/main/java/org/calrissian/accumulorecipes/geospatialstore/support/QuadTreeHelper.java#L33
>>>
>>>         On Fri, Jan 9, 2015 at 2:37 PM, Eugene Cheipesh
>>>         <echeipesh@gmail.com <ma...@gmail.com>> wrote:
>>>
>>>             Hello,
>>>
>>>             I am attempting to write an Iterator based on a Z-curve
>>>             index to search through multi-dimensional data.
>>>             Essentially, given a record that I have encountered that
>>>             is in the index range not in the multi-demensional query
>>>             range I have a way to generate the next candidate record,
>>>             potentially far ahead of the current point.
>>>
>>>             Ideally I would be able to refine my search range with
>>>             subsequent calls to seek(). It appears that Accumulo will
>>>             create an iterator for every RFile (or some split other
>>>             split point). The beginning of the range argument to seek
>>>             will be the record at beginning of this split (which is
>>>             good), however all instances of the iterator have the
>>>             same, global range end (which is bad).
>>>
>>>             I need to avoid the case where I seek past the range
>>>             boundary of each individual iterator instance and throw a
>>>             NullPointerException. Is there any way to get enough
>>>             information to achieve this?
>>>
>>>             Thank you,
>>>
>>>             --
>>>             Eugene Cheipesh
>>>
>>>
>>
>>

Re: Seeking Iterator

Posted by Eugene Cheipesh <ec...@gmail.com>.
That is the idea I am playing with, optimizing subsequent calls to .next vs .reseek. 

Using the VersioningIterator as an I was able to get reseek working, the trick turned out to be to check getSource.hasTop a little more carefully. Thank you for that pointer. Disappointingly enough there does not appear to be a huge difference in performance from a Filter that performs the same checking without seeking forward.

I am attempting to use accumulo tracing to come up with an explanation but results are spotty. Query runtime is ~2.3s for 200k entries and I am only capturing tiny fraction of that through tracing. I am using AccumuloInputFormat to pull results into a spark job. Would I get better results if I was creating the BatchScanner directly? Otherwise what would be the best method to debug the iterator and query performance?

-- 
Eugene Cheipesh

From: Russ Weeks <rw...@newbrightidea.com>
Reply: user@accumulo.apache.org <us...@accumulo.apache.org>>
Date: January 9, 2015 at 11:32:13 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>>
Subject:  Re: Seeking Iterator  

On Fri, Jan 9, 2015 at 7:56 PM, Christopher <ct...@apache.org> wrote:
Another optimization you can try: instead of always seeking to the computed next, you can advance internally inside your iterator by calling its source's next method a few times. If you don't reach the next element that you would have seek'd to in some reasonable number of iterations, you can then seek. This also is a strategy that is hard to optimize: Do I need to advance, on average 3 or 20 or 10000000  keys? How many before it would have been more efficient to just seek? There's no easy answer. Experimentation helps.

The VersioningIterator has a good example of this approach: https://github.com/apache/accumulo/blob/901d60ef1cf72c2d55c90746fce94e108a992d3b/core/src/main/java/org/apache/accumulo/core/iterators/user/VersioningIterator.java#L95

-Russ
 


--
Christopher L Tubbs II
http://gravatar.com/ctubbsii

On Fri, Jan 9, 2015 at 6:54 PM, Eugene Cheipesh <ec...@gmail.com> wrote:
That’s would work well enough and is my next choice.

 The thought was, rows are stored in increasing order, so as long as I know when I walked off the edge, and flag the iterator as empty it’d be good.  I’m just chasing the optimal in this case, but if it doesn’t exist, oh well.

Thank you for the reference link, it’s very helpful. 

-- 
Eugene Cheipesh

From: Russ Weeks <rw...@newbrightidea.com>
Reply: user@accumulo.apache.org <us...@accumulo.apache.org>>
Date: January 9, 2015 at 6:48:47 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>>
Subject:  Re: Seeking Iterator

Hi, Eugene,

I think the conventional approach is to decompose your search area (bounding box?) into a set of scan ranges that introduce minimal extraneous curve segments, and then pass all those scan ranges into a BatchScanner. The excellent Accumulo Recipes site has an example[1]. Does this approach not work for you?

In general, your custom iterator should never try to seek to a row id different from the current row id, because that row could be hosted by a different tablet server.

-Russ

1: https://github.com/calrissian/accumulo-recipes/blob/master/store/geospatial-store/src/main/java/org/calrissian/accumulorecipes/geospatialstore/support/QuadTreeHelper.java#L33

On Fri, Jan 9, 2015 at 2:37 PM, Eugene Cheipesh <ec...@gmail.com> wrote:
Hello,

I am attempting to write an Iterator based on a Z-curve index to search through multi-dimensional data. Essentially, given a record that I have encountered that is in the index range not in the multi-demensional query range I have a way to generate the next candidate record, potentially far ahead of the current point.

Ideally I would be able to refine my search range with subsequent calls to seek(). It appears that Accumulo will create an iterator for every RFile (or some split other split point). The beginning of the range argument to seek will be the record at beginning of this split (which is good), however all instances of the iterator have the same, global range end (which is bad).

I need to avoid the case where I seek past the range boundary of each individual iterator instance and throw a NullPointerException. Is there any way to get enough information to achieve this?

Thank you,

-- 
Eugene Cheipesh




Re: Seeking Iterator

Posted by Russ Weeks <rw...@newbrightidea.com>.
On Fri, Jan 9, 2015 at 7:56 PM, Christopher <ct...@apache.org> wrote:

> Another optimization you can try: instead of always seeking to the
> computed next, you can advance internally inside your iterator by calling
> its source's next method a few times. If you don't reach the next element
> that you would have seek'd to in some reasonable number of iterations, you
> can then seek. This also is a strategy that is hard to optimize: Do I need
> to advance, on average 3 or 20 or 10000000  keys? How many before it would
> have been more efficient to just seek? There's no easy answer.
> Experimentation helps.
>

The VersioningIterator has a good example of this approach:
https://github.com/apache/accumulo/blob/901d60ef1cf72c2d55c90746fce94e108a992d3b/core/src/main/java/org/apache/accumulo/core/iterators/user/VersioningIterator.java#L95

-Russ


>
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
> On Fri, Jan 9, 2015 at 6:54 PM, Eugene Cheipesh <ec...@gmail.com>
> wrote:
>
>> That’s would work well enough and is my next choice.
>>
>>  The thought was, rows are stored in increasing order, so as long as I
>> know when I walked off the edge, and flag the iterator as empty it’d be
>> good.  I’m just chasing the optimal in this case, but if it doesn’t exist,
>> oh well.
>>
>> Thank you for the reference link, it’s very helpful.
>>
>> --
>> Eugene Cheipesh
>>
>> From: Russ Weeks <rw...@newbrightidea.com> <rw...@newbrightidea.com>
>> Reply: user@accumulo.apache.org <us...@accumulo.apache.org>>
>> <us...@accumulo.apache.org>
>> Date: January 9, 2015 at 6:48:47 PM
>> To: user@accumulo.apache.org <us...@accumulo.apache.org>>
>> <us...@accumulo.apache.org>
>> Subject:  Re: Seeking Iterator
>>
>>  Hi, Eugene,
>>
>> I think the conventional approach is to decompose your search area
>> (bounding box?) into a set of scan ranges that introduce minimal extraneous
>> curve segments, and then pass all those scan ranges into a BatchScanner.
>> The excellent Accumulo Recipes site has an example[1]. Does this approach
>> not work for you?
>>
>> In general, your custom iterator should never try to seek to a row id
>> different from the current row id, because that row could be hosted by a
>> different tablet server.
>>
>> -Russ
>>
>> 1:
>> https://github.com/calrissian/accumulo-recipes/blob/master/store/geospatial-store/src/main/java/org/calrissian/accumulorecipes/geospatialstore/support/QuadTreeHelper.java#L33
>>
>> On Fri, Jan 9, 2015 at 2:37 PM, Eugene Cheipesh <ec...@gmail.com>
>> wrote:
>>
>>>  Hello,
>>>
>>>  I am attempting to write an Iterator based on a Z-curve index to search
>>> through multi-dimensional data. Essentially, given a record that I have
>>> encountered that is in the index range not in the multi-demensional query
>>> range I have a way to generate the next candidate record, potentially far
>>> ahead of the current point.
>>>
>>>  Ideally I would be able to refine my search range with subsequent calls
>>> to seek(). It appears that Accumulo will create an iterator for every RFile
>>> (or some split other split point). The beginning of the range argument to
>>> seek will be the record at beginning of this split (which is good), however
>>> all instances of the iterator have the same, global range end (which is
>>> bad).
>>>
>>>  I need to avoid the case where I seek past the range boundary of each
>>> individual iterator instance and throw a NullPointerException. Is there any
>>> way to get enough information to achieve this?
>>>
>>>  Thank you,
>>>
>>>  --
>>> Eugene Cheipesh
>>>
>>
>>
>

Re: Seeking Iterator

Posted by Christopher <ct...@apache.org>.
The base iterator on the server side implements a seek fence, so you can't
seek outside of the underlying source anyway. So, it's safe to seek ahead
as much as you want until you exhaust the source (reach null top key).

A BatchScanner with a single range will also break things up internally
into multiple smaller ranges if they are spread across different tablets.
You only really need to compute your own separate ranges in this case if
you have large known gaps you don't want to bother with. On the other hand,
you may not care about going through these unnecessary ranges if they
typically seek straight to the end, because the next computed key is
outside the scope of that tablet.

It is hard to optimize this problem. There is no easy answer. I would
suggest experimentation, based on your data, to determine the optimal case.
It's basically a trade-off between seeks and pre-computing ranges to run in
parallel.

Another optimization you can try: instead of always seeking to the computed
next, you can advance internally inside your iterator by calling its
source's next method a few times. If you don't reach the next element that
you would have seek'd to in some reasonable number of iterations, you can
then seek. This also is a strategy that is hard to optimize: Do I need to
advance, on average 3 or 20 or 10000000  keys? How many before it would
have been more efficient to just seek? There's no easy answer.
Experimentation helps.


--
Christopher L Tubbs II
http://gravatar.com/ctubbsii

On Fri, Jan 9, 2015 at 6:54 PM, Eugene Cheipesh <ec...@gmail.com> wrote:

> That’s would work well enough and is my next choice.
>
>  The thought was, rows are stored in increasing order, so as long as I
> know when I walked off the edge, and flag the iterator as empty it’d be
> good.  I’m just chasing the optimal in this case, but if it doesn’t exist,
> oh well.
>
> Thank you for the reference link, it’s very helpful.
>
> --
> Eugene Cheipesh
>
> From: Russ Weeks <rw...@newbrightidea.com> <rw...@newbrightidea.com>
> Reply: user@accumulo.apache.org <us...@accumulo.apache.org>>
> <us...@accumulo.apache.org>
> Date: January 9, 2015 at 6:48:47 PM
> To: user@accumulo.apache.org <us...@accumulo.apache.org>>
> <us...@accumulo.apache.org>
> Subject:  Re: Seeking Iterator
>
>  Hi, Eugene,
>
> I think the conventional approach is to decompose your search area
> (bounding box?) into a set of scan ranges that introduce minimal extraneous
> curve segments, and then pass all those scan ranges into a BatchScanner.
> The excellent Accumulo Recipes site has an example[1]. Does this approach
> not work for you?
>
> In general, your custom iterator should never try to seek to a row id
> different from the current row id, because that row could be hosted by a
> different tablet server.
>
> -Russ
>
> 1:
> https://github.com/calrissian/accumulo-recipes/blob/master/store/geospatial-store/src/main/java/org/calrissian/accumulorecipes/geospatialstore/support/QuadTreeHelper.java#L33
>
> On Fri, Jan 9, 2015 at 2:37 PM, Eugene Cheipesh <ec...@gmail.com>
> wrote:
>
>>  Hello,
>>
>>  I am attempting to write an Iterator based on a Z-curve index to search
>> through multi-dimensional data. Essentially, given a record that I have
>> encountered that is in the index range not in the multi-demensional query
>> range I have a way to generate the next candidate record, potentially far
>> ahead of the current point.
>>
>>  Ideally I would be able to refine my search range with subsequent calls
>> to seek(). It appears that Accumulo will create an iterator for every RFile
>> (or some split other split point). The beginning of the range argument to
>> seek will be the record at beginning of this split (which is good), however
>> all instances of the iterator have the same, global range end (which is
>> bad).
>>
>>  I need to avoid the case where I seek past the range boundary of each
>> individual iterator instance and throw a NullPointerException. Is there any
>> way to get enough information to achieve this?
>>
>>  Thank you,
>>
>>  --
>> Eugene Cheipesh
>>
>
>

Re: Seeking Iterator

Posted by Chris Bennight <ch...@slowcar.net>.
Russ's solution is generally the way to go.

In order to implement your original methodology you would need to query
accumulo for information about the key ranges associated with each split
point, decompose your "quadtree" until you ended up with a quadrant that
was completely encapsulated within a particular split (i.e. min/max, then
have an iterator be responsible for the further decomposition/refinement
(you would need to serialize and send a portion of your query range to the
iterator as well as the decomposition logic).
There would then be issues where some of the quadrants cross splits, so you
would end up with duplicate decomposition work. (i.e. a bounding box is
likely going to have discontinuities in it when translated to a 1d range)

In the end all you would be saving is the up-front decomposition cost -
which is generally trivial unless start talking very large numbers of
precision (64 bit geohashes for 2 dimensions, etc) - and in that case the
number of ranges you generate anyway would be a bit obscene, so you would
want to heuristically roll them up.

In order to optimize accumulo input splits we tackled issues similar to
this problem (optimizing index ranges and rfile split points), and you can
see some of the logic here:
https://github.com/ngageoint/geowave/blob/master/geowave-accumulo/src/main/java/mil/nga/giat/geowave/accumulo/mapreduce/input/GeoWaveInputFormat.java

--

Also, assuming this is geospatial in nature, you probably should have at
least one test case for date line handling - either document it's not
allowed, or determine how you want to handle a bounding box that crosses
the dateline  (typically some variation of breaking it down into two or
more bounding areas on either side, and combining the results)








On Fri, Jan 9, 2015 at 6:54 PM, Eugene Cheipesh <ec...@gmail.com> wrote:

> That’s would work well enough and is my next choice.
>
>  The thought was, rows are stored in increasing order, so as long as I
> know when I walked off the edge, and flag the iterator as empty it’d be
> good.  I’m just chasing the optimal in this case, but if it doesn’t exist,
> oh well.
>
> Thank you for the reference link, it’s very helpful.
>
> --
> Eugene Cheipesh
>
> From: Russ Weeks <rw...@newbrightidea.com> <rw...@newbrightidea.com>
> Reply: user@accumulo.apache.org <us...@accumulo.apache.org>>
> <us...@accumulo.apache.org>
> Date: January 9, 2015 at 6:48:47 PM
> To: user@accumulo.apache.org <us...@accumulo.apache.org>>
> <us...@accumulo.apache.org>
> Subject:  Re: Seeking Iterator
>
>  Hi, Eugene,
>
> I think the conventional approach is to decompose your search area
> (bounding box?) into a set of scan ranges that introduce minimal extraneous
> curve segments, and then pass all those scan ranges into a BatchScanner.
> The excellent Accumulo Recipes site has an example[1]. Does this approach
> not work for you?
>
> In general, your custom iterator should never try to seek to a row id
> different from the current row id, because that row could be hosted by a
> different tablet server.
>
> -Russ
>
> 1:
> https://github.com/calrissian/accumulo-recipes/blob/master/store/geospatial-store/src/main/java/org/calrissian/accumulorecipes/geospatialstore/support/QuadTreeHelper.java#L33
>
> On Fri, Jan 9, 2015 at 2:37 PM, Eugene Cheipesh <ec...@gmail.com>
> wrote:
>
>>  Hello,
>>
>>  I am attempting to write an Iterator based on a Z-curve index to search
>> through multi-dimensional data. Essentially, given a record that I have
>> encountered that is in the index range not in the multi-demensional query
>> range I have a way to generate the next candidate record, potentially far
>> ahead of the current point.
>>
>>  Ideally I would be able to refine my search range with subsequent calls
>> to seek(). It appears that Accumulo will create an iterator for every RFile
>> (or some split other split point). The beginning of the range argument to
>> seek will be the record at beginning of this split (which is good), however
>> all instances of the iterator have the same, global range end (which is
>> bad).
>>
>>  I need to avoid the case where I seek past the range boundary of each
>> individual iterator instance and throw a NullPointerException. Is there any
>> way to get enough information to achieve this?
>>
>>  Thank you,
>>
>>  --
>> Eugene Cheipesh
>>
>
>

Re: Seeking Iterator

Posted by Eugene Cheipesh <ec...@gmail.com>.
That’s would work well enough and is my next choice.

 The thought was, rows are stored in increasing order, so as long as I know when I walked off the edge, and flag the iterator as empty it’d be good.  I’m just chasing the optimal in this case, but if it doesn’t exist, oh well.

Thank you for the reference link, it’s very helpful. 

-- 
Eugene Cheipesh

From: Russ Weeks <rw...@newbrightidea.com>
Reply: user@accumulo.apache.org <us...@accumulo.apache.org>>
Date: January 9, 2015 at 6:48:47 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>>
Subject:  Re: Seeking Iterator  

Hi, Eugene,

I think the conventional approach is to decompose your search area (bounding box?) into a set of scan ranges that introduce minimal extraneous curve segments, and then pass all those scan ranges into a BatchScanner. The excellent Accumulo Recipes site has an example[1]. Does this approach not work for you?

In general, your custom iterator should never try to seek to a row id different from the current row id, because that row could be hosted by a different tablet server.

-Russ

1: https://github.com/calrissian/accumulo-recipes/blob/master/store/geospatial-store/src/main/java/org/calrissian/accumulorecipes/geospatialstore/support/QuadTreeHelper.java#L33

On Fri, Jan 9, 2015 at 2:37 PM, Eugene Cheipesh <ec...@gmail.com> wrote:
Hello,

I am attempting to write an Iterator based on a Z-curve index to search through multi-dimensional data. Essentially, given a record that I have encountered that is in the index range not in the multi-demensional query range I have a way to generate the next candidate record, potentially far ahead of the current point.

Ideally I would be able to refine my search range with subsequent calls to seek(). It appears that Accumulo will create an iterator for every RFile (or some split other split point). The beginning of the range argument to seek will be the record at beginning of this split (which is good), however all instances of the iterator have the same, global range end (which is bad).

I need to avoid the case where I seek past the range boundary of each individual iterator instance and throw a NullPointerException. Is there any way to get enough information to achieve this?

Thank you,

-- 
Eugene Cheipesh


Re: Seeking Iterator

Posted by Russ Weeks <rw...@newbrightidea.com>.
Hi, Eugene,

I think the conventional approach is to decompose your search area
(bounding box?) into a set of scan ranges that introduce minimal extraneous
curve segments, and then pass all those scan ranges into a BatchScanner.
The excellent Accumulo Recipes site has an example[1]. Does this approach
not work for you?

In general, your custom iterator should never try to seek to a row id
different from the current row id, because that row could be hosted by a
different tablet server.

-Russ

1:
https://github.com/calrissian/accumulo-recipes/blob/master/store/geospatial-store/src/main/java/org/calrissian/accumulorecipes/geospatialstore/support/QuadTreeHelper.java#L33

On Fri, Jan 9, 2015 at 2:37 PM, Eugene Cheipesh <ec...@gmail.com> wrote:

> Hello,
>
> I am attempting to write an Iterator based on a Z-curve index to search
> through multi-dimensional data. Essentially, given a record that I have
> encountered that is in the index range not in the multi-demensional query
> range I have a way to generate the next candidate record, potentially far
> ahead of the current point.
>
> Ideally I would be able to refine my search range with subsequent calls to
> seek(). It appears that Accumulo will create an iterator for every RFile
> (or some split other split point). The beginning of the range argument to
> seek will be the record at beginning of this split (which is good), however
> all instances of the iterator have the same, global range end (which is
> bad).
>
> I need to avoid the case where I seek past the range boundary of each
> individual iterator instance and throw a NullPointerException. Is there any
> way to get enough information to achieve this?
>
> Thank you,
>
> --
> Eugene Cheipesh
>