You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by Russ Weeks <rw...@newbrightidea.com> on 2015/01/16 19:27:02 UTC

Geospatial + Partitioned Index

Hey, all,

I'm looking at switching my geospatial index to a partitioned index to
smooth out some hotspots. So for any query, I'll have a bunch of ranges
representing intervals on a Hilbert curve, plus a bunch of partitions, each
of which needs to be scanned for every range.

The way that the (excellent!) Accumulo Recipes geospatial store addresses
this is to take the product of the partitions and the curve intervals[1].
It seems like an alternative would be to encode the curve intervals as a
property of a custom iterator (I need one anyways to filter out extraneous
points from the search area) and then the client would just scan (-inf,
+inf), which I think is more typical when querying a partitioned index?

Can anybody comment on which approach is preferred? Is it common to expose
the number of partitions in the index and the encoding of those partitions
to client code? Am I needlessly worried that taking the product of the
curve intervals and the partitions will produce too many ranges?

Thanks,
-Russ

1:
https://github.com/calrissian/accumulo-recipes/blob/master/store/geospatial-store/src/main/java/org/calrissian/accumulorecipes/geospatialstore/impl/AccumuloGeoSpatialStore.java#L190

Re: Geospatial + Partitioned Index

Posted by Russ Weeks <rw...@newbrightidea.com>.

Good to know, thanks Josh!
-Russ

On Fri, Jan 16, 2015 at 2:40 PM, Josh Elser <jo...@gmail.com> wrote:

> Thanks for the clarification, Russ, I assumed something of the sort was
> the case.
>
> It's important to remember that there is still benefit to "partition
> elimination". Doing the entire table, while it won't read all the data on
> the backend, you'll likely incur extra RPC to servers, file opens, iterator
> creation, etc. If your query is only going to match a few records, this can
> turn out to be a significant portion of your execution time. Something to
> keep in mind :)
>
> Russ Weeks wrote:
>
>> Hi, Josh,
>>
>> Thanks for your response. I think I should clarify something. When I
>> said, "the client would just scan (-inf, +inf)", I didn't mean that the
>> net effect would be to read all data. I just meant that my custom
>> Iterator would seek() to ranges which are a function of its
>> configuration and its knowledge of the partitioning scheme, just like
>> the IntersectingIterator. Except that instead of its configuration
>> defining a set of keyword terms, it would define a set of disjoint
>> intervals on a space-filling curve.
>>
>> My understanding is that setting the scan range to (-inf,+inf) in this
>> case is just a way to tell Accumulo, "run this scan across all tablets".
>>
>> -Russ
>>
>> On Fri, Jan 16, 2015 at 12:17 PM, Josh Elser <josh.elser@gmail.com
>> <ma...@gmail.com>> wrote:
>>
>>     Russ Weeks wrote:
>>
>>         Hey, all,
>>
>>         I'm looking at switching my geospatial index to a partitioned
>>         index to
>>         smooth out some hotspots. So for any query, I'll have a bunch of
>>         ranges
>>         representing intervals on a Hilbert curve, plus a bunch of
>>         partitions,
>>         each of which needs to be scanned for every range.
>>
>>         The way that the (excellent!) Accumulo Recipes geospatial store
>>         addresses this is to take the product of the partitions and the
>>         curve
>>         intervals[1]. It seems like an alternative would be to encode
>>         the curve
>>         intervals as a property of a custom iterator (I need one anyways
>> to
>>         filter out extraneous points from the search area) and then the
>>         client
>>         would just scan (-inf, +inf), which I think is more typical when
>>         querying a partitioned index?
>>
>>
>>     I'm no expert on storing geo-spatial data, but having to scan
>>     (-inf,+inf) on a table for a query is typically the reason people
>>     deal with the pain of hot-spotting, although it is the easiest to
>>     implement.
>>
>>     If you can be "tricky" in how you're encoding your data in the row
>>     such that you can reduce the search space over your partitioned
>>     index, you can try to get the best of both worlds (avoid reading all
>>     data and still get a good distribution).
>>
>>     Since that was extremely vague, here's an example: say you had a
>>     text index and wanted to look up the word "the" and your index had
>>     100 partitions, [0,99]. If you knew that it was only possible for
>>     "the" to show up on partitions 5, 27 and 83 (typically by use of
>>     some hashing function), you could drastically reduce your search
>>     space while still avoiding hot spotting on a single server.
>>
>>         Can anybody comment on which approach is preferred? Is it common
>> to
>>         expose the number of partitions in the index and the encoding of
>>         those
>>         partitions to client code? Am I needlessly worried that taking the
>>         product of the curve intervals and the partitions will produce
>>         too many
>>         ranges?
>>
>>
>>     In the trivial sense, the client doesn't need to know the partitions
>>     and would just scan the entire index like you said earlier. You
>>     could also track the partitions that you have created in a separate
>>     table and the client could read that table to know ahead of time (if
>>     you have a reason to do so in your implementation).
>>
>>     Depending on the amount of data you have, lots of ranges to check
>>     could take some time. YMMV
>>
>>
>>         Thanks,
>>         -Russ
>>
>>         1:
>>         https://github.com/calrissian/__accumulo-recipes/blob/
>> master/__store/geospatial-store/src/__main/java/org/
>> calrissian/__accumulorecipes/__geospatialstore/impl/__
>> AccumuloGeoSpatialStore.java#__L190
>>         <https://github.com/calrissian/accumulo-recipes/
>> blob/master/store/geospatial-store/src/main/java/org/
>> calrissian/accumulorecipes/geospatialstore/impl/
>> AccumuloGeoSpatialStore.java#L190>
>>
>>
>>

Re: Geospatial + Partitioned Index

Posted by Josh Elser <jo...@gmail.com>.

Thanks for the clarification, Russ, I assumed something of the sort was 
the case.

It's important to remember that there is still benefit to "partition 
elimination". Doing the entire table, while it won't read all the data 
on the backend, you'll likely incur extra RPC to servers, file opens, 
iterator creation, etc. If your query is only going to match a few 
records, this can turn out to be a significant portion of your execution 
time. Something to keep in mind :)

Russ Weeks wrote:
> Hi, Josh,
>
> Thanks for your response. I think I should clarify something. When I
> said, "the client would just scan (-inf, +inf)", I didn't mean that the
> net effect would be to read all data. I just meant that my custom
> Iterator would seek() to ranges which are a function of its
> configuration and its knowledge of the partitioning scheme, just like
> the IntersectingIterator. Except that instead of its configuration
> defining a set of keyword terms, it would define a set of disjoint
> intervals on a space-filling curve.
>
> My understanding is that setting the scan range to (-inf,+inf) in this
> case is just a way to tell Accumulo, "run this scan across all tablets".
>
> -Russ
>
> On Fri, Jan 16, 2015 at 12:17 PM, Josh Elser <josh.elser@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Russ Weeks wrote:
>
>         Hey, all,
>
>         I'm looking at switching my geospatial index to a partitioned
>         index to
>         smooth out some hotspots. So for any query, I'll have a bunch of
>         ranges
>         representing intervals on a Hilbert curve, plus a bunch of
>         partitions,
>         each of which needs to be scanned for every range.
>
>         The way that the (excellent!) Accumulo Recipes geospatial store
>         addresses this is to take the product of the partitions and the
>         curve
>         intervals[1]. It seems like an alternative would be to encode
>         the curve
>         intervals as a property of a custom iterator (I need one anyways to
>         filter out extraneous points from the search area) and then the
>         client
>         would just scan (-inf, +inf), which I think is more typical when
>         querying a partitioned index?
>
>
>     I'm no expert on storing geo-spatial data, but having to scan
>     (-inf,+inf) on a table for a query is typically the reason people
>     deal with the pain of hot-spotting, although it is the easiest to
>     implement.
>
>     If you can be "tricky" in how you're encoding your data in the row
>     such that you can reduce the search space over your partitioned
>     index, you can try to get the best of both worlds (avoid reading all
>     data and still get a good distribution).
>
>     Since that was extremely vague, here's an example: say you had a
>     text index and wanted to look up the word "the" and your index had
>     100 partitions, [0,99]. If you knew that it was only possible for
>     "the" to show up on partitions 5, 27 and 83 (typically by use of
>     some hashing function), you could drastically reduce your search
>     space while still avoiding hot spotting on a single server.
>
>         Can anybody comment on which approach is preferred? Is it common to
>         expose the number of partitions in the index and the encoding of
>         those
>         partitions to client code? Am I needlessly worried that taking the
>         product of the curve intervals and the partitions will produce
>         too many
>         ranges?
>
>
>     In the trivial sense, the client doesn't need to know the partitions
>     and would just scan the entire index like you said earlier. You
>     could also track the partitions that you have created in a separate
>     table and the client could read that table to know ahead of time (if
>     you have a reason to do so in your implementation).
>
>     Depending on the amount of data you have, lots of ranges to check
>     could take some time. YMMV
>
>
>         Thanks,
>         -Russ
>
>         1:
>         https://github.com/calrissian/__accumulo-recipes/blob/master/__store/geospatial-store/src/__main/java/org/calrissian/__accumulorecipes/__geospatialstore/impl/__AccumuloGeoSpatialStore.java#__L190
>         <https://github.com/calrissian/accumulo-recipes/blob/master/store/geospatial-store/src/main/java/org/calrissian/accumulorecipes/geospatialstore/impl/AccumuloGeoSpatialStore.java#L190>
>
>

Re: Geospatial + Partitioned Index

Posted by Russ Weeks <rw...@newbrightidea.com>.

Hi, Jim,

Thanks very much for the info - and also for "Spatio-temporal Indexing in
Non-relational Distributed Databases", it's a very useful resource. At the
moment, geospatial data is peripheral to the product I'm working on, so we
figured it would be easier to build something simple on our own rather than
go to the effort of integrating a separate project. But I've definitely got
my eye on GeoWave/GeoMesa should things change.

Regards,
-Russ

On Fri, Jan 16, 2015 at 1:50 PM, James Hughes <jn...@virginia.edu> wrote:

> Hi Russ,
>
> We've had good success in GeoMesa's spatio-temporal index blending
> together sharding, spatial, and temporal data for indexing and planning
> queries for geo-time data.  As you noted, this approach can lead to a large
> number of ranges; so far, we haven't seen any problems based on the number
> of ranges.
>
> In general, if you are looking for a product which can handle geographic
> data with time and other attributes, I'd suggest either GeoMesa or
> GeoWave.  The two are similar, and we are hopeful that we further align
> them in the future.
>
> If rolling your own is the order of the day, you might check with Eugene
> to see how his spatial range skipping iterator is going.  He may be using
> GeoHashes/z-order curves rather than Hilbert curves.
>
> Thanks,
>
> Jim
>
> On Fri, Jan 16, 2015 at 3:37 PM, Russ Weeks <rw...@newbrightidea.com>
> wrote:
>
>> Hi, Josh,
>>
>> Thanks for your response. I think I should clarify something. When I
>> said, "the client would just scan (-inf, +inf)", I didn't mean that the net
>> effect would be to read all data. I just meant that my custom Iterator
>> would seek() to ranges which are a function of its configuration and its
>> knowledge of the partitioning scheme, just like the IntersectingIterator.
>> Except that instead of its configuration defining a set of keyword terms,
>> it would define a set of disjoint intervals on a space-filling curve.
>>
>> My understanding is that setting the scan range to (-inf,+inf) in this
>> case is just a way to tell Accumulo, "run this scan across all tablets".
>>
>> -Russ
>>
>> On Fri, Jan 16, 2015 at 12:17 PM, Josh Elser <jo...@gmail.com>
>> wrote:
>>
>>> Russ Weeks wrote:
>>>
>>>> Hey, all,
>>>>
>>>> I'm looking at switching my geospatial index to a partitioned index to
>>>> smooth out some hotspots. So for any query, I'll have a bunch of ranges
>>>> representing intervals on a Hilbert curve, plus a bunch of partitions,
>>>> each of which needs to be scanned for every range.
>>>>
>>>> The way that the (excellent!) Accumulo Recipes geospatial store
>>>> addresses this is to take the product of the partitions and the curve
>>>> intervals[1]. It seems like an alternative would be to encode the curve
>>>> intervals as a property of a custom iterator (I need one anyways to
>>>> filter out extraneous points from the search area) and then the client
>>>> would just scan (-inf, +inf), which I think is more typical when
>>>> querying a partitioned index?
>>>>
>>>
>>> I'm no expert on storing geo-spatial data, but having to scan
>>> (-inf,+inf) on a table for a query is typically the reason people deal with
>>> the pain of hot-spotting, although it is the easiest to implement.
>>>
>>> If you can be "tricky" in how you're encoding your data in the row such
>>> that you can reduce the search space over your partitioned index, you can
>>> try to get the best of both worlds (avoid reading all data and still get a
>>> good distribution).
>>>
>>> Since that was extremely vague, here's an example: say you had a text
>>> index and wanted to look up the word "the" and your index had 100
>>> partitions, [0,99]. If you knew that it was only possible for "the" to show
>>> up on partitions 5, 27 and 83 (typically by use of some hashing function),
>>> you could drastically reduce your search space while still avoiding hot
>>> spotting on a single server.
>>>
>>>  Can anybody comment on which approach is preferred? Is it common to
>>>> expose the number of partitions in the index and the encoding of those
>>>> partitions to client code? Am I needlessly worried that taking the
>>>> product of the curve intervals and the partitions will produce too many
>>>> ranges?
>>>>
>>>
>>> In the trivial sense, the client doesn't need to know the partitions and
>>> would just scan the entire index like you said earlier. You could also
>>> track the partitions that you have created in a separate table and the
>>> client could read that table to know ahead of time (if you have a reason to
>>> do so in your implementation).
>>>
>>> Depending on the amount of data you have, lots of ranges to check could
>>> take some time. YMMV
>>>
>>>
>>>  Thanks,
>>>> -Russ
>>>>
>>>> 1:
>>>> https://github.com/calrissian/accumulo-recipes/blob/master/
>>>> store/geospatial-store/src/main/java/org/calrissian/accumulorecipes/
>>>> geospatialstore/impl/AccumuloGeoSpatialStore.java#L190
>>>>
>>>
>>
>

Re: Geospatial + Partitioned Index

Posted by James Hughes <jn...@virginia.edu>.

Hi Russ,

We've had good success in GeoMesa's spatio-temporal index blending together
sharding, spatial, and temporal data for indexing and planning queries for
geo-time data.  As you noted, this approach can lead to a large number of
ranges; so far, we haven't seen any problems based on the number of ranges.

In general, if you are looking for a product which can handle geographic
data with time and other attributes, I'd suggest either GeoMesa or
GeoWave.  The two are similar, and we are hopeful that we further align
them in the future.

If rolling your own is the order of the day, you might check with Eugene to
see how his spatial range skipping iterator is going.  He may be using
GeoHashes/z-order curves rather than Hilbert curves.

Thanks,

Jim

On Fri, Jan 16, 2015 at 3:37 PM, Russ Weeks <rw...@newbrightidea.com>
wrote:

> Hi, Josh,
>
> Thanks for your response. I think I should clarify something. When I said,
> "the client would just scan (-inf, +inf)", I didn't mean that the net
> effect would be to read all data. I just meant that my custom Iterator
> would seek() to ranges which are a function of its configuration and its
> knowledge of the partitioning scheme, just like the IntersectingIterator.
> Except that instead of its configuration defining a set of keyword terms,
> it would define a set of disjoint intervals on a space-filling curve.
>
> My understanding is that setting the scan range to (-inf,+inf) in this
> case is just a way to tell Accumulo, "run this scan across all tablets".
>
> -Russ
>
> On Fri, Jan 16, 2015 at 12:17 PM, Josh Elser <jo...@gmail.com> wrote:
>
>> Russ Weeks wrote:
>>
>>> Hey, all,
>>>
>>> I'm looking at switching my geospatial index to a partitioned index to
>>> smooth out some hotspots. So for any query, I'll have a bunch of ranges
>>> representing intervals on a Hilbert curve, plus a bunch of partitions,
>>> each of which needs to be scanned for every range.
>>>
>>> The way that the (excellent!) Accumulo Recipes geospatial store
>>> addresses this is to take the product of the partitions and the curve
>>> intervals[1]. It seems like an alternative would be to encode the curve
>>> intervals as a property of a custom iterator (I need one anyways to
>>> filter out extraneous points from the search area) and then the client
>>> would just scan (-inf, +inf), which I think is more typical when
>>> querying a partitioned index?
>>>
>>
>> I'm no expert on storing geo-spatial data, but having to scan (-inf,+inf)
>> on a table for a query is typically the reason people deal with the pain of
>> hot-spotting, although it is the easiest to implement.
>>
>> If you can be "tricky" in how you're encoding your data in the row such
>> that you can reduce the search space over your partitioned index, you can
>> try to get the best of both worlds (avoid reading all data and still get a
>> good distribution).
>>
>> Since that was extremely vague, here's an example: say you had a text
>> index and wanted to look up the word "the" and your index had 100
>> partitions, [0,99]. If you knew that it was only possible for "the" to show
>> up on partitions 5, 27 and 83 (typically by use of some hashing function),
>> you could drastically reduce your search space while still avoiding hot
>> spotting on a single server.
>>
>>  Can anybody comment on which approach is preferred? Is it common to
>>> expose the number of partitions in the index and the encoding of those
>>> partitions to client code? Am I needlessly worried that taking the
>>> product of the curve intervals and the partitions will produce too many
>>> ranges?
>>>
>>
>> In the trivial sense, the client doesn't need to know the partitions and
>> would just scan the entire index like you said earlier. You could also
>> track the partitions that you have created in a separate table and the
>> client could read that table to know ahead of time (if you have a reason to
>> do so in your implementation).
>>
>> Depending on the amount of data you have, lots of ranges to check could
>> take some time. YMMV
>>
>>
>>  Thanks,
>>> -Russ
>>>
>>> 1:
>>> https://github.com/calrissian/accumulo-recipes/blob/master/
>>> store/geospatial-store/src/main/java/org/calrissian/accumulorecipes/
>>> geospatialstore/impl/AccumuloGeoSpatialStore.java#L190
>>>
>>
>

Re: Geospatial + Partitioned Index

Posted by Russ Weeks <rw...@newbrightidea.com>.

Hi, Josh,

Thanks for your response. I think I should clarify something. When I said,
"the client would just scan (-inf, +inf)", I didn't mean that the net
effect would be to read all data. I just meant that my custom Iterator
would seek() to ranges which are a function of its configuration and its
knowledge of the partitioning scheme, just like the IntersectingIterator.
Except that instead of its configuration defining a set of keyword terms,
it would define a set of disjoint intervals on a space-filling curve.

My understanding is that setting the scan range to (-inf,+inf) in this case
is just a way to tell Accumulo, "run this scan across all tablets".

-Russ

On Fri, Jan 16, 2015 at 12:17 PM, Josh Elser <jo...@gmail.com> wrote:

> Russ Weeks wrote:
>
>> Hey, all,
>>
>> I'm looking at switching my geospatial index to a partitioned index to
>> smooth out some hotspots. So for any query, I'll have a bunch of ranges
>> representing intervals on a Hilbert curve, plus a bunch of partitions,
>> each of which needs to be scanned for every range.
>>
>> The way that the (excellent!) Accumulo Recipes geospatial store
>> addresses this is to take the product of the partitions and the curve
>> intervals[1]. It seems like an alternative would be to encode the curve
>> intervals as a property of a custom iterator (I need one anyways to
>> filter out extraneous points from the search area) and then the client
>> would just scan (-inf, +inf), which I think is more typical when
>> querying a partitioned index?
>>
>
> I'm no expert on storing geo-spatial data, but having to scan (-inf,+inf)
> on a table for a query is typically the reason people deal with the pain of
> hot-spotting, although it is the easiest to implement.
>
> If you can be "tricky" in how you're encoding your data in the row such
> that you can reduce the search space over your partitioned index, you can
> try to get the best of both worlds (avoid reading all data and still get a
> good distribution).
>
> Since that was extremely vague, here's an example: say you had a text
> index and wanted to look up the word "the" and your index had 100
> partitions, [0,99]. If you knew that it was only possible for "the" to show
> up on partitions 5, 27 and 83 (typically by use of some hashing function),
> you could drastically reduce your search space while still avoiding hot
> spotting on a single server.
>
>  Can anybody comment on which approach is preferred? Is it common to
>> expose the number of partitions in the index and the encoding of those
>> partitions to client code? Am I needlessly worried that taking the
>> product of the curve intervals and the partitions will produce too many
>> ranges?
>>
>
> In the trivial sense, the client doesn't need to know the partitions and
> would just scan the entire index like you said earlier. You could also
> track the partitions that you have created in a separate table and the
> client could read that table to know ahead of time (if you have a reason to
> do so in your implementation).
>
> Depending on the amount of data you have, lots of ranges to check could
> take some time. YMMV
>
>
>  Thanks,
>> -Russ
>>
>> 1:
>> https://github.com/calrissian/accumulo-recipes/blob/master/
>> store/geospatial-store/src/main/java/org/calrissian/accumulorecipes/
>> geospatialstore/impl/AccumuloGeoSpatialStore.java#L190
>>
>

Re: Geospatial + Partitioned Index

Posted by Josh Elser <jo...@gmail.com>.

Russ Weeks wrote:
> Hey, all,
>
> I'm looking at switching my geospatial index to a partitioned index to
> smooth out some hotspots. So for any query, I'll have a bunch of ranges
> representing intervals on a Hilbert curve, plus a bunch of partitions,
> each of which needs to be scanned for every range.
>
> The way that the (excellent!) Accumulo Recipes geospatial store
> addresses this is to take the product of the partitions and the curve
> intervals[1]. It seems like an alternative would be to encode the curve
> intervals as a property of a custom iterator (I need one anyways to
> filter out extraneous points from the search area) and then the client
> would just scan (-inf, +inf), which I think is more typical when
> querying a partitioned index?

I'm no expert on storing geo-spatial data, but having to scan 
(-inf,+inf) on a table for a query is typically the reason people deal 
with the pain of hot-spotting, although it is the easiest to implement.

If you can be "tricky" in how you're encoding your data in the row such 
that you can reduce the search space over your partitioned index, you 
can try to get the best of both worlds (avoid reading all data and still 
get a good distribution).

Since that was extremely vague, here's an example: say you had a text 
index and wanted to look up the word "the" and your index had 100 
partitions, [0,99]. If you knew that it was only possible for "the" to 
show up on partitions 5, 27 and 83 (typically by use of some hashing 
function), you could drastically reduce your search space while still 
avoiding hot spotting on a single server.

> Can anybody comment on which approach is preferred? Is it common to
> expose the number of partitions in the index and the encoding of those
> partitions to client code? Am I needlessly worried that taking the
> product of the curve intervals and the partitions will produce too many
> ranges?

In the trivial sense, the client doesn't need to know the partitions and 
would just scan the entire index like you said earlier. You could also 
track the partitions that you have created in a separate table and the 
client could read that table to know ahead of time (if you have a reason 
to do so in your implementation).

Depending on the amount of data you have, lots of ranges to check could 
take some time. YMMV

> Thanks,
> -Russ
>
> 1:
> https://github.com/calrissian/accumulo-recipes/blob/master/store/geospatial-store/src/main/java/org/calrissian/accumulorecipes/geospatialstore/impl/AccumuloGeoSpatialStore.java#L190