You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Lydia <ic...@googlemail.com> on 2017/05/05 19:40:53 UTC

Smart Table creation for 2D range query

Hi all,

I am new to Apache Cassandra and I would like to get some advice on how to tackle a table creation / indexing in a sophisticated way.

My aim is to store x- and y-coordinates, accompanied by some columns with meta information (m1, ... ,m5). There will be around 100,000,000 rows overall. Some rows might have the same (x,y) pairs but always distinct meta information. 

In the end I want to do a rather simple range query in the form of e.g. (0 >= x <= 1) AND (0 >= y <= 1).

What would be the best choice of variables to set as primary key, partition key. Or should I use a index? And if so on what column(s)?

Thanks in advance!
Best regards, 
Lydia
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org

Re: Smart Table creation for 2D range query

Posted by Jon Haddad <jo...@gmail.com>.

Sure, I don't see why not.  Ultimately this is more or less the same thing
I proposed.   You end up with a slightly different way of encoding a point
in space into a rough geographical area.  Whether you encode them as a tree
structure or some prefix of a geohash is a matter of convenience.  I'm not
sure if there's any performance advantage to using geohashes, from a
Cassandra data model & query perspective, as I haven't spent much time with
them.  Maybe someone who's done this can chime in.

On Tue, May 9, 2017 at 1:16 PM Jim Ancona <ji...@anconafamily.com> wrote:

> Couldn't you use a bucketing strategy for the hash value, much like with
> time series data? That is, choose a partition key granularity that puts a
> reasonable number of rows in a partition, with the actual hash being the
> clustering key. Then ranges that within the partition key granularity could
> be efficiently queried.
>
> Jim
>
> On Tue, May 9, 2017 at 11:19 AM, Jon Haddad <jo...@gmail.com>
> wrote:
>
>> The problem with using geohashes is that you can’t efficiently do ranges
>> with random token distribution.  So even if your scalar values are close to
>> each other numerically they’ll likely end up on different nodes, and you
>> end up doing a scatter gather.
>>
>> If the goal is to provide a scalable solution, building a table that
>> functions as an R-Tree or Quad Tree is the only way I know that can solve
>> the problem without scanning the entire cluster.
>>
>> Jon
>>
>> On May 9, 2017, at 10:11 AM, Jim Ancona <ji...@anconafamily.com> wrote:
>>
>> There are clever ways to encode coordinates into a single scalar value
>> where points that are close on a surface are also close in value, making
>> queries efficient. Examples are Geohash
>> <https://en.wikipedia.org/wiki/Geohash> and Google's S2
>> <https://docs.google.com/presentation/d/1Hl4KapfAENAOf4gv-pSngKwvS_jwNVHRPZTTDzXXn6Q/view#slide=id.i0>.
>> As Jon mentions, this puts more work on the client, but might give you a
>> lot of querying flexibility when using Cassandra.
>>
>> Jim
>>
>> On Mon, May 8, 2017 at 11:13 PM, Jon Haddad <jo...@gmail.com>
>> wrote:
>>
>>> It gets a little tricky when you try to add in the coordinates to the
>>> clustering key if you want to do operations that are more complex.  For
>>> instance, finding all the elements within a radius of point (x,y) isn’t
>>> particularly fun with Cassandra.  I recommend moving that logic into the
>>> application.
>>>
>>> > On May 8, 2017, at 10:06 PM, kurt greaves <ku...@instaclustr.com>
>>> wrote:
>>> >
>>> > Note that will not give you the desired range queries of 0 >= x <= 1
>>> and 0 >= y <= 1.
>>> >
>>> >
>>> > Something akin to Jon's solution could give you those range queries
>>> if you made the x and y components part of the clustering key.
>>> >
>>> > For example, a space of (1,1) could contain all x,y coordinates where
>>> x and y are > 0 and <= 1. You would then have a table like:
>>> >
>>> > CREATE TABLE geospatial (
>>> > space text,
>>> > x double,
>>> > y double,
>>> > item text,
>>> > m1,
>>> > m2,
>>> > m3,
>>> > primary key ((space), x, y, m1, m2, m3, m4, m5)
>>> > );
>>> >
>>> > A query of select * where space = '1,1' and x <1 and x >0.5 and y< 0.2
>>> and y>0.1; should yield all x and y pairs and their distinct metadata. Or
>>> something like that anyway.
>>> >
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>>> For additional commands, e-mail: user-help@cassandra.apache.org
>>>
>>>
>>
>>
>

Re: Smart Table creation for 2D range query

Posted by Jim Ancona <ji...@anconafamily.com>.

Couldn't you use a bucketing strategy for the hash value, much like with
time series data? That is, choose a partition key granularity that puts a
reasonable number of rows in a partition, with the actual hash being the
clustering key. Then ranges that within the partition key granularity could
be efficiently queried.

Jim

On Tue, May 9, 2017 at 11:19 AM, Jon Haddad <jo...@gmail.com>
wrote:

> The problem with using geohashes is that you can’t efficiently do ranges
> with random token distribution.  So even if your scalar values are close to
> each other numerically they’ll likely end up on different nodes, and you
> end up doing a scatter gather.
>
> If the goal is to provide a scalable solution, building a table that
> functions as an R-Tree or Quad Tree is the only way I know that can solve
> the problem without scanning the entire cluster.
>
> Jon
>
> On May 9, 2017, at 10:11 AM, Jim Ancona <ji...@anconafamily.com> wrote:
>
> There are clever ways to encode coordinates into a single scalar value
> where points that are close on a surface are also close in value, making
> queries efficient. Examples are Geohash
> <https://en.wikipedia.org/wiki/Geohash> and Google's S2
> <https://docs.google.com/presentation/d/1Hl4KapfAENAOf4gv-pSngKwvS_jwNVHRPZTTDzXXn6Q/view#slide=id.i0>.
> As Jon mentions, this puts more work on the client, but might give you a
> lot of querying flexibility when using Cassandra.
>
> Jim
>
> On Mon, May 8, 2017 at 11:13 PM, Jon Haddad <jo...@gmail.com>
> wrote:
>
>> It gets a little tricky when you try to add in the coordinates to the
>> clustering key if you want to do operations that are more complex.  For
>> instance, finding all the elements within a radius of point (x,y) isn’t
>> particularly fun with Cassandra.  I recommend moving that logic into the
>> application.
>>
>> > On May 8, 2017, at 10:06 PM, kurt greaves <ku...@instaclustr.com> wrote:
>> >
>> > Note that will not give you the desired range queries of 0 >= x <= 1
>> and 0 >= y <= 1.
>> >
>> >
>> > Something akin to Jon's solution could give you those range queries if
>> you made the x and y components part of the clustering key.
>> >
>> > For example, a space of (1,1) could contain all x,y coordinates where x
>> and y are > 0 and <= 1. You would then have a table like:
>> >
>> > CREATE TABLE geospatial (
>> > space text,
>> > x double,
>> > y double,
>> > item text,
>> > m1,
>> > m2,
>> > m3,
>> > primary key ((space), x, y, m1, m2, m3, m4, m5)
>> > );
>> >
>> > A query of select * where space = '1,1' and x <1 and x >0.5 and y< 0.2
>> and y>0.1; should yield all x and y pairs and their distinct metadata. Or
>> something like that anyway.
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>> For additional commands, e-mail: user-help@cassandra.apache.org
>>
>>
>
>

Re: Smart Table creation for 2D range query

Posted by Jon Haddad <jo...@gmail.com>.

The problem with using geohashes is that you can’t efficiently do ranges with random token distribution.  So even if your scalar values are close to each other numerically they’ll likely end up on different nodes, and you end up doing a scatter gather.

If the goal is to provide a scalable solution, building a table that functions as an R-Tree or Quad Tree is the only way I know that can solve the problem without scanning the entire cluster.

Jon

> On May 9, 2017, at 10:11 AM, Jim Ancona <ji...@anconafamily.com> wrote:
> 
> There are clever ways to encode coordinates into a single scalar value where points that are close on a surface are also close in value, making queries efficient. Examples are Geohash <https://en.wikipedia.org/wiki/Geohash> and Google's S2 <https://docs.google.com/presentation/d/1Hl4KapfAENAOf4gv-pSngKwvS_jwNVHRPZTTDzXXn6Q/view#slide=id.i0>. As Jon mentions, this puts more work on the client, but might give you a lot of querying flexibility when using Cassandra.
> 
> Jim
> 
> On Mon, May 8, 2017 at 11:13 PM, Jon Haddad <jonathan.haddad@gmail.com <ma...@gmail.com>> wrote:
> It gets a little tricky when you try to add in the coordinates to the clustering key if you want to do operations that are more complex.  For instance, finding all the elements within a radius of point (x,y) isn’t particularly fun with Cassandra.  I recommend moving that logic into the application.
> 
> > On May 8, 2017, at 10:06 PM, kurt greaves <kurt@instaclustr.com <ma...@instaclustr.com>> wrote:
> >
> > Note that will not give you the desired range queries of 0 >= x <= 1 and 0 >= y <= 1.
> >
> >
> > Something akin to Jon's solution could give you those range queries if you made the x and y components part of the clustering key.
> >
> > For example, a space of (1,1) could contain all x,y coordinates where x and y are > 0 and <= 1. You would then have a table like:
> >
> > CREATE TABLE geospatial (
> > space text,
> > x double,
> > y double,
> > item text,
> > m1,
> > m2,
> > m3,
> > primary key ((space), x, y, m1, m2, m3, m4, m5)
> > );
> >
> > A query of select * where space = '1,1' and x <1 and x >0.5 and y< 0.2 and y>0.1; should yield all x and y pairs and their distinct metadata. Or something like that anyway.
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org <ma...@cassandra.apache.org>
> For additional commands, e-mail: user-help@cassandra.apache.org <ma...@cassandra.apache.org>
> 
>

Re: Smart Table creation for 2D range query

Posted by Jim Ancona <ji...@anconafamily.com>.

There are clever ways to encode coordinates into a single scalar value
where points that are close on a surface are also close in value, making
queries efficient. Examples are Geohash
<https://en.wikipedia.org/wiki/Geohash> and Google's S2
<https://docs.google.com/presentation/d/1Hl4KapfAENAOf4gv-pSngKwvS_jwNVHRPZTTDzXXn6Q/view#slide=id.i0>.
As Jon mentions, this puts more work on the client, but might give you a
lot of querying flexibility when using Cassandra.

Jim

On Mon, May 8, 2017 at 11:13 PM, Jon Haddad <jo...@gmail.com>
wrote:

> It gets a little tricky when you try to add in the coordinates to the
> clustering key if you want to do operations that are more complex.  For
> instance, finding all the elements within a radius of point (x,y) isn’t
> particularly fun with Cassandra.  I recommend moving that logic into the
> application.
>
> > On May 8, 2017, at 10:06 PM, kurt greaves <ku...@instaclustr.com> wrote:
> >
> > Note that will not give you the desired range queries of 0 >= x <= 1 and
> 0 >= y <= 1.
> >
> >
> > Something akin to Jon's solution could give you those range queries if
> you made the x and y components part of the clustering key.
> >
> > For example, a space of (1,1) could contain all x,y coordinates where x
> and y are > 0 and <= 1. You would then have a table like:
> >
> > CREATE TABLE geospatial (
> > space text,
> > x double,
> > y double,
> > item text,
> > m1,
> > m2,
> > m3,
> > primary key ((space), x, y, m1, m2, m3, m4, m5)
> > );
> >
> > A query of select * where space = '1,1' and x <1 and x >0.5 and y< 0.2
> and y>0.1; should yield all x and y pairs and their distinct metadata. Or
> something like that anyway.
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org
>
>

Re: Smart Table creation for 2D range query

Posted by Jon Haddad <jo...@gmail.com>.

It gets a little tricky when you try to add in the coordinates to the clustering key if you want to do operations that are more complex.  For instance, finding all the elements within a radius of point (x,y) isn’t particularly fun with Cassandra.  I recommend moving that logic into the application.  

> On May 8, 2017, at 10:06 PM, kurt greaves <ku...@instaclustr.com> wrote:
> 
> Note that will not give you the desired range queries of 0 >= x <= 1 and 0 >= y <= 1.
> 
> 
> Something akin to Jon's solution could give you those range queries if you made the x and y components part of the clustering key.
> 
> For example, a space of (1,1) could contain all x,y coordinates where x and y are > 0 and <= 1. You would then have a table like:
> 
> CREATE TABLE geospatial (
> space text,
> x double,
> y double,
> item text,
> m1,
> m2,
> m3,
> primary key ((space), x, y, m1, m2, m3, m4, m5)
> );
> 
> A query of select * where space = '1,1' and x <1 and x >0.5 and y< 0.2 and y>0.1; should yield all x and y pairs and their distinct metadata. Or something like that anyway.
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org

Re: Smart Table creation for 2D range query

Posted by kurt greaves <ku...@instaclustr.com>.

Note that will not give you the desired range queries of 0 >= x <= 1 and 0
>= y <= 1.


Something akin to Jon's solution could give you those range queries if you
made the x and y components part of the clustering key.

For example, a space of (1,1) could contain all x,y coordinates where x and
y are > 0 and <= 1. You would then have a table like:

CREATE TABLE geospatial (
space text,
x double,
y double,
item text,
m1,
m2,
m3,
primary key ((space), x, y, m1, m2, m3, m4, m5)
);

A query of select * where space = '1,1' and x <1 and x >0.5 and y< 0.2 and
y>0.1; should yield all x and y pairs and their distinct metadata. Or
something like that anyway.

Re: Smart Table creation for 2D range query

Posted by Anthony Grasso <an...@gmail.com>.

Hi Lydia,

Yes. This will define the *x*, *y* columns as the components of the
partition key. Note that by doing this both *x* and *y* values will be
required to at a minimum to perform a valid query.

Alternatively, the *x* and *y* values could be combined in into a single
text field as Jon has suggested.

Kind regards,
Anthony

On 7 May 2017 at 17:15, Lydia Ickler <ic...@googlemail.com> wrote:

> Like this?
>
> CREATE TABLE test (
>   x double,
>   y double,
>   m1 int,
>   ...
>   m5 int,
>   PRIMARY KEY ((x,y), m1, … , m5)
> )
>
>
>
> Am 05.05.2017 um 21:54 schrieb Nitan Kainth <ni...@bamlabs.com>:
>
> Make metadata as partition key and x,y as part of partition key i.e.
> Primary key. It should work
>
> Sent from my iPhone
>
> On May 5, 2017, at 2:40 PM, Lydia <ic...@googlemail.com> wrote:
>
>
> Hi all,
>
>
> I am new to Apache Cassandra and I would like to get some advice on how to
> tackle a table creation / indexing in a sophisticated way.
>
>
> My aim is to store x- and y-coordinates, accompanied by some columns with
> meta information (m1, ... ,m5). There will be around 100,000,000 rows
> overall. Some rows might have the same (x,y) pairs but always distinct meta
> information.
>
>
> In the end I want to do a rather simple range query in the form of e.g. (0
> >= x <= 1) AND (0 >= y <= 1).
>
>
> What would be the best choice of variables to set as primary key,
> partition key. Or should I use a index? And if so on what column(s)?
>
>
> Thanks in advance!
>
> Best regards,
>
> Lydia
>
> ---------------------------------------------------------------------
>
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>
> For additional commands, e-mail: user-help@cassandra.apache.org
>
>
>

Re: Smart Table creation for 2D range query

Posted by Lydia Ickler <ic...@googlemail.com>.

Like this?

CREATE TABLE test ( x double, y double, m1 int, ... m5 int, PRIMARY KEY ((x,y), m1, … , m5) )


> Am 05.05.2017 um 21:54 schrieb Nitan Kainth <ni...@bamlabs.com>:
> 
> Make metadata as partition key and x,y as part of partition key i.e. Primary key. It should work
> 
> Sent from my iPhone
> 
>> On May 5, 2017, at 2:40 PM, Lydia <ic...@googlemail.com> wrote:
>> 
>> Hi all,
>> 
>> I am new to Apache Cassandra and I would like to get some advice on how to tackle a table creation / indexing in a sophisticated way.
>> 
>> My aim is to store x- and y-coordinates, accompanied by some columns with meta information (m1, ... ,m5). There will be around 100,000,000 rows overall. Some rows might have the same (x,y) pairs but always distinct meta information. 
>> 
>> In the end I want to do a rather simple range query in the form of e.g. (0 >= x <= 1) AND (0 >= y <= 1).
>> 
>> What would be the best choice of variables to set as primary key, partition key. Or should I use a index? And if so on what column(s)?
>> 
>> Thanks in advance!
>> Best regards, 
>> Lydia
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>> For additional commands, e-mail: user-help@cassandra.apache.org
>>

Re: Smart Table creation for 2D range query

Posted by Jon Haddad <jo...@gmail.com>.

I think you’ll want to model your table similar to how an R-Tree [1] / Quad tree [2] works.  Let’s suppose you had a 10x10 meter land area and you wanted to put stuff in there.  In order to find “all the things in point x,y”, you could break your land area into a grid.  A partition would contain all the items that are in that grid space.  In my simple example, I’d have 100 partitions.

For example:

// space is a simple "x.y" text field
CREATE TABLE geospatial (
space text,
item text,
primary key (space, item)
);

insert into geospatial (space, item) values ('1.1', 'hat');
insert into geospatial (space, item) values ('1.1', 'bird');
insert into geospatial (space, item) values ('6.4', 'dog’);

This example is pretty trivial, and doesn’t take into account hot partitions.  That’s where the process of subdividing a space occurs when it reaches a certain size.

[1] https://en.wikipedia.org/wiki/R-tree <https://en.wikipedia.org/wiki/R-tree>
[2] https://en.wikipedia.org/wiki/Quadtree <https://en.wikipedia.org/wiki/Quadtree>
> On May 5, 2017, at 12:54 PM, Nitan Kainth <ni...@bamlabs.com> wrote:
> 
> Make metadata as partition key and x,y as part of partition key i.e. Primary key. It should work
> 
> Sent from my iPhone
> 
>> On May 5, 2017, at 2:40 PM, Lydia <ic...@googlemail.com> wrote:
>> 
>> Hi all,
>> 
>> I am new to Apache Cassandra and I would like to get some advice on how to tackle a table creation / indexing in a sophisticated way.
>> 
>> My aim is to store x- and y-coordinates, accompanied by some columns with meta information (m1, ... ,m5). There will be around 100,000,000 rows overall. Some rows might have the same (x,y) pairs but always distinct meta information. 
>> 
>> In the end I want to do a rather simple range query in the form of e.g. (0 >= x <= 1) AND (0 >= y <= 1).
>> 
>> What would be the best choice of variables to set as primary key, partition key. Or should I use a index? And if so on what column(s)?
>> 
>> Thanks in advance!
>> Best regards, 
>> Lydia
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>> For additional commands, e-mail: user-help@cassandra.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org
>

Re: Smart Table creation for 2D range query

Posted by Nitan Kainth <ni...@bamlabs.com>.

Make metadata as partition key and x,y as part of partition key i.e. Primary key. It should work

Sent from my iPhone

> On May 5, 2017, at 2:40 PM, Lydia <ic...@googlemail.com> wrote:
> 
> Hi all,
> 
> I am new to Apache Cassandra and I would like to get some advice on how to tackle a table creation / indexing in a sophisticated way.
> 
> My aim is to store x- and y-coordinates, accompanied by some columns with meta information (m1, ... ,m5). There will be around 100,000,000 rows overall. Some rows might have the same (x,y) pairs but always distinct meta information. 
> 
> In the end I want to do a rather simple range query in the form of e.g. (0 >= x <= 1) AND (0 >= y <= 1).
> 
> What would be the best choice of variables to set as primary key, partition key. Or should I use a index? And if so on what column(s)?
> 
> Thanks in advance!
> Best regards, 
> Lydia
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org