You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Peter Hsu <pe...@motivecast.com> on 2012/06/30 02:13:10 UTC
Data modeling question
I have a question on what the best way is to store the data in my schema.
The data
I have millions of nodes, each with a different cartesian coordinate. The keys for the nodes are hashed based on the coordinate.
My search is a proximity search. I'd like to find all the nodes within a given distance from a particular node. I can create an arbitrary grouping that groups an arbitrary number of nodes together, based on proximity…
e.g.
group 0 contains all points from (0,0) to (10,10)
group 1 contains all points from (10,0 to 20,10).
For each coordinate, I store various meta data:
8 columns, 4 UTF8Type ~20bytes each, 4 DoubleType
The query
I need a proximity search to return all data within a range from a selected node. The typical read size is ~100 distinct rows (e.g. a 10x10 grid around the selected node).. Since it's on a coordinate system, I know ahead of time exactly which 100 rows I need.
The modeling options
Option 1:
- single column family, with key being the coordinate hash
e,g,
'0,0' : { meta }
'0,1' : { meta }
…
'10, 20' : { meta}
- query for 100 rows in parallel
- I think this option sucks because it's essentially 100 non-sequential reads??
Option 2:
- group my data into super columns, with key being the grouping
e.g.
'0' {
'0, 0' : { meta }
...
'10, 10' : { meta }
}
'1' {
'10, 0' : {meta}
…
'20, 10': {meta}
}
- query by the appropriate grouping
- since i can't guarantee the query won't fall near the boundary of a grouping, I'm looking at querying up to 4 different super column rows for each query
- this seems reasonable, since i'm doing bulk sequential reads, but have some overhead in terms of pre-filtering and post-filtering
- sucks in terms of flexibility for modifying size of proximity search
Option 3:
- create a secondary index based on the grouping
e.g.
e,g,
'0,0' : { meta, group='0' }
'0,1' : { meta, group='0' }
…
'10, 20' : { meta, group='1'}
- query by secondary index
- same as above, will return some extra data, and will need to do filtering..
- no idea how cassandra stores this data internally, but will the data access here be sequential?
- a little more flexible in terms of proximity search - can create multiple grouping types based on the size of the search
Option 4:
- composite queries??
-- I haven't had time to read up too much on this, so I'm not sure if it would help for my use case or not.
questions
- I know there are pros and cons to each approach wrt flexibility of my search size, but assuming my search proximity size is fixed, which method provides the optimal performance?
- I guess the main question is will querying by secondary index be efficient enough or is it worth it to group the data into super columns?
- Is there a better way I haven't thought about to model the data?
Re: Data modeling question
Posted by Peter Hsu <pe...@motivecast.com>.
Just read up on composite keys and what looks like future deprecation of super column families.
I guess Option 2 would now be:
- column family with composite key from grouping and location
> e.g.
> '0:0,0': { meta }
> ...
> '0:10,10' : { meta }
> '1:10,0' : {meta}
> …
> '1:20, 10': {meta}
> }
On Jun 29, 2012, at 5:13 PM, Peter Hsu wrote:
> I have a question on what the best way is to store the data in my schema.
>
> The data
> I have millions of nodes, each with a different cartesian coordinate. The keys for the nodes are hashed based on the coordinate.
>
> My search is a proximity search. I'd like to find all the nodes within a given distance from a particular node. I can create an arbitrary grouping that groups an arbitrary number of nodes together, based on proximity…
>
> e.g.
> group 0 contains all points from (0,0) to (10,10)
> group 1 contains all points from (10,0 to 20,10).
>
> For each coordinate, I store various meta data:
> 8 columns, 4 UTF8Type ~20bytes each, 4 DoubleType
>
> The query
> I need a proximity search to return all data within a range from a selected node. The typical read size is ~100 distinct rows (e.g. a 10x10 grid around the selected node).. Since it's on a coordinate system, I know ahead of time exactly which 100 rows I need.
>
> The modeling options
>
> Option 1:
> - single column family, with key being the coordinate hash
>
> e,g,
> '0,0' : { meta }
> '0,1' : { meta }
> …
> '10, 20' : { meta}
>
> - query for 100 rows in parallel
>
> - I think this option sucks because it's essentially 100 non-sequential reads??
>
> Option 2:
> - group my data into super columns, with key being the grouping
>
> e.g.
> '0' {
> '0, 0' : { meta }
> ...
> '10, 10' : { meta }
> }
> '1' {
> '10, 0' : {meta}
> …
> '20, 10': {meta}
> }
>
>
> - query by the appropriate grouping
> - since i can't guarantee the query won't fall near the boundary of a grouping, I'm looking at querying up to 4 different super column rows for each query
> - this seems reasonable, since i'm doing bulk sequential reads, but have some overhead in terms of pre-filtering and post-filtering
> - sucks in terms of flexibility for modifying size of proximity search
>
> Option 3:
> - create a secondary index based on the grouping
>
> e.g.
>
> e,g,
> '0,0' : { meta, group='0' }
> '0,1' : { meta, group='0' }
> …
> '10, 20' : { meta, group='1'}
>
> - query by secondary index
> - same as above, will return some extra data, and will need to do filtering..
> - no idea how cassandra stores this data internally, but will the data access here be sequential?
> - a little more flexible in terms of proximity search - can create multiple grouping types based on the size of the search
>
> Option 4:
> - composite queries??
> -- I haven't had time to read up too much on this, so I'm not sure if it would help for my use case or not.
>
> questions
> - I know there are pros and cons to each approach wrt flexibility of my search size, but assuming my search proximity size is fixed, which method provides the optimal performance?
> - I guess the main question is will querying by secondary index be efficient enough or is it worth it to group the data into super columns?
> - Is there a better way I haven't thought about to model the data?
>
>