You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Peter Hsu <pe...@motivecast.com> on 2012/06/30 02:13:10 UTC

Data modeling question

I have a question on what the best way is to store the data in my schema.

The data
I have millions of nodes, each with a different cartesian coordinate. The keys for the nodes are hashed based on the coordinate.

My search is a proximity search. I'd like to find all the nodes within a given distance from a particular node. I can create an arbitrary grouping that groups an arbitrary number of nodes together, based on proximity…

e.g.
group 0 contains all points from (0,0) to (10,10)
group 1 contains all points from (10,0 to 20,10).

For each coordinate, I store various meta data:
8 columns, 4 UTF8Type ~20bytes each, 4 DoubleType

The query
I need a proximity search to return all data within a range from a selected node. The typical read size is ~100 distinct rows (e.g. a 10x10 grid around the selected node).. Since it's on a coordinate system, I know ahead of time exactly which 100 rows I need.

The modeling options

Option 1:
- single column family, with key being the coordinate hash

e,g,
'0,0' : { meta }
'0,1' : { meta }
…
'10, 20' : { meta}

- query for 100 rows in parallel

- I think this option sucks because it's essentially 100 non-sequential reads??

Option 2:
- group my data into super columns, with key being the grouping

e.g.
'0' {
'0, 0' : { meta }
...
'10, 10' : { meta }
}
'1' {
'10, 0' : {meta}
…
'20, 10': {meta}
}

- query by the appropriate grouping
- since i can't guarantee the query won't fall near the boundary of a grouping, I'm looking at querying up to 4 different super column rows for each query
- this seems reasonable, since i'm doing bulk sequential reads, but have some overhead in terms of pre-filtering and post-filtering
- sucks in terms of flexibility for modifying size of proximity search

Option 3:
- create a secondary index based on the grouping

e.g.

e,g,
'0,0' : { meta, group='0' }
'0,1' : { meta, group='0' }
…
'10, 20' : { meta, group='1'}

- query by secondary index
- same as above, will return some extra data, and will need to do filtering..
- no idea how cassandra stores this data internally, but will the data access here be sequential?
- a little more flexible in terms of proximity search - can create multiple grouping types based on the size of the search

Option 4:
- composite queries??
-- I haven't had time to read up too much on this, so I'm not sure if it would help for my use case or not.

questions
- I know there are pros and cons to each approach wrt flexibility of my search size, but assuming my search proximity size is fixed, which method provides the optimal performance?
- I guess the main question is will querying by secondary index be efficient enough or is it worth it to group the data into super columns?
- Is there a better way I haven't thought about to model the data?

Re: Data modeling question

Posted by Peter Hsu <pe...@motivecast.com>.

Just read up on composite keys and what looks like future deprecation of super column families.

I guess Option 2 would now be:

- column family with composite key from grouping and location

> e.g.
>  '0:0,0': { meta }
>  ...
>  '0:10,10' : { meta }
>  '1:10,0' : {meta}
> …
>  '1:20, 10': {meta}
> }



On Jun 29, 2012, at 5:13 PM, Peter Hsu wrote:

> I have a question on what the best way is to store the data in my schema.
> 
> The data
> I have millions of nodes, each with a different cartesian coordinate.  The keys for the nodes are hashed based on the coordinate.
> 
> My search is a proximity search.  I'd like to find all the nodes within a given distance from a particular node.  I can create an arbitrary grouping that groups an arbitrary number of nodes together, based on proximity… 
> 
> e.g. 
> group 0  contains all points from (0,0) to (10,10)
> group 1 contains all points from (10,0 to 20,10).
> 
> For each coordinate, I store various meta data:
>  8 columns, 4 UTF8Type ~20bytes each, 4 DoubleType
> 
> The query
> I need a proximity search to return all data within a range from a selected node.  The typical read size is ~100 distinct rows (e.g. a 10x10 grid around the selected node)..  Since it's on a coordinate system, I know ahead of time exactly which 100 rows I need.
> 
> The modeling options
> 
> Option 1:
>  - single column family, with key being the coordinate hash
> 
> e,g,
> '0,0' : { meta }
> '0,1' : { meta }
> …
> '10, 20' : { meta}
> 
>  - query for 100 rows in parallel
> 
>  - I think this option sucks because it's essentially 100 non-sequential reads??
> 
> Option 2:
>  - group my data into super columns, with key being the grouping
> 
> e.g.
>  '0' {
>   '0, 0' : { meta }
>  ...
>   '10, 10' : { meta }
>  }
> '1' {
>  '10, 0' : {meta}
> …
>  '20, 10': {meta}
> }
> 
> 
>  - query by the appropriate grouping 
>  - since i can't guarantee the query won't fall near the boundary of a grouping, I'm looking at querying up to 4 different super column rows for each query
>  - this seems reasonable, since i'm doing bulk sequential reads, but have some overhead in terms of pre-filtering and post-filtering
>  - sucks in terms of flexibility for modifying size of proximity search
> 
> Option 3:
>  - create a secondary index based on the grouping
> 
> e.g.
> 
> e,g,
> '0,0' : { meta, group='0' }
> '0,1' : { meta, group='0' }
> …
> '10, 20' : { meta, group='1'}
> 
>  - query by secondary index
>  - same as above, will return some extra data, and will need to do filtering..
>  - no idea how cassandra stores this data internally, but will the data access here be sequential?
>  - a little more flexible in terms of proximity search - can create multiple grouping types based on the size of the search
> 
> Option 4:
>  - composite queries??
>  -- I haven't had time to read up too much on this, so I'm not sure if it would help for my use case or not.
> 
> questions
>  - I know there are pros and cons to each approach wrt flexibility of my search size, but assuming my search proximity size is fixed, which method provides the optimal performance?
>  - I guess the main question is will querying by secondary index be efficient enough or is it worth it to group the data into super columns?
>  - Is there a better way I haven't thought about to model the data?
> 
>