You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Deno Vichas <de...@syncopated.net> on 2011/11/30 21:57:36 UTC

data modeling question

hey all!

i'm started my first project using cassandra and some data model 
questions.  i'm working on an app that fetches stock market data.  i 
need to keep track of when i fetch a set of data for any given stock in 
any sector;  here's what i think my model should look like;

fetches : {
<sector> : {
          quote : {
<timeuuid>: {
<symbol> : ---
             }
         }
         ticks : {
<timeuuid>: {
<symbol> : ---
             }
         }
         fundamentals : {
<timeuuid>: {
<symbol> : ---
             }
         }
     }
}


is there anything that less an ideal doing it this way versus creating 
separate CF per sector?    how do you create Super CF inside of Super CF 
via the CLI?



thanks,
deno

Re: data modeling question

Posted by Deno Vichas <de...@syncopated.net>.

here's what i ended up, this seems to work for me.


     @Test
     public void readAndWriteSettingTTL() throws Exception {
         int ttl = 2;
         String columnFamily = "Quote";
         Set<String> symbols = new HashSet<String>(){{
                                                 add("appl");
                                                 add("goog");
                                                 add("ibm");
                                                 add("csco");
                                             }};

         UUID timeUUID = TimeUUIDUtils.getUniqueTimeUUIDinMillis();

         Mutator<String> mutator = HFactory.createMutator(_keyspace, 
_stringSerializer);
         for(String symbol : symbols) 
addInsertionToMutator(columnFamily, timeUUID, mutator, symbol, ttl);
         mutator.execute();

         RangeSlicesQuery<String, UUID, String> rangeSlicesQuery = 
HFactory.createRangeSlicesQuery(_keyspace, _stringSerializer, 
_uuidSerializer, _stringSerializer);
         rangeSlicesQuery.setColumnFamily(columnFamily);
         rangeSlicesQuery.setKeys("", "");
         rangeSlicesQuery.setRange(null, null, false, 1);
         QueryResult<OrderedRows<String, UUID, String>> result = 
rangeSlicesQuery.execute();

         UUID uuid = 
result.get().getList().get(0).getColumnSlice().getColumns().get(0).getName();

         Assert.assertEquals("UUID should be the same", timeUUID, uuid);
         Assert.assertEquals("We should have 4 records", 4, 
result.get().getList().size());

         Thread.sleep(5000); // wait till TTL hits to make sure keys are 
getting flushed.

         QueryResult<OrderedRows<String, UUID, String>> result2 = 
rangeSlicesQuery.execute();
         for(Row<String, UUID, String> row : result2.get().getList()) {
             Assert.assertEquals("We should have no records", 0, 
row.getColumnSlice().getColumns().size());
         }

     }

     private void addInsertionToMutator(String columnFamily, UUID 
columnName, Mutator<String> mutator, String symbol, int ttl) {
         mutator.addInsertion(symbol, columnFamily, 
HFactory.createColumn(columnName, "", ttl, _uuidSerializer, 
_stringSerializer));
     }


On 11/30/2011 1:56 PM, David McNelis wrote:
> You wouldn't query for all the keys that have a column name x exactly. 
>  Instead what you would do is for sector x grab your list of symbols 
> S.  Then you would get the last column for each of those symbols 
> (which you do in different ways depending on the API), and then test 
> if that date is within your threshold.  If not, it goes into your list 
> of symbols to fetch.
>
> Alternatively, you could iterate over the symbols grabbing data where 
> the date is between range A and B, if you get an empty set / no 
> columns returned, then you need to re-pull for that symbol.  Does that 
> make sense?
>
> Either way you end up hitting on each of the individual symbols. 
>  Maybe someone else has a better idea of how to structure the data for 
> that particular use case.
>
> On Wed, Nov 30, 2011 at 3:45 PM, Deno Vichas <deno@syncopated.net 
> <ma...@syncopated.net>> wrote:
>
>     with the quote CF below how would one query for all keys that have
>     a column name value that have a timeuuid of later than x minutes?
>      i need to be able to find all symbols that have not been fetch in
>     x minutes by sector.  i know i get list of symbol by sector from
>     my sector CF.
>
>     thanks,
>     deno
>
>
>     On 11/30/2011 1:07 PM, David McNelis wrote:
>
>
>         Then I would have a column family for quotes where I have the
>         key as the symbol, the column name as the timestamp, the value
>         as the quote:
>
>         quote : {
>            key: symbol
>            column names:  timeuuid
>            column values:  quote at that time for that symbol
>         }
>
>
>
>
>
>
> -- 
> *David McNelis*
> Lead Software Engineer
> Agentis Energy
> www.agentisenergy.com <http://www.agentisenergy.com>
> c: 219.384.5143
>
> /A Smart Grid technology company focused on helping consumers of 
> energy control an often under-managed resource./
>
>

Re: data modeling question

Posted by David McNelis <dm...@agentisenergy.com>.

You wouldn't query for all the keys that have a column name x exactly.
 Instead what you would do is for sector x grab your list of symbols S.
 Then you would get the last column for each of those symbols (which you do
in different ways depending on the API), and then test if that date is
within your threshold.  If not, it goes into your list of symbols to fetch.

Alternatively, you could iterate over the symbols grabbing data where the
date is between range A and B, if you get an empty set / no columns
returned, then you need to re-pull for that symbol.  Does that make sense?

Either way you end up hitting on each of the individual symbols.  Maybe
someone else has a better idea of how to structure the data for that
particular use case.

On Wed, Nov 30, 2011 at 3:45 PM, Deno Vichas <de...@syncopated.net> wrote:

> with the quote CF below how would one query for all keys that have a
> column name value that have a timeuuid of later than x minutes?  i need to
> be able to find all symbols that have not been fetch in x minutes by
> sector.  i know i get list of symbol by sector from my sector CF.
>
> thanks,
> deno
>
>
> On 11/30/2011 1:07 PM, David McNelis wrote:
>
>>
>> Then I would have a column family for quotes where I have the key as the
>> symbol, the column name as the timestamp, the value as the quote:
>>
>> quote : {
>>    key: symbol
>>    column names:  timeuuid
>>    column values:  quote at that time for that symbol
>> }
>>
>>
>>
>

-- 
*David McNelis*
Lead Software Engineer
Agentis Energy
www.agentisenergy.com
c: 219.384.5143

*A Smart Grid technology company focused on helping consumers of energy
control an often under-managed resource.*

Re: data modeling question

Posted by Deno Vichas <de...@syncopated.net>.

with the quote CF below how would one query for all keys that have a 
column name value that have a timeuuid of later than x minutes?  i need 
to be able to find all symbols that have not been fetch in x minutes by 
sector.  i know i get list of symbol by sector from my sector CF.

thanks,
deno

On 11/30/2011 1:07 PM, David McNelis wrote:
>
> Then I would have a column family for quotes where I have the key as 
> the symbol, the column name as the timestamp, the value as the quote:
>
> quote : {
>     key: symbol
>     column names:  timeuuid
>     column values:  quote at that time for that symbol
> }
>
>

Re: data modeling question

Posted by David McNelis <dm...@agentisenergy.com>.

Personally I would create a separate column family for each basic area.
 For example

To organize my sectors and symbols I would create a column family where the
key is the sector name and the column names are the symbols for that
sector, i.e.:

sector : {
    key: sector name
    Column names: symbols
    Column values: null
}

Then I would have a column family for quotes where I have the key as the
symbol, the column name as the timestamp, the value as the quote:

quote : {
    key: symbol
    column names:  timeuuid
    column values:  quote at that time for that symbol
}

I would then use the same basic structure for your other column families,
ticks and fundamentals.  In general people tend to stay away from super
column families when possible for several reasons, but the most commonly
sited one is that when you get a SCF, the entire SCF must be deserialized
in order to access it.  So if you have a bunch of SCF, you're running a
risk of ending up needing to read in a lot more data than is necessary to
get the information you are looking for.

On Wed, Nov 30, 2011 at 2:57 PM, Deno Vichas <de...@syncopated.net> wrote:

>  hey all!
>
> i'm started my first project using cassandra and some data model
> questions.  i'm working on an app that fetches stock market data.  i need
> to keep track of when i fetch a set of data for any given stock in any
> sector;  here's what i think my model should look like;
>
> fetches : {
>     <sector> : {
>          quote : {
>             <timeuuid>: {
>                 <symbol> : ---
>             }
>         }
>         ticks : {
>             <timeuuid>: {
>                 <symbol> : ---
>             }
>         }
>         fundamentals : {
>             <timeuuid>: {
>                 <symbol> : ---
>             }
>         }
>     }
> }
>
>
> is there anything that less an ideal doing it this way versus creating
> separate CF per sector?    how do you create Super CF inside of Super CF
> via the CLI?
>
>
>
> thanks,
> deno
>
>
>

-- 
*David McNelis*
Lead Software Engineer
Agentis Energy
www.agentisenergy.com
c: 219.384.5143

*A Smart Grid technology company focused on helping consumers of energy
control an often under-managed resource.*