You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Marcelo Elias Del Valle <mv...@gmail.com> on 2012/09/19 21:02:16 UTC

Correct model

I am new to Cassandra and NoSQL at all.
I built my first model and any comments would be of great help. I am
describing my thoughts bellow.

It's a very simple model. I will need to store several users and, for each
user, I will need to store several requests. It request has it's insertion
time. As the query comes first, here are the only queries I will need to
run against this model:
- Select all the requests for an user
- Select all the users which has new requests, since date D

I created the following model: an UserCF, whose key is a userID generated
by TimeUUID, and a RequestCF, whose key is composite: UserUUID + timestamp.
For each user, I will store basic data and, for each request, I will insert
a lot of columns.

My questions:
- Is the strategy of using a composite key good for this case? I thought in
other solutions, but this one seemed to be the best. Another solution would
be have a non-composite key of type UUID for the requests, and have another
CF to relate user and request.
- To perform the second query, instead of selecting if each user has a
request inserted after date D, I thought in storing the last request
insertion date into the userCF, everytime I have a new insert for the user.
It would be a data replication, but I would have no read-before-write and I
am guessing the second query would perform faster.

Any thoughts?

-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: Correct model

Posted by "Hiller, Dean" <De...@nrel.gov>.

Just fyi that some of these are cassandra questions…

Dean,

    In the playOrm data modeling, if I understood it correctly, every CF has its own id, right?

No, each entity has a field annotated with @NoSqlId.  That tells playOrm this is the row key.  Each INSTANCE of the entity is a row in cassandra (very much like hibernate for RDBMS).  So every instance of Activity has a different NoSqlId (NOTE: ids are auto generated so you don't need to deal with it though you can set it manually if you like)

For instance, User would have its own ID, Activities would have its own id, etc.

User has a field private String id; annotated with @NoSqlId so each INSTANCE of User has it's own id and each INSTANCE of Activity has it's own id.

What if I have a trillion activities?

This is fine and is a normal cassandra use-case.  In fact, this is highly desirable in nosql stores and retrieving by key is desired when possible.

Wouldn't be a problem to have 1 row id for each activity?

Nope, no problems.

     Cassandra always indexes by row id, right?

If you do CQL and cassandra partitioning/indexing, then yes, BUT if you do PlayOrm partitioning, then NO.  PlayOrm indexes your columns and there is ONE index for EACH partition so if you have 1 trillion rows and 1 billion partitions, then each index on average is 1000 rows only so you can do a quick query into an index that only has 1000 values.

If I have too many row ids without using composite keys, will it scale the same way?

Yes, partitions is the key though….you must decide your partitioning so that partitions(or I could say indices) do not have a very high row count.  I currently maintain less than 1 million but I would say it slows down somewhere in the millions of rows per partition(ie. You can get pretty big but smaller can be better).

Wouldn't the time to insert an activity be each time longer because I have too many activities?

Nope, this is a cassandra question really and cassandra is optimized as all noSQL stores are to put and read value by key.  They all work best that way.

Behind the scenes there is a meta table that PlayOrm writes to(one row per java class you create that is annotated with @NoSqlEntity) and that is used to drive the ad-hoc tool so you can query into cassandra and not get hex out, but get the real values and see them.

Best regards,
Marcelo Valle.

2012/9/25 Hiller, Dean <De...@nrel.gov>>
If you need anything added/fixed, just let PlayOrm know.  PlayOrm has been able to quickly add so far…that may change as more and more requests come but so far PlayOrm seems to have managed to keep up.

We are using it live by the way already.  It works out very well so far for us (We have 5000 column families, obviously dynamically created instead of by hand…a very interesting use case of cassandra).  In our live environment we configured astyanax with LocalQUOROM on reads AND writes so CP style so we can afford one node out of 3 to go down but if two go down it stops working THOUGH there is a patch in astyanax to auto switch from LocalQUOROM to ONE NODE read/write when two nodes go down that we would like to suck in eventually so it is always live(I don't think Hector has that and it is a really NICE feature….ie fail localquorm read/write and then try again with consistency level of one).

Later,
Dean


From: Marcelo Elias Del Valle <mv...@gmail.com>>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>>" <us...@cassandra.apache.org>>>
Date: Monday, September 24, 2012 1:54 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>>" <us...@cassandra.apache.org>>>
Subject: Re: Correct model

Dean, this sounds like magic :D
I don't know details about the performance on the index implementations you chose, but it would pay the way to use it in my case, as I don't need the best performance in the world when reading, but I need to assure scalability and have a simple model to maintain. I liked the playOrm concept regarding this.
I have more doubts, but I will ask them at stack over flow from now on.

2012/9/24 Hiller, Dean <De...@nrel.gov>>>
PlayOrm will automatically create a CF to index my CF?

It creates 3 CF's for all indices, IntegerIndice, DecimalIndice, and StringIndice such that the ad-hoc tool that is in development can display the indices as it knows the prefix of the composite column name is of Integer, Decimal or String and it knows the postfix type as well so it can translate back from bytes to the types and properly display in a GUI (i.e. On top of SELECT, the ad-hoc tool is adding a way to view the induce rows so you can check if they got corrupt or not).

Will it auto-manage it, like Cassandra's secondary indexes?

YES

Further detail…

You annotated fields with @NoSqlIndexed and PlayOrm adds/removes from the index as you add/modify/remove the entity…..a modify does a remove old val from index and insert new value into index.

An example would be PlayOrm stores all long, int, short, byte in a type that uses the least amount of space so IF you have a long OR BigInteger between –128 to 128 it only ends up storing 1 byte in cassandra(SAVING tons of space!!!).  Then if you are indexing a type that is one of those, PlayOrm creates a IntegerIndice table.

Right now, another guy is working on playorm-server which is a webgui to allow ad-hoc access to all your data as well so you can ad-hoc queries to see data and instead of showing Hex, it shows the real values by translating the bytes to String for the schema portions that it is aware of that is.

Later,
Dean

From: Marcelo Elias Del Valle <mv...@gmail.com>>>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>>>" <us...@cassandra.apache.org>>>>
Date: Monday, September 24, 2012 12:09 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>>>" <us...@cassandra.apache.org>>>>
Subject: Re: Correct model

Dean,

    There is one last thing I would like to ask about playOrm by this list, the next questiosn will come by stackOverflow. Just because of the context, I prefer asking this here:
     When you say playOrm indexes a table (which would be a CF behind the scenes), what do you mean? PlayOrm will automatically create a CF to index my CF? Will it auto-manage it, like Cassandra's secondary indexes?
     In Cassandra, the application is responsible for maintaining the index, right? I might be wrong, but unless I am using secondary indexes I need to update index values manually, right?
     I got confused when you said "PlayOrm indexes the columns you choose". How do I choose and what exactly it means?

Best regards,
Marcelo Valle.

2012/9/24 Hiller, Dean <De...@nrel.gov>>>>
Oh, ok, you were talking about the wide row pattern, right?

yes

But playORM is compatible with Aaron's model, isn't it?

Not yet, PlayOrm supports partitioning one table multiple ways as it indexes the columns(in your case, the userid FK column and the time column)

Can I map exactly this using playORM?

Not yet, but the plan is to map these typical Cassandra scenarios as well.

 Can I ask playOrm questions in this list?

The best place to ask PlayOrm questions is on stack overflow and tag with PlayOrm though I monitor this list and stack overflow for questions(there are already a few questions on stack overflow).

The examples directory is empty for now, I would like to see how to set up the connection with it.

Running build or build.bat is always kept working and all 62 tests pass(or we don't merge to master) so to see how to make a connection or run an example

 1.  Run build.bat or build which generates parsing code
 2.  Import into eclipse (it already has .classpath and .project for you already there)
 3.  In FactorySingleton.java you can modify IN_MEMORY to CASSANDRA or not and run any of the tests in-memory or against localhost(We run the test suite also against a 6 node cluster as well and all passes)
 4.  FactorySingleton probably has the code you are looking for plus you need a class called nosql.Persistence or it won't scan your jar file.(class file not xml file like JPA)

Do you mean I need to load all the keys in memory to do a multi get?

No, you batch.  I am not sure about CQL, but PlayOrm returns a Cursor not the results so you can loop through every key and behind the scenes it is doing batch requests so you can load up 100 keys and make one multi get request for those 100 keys and then can load up the next 100 keys, etc. etc. etc.  I need to look more into the apis and protocol of CQL to see if it allows this style of batching.  PlayOrm does support this style of batching today.  Aaron would know if CQL does.

Why did you move? Hector is being considered for being the "official" client for Cassandra, isn't it?

At the time, I wanted the file streaming feature.  Also, Hector seemed a bit cumbersome as well compared to astyanax or at least if you were building a platform and had no use for typing the columns.  Just personal preference really here.

I am not sure I understood this part. If I need to refactor, having the partition id in the key would be a bad thing? What would be the alternative? In my case, as I use userId : partitionId as row key, this might be a problem, right?

PlayOrm indexes the columns you choose(ie. The ones you want to use in the where clause) and partitions by columns you choose not based on the key so in PlayOrm, the key is typically a TimeUUID or something cluster unique…..any tables referencing that TimeUUID never have to change.  With Cassandra partitioning, if you repartition that table a different way or go for some kind of major change(usually done with map/reduce), all your foreign keys "may" have to change….it really depends on the situation though.  Maybe you get the design right and never have to change.

@NoSqlQuery(name="findWithJoinQuery", query="PARTITIONS t(:partId) SELECT t FROM TABLE as t "+
"INNER JOIN t.activityTypeInfo as i WHERE i.type = :type and t.numShares < :shares"),

What would happen behind the scenes when I execute this query?

In this case, t or TABLE is a partitioned table since a partition is defined.  And t.activityTypeInfo refers to the ActivityTypeInfo table which is not partitioned(AND ActivityTypeInfo won't scale to billions of rows because there is no partitioning but maybe you don't need it!!!).  Behind the scenes when you call getResult, it returns a cursor that has NOT done anything yet.  When you start looping through the cursor, behind the scenes it is batching requests asking for next 500 matches(configurable) so you never run out of memory….it is EXACTLY like a database cursor.  You can even use the cursor to show a user the first set of results and when user clicks next pick up right where the cursor left off (if you saved it to the HttpSession).

You can only use joins with partition keys, right?

Nope, joins work on anything.  You only need to specify the partitionId when you have a partitioned table in the list of join tables. (That is what the PARTITIONS clause is for, to identify partitionId = what?)…it was put BEFORE the SQL instead of within it…CQL took the opposite approach but PlayOrm can also join different partitions together as well ;) ).

In this case, is partId the row id of TABLE CF?

Nope, partId is one of the columns.  There is a test case on this class in PlayOrm …(notice the annotation NoSqlPartitionByThisField on the column/field in the entity)…

https://github.com/deanhiller/playorm/blob/master/input/javasrc/com/alvazan/test/db/PartitionedSingleTrade.java

PlayOrm allows partitioned tables AND non-partioned tables(non-partitioned tables won't scale but maybe you will never have that many rows).  You can join any two combinations(non-partitioned with partitioned, non-partitioned with non-partitioned, partition with another partition).

I only prefer stackoverflow as I like referencing links/questions with their urls.  To reference this email is very hard later on as I have to find it so in general, I HATE email lists ;) but it seems cassandra prefers them so any questions on PlayOrm you can put there and I am not sure how many on this may or may not be interested so it creates less noise on this list too.

Later,
Dean


From: Marcelo Elias Del Valle <mv...@gmail.com>>>>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>>>>" <us...@cassandra.apache.org>>>>>
Date: Monday, September 24, 2012 11:07 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>>>>" <us...@cassandra.apache.org>>>>>
Subject: Re: Correct model



2012/9/24 Hiller, Dean <De...@nrel.gov>>>>>
I am confused.  In this email you say you want "get all requests for a user" and in a previous one you said "Select all the users which has new requests, since date D" so let me answer both…

I have both needs. These are the two queries I need to perform on the model.

For latter, you make ONE query into the latest partition(ONE partition) of the GlobalRequestsCF which gives you the most recent requests ALONG with the user ids of those requests.  If you queried all partitions, you would most likely blow out your JVM memory.

For the former, you make ONE query to the UserRequestsCF with userid = <your user id> to get all the requests for that user

Now I think I got the main idea! This answered a lot!

Sorry, I was skipping some context.  A lot of the backing indexing sometimes is done as a long row so in playOrm, too many rows in a partition means == too many columns in the indexing row for that partition.  I believe the same is true in cassandra for their indexing.

Oh, ok, you were talking about the wide row pattern, right? But playORM is compatible with Aaron's model, isn't it? Can I map exactly this using playORM? The hardest thing for me to use playORM now is I don't know Cassandra well yet, and I know playORM even less. Can I ask playOrm questions in this list? I will try to create a POC here!
Only now I am starting to understand what it does ;-) The examples directory is empty for now, I would like to see how to set up the connection with it.

Cassandra spreads all your data out on all nodes with or without partitions.  A single partition does have it's data co-located though.

Now I see. The main advantage of using partitions is keeping the indexes small enough. It has nothing to do with the nodes. Thanks!

If you are at 100k(and the requests are rather small), you could embed all the requests in the user or go with Aaron's below suggestion of a UserRequestsCF.  If your requests are rather large, you probably don't want to embed them in the User.  Either way, it's one query or one row key lookup.

I see it now.

Multiget ignores partitions…you feed it a LIST of keys and it gets them.  It just so happens that partitionId had to be part of your row key.

Do you mean I need to load all the keys in memory to do a multiget?

I have used Hector and now use Astyanax, I don't worry much about that layer, but I feed astyanax 3 nodes and I believe it discovers some of the other ones.  I believe the latter is true but am not 100% sure as I have not looked at that code.

Why did you move? Hector is being considered for being the "official" client for Cassandra, isn't it? I looked at the Astyanax api and it seemed much more high level though

As an analogy on the above, if you happen to have used PlayOrm, you would ONLY need one Requests table and you partition by user AND time(two views into the same data partitioned two different ways) and you can do exactly the same thing as Aaron's example.  PlayOrm doesn't embed the partition ids in the key leaving it free to partition twice like in your case….and in a refactor, you have to map/reduce A LOT more rows because of rows having the FK of <partitionid><subrowkey> whereas if you don't have partition id in the key, you only map/reduce the partitioned table in a redesign/refactor.  That said, we will be adding support for CQL partitioning in addition to PlayOrm partitioning even though it can be a little less flexible sometimes.

I am not sure I understood this part. If I need to refactor, having the partition id in the key would be a bad thing? What would be the alternative? In my case, as I use userId : partitionId as row key, this might be a problem, right?

Also, CQL locates all the data on one node for a partition.  We have found it can be faster "sometimes" with the parallelized disks that the partitions are NOT all on one node so PlayOrm partitions are virtual only and do not relate to where the rows are stored.  An example on our 6 nodes was a join query on a partition with 1,000,000 rows took 60ms (of course I can't compare to CQL here since it doesn't do joins).  It really depends how much data is going to come back in the query though too?  There are tradeoff's between disk parallel nodes and having your data all on one node of course.

I guess I am still not ready for this level of info. :D
In the playORM readme, we have the following:


@NoSqlQuery(name="findWithJoinQuery", query="PARTITIONS t(:partId) SELECT t FROM TABLE as t "+
"INNER JOIN t.activityTypeInfo as i WHERE i.type = :type and t.numShares < :shares"),

What would happen behind the scenes when I execute this query? You can only use joins with partition keys, right?
In this case, is partId the row id of TABLE CF?


Thanks a lot for the answers

--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr



--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr



--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr



--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: Correct model

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

Dean,

    In the playOrm data modeling, if I understood it correctly, every CF
has its own id, right? For instance, User would have its own ID, Activities
would have its own id, etc. What if I have a trillion activities? Wouldn't
be a problem to have 1 row id for each activity?
     Cassandra always indexes by row id, right? If I have too many row ids
without using composite keys, will it scale the same way? Wouldn't the time
to insert an activity be each time longer because I have too many
activities?

Best regards,
Marcelo Valle.

2012/9/25 Hiller, Dean <De...@nrel.gov>

> If you need anything added/fixed, just let PlayOrm know.  PlayOrm has been
> able to quickly add so far…that may change as more and more requests come
> but so far PlayOrm seems to have managed to keep up.
>
> We are using it live by the way already.  It works out very well so far
> for us (We have 5000 column families, obviously dynamically created instead
> of by hand…a very interesting use case of cassandra).  In our live
> environment we configured astyanax with LocalQUOROM on reads AND writes so
> CP style so we can afford one node out of 3 to go down but if two go down
> it stops working THOUGH there is a patch in astyanax to auto switch from
> LocalQUOROM to ONE NODE read/write when two nodes go down that we would
> like to suck in eventually so it is always live(I don't think Hector has
> that and it is a really NICE feature….ie fail localquorm read/write and
> then try again with consistency level of one).
>
> Later,
> Dean
>
>
> From: Marcelo Elias Del Valle <mvallebr@gmail.com<mailto:
> mvallebr@gmail.com>>
> Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org>>
> Date: Monday, September 24, 2012 1:54 PM
> To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org>>
> Subject: Re: Correct model
>
> Dean, this sounds like magic :D
> I don't know details about the performance on the index implementations
> you chose, but it would pay the way to use it in my case, as I don't need
> the best performance in the world when reading, but I need to assure
> scalability and have a simple model to maintain. I liked the playOrm
> concept regarding this.
> I have more doubts, but I will ask them at stack over flow from now on.
>
> 2012/9/24 Hiller, Dean <De...@nrel.gov>>
> PlayOrm will automatically create a CF to index my CF?
>
> It creates 3 CF's for all indices, IntegerIndice, DecimalIndice, and
> StringIndice such that the ad-hoc tool that is in development can display
> the indices as it knows the prefix of the composite column name is of
> Integer, Decimal or String and it knows the postfix type as well so it can
> translate back from bytes to the types and properly display in a GUI (i.e.
> On top of SELECT, the ad-hoc tool is adding a way to view the induce rows
> so you can check if they got corrupt or not).
>
> Will it auto-manage it, like Cassandra's secondary indexes?
>
> YES
>
> Further detail…
>
> You annotated fields with @NoSqlIndexed and PlayOrm adds/removes from the
> index as you add/modify/remove the entity…..a modify does a remove old val
> from index and insert new value into index.
>
> An example would be PlayOrm stores all long, int, short, byte in a type
> that uses the least amount of space so IF you have a long OR BigInteger
> between –128 to 128 it only ends up storing 1 byte in cassandra(SAVING tons
> of space!!!).  Then if you are indexing a type that is one of those,
> PlayOrm creates a IntegerIndice table.
>
> Right now, another guy is working on playorm-server which is a webgui to
> allow ad-hoc access to all your data as well so you can ad-hoc queries to
> see data and instead of showing Hex, it shows the real values by
> translating the bytes to String for the schema portions that it is aware of
> that is.
>
> Later,
> Dean
>
> From: Marcelo Elias Del Valle <mvallebr@gmail.com<mailto:
> mvallebr@gmail.com><ma...@gmail.com>>>
> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org
> ><ma...@cassandra.apache.org>>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org><mailto:
> user@cassandra.apache.org<ma...@cassandra.apache.org>>>
> Date: Monday, September 24, 2012 12:09 PM
> To: "user@cassandra.apache.org<ma...@cassandra.apache.org><mailto:
> user@cassandra.apache.org<ma...@cassandra.apache.org>>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org><mailto:
> user@cassandra.apache.org<ma...@cassandra.apache.org>>>
> Subject: Re: Correct model
>
> Dean,
>
>     There is one last thing I would like to ask about playOrm by this
> list, the next questiosn will come by stackOverflow. Just because of the
> context, I prefer asking this here:
>      When you say playOrm indexes a table (which would be a CF behind the
> scenes), what do you mean? PlayOrm will automatically create a CF to index
> my CF? Will it auto-manage it, like Cassandra's secondary indexes?
>      In Cassandra, the application is responsible for maintaining the
> index, right? I might be wrong, but unless I am using secondary indexes I
> need to update index values manually, right?
>      I got confused when you said "PlayOrm indexes the columns you
> choose". How do I choose and what exactly it means?
>
> Best regards,
> Marcelo Valle.
>
> 2012/9/24 Hiller, Dean <Dean.Hiller@nrel.gov<mailto:Dean.Hiller@nrel.gov
> ><ma...@nrel.gov>>>
> Oh, ok, you were talking about the wide row pattern, right?
>
> yes
>
> But playORM is compatible with Aaron's model, isn't it?
>
> Not yet, PlayOrm supports partitioning one table multiple ways as it
> indexes the columns(in your case, the userid FK column and the time column)
>
> Can I map exactly this using playORM?
>
> Not yet, but the plan is to map these typical Cassandra scenarios as well.
>
>  Can I ask playOrm questions in this list?
>
> The best place to ask PlayOrm questions is on stack overflow and tag with
> PlayOrm though I monitor this list and stack overflow for questions(there
> are already a few questions on stack overflow).
>
> The examples directory is empty for now, I would like to see how to set up
> the connection with it.
>
> Running build or build.bat is always kept working and all 62 tests pass(or
> we don't merge to master) so to see how to make a connection or run an
> example
>
>  1.  Run build.bat or build which generates parsing code
>  2.  Import into eclipse (it already has .classpath and .project for you
> already there)
>  3.  In FactorySingleton.java you can modify IN_MEMORY to CASSANDRA or not
> and run any of the tests in-memory or against localhost(We run the test
> suite also against a 6 node cluster as well and all passes)
>  4.  FactorySingleton probably has the code you are looking for plus you
> need a class called nosql.Persistence or it won't scan your jar file.(class
> file not xml file like JPA)
>
> Do you mean I need to load all the keys in memory to do a multi get?
>
> No, you batch.  I am not sure about CQL, but PlayOrm returns a Cursor not
> the results so you can loop through every key and behind the scenes it is
> doing batch requests so you can load up 100 keys and make one multi get
> request for those 100 keys and then can load up the next 100 keys, etc.
> etc. etc.  I need to look more into the apis and protocol of CQL to see if
> it allows this style of batching.  PlayOrm does support this style of
> batching today.  Aaron would know if CQL does.
>
> Why did you move? Hector is being considered for being the "official"
> client for Cassandra, isn't it?
>
> At the time, I wanted the file streaming feature.  Also, Hector seemed a
> bit cumbersome as well compared to astyanax or at least if you were
> building a platform and had no use for typing the columns.  Just personal
> preference really here.
>
> I am not sure I understood this part. If I need to refactor, having the
> partition id in the key would be a bad thing? What would be the
> alternative? In my case, as I use userId : partitionId as row key, this
> might be a problem, right?
>
> PlayOrm indexes the columns you choose(ie. The ones you want to use in the
> where clause) and partitions by columns you choose not based on the key so
> in PlayOrm, the key is typically a TimeUUID or something cluster
> unique…..any tables referencing that TimeUUID never have to change.  With
> Cassandra partitioning, if you repartition that table a different way or go
> for some kind of major change(usually done with map/reduce), all your
> foreign keys "may" have to change….it really depends on the situation
> though.  Maybe you get the design right and never have to change.
>
> @NoSqlQuery(name="findWithJoinQuery", query="PARTITIONS t(:partId) SELECT
> t FROM TABLE as t "+
> "INNER JOIN t.activityTypeInfo as i WHERE i.type = :type and t.numShares <
> :shares"),
>
> What would happen behind the scenes when I execute this query?
>
> In this case, t or TABLE is a partitioned table since a partition is
> defined.  And t.activityTypeInfo refers to the ActivityTypeInfo table which
> is not partitioned(AND ActivityTypeInfo won't scale to billions of rows
> because there is no partitioning but maybe you don't need it!!!).  Behind
> the scenes when you call getResult, it returns a cursor that has NOT done
> anything yet.  When you start looping through the cursor, behind the scenes
> it is batching requests asking for next 500 matches(configurable) so you
> never run out of memory….it is EXACTLY like a database cursor.  You can
> even use the cursor to show a user the first set of results and when user
> clicks next pick up right where the cursor left off (if you saved it to the
> HttpSession).
>
> You can only use joins with partition keys, right?
>
> Nope, joins work on anything.  You only need to specify the partitionId
> when you have a partitioned table in the list of join tables. (That is what
> the PARTITIONS clause is for, to identify partitionId = what?)…it was put
> BEFORE the SQL instead of within it…CQL took the opposite approach but
> PlayOrm can also join different partitions together as well ;) ).
>
> In this case, is partId the row id of TABLE CF?
>
> Nope, partId is one of the columns.  There is a test case on this class in
> PlayOrm …(notice the annotation NoSqlPartitionByThisField on the
> column/field in the entity)…
>
>
> https://github.com/deanhiller/playorm/blob/master/input/javasrc/com/alvazan/test/db/PartitionedSingleTrade.java
>
> PlayOrm allows partitioned tables AND non-partioned tables(non-partitioned
> tables won't scale but maybe you will never have that many rows).  You can
> join any two combinations(non-partitioned with partitioned, non-partitioned
> with non-partitioned, partition with another partition).
>
> I only prefer stackoverflow as I like referencing links/questions with
> their urls.  To reference this email is very hard later on as I have to
> find it so in general, I HATE email lists ;) but it seems cassandra prefers
> them so any questions on PlayOrm you can put there and I am not sure how
> many on this may or may not be interested so it creates less noise on this
> list too.
>
> Later,
> Dean
>
>
> From: Marcelo Elias Del Valle <mvallebr@gmail.com<mailto:
> mvallebr@gmail.com><mailto:mvallebr@gmail.com<mailto:mvallebr@gmail.com
> >><ma...@gmail.com><mailto:
> mvallebr@gmail.com<ma...@gmail.com>>>>
> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org
> ><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org
> >><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org
> ><ma...@cassandra.apache.org>>>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org><mailto:
> user@cassandra.apache.org<ma...@cassandra.apache.org>><mailto:
> user@cassandra.apache.org<ma...@cassandra.apache.org><mailto:
> user@cassandra.apache.org<ma...@cassandra.apache.org>>>>
> Date: Monday, September 24, 2012 11:07 AM
> To: "user@cassandra.apache.org<ma...@cassandra.apache.org><mailto:
> user@cassandra.apache.org<ma...@cassandra.apache.org>><mailto:
> user@cassandra.apache.org<ma...@cassandra.apache.org><mailto:
> user@cassandra.apache.org<ma...@cassandra.apache.org>>>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org><mailto:
> user@cassandra.apache.org<ma...@cassandra.apache.org>><mailto:
> user@cassandra.apache.org<ma...@cassandra.apache.org><mailto:
> user@cassandra.apache.org<ma...@cassandra.apache.org>>>>
> Subject: Re: Correct model
>
>
>
> 2012/9/24 Hiller, Dean <Dean.Hiller@nrel.gov<mailto:Dean.Hiller@nrel.gov
> ><ma...@nrel.gov>><mailto:
> Dean.Hiller@nrel.gov<ma...@nrel.gov><mailto:
> Dean.Hiller@nrel.gov<ma...@nrel.gov>>>>
> I am confused.  In this email you say you want "get all requests for a
> user" and in a previous one you said "Select all the users which has new
> requests, since date D" so let me answer both…
>
> I have both needs. These are the two queries I need to perform on the
> model.
>
> For latter, you make ONE query into the latest partition(ONE partition) of
> the GlobalRequestsCF which gives you the most recent requests ALONG with
> the user ids of those requests.  If you queried all partitions, you would
> most likely blow out your JVM memory.
>
> For the former, you make ONE query to the UserRequestsCF with userid =
> <your user id> to get all the requests for that user
>
> Now I think I got the main idea! This answered a lot!
>
> Sorry, I was skipping some context.  A lot of the backing indexing
> sometimes is done as a long row so in playOrm, too many rows in a partition
> means == too many columns in the indexing row for that partition.  I
> believe the same is true in cassandra for their indexing.
>
> Oh, ok, you were talking about the wide row pattern, right? But playORM is
> compatible with Aaron's model, isn't it? Can I map exactly this using
> playORM? The hardest thing for me to use playORM now is I don't know
> Cassandra well yet, and I know playORM even less. Can I ask playOrm
> questions in this list? I will try to create a POC here!
> Only now I am starting to understand what it does ;-) The examples
> directory is empty for now, I would like to see how to set up the
> connection with it.
>
> Cassandra spreads all your data out on all nodes with or without
> partitions.  A single partition does have it's data co-located though.
>
> Now I see. The main advantage of using partitions is keeping the indexes
> small enough. It has nothing to do with the nodes. Thanks!
>
> If you are at 100k(and the requests are rather small), you could embed all
> the requests in the user or go with Aaron's below suggestion of a
> UserRequestsCF.  If your requests are rather large, you probably don't want
> to embed them in the User.  Either way, it's one query or one row key
> lookup.
>
> I see it now.
>
> Multiget ignores partitions…you feed it a LIST of keys and it gets them.
>  It just so happens that partitionId had to be part of your row key.
>
> Do you mean I need to load all the keys in memory to do a multiget?
>
> I have used Hector and now use Astyanax, I don't worry much about that
> layer, but I feed astyanax 3 nodes and I believe it discovers some of the
> other ones.  I believe the latter is true but am not 100% sure as I have
> not looked at that code.
>
> Why did you move? Hector is being considered for being the "official"
> client for Cassandra, isn't it? I looked at the Astyanax api and it seemed
> much more high level though
>
> As an analogy on the above, if you happen to have used PlayOrm, you would
> ONLY need one Requests table and you partition by user AND time(two views
> into the same data partitioned two different ways) and you can do exactly
> the same thing as Aaron's example.  PlayOrm doesn't embed the partition ids
> in the key leaving it free to partition twice like in your case….and in a
> refactor, you have to map/reduce A LOT more rows because of rows having the
> FK of <partitionid><subrowkey> whereas if you don't have partition id in
> the key, you only map/reduce the partitioned table in a redesign/refactor.
>  That said, we will be adding support for CQL partitioning in addition to
> PlayOrm partitioning even though it can be a little less flexible sometimes.
>
> I am not sure I understood this part. If I need to refactor, having the
> partition id in the key would be a bad thing? What would be the
> alternative? In my case, as I use userId : partitionId as row key, this
> might be a problem, right?
>
> Also, CQL locates all the data on one node for a partition.  We have found
> it can be faster "sometimes" with the parallelized disks that the
> partitions are NOT all on one node so PlayOrm partitions are virtual only
> and do not relate to where the rows are stored.  An example on our 6 nodes
> was a join query on a partition with 1,000,000 rows took 60ms (of course I
> can't compare to CQL here since it doesn't do joins).  It really depends
> how much data is going to come back in the query though too?  There are
> tradeoff's between disk parallel nodes and having your data all on one node
> of course.
>
> I guess I am still not ready for this level of info. :D
> In the playORM readme, we have the following:
>
>
> @NoSqlQuery(name="findWithJoinQuery", query="PARTITIONS t(:partId) SELECT
> t FROM TABLE as t "+
> "INNER JOIN t.activityTypeInfo as i WHERE i.type = :type and t.numShares <
> :shares"),
>
> What would happen behind the scenes when I execute this query? You can
> only use joins with partition keys, right?
> In this case, is partId the row id of TABLE CF?
>
>
> Thanks a lot for the answers
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>



-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: Correct model

Posted by "Hiller, Dean" <De...@nrel.gov>.

Oh, and if you really want to scale very easily, just use play framework
1.2.5 ;).  We use that and since it is stateless, to scale up, you simple
add more servers.  Also, it's like coding in php or ruby, etc. etc as far
as development speed(no server restarts) so it's a pretty nice framework.
We tried 2.x version, but it is just too unproductive with server restarts.

If you do use playframework, let me know and I can send you the startup
code we use in play framework so you can simply call NoSql.em() to get
that requests NoSqlEntityManager.  A play framework plugin will be
developed as well for the 1.2.x line.

Later,
Dean

On 9/25/12 6:36 AM, "Hiller, Dean" <De...@nrel.gov> wrote:

>If you need anything added/fixed, just let PlayOrm know.  PlayOrm has
>been able to quickly add so farŠthat may change as more and more requests
>come but so far PlayOrm seems to have managed to keep up.
>
>We are using it live by the way already.  It works out very well so far
>for us (We have 5000 column families, obviously dynamically created
>instead of by handŠa very interesting use case of cassandra).  In our
>live environment we configured astyanax with LocalQUOROM on reads AND
>writes so CP style so we can afford one node out of 3 to go down but if
>two go down it stops working THOUGH there is a patch in astyanax to auto
>switch from LocalQUOROM to ONE NODE read/write when two nodes go down
>that we would like to suck in eventually so it is always live(I don't
>think Hector has that and it is a really NICE featureŠ.ie fail localquorm
>read/write and then try again with consistency level of one).
>
>Later,
>Dean
>
>
>From: Marcelo Elias Del Valle
><mv...@gmail.com>>
>Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
><us...@cassandra.apache.org>>
>Date: Monday, September 24, 2012 1:54 PM
>To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
><us...@cassandra.apache.org>>
>Subject: Re: Correct model
>
>Dean, this sounds like magic :D
>I don't know details about the performance on the index implementations
>you chose, but it would pay the way to use it in my case, as I don't need
>the best performance in the world when reading, but I need to assure
>scalability and have a simple model to maintain. I liked the playOrm
>concept regarding this.
>I have more doubts, but I will ask them at stack over flow from now on.
>
>2012/9/24 Hiller, Dean <De...@nrel.gov>>
>PlayOrm will automatically create a CF to index my CF?
>
>It creates 3 CF's for all indices, IntegerIndice, DecimalIndice, and
>StringIndice such that the ad-hoc tool that is in development can display
>the indices as it knows the prefix of the composite column name is of
>Integer, Decimal or String and it knows the postfix type as well so it
>can translate back from bytes to the types and properly display in a GUI
>(i.e. On top of SELECT, the ad-hoc tool is adding a way to view the
>induce rows so you can check if they got corrupt or not).
>
>Will it auto-manage it, like Cassandra's secondary indexes?
>
>YES
>
>Further detailŠ
>
>You annotated fields with @NoSqlIndexed and PlayOrm adds/removes from the
>index as you add/modify/remove the entityŠ..a modify does a remove old
>val from index and insert new value into index.
>
>An example would be PlayOrm stores all long, int, short, byte in a type
>that uses the least amount of space so IF you have a long OR BigInteger
>between 128 to 128 it only ends up storing 1 byte in cassandra(SAVING
>tons of space!!!).  Then if you are indexing a type that is one of those,
>PlayOrm creates a IntegerIndice table.
>
>Right now, another guy is working on playorm-server which is a webgui to
>allow ad-hoc access to all your data as well so you can ad-hoc queries to
>see data and instead of showing Hex, it shows the real values by
>translating the bytes to String for the schema portions that it is aware
>of that is.
>
>Later,
>Dean
>
>From: Marcelo Elias Del Valle
><mv...@gmail.com><mailto:mvallebr@gmail.com<m
>ailto:mvallebr@gmail.com>>>
>Reply-To:
>"user@cassandra.apache.org<ma...@cassandra.apache.org><mailto:user@c
>assandra.apache.org<ma...@cassandra.apache.org>>"
><us...@cassandra.apache.org><mailto:user@c
>assandra.apache.org<ma...@cassandra.apache.org>>>
>Date: Monday, September 24, 2012 12:09 PM
>To:
>"user@cassandra.apache.org<ma...@cassandra.apache.org><mailto:user@c
>assandra.apache.org<ma...@cassandra.apache.org>>"
><us...@cassandra.apache.org><mailto:user@c
>assandra.apache.org<ma...@cassandra.apache.org>>>
>Subject: Re: Correct model
>
>Dean,
>
>    There is one last thing I would like to ask about playOrm by this
>list, the next questiosn will come by stackOverflow. Just because of the
>context, I prefer asking this here:
>     When you say playOrm indexes a table (which would be a CF behind the
>scenes), what do you mean? PlayOrm will automatically create a CF to
>index my CF? Will it auto-manage it, like Cassandra's secondary indexes?
>     In Cassandra, the application is responsible for maintaining the
>index, right? I might be wrong, but unless I am using secondary indexes I
>need to update index values manually, right?
>     I got confused when you said "PlayOrm indexes the columns you
>choose". How do I choose and what exactly it means?
>
>Best regards,
>Marcelo Valle.
>
>2012/9/24 Hiller, Dean
><De...@nrel.gov><mailto:Dean.Hiller@nrel
>.gov<ma...@nrel.gov>>>
>Oh, ok, you were talking about the wide row pattern, right?
>
>yes
>
>But playORM is compatible with Aaron's model, isn't it?
>
>Not yet, PlayOrm supports partitioning one table multiple ways as it
>indexes the columns(in your case, the userid FK column and the time
>column)
>
>Can I map exactly this using playORM?
>
>Not yet, but the plan is to map these typical Cassandra scenarios as well.
>
> Can I ask playOrm questions in this list?
>
>The best place to ask PlayOrm questions is on stack overflow and tag with
>PlayOrm though I monitor this list and stack overflow for questions(there
>are already a few questions on stack overflow).
>
>The examples directory is empty for now, I would like to see how to set
>up the connection with it.
>
>Running build or build.bat is always kept working and all 62 tests
>pass(or we don't merge to master) so to see how to make a connection or
>run an example
>
> 1.  Run build.bat or build which generates parsing code
> 2.  Import into eclipse (it already has .classpath and .project for you
>already there)
> 3.  In FactorySingleton.java you can modify IN_MEMORY to CASSANDRA or
>not and run any of the tests in-memory or against localhost(We run the
>test suite also against a 6 node cluster as well and all passes)
> 4.  FactorySingleton probably has the code you are looking for plus you
>need a class called nosql.Persistence or it won't scan your jar
>file.(class file not xml file like JPA)
>
>Do you mean I need to load all the keys in memory to do a multi get?
>
>No, you batch.  I am not sure about CQL, but PlayOrm returns a Cursor not
>the results so you can loop through every key and behind the scenes it is
>doing batch requests so you can load up 100 keys and make one multi get
>request for those 100 keys and then can load up the next 100 keys, etc.
>etc. etc.  I need to look more into the apis and protocol of CQL to see
>if it allows this style of batching.  PlayOrm does support this style of
>batching today.  Aaron would know if CQL does.
>
>Why did you move? Hector is being considered for being the "official"
>client for Cassandra, isn't it?
>
>At the time, I wanted the file streaming feature.  Also, Hector seemed a
>bit cumbersome as well compared to astyanax or at least if you were
>building a platform and had no use for typing the columns.  Just personal
>preference really here.
>
>I am not sure I understood this part. If I need to refactor, having the
>partition id in the key would be a bad thing? What would be the
>alternative? In my case, as I use userId : partitionId as row key, this
>might be a problem, right?
>
>PlayOrm indexes the columns you choose(ie. The ones you want to use in
>the where clause) and partitions by columns you choose not based on the
>key so in PlayOrm, the key is typically a TimeUUID or something cluster
>uniqueŠ..any tables referencing that TimeUUID never have to change.  With
>Cassandra partitioning, if you repartition that table a different way or
>go for some kind of major change(usually done with map/reduce), all your
>foreign keys "may" have to changeŠ.it really depends on the situation
>though.  Maybe you get the design right and never have to change.
>
>@NoSqlQuery(name="findWithJoinQuery", query="PARTITIONS t(:partId) SELECT
>t FROM TABLE as t "+
>"INNER JOIN t.activityTypeInfo as i WHERE i.type = :type and t.numShares
>< :shares"),
>
>What would happen behind the scenes when I execute this query?
>
>In this case, t or TABLE is a partitioned table since a partition is
>defined.  And t.activityTypeInfo refers to the ActivityTypeInfo table
>which is not partitioned(AND ActivityTypeInfo won't scale to billions of
>rows because there is no partitioning but maybe you don't need it!!!).
>Behind the scenes when you call getResult, it returns a cursor that has
>NOT done anything yet.  When you start looping through the cursor, behind
>the scenes it is batching requests asking for next 500
>matches(configurable) so you never run out of memoryŠ.it is EXACTLY like
>a database cursor.  You can even use the cursor to show a user the first
>set of results and when user clicks next pick up right where the cursor
>left off (if you saved it to the HttpSession).
>
>You can only use joins with partition keys, right?
>
>Nope, joins work on anything.  You only need to specify the partitionId
>when you have a partitioned table in the list of join tables. (That is
>what the PARTITIONS clause is for, to identify partitionId = what?)Šit
>was put BEFORE the SQL instead of within itŠCQL took the opposite
>approach but PlayOrm can also join different partitions together as well
>;) ).
>
>In this case, is partId the row id of TABLE CF?
>
>Nope, partId is one of the columns.  There is a test case on this class
>in PlayOrm Š(notice the annotation NoSqlPartitionByThisField on the
>column/field in the entity)Š
>
>https://github.com/deanhiller/playorm/blob/master/input/javasrc/com/alvaza
>n/test/db/PartitionedSingleTrade.java
>
>PlayOrm allows partitioned tables AND non-partioned
>tables(non-partitioned tables won't scale but maybe you will never have
>that many rows).  You can join any two combinations(non-partitioned with
>partitioned, non-partitioned with non-partitioned, partition with another
>partition).
>
>I only prefer stackoverflow as I like referencing links/questions with
>their urls.  To reference this email is very hard later on as I have to
>find it so in general, I HATE email lists ;) but it seems cassandra
>prefers them so any questions on PlayOrm you can put there and I am not
>sure how many on this may or may not be interested so it creates less
>noise on this list too.
>
>Later,
>Dean
>
>
>From: Marcelo Elias Del Valle
><mv...@gmail.com><mailto:mvallebr@gmail.com<m
>ailto:mvallebr@gmail.com>><mailto:mvallebr@gmail.com<mailto:mvallebr@gmail
>.com><ma...@gmail.com>>>>
>Reply-To:
>"user@cassandra.apache.org<ma...@cassandra.apache.org><mailto:user@c
>assandra.apache.org<ma...@cassandra.apache.org>><mailto:user@cassand
>ra.apache.org<ma...@cassandra.apache.org><mailto:user@cassandra.apac
>he.org<ma...@cassandra.apache.org>>>"
><us...@cassandra.apache.org><mailto:user@c
>assandra.apache.org<ma...@cassandra.apache.org>><mailto:user@cassand
>ra.apache.org<ma...@cassandra.apache.org><mailto:user@cassandra.apac
>he.org<ma...@cassandra.apache.org>>>>
>Date: Monday, September 24, 2012 11:07 AM
>To:
>"user@cassandra.apache.org<ma...@cassandra.apache.org><mailto:user@c
>assandra.apache.org<ma...@cassandra.apache.org>><mailto:user@cassand
>ra.apache.org<ma...@cassandra.apache.org><mailto:user@cassandra.apac
>he.org<ma...@cassandra.apache.org>>>"
><us...@cassandra.apache.org><mailto:user@c
>assandra.apache.org<ma...@cassandra.apache.org>><mailto:user@cassand
>ra.apache.org<ma...@cassandra.apache.org><mailto:user@cassandra.apac
>he.org<ma...@cassandra.apache.org>>>>
>Subject: Re: Correct model
>
>
>
>2012/9/24 Hiller, Dean
><De...@nrel.gov><mailto:Dean.Hiller@nrel
>.gov<ma...@nrel.gov>><mailto:Dean.Hiller@nrel.gov<mailto:Dean
>.Hiller@nrel.gov><ma...@nrel.gov>
>>>>
>I am confused.  In this email you say you want "get all requests for a
>user" and in a previous one you said "Select all the users which has new
>requests, since date D" so let me answer bothŠ
>
>I have both needs. These are the two queries I need to perform on the
>model.
>
>For latter, you make ONE query into the latest partition(ONE partition)
>of the GlobalRequestsCF which gives you the most recent requests ALONG
>with the user ids of those requests.  If you queried all partitions, you
>would most likely blow out your JVM memory.
>
>For the former, you make ONE query to the UserRequestsCF with userid =
><your user id> to get all the requests for that user
>
>Now I think I got the main idea! This answered a lot!
>
>Sorry, I was skipping some context.  A lot of the backing indexing
>sometimes is done as a long row so in playOrm, too many rows in a
>partition means == too many columns in the indexing row for that
>partition.  I believe the same is true in cassandra for their indexing.
>
>Oh, ok, you were talking about the wide row pattern, right? But playORM
>is compatible with Aaron's model, isn't it? Can I map exactly this using
>playORM? The hardest thing for me to use playORM now is I don't know
>Cassandra well yet, and I know playORM even less. Can I ask playOrm
>questions in this list? I will try to create a POC here!
>Only now I am starting to understand what it does ;-) The examples
>directory is empty for now, I would like to see how to set up the
>connection with it.
>
>Cassandra spreads all your data out on all nodes with or without
>partitions.  A single partition does have it's data co-located though.
>
>Now I see. The main advantage of using partitions is keeping the indexes
>small enough. It has nothing to do with the nodes. Thanks!
>
>If you are at 100k(and the requests are rather small), you could embed
>all the requests in the user or go with Aaron's below suggestion of a
>UserRequestsCF.  If your requests are rather large, you probably don't
>want to embed them in the User.  Either way, it's one query or one row
>key lookup.
>
>I see it now.
>
>Multiget ignores partitionsŠyou feed it a LIST of keys and it gets them.
>It just so happens that partitionId had to be part of your row key.
>
>Do you mean I need to load all the keys in memory to do a multiget?
>
>I have used Hector and now use Astyanax, I don't worry much about that
>layer, but I feed astyanax 3 nodes and I believe it discovers some of the
>other ones.  I believe the latter is true but am not 100% sure as I have
>not looked at that code.
>
>Why did you move? Hector is being considered for being the "official"
>client for Cassandra, isn't it? I looked at the Astyanax api and it
>seemed much more high level though
>
>As an analogy on the above, if you happen to have used PlayOrm, you would
>ONLY need one Requests table and you partition by user AND time(two views
>into the same data partitioned two different ways) and you can do exactly
>the same thing as Aaron's example.  PlayOrm doesn't embed the partition
>ids in the key leaving it free to partition twice like in your caseŠ.and
>in a refactor, you have to map/reduce A LOT more rows because of rows
>having the FK of <partitionid><subrowkey> whereas if you don't have
>partition id in the key, you only map/reduce the partitioned table in a
>redesign/refactor.  That said, we will be adding support for CQL
>partitioning in addition to PlayOrm partitioning even though it can be a
>little less flexible sometimes.
>
>I am not sure I understood this part. If I need to refactor, having the
>partition id in the key would be a bad thing? What would be the
>alternative? In my case, as I use userId : partitionId as row key, this
>might be a problem, right?
>
>Also, CQL locates all the data on one node for a partition.  We have
>found it can be faster "sometimes" with the parallelized disks that the
>partitions are NOT all on one node so PlayOrm partitions are virtual only
>and do not relate to where the rows are stored.  An example on our 6
>nodes was a join query on a partition with 1,000,000 rows took 60ms (of
>course I can't compare to CQL here since it doesn't do joins).  It really
>depends how much data is going to come back in the query though too?
>There are tradeoff's between disk parallel nodes and having your data all
>on one node of course.
>
>I guess I am still not ready for this level of info. :D
>In the playORM readme, we have the following:
>
>
>@NoSqlQuery(name="findWithJoinQuery", query="PARTITIONS t(:partId) SELECT
>t FROM TABLE as t "+
>"INNER JOIN t.activityTypeInfo as i WHERE i.type = :type and t.numShares
>< :shares"),
>
>What would happen behind the scenes when I execute this query? You can
>only use joins with partition keys, right?
>In this case, is partId the row id of TABLE CF?
>
>
>Thanks a lot for the answers
>
>--
>Marcelo Elias Del Valle
>http://mvalle.com - @mvallebr
>
>
>
>--
>Marcelo Elias Del Valle
>http://mvalle.com - @mvallebr
>
>
>
>--
>Marcelo Elias Del Valle
>http://mvalle.com - @mvallebr

Re: Correct model

Posted by "Hiller, Dean" <De...@nrel.gov>.

If you need anything added/fixed, just let PlayOrm know.  PlayOrm has been able to quickly add so far…that may change as more and more requests come but so far PlayOrm seems to have managed to keep up.

We are using it live by the way already.  It works out very well so far for us (We have 5000 column families, obviously dynamically created instead of by hand…a very interesting use case of cassandra).  In our live environment we configured astyanax with LocalQUOROM on reads AND writes so CP style so we can afford one node out of 3 to go down but if two go down it stops working THOUGH there is a patch in astyanax to auto switch from LocalQUOROM to ONE NODE read/write when two nodes go down that we would like to suck in eventually so it is always live(I don't think Hector has that and it is a really NICE feature….ie fail localquorm read/write and then try again with consistency level of one).

Later,
Dean


From: Marcelo Elias Del Valle <mv...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Monday, September 24, 2012 1:54 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Correct model

Dean, this sounds like magic :D
I don't know details about the performance on the index implementations you chose, but it would pay the way to use it in my case, as I don't need the best performance in the world when reading, but I need to assure scalability and have a simple model to maintain. I liked the playOrm concept regarding this.
I have more doubts, but I will ask them at stack over flow from now on.

2012/9/24 Hiller, Dean <De...@nrel.gov>>
PlayOrm will automatically create a CF to index my CF?

It creates 3 CF's for all indices, IntegerIndice, DecimalIndice, and StringIndice such that the ad-hoc tool that is in development can display the indices as it knows the prefix of the composite column name is of Integer, Decimal or String and it knows the postfix type as well so it can translate back from bytes to the types and properly display in a GUI (i.e. On top of SELECT, the ad-hoc tool is adding a way to view the induce rows so you can check if they got corrupt or not).

Will it auto-manage it, like Cassandra's secondary indexes?

YES

Further detail…

You annotated fields with @NoSqlIndexed and PlayOrm adds/removes from the index as you add/modify/remove the entity…..a modify does a remove old val from index and insert new value into index.

An example would be PlayOrm stores all long, int, short, byte in a type that uses the least amount of space so IF you have a long OR BigInteger between –128 to 128 it only ends up storing 1 byte in cassandra(SAVING tons of space!!!).  Then if you are indexing a type that is one of those, PlayOrm creates a IntegerIndice table.

Right now, another guy is working on playorm-server which is a webgui to allow ad-hoc access to all your data as well so you can ad-hoc queries to see data and instead of showing Hex, it shows the real values by translating the bytes to String for the schema portions that it is aware of that is.

Later,
Dean

From: Marcelo Elias Del Valle <mv...@gmail.com>>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>>" <us...@cassandra.apache.org>>>
Date: Monday, September 24, 2012 12:09 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>>" <us...@cassandra.apache.org>>>
Subject: Re: Correct model

Dean,

    There is one last thing I would like to ask about playOrm by this list, the next questiosn will come by stackOverflow. Just because of the context, I prefer asking this here:
     When you say playOrm indexes a table (which would be a CF behind the scenes), what do you mean? PlayOrm will automatically create a CF to index my CF? Will it auto-manage it, like Cassandra's secondary indexes?
     In Cassandra, the application is responsible for maintaining the index, right? I might be wrong, but unless I am using secondary indexes I need to update index values manually, right?
     I got confused when you said "PlayOrm indexes the columns you choose". How do I choose and what exactly it means?

Best regards,
Marcelo Valle.

2012/9/24 Hiller, Dean <De...@nrel.gov>>>
Oh, ok, you were talking about the wide row pattern, right?

yes

But playORM is compatible with Aaron's model, isn't it?

Not yet, PlayOrm supports partitioning one table multiple ways as it indexes the columns(in your case, the userid FK column and the time column)

Can I map exactly this using playORM?

Not yet, but the plan is to map these typical Cassandra scenarios as well.

 Can I ask playOrm questions in this list?

The best place to ask PlayOrm questions is on stack overflow and tag with PlayOrm though I monitor this list and stack overflow for questions(there are already a few questions on stack overflow).

The examples directory is empty for now, I would like to see how to set up the connection with it.

Running build or build.bat is always kept working and all 62 tests pass(or we don't merge to master) so to see how to make a connection or run an example

 1.  Run build.bat or build which generates parsing code
 2.  Import into eclipse (it already has .classpath and .project for you already there)
 3.  In FactorySingleton.java you can modify IN_MEMORY to CASSANDRA or not and run any of the tests in-memory or against localhost(We run the test suite also against a 6 node cluster as well and all passes)
 4.  FactorySingleton probably has the code you are looking for plus you need a class called nosql.Persistence or it won't scan your jar file.(class file not xml file like JPA)

Do you mean I need to load all the keys in memory to do a multi get?

No, you batch.  I am not sure about CQL, but PlayOrm returns a Cursor not the results so you can loop through every key and behind the scenes it is doing batch requests so you can load up 100 keys and make one multi get request for those 100 keys and then can load up the next 100 keys, etc. etc. etc.  I need to look more into the apis and protocol of CQL to see if it allows this style of batching.  PlayOrm does support this style of batching today.  Aaron would know if CQL does.

Why did you move? Hector is being considered for being the "official" client for Cassandra, isn't it?

At the time, I wanted the file streaming feature.  Also, Hector seemed a bit cumbersome as well compared to astyanax or at least if you were building a platform and had no use for typing the columns.  Just personal preference really here.

I am not sure I understood this part. If I need to refactor, having the partition id in the key would be a bad thing? What would be the alternative? In my case, as I use userId : partitionId as row key, this might be a problem, right?

PlayOrm indexes the columns you choose(ie. The ones you want to use in the where clause) and partitions by columns you choose not based on the key so in PlayOrm, the key is typically a TimeUUID or something cluster unique…..any tables referencing that TimeUUID never have to change.  With Cassandra partitioning, if you repartition that table a different way or go for some kind of major change(usually done with map/reduce), all your foreign keys "may" have to change….it really depends on the situation though.  Maybe you get the design right and never have to change.

@NoSqlQuery(name="findWithJoinQuery", query="PARTITIONS t(:partId) SELECT t FROM TABLE as t "+
"INNER JOIN t.activityTypeInfo as i WHERE i.type = :type and t.numShares < :shares"),

What would happen behind the scenes when I execute this query?

In this case, t or TABLE is a partitioned table since a partition is defined.  And t.activityTypeInfo refers to the ActivityTypeInfo table which is not partitioned(AND ActivityTypeInfo won't scale to billions of rows because there is no partitioning but maybe you don't need it!!!).  Behind the scenes when you call getResult, it returns a cursor that has NOT done anything yet.  When you start looping through the cursor, behind the scenes it is batching requests asking for next 500 matches(configurable) so you never run out of memory….it is EXACTLY like a database cursor.  You can even use the cursor to show a user the first set of results and when user clicks next pick up right where the cursor left off (if you saved it to the HttpSession).

You can only use joins with partition keys, right?

Nope, joins work on anything.  You only need to specify the partitionId when you have a partitioned table in the list of join tables. (That is what the PARTITIONS clause is for, to identify partitionId = what?)…it was put BEFORE the SQL instead of within it…CQL took the opposite approach but PlayOrm can also join different partitions together as well ;) ).

In this case, is partId the row id of TABLE CF?

Nope, partId is one of the columns.  There is a test case on this class in PlayOrm …(notice the annotation NoSqlPartitionByThisField on the column/field in the entity)…

https://github.com/deanhiller/playorm/blob/master/input/javasrc/com/alvazan/test/db/PartitionedSingleTrade.java

PlayOrm allows partitioned tables AND non-partioned tables(non-partitioned tables won't scale but maybe you will never have that many rows).  You can join any two combinations(non-partitioned with partitioned, non-partitioned with non-partitioned, partition with another partition).

I only prefer stackoverflow as I like referencing links/questions with their urls.  To reference this email is very hard later on as I have to find it so in general, I HATE email lists ;) but it seems cassandra prefers them so any questions on PlayOrm you can put there and I am not sure how many on this may or may not be interested so it creates less noise on this list too.

Later,
Dean


From: Marcelo Elias Del Valle <mv...@gmail.com>>>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>>>" <us...@cassandra.apache.org>>>>
Date: Monday, September 24, 2012 11:07 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>>>" <us...@cassandra.apache.org>>>>
Subject: Re: Correct model



2012/9/24 Hiller, Dean <De...@nrel.gov>>>>
I am confused.  In this email you say you want "get all requests for a user" and in a previous one you said "Select all the users which has new requests, since date D" so let me answer both…

I have both needs. These are the two queries I need to perform on the model.

For latter, you make ONE query into the latest partition(ONE partition) of the GlobalRequestsCF which gives you the most recent requests ALONG with the user ids of those requests.  If you queried all partitions, you would most likely blow out your JVM memory.

For the former, you make ONE query to the UserRequestsCF with userid = <your user id> to get all the requests for that user

Now I think I got the main idea! This answered a lot!

Sorry, I was skipping some context.  A lot of the backing indexing sometimes is done as a long row so in playOrm, too many rows in a partition means == too many columns in the indexing row for that partition.  I believe the same is true in cassandra for their indexing.

Oh, ok, you were talking about the wide row pattern, right? But playORM is compatible with Aaron's model, isn't it? Can I map exactly this using playORM? The hardest thing for me to use playORM now is I don't know Cassandra well yet, and I know playORM even less. Can I ask playOrm questions in this list? I will try to create a POC here!
Only now I am starting to understand what it does ;-) The examples directory is empty for now, I would like to see how to set up the connection with it.

Cassandra spreads all your data out on all nodes with or without partitions.  A single partition does have it's data co-located though.

Now I see. The main advantage of using partitions is keeping the indexes small enough. It has nothing to do with the nodes. Thanks!

If you are at 100k(and the requests are rather small), you could embed all the requests in the user or go with Aaron's below suggestion of a UserRequestsCF.  If your requests are rather large, you probably don't want to embed them in the User.  Either way, it's one query or one row key lookup.

I see it now.

Multiget ignores partitions…you feed it a LIST of keys and it gets them.  It just so happens that partitionId had to be part of your row key.

Do you mean I need to load all the keys in memory to do a multiget?

I have used Hector and now use Astyanax, I don't worry much about that layer, but I feed astyanax 3 nodes and I believe it discovers some of the other ones.  I believe the latter is true but am not 100% sure as I have not looked at that code.

Why did you move? Hector is being considered for being the "official" client for Cassandra, isn't it? I looked at the Astyanax api and it seemed much more high level though

As an analogy on the above, if you happen to have used PlayOrm, you would ONLY need one Requests table and you partition by user AND time(two views into the same data partitioned two different ways) and you can do exactly the same thing as Aaron's example.  PlayOrm doesn't embed the partition ids in the key leaving it free to partition twice like in your case….and in a refactor, you have to map/reduce A LOT more rows because of rows having the FK of <partitionid><subrowkey> whereas if you don't have partition id in the key, you only map/reduce the partitioned table in a redesign/refactor.  That said, we will be adding support for CQL partitioning in addition to PlayOrm partitioning even though it can be a little less flexible sometimes.

I am not sure I understood this part. If I need to refactor, having the partition id in the key would be a bad thing? What would be the alternative? In my case, as I use userId : partitionId as row key, this might be a problem, right?

Also, CQL locates all the data on one node for a partition.  We have found it can be faster "sometimes" with the parallelized disks that the partitions are NOT all on one node so PlayOrm partitions are virtual only and do not relate to where the rows are stored.  An example on our 6 nodes was a join query on a partition with 1,000,000 rows took 60ms (of course I can't compare to CQL here since it doesn't do joins).  It really depends how much data is going to come back in the query though too?  There are tradeoff's between disk parallel nodes and having your data all on one node of course.

I guess I am still not ready for this level of info. :D
In the playORM readme, we have the following:


@NoSqlQuery(name="findWithJoinQuery", query="PARTITIONS t(:partId) SELECT t FROM TABLE as t "+
"INNER JOIN t.activityTypeInfo as i WHERE i.type = :type and t.numShares < :shares"),

What would happen behind the scenes when I execute this query? You can only use joins with partition keys, right?
In this case, is partId the row id of TABLE CF?


Thanks a lot for the answers

--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr



--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr



--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: Correct model

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

Dean, this sounds like magic :D
I don't know details about the performance on the index implementations you
chose, but it would pay the way to use it in my case, as I don't need the
best performance in the world when reading, but I need to assure
scalability and have a simple model to maintain. I liked the playOrm
concept regarding this.
I have more doubts, but I will ask them at stack over flow from now on.

2012/9/24 Hiller, Dean <De...@nrel.gov>

> PlayOrm will automatically create a CF to index my CF?
>
> It creates 3 CF's for all indices, IntegerIndice, DecimalIndice, and
> StringIndice such that the ad-hoc tool that is in development can display
> the indices as it knows the prefix of the composite column name is of
> Integer, Decimal or String and it knows the postfix type as well so it can
> translate back from bytes to the types and properly display in a GUI (i.e.
> On top of SELECT, the ad-hoc tool is adding a way to view the induce rows
> so you can check if they got corrupt or not).
>
> Will it auto-manage it, like Cassandra's secondary indexes?
>
> YES
>
> Further detail…
>
> You annotated fields with @NoSqlIndexed and PlayOrm adds/removes from the
> index as you add/modify/remove the entity…..a modify does a remove old val
> from index and insert new value into index.
>
> An example would be PlayOrm stores all long, int, short, byte in a type
> that uses the least amount of space so IF you have a long OR BigInteger
> between –128 to 128 it only ends up storing 1 byte in cassandra(SAVING tons
> of space!!!).  Then if you are indexing a type that is one of those,
> PlayOrm creates a IntegerIndice table.
>
> Right now, another guy is working on playorm-server which is a webgui to
> allow ad-hoc access to all your data as well so you can ad-hoc queries to
> see data and instead of showing Hex, it shows the real values by
> translating the bytes to String for the schema portions that it is aware of
> that is.
>
> Later,
> Dean
>
> From: Marcelo Elias Del Valle <mvallebr@gmail.com<mailto:
> mvallebr@gmail.com>>
> Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org>>
> Date: Monday, September 24, 2012 12:09 PM
> To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org>>
> Subject: Re: Correct model
>
> Dean,
>
>     There is one last thing I would like to ask about playOrm by this
> list, the next questiosn will come by stackOverflow. Just because of the
> context, I prefer asking this here:
>      When you say playOrm indexes a table (which would be a CF behind the
> scenes), what do you mean? PlayOrm will automatically create a CF to index
> my CF? Will it auto-manage it, like Cassandra's secondary indexes?
>      In Cassandra, the application is responsible for maintaining the
> index, right? I might be wrong, but unless I am using secondary indexes I
> need to update index values manually, right?
>      I got confused when you said "PlayOrm indexes the columns you
> choose". How do I choose and what exactly it means?
>
> Best regards,
> Marcelo Valle.
>
> 2012/9/24 Hiller, Dean <De...@nrel.gov>>
> Oh, ok, you were talking about the wide row pattern, right?
>
> yes
>
> But playORM is compatible with Aaron's model, isn't it?
>
> Not yet, PlayOrm supports partitioning one table multiple ways as it
> indexes the columns(in your case, the userid FK column and the time column)
>
> Can I map exactly this using playORM?
>
> Not yet, but the plan is to map these typical Cassandra scenarios as well.
>
>  Can I ask playOrm questions in this list?
>
> The best place to ask PlayOrm questions is on stack overflow and tag with
> PlayOrm though I monitor this list and stack overflow for questions(there
> are already a few questions on stack overflow).
>
> The examples directory is empty for now, I would like to see how to set up
> the connection with it.
>
> Running build or build.bat is always kept working and all 62 tests pass(or
> we don't merge to master) so to see how to make a connection or run an
> example
>
>  1.  Run build.bat or build which generates parsing code
>  2.  Import into eclipse (it already has .classpath and .project for you
> already there)
>  3.  In FactorySingleton.java you can modify IN_MEMORY to CASSANDRA or not
> and run any of the tests in-memory or against localhost(We run the test
> suite also against a 6 node cluster as well and all passes)
>  4.  FactorySingleton probably has the code you are looking for plus you
> need a class called nosql.Persistence or it won't scan your jar file.(class
> file not xml file like JPA)
>
> Do you mean I need to load all the keys in memory to do a multi get?
>
> No, you batch.  I am not sure about CQL, but PlayOrm returns a Cursor not
> the results so you can loop through every key and behind the scenes it is
> doing batch requests so you can load up 100 keys and make one multi get
> request for those 100 keys and then can load up the next 100 keys, etc.
> etc. etc.  I need to look more into the apis and protocol of CQL to see if
> it allows this style of batching.  PlayOrm does support this style of
> batching today.  Aaron would know if CQL does.
>
> Why did you move? Hector is being considered for being the "official"
> client for Cassandra, isn't it?
>
> At the time, I wanted the file streaming feature.  Also, Hector seemed a
> bit cumbersome as well compared to astyanax or at least if you were
> building a platform and had no use for typing the columns.  Just personal
> preference really here.
>
> I am not sure I understood this part. If I need to refactor, having the
> partition id in the key would be a bad thing? What would be the
> alternative? In my case, as I use userId : partitionId as row key, this
> might be a problem, right?
>
> PlayOrm indexes the columns you choose(ie. The ones you want to use in the
> where clause) and partitions by columns you choose not based on the key so
> in PlayOrm, the key is typically a TimeUUID or something cluster
> unique…..any tables referencing that TimeUUID never have to change.  With
> Cassandra partitioning, if you repartition that table a different way or go
> for some kind of major change(usually done with map/reduce), all your
> foreign keys "may" have to change….it really depends on the situation
> though.  Maybe you get the design right and never have to change.
>
> @NoSqlQuery(name="findWithJoinQuery", query="PARTITIONS t(:partId) SELECT
> t FROM TABLE as t "+
> "INNER JOIN t.activityTypeInfo as i WHERE i.type = :type and t.numShares <
> :shares"),
>
> What would happen behind the scenes when I execute this query?
>
> In this case, t or TABLE is a partitioned table since a partition is
> defined.  And t.activityTypeInfo refers to the ActivityTypeInfo table which
> is not partitioned(AND ActivityTypeInfo won't scale to billions of rows
> because there is no partitioning but maybe you don't need it!!!).  Behind
> the scenes when you call getResult, it returns a cursor that has NOT done
> anything yet.  When you start looping through the cursor, behind the scenes
> it is batching requests asking for next 500 matches(configurable) so you
> never run out of memory….it is EXACTLY like a database cursor.  You can
> even use the cursor to show a user the first set of results and when user
> clicks next pick up right where the cursor left off (if you saved it to the
> HttpSession).
>
> You can only use joins with partition keys, right?
>
> Nope, joins work on anything.  You only need to specify the partitionId
> when you have a partitioned table in the list of join tables. (That is what
> the PARTITIONS clause is for, to identify partitionId = what?)…it was put
> BEFORE the SQL instead of within it…CQL took the opposite approach but
> PlayOrm can also join different partitions together as well ;) ).
>
> In this case, is partId the row id of TABLE CF?
>
> Nope, partId is one of the columns.  There is a test case on this class in
> PlayOrm …(notice the annotation NoSqlPartitionByThisField on the
> column/field in the entity)…
>
>
> https://github.com/deanhiller/playorm/blob/master/input/javasrc/com/alvazan/test/db/PartitionedSingleTrade.java
>
> PlayOrm allows partitioned tables AND non-partioned tables(non-partitioned
> tables won't scale but maybe you will never have that many rows).  You can
> join any two combinations(non-partitioned with partitioned, non-partitioned
> with non-partitioned, partition with another partition).
>
> I only prefer stackoverflow as I like referencing links/questions with
> their urls.  To reference this email is very hard later on as I have to
> find it so in general, I HATE email lists ;) but it seems cassandra prefers
> them so any questions on PlayOrm you can put there and I am not sure how
> many on this may or may not be interested so it creates less noise on this
> list too.
>
> Later,
> Dean
>
>
> From: Marcelo Elias Del Valle <mvallebr@gmail.com<mailto:
> mvallebr@gmail.com><ma...@gmail.com>>>
> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org
> ><ma...@cassandra.apache.org>>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org><mailto:
> user@cassandra.apache.org<ma...@cassandra.apache.org>>>
> Date: Monday, September 24, 2012 11:07 AM
> To: "user@cassandra.apache.org<ma...@cassandra.apache.org><mailto:
> user@cassandra.apache.org<ma...@cassandra.apache.org>>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org><mailto:
> user@cassandra.apache.org<ma...@cassandra.apache.org>>>
> Subject: Re: Correct model
>
>
>
> 2012/9/24 Hiller, Dean <Dean.Hiller@nrel.gov<mailto:Dean.Hiller@nrel.gov
> ><ma...@nrel.gov>>>
> I am confused.  In this email you say you want "get all requests for a
> user" and in a previous one you said "Select all the users which has new
> requests, since date D" so let me answer both…
>
> I have both needs. These are the two queries I need to perform on the
> model.
>
> For latter, you make ONE query into the latest partition(ONE partition) of
> the GlobalRequestsCF which gives you the most recent requests ALONG with
> the user ids of those requests.  If you queried all partitions, you would
> most likely blow out your JVM memory.
>
> For the former, you make ONE query to the UserRequestsCF with userid =
> <your user id> to get all the requests for that user
>
> Now I think I got the main idea! This answered a lot!
>
> Sorry, I was skipping some context.  A lot of the backing indexing
> sometimes is done as a long row so in playOrm, too many rows in a partition
> means == too many columns in the indexing row for that partition.  I
> believe the same is true in cassandra for their indexing.
>
> Oh, ok, you were talking about the wide row pattern, right? But playORM is
> compatible with Aaron's model, isn't it? Can I map exactly this using
> playORM? The hardest thing for me to use playORM now is I don't know
> Cassandra well yet, and I know playORM even less. Can I ask playOrm
> questions in this list? I will try to create a POC here!
> Only now I am starting to understand what it does ;-) The examples
> directory is empty for now, I would like to see how to set up the
> connection with it.
>
> Cassandra spreads all your data out on all nodes with or without
> partitions.  A single partition does have it's data co-located though.
>
> Now I see. The main advantage of using partitions is keeping the indexes
> small enough. It has nothing to do with the nodes. Thanks!
>
> If you are at 100k(and the requests are rather small), you could embed all
> the requests in the user or go with Aaron's below suggestion of a
> UserRequestsCF.  If your requests are rather large, you probably don't want
> to embed them in the User.  Either way, it's one query or one row key
> lookup.
>
> I see it now.
>
> Multiget ignores partitions…you feed it a LIST of keys and it gets them.
>  It just so happens that partitionId had to be part of your row key.
>
> Do you mean I need to load all the keys in memory to do a multiget?
>
> I have used Hector and now use Astyanax, I don't worry much about that
> layer, but I feed astyanax 3 nodes and I believe it discovers some of the
> other ones.  I believe the latter is true but am not 100% sure as I have
> not looked at that code.
>
> Why did you move? Hector is being considered for being the "official"
> client for Cassandra, isn't it? I looked at the Astyanax api and it seemed
> much more high level though
>
> As an analogy on the above, if you happen to have used PlayOrm, you would
> ONLY need one Requests table and you partition by user AND time(two views
> into the same data partitioned two different ways) and you can do exactly
> the same thing as Aaron's example.  PlayOrm doesn't embed the partition ids
> in the key leaving it free to partition twice like in your case….and in a
> refactor, you have to map/reduce A LOT more rows because of rows having the
> FK of <partitionid><subrowkey> whereas if you don't have partition id in
> the key, you only map/reduce the partitioned table in a redesign/refactor.
>  That said, we will be adding support for CQL partitioning in addition to
> PlayOrm partitioning even though it can be a little less flexible sometimes.
>
> I am not sure I understood this part. If I need to refactor, having the
> partition id in the key would be a bad thing? What would be the
> alternative? In my case, as I use userId : partitionId as row key, this
> might be a problem, right?
>
> Also, CQL locates all the data on one node for a partition.  We have found
> it can be faster "sometimes" with the parallelized disks that the
> partitions are NOT all on one node so PlayOrm partitions are virtual only
> and do not relate to where the rows are stored.  An example on our 6 nodes
> was a join query on a partition with 1,000,000 rows took 60ms (of course I
> can't compare to CQL here since it doesn't do joins).  It really depends
> how much data is going to come back in the query though too?  There are
> tradeoff's between disk parallel nodes and having your data all on one node
> of course.
>
> I guess I am still not ready for this level of info. :D
> In the playORM readme, we have the following:
>
>
> @NoSqlQuery(name="findWithJoinQuery", query="PARTITIONS t(:partId) SELECT
> t FROM TABLE as t "+
> "INNER JOIN t.activityTypeInfo as i WHERE i.type = :type and t.numShares <
> :shares"),
>
> What would happen behind the scenes when I execute this query? You can
> only use joins with partition keys, right?
> In this case, is partId the row id of TABLE CF?
>
>
> Thanks a lot for the answers
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>



-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: Correct model

Posted by "Hiller, Dean" <De...@nrel.gov>.

PlayOrm will automatically create a CF to index my CF?

It creates 3 CF's for all indices, IntegerIndice, DecimalIndice, and StringIndice such that the ad-hoc tool that is in development can display the indices as it knows the prefix of the composite column name is of Integer, Decimal or String and it knows the postfix type as well so it can translate back from bytes to the types and properly display in a GUI (i.e. On top of SELECT, the ad-hoc tool is adding a way to view the induce rows so you can check if they got corrupt or not).

Will it auto-manage it, like Cassandra's secondary indexes?

YES

Further detail…

You annotated fields with @NoSqlIndexed and PlayOrm adds/removes from the index as you add/modify/remove the entity…..a modify does a remove old val from index and insert new value into index.

An example would be PlayOrm stores all long, int, short, byte in a type that uses the least amount of space so IF you have a long OR BigInteger between –128 to 128 it only ends up storing 1 byte in cassandra(SAVING tons of space!!!).  Then if you are indexing a type that is one of those, PlayOrm creates a IntegerIndice table.

Right now, another guy is working on playorm-server which is a webgui to allow ad-hoc access to all your data as well so you can ad-hoc queries to see data and instead of showing Hex, it shows the real values by translating the bytes to String for the schema portions that it is aware of that is.

Later,
Dean

From: Marcelo Elias Del Valle <mv...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Monday, September 24, 2012 12:09 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Correct model

Dean,

    There is one last thing I would like to ask about playOrm by this list, the next questiosn will come by stackOverflow. Just because of the context, I prefer asking this here:
     When you say playOrm indexes a table (which would be a CF behind the scenes), what do you mean? PlayOrm will automatically create a CF to index my CF? Will it auto-manage it, like Cassandra's secondary indexes?
     In Cassandra, the application is responsible for maintaining the index, right? I might be wrong, but unless I am using secondary indexes I need to update index values manually, right?
     I got confused when you said "PlayOrm indexes the columns you choose". How do I choose and what exactly it means?

Best regards,
Marcelo Valle.

2012/9/24 Hiller, Dean <De...@nrel.gov>>
Oh, ok, you were talking about the wide row pattern, right?

yes

But playORM is compatible with Aaron's model, isn't it?

Not yet, PlayOrm supports partitioning one table multiple ways as it indexes the columns(in your case, the userid FK column and the time column)

Can I map exactly this using playORM?

Not yet, but the plan is to map these typical Cassandra scenarios as well.

 Can I ask playOrm questions in this list?

The best place to ask PlayOrm questions is on stack overflow and tag with PlayOrm though I monitor this list and stack overflow for questions(there are already a few questions on stack overflow).

The examples directory is empty for now, I would like to see how to set up the connection with it.

Running build or build.bat is always kept working and all 62 tests pass(or we don't merge to master) so to see how to make a connection or run an example

 1.  Run build.bat or build which generates parsing code
 2.  Import into eclipse (it already has .classpath and .project for you already there)
 3.  In FactorySingleton.java you can modify IN_MEMORY to CASSANDRA or not and run any of the tests in-memory or against localhost(We run the test suite also against a 6 node cluster as well and all passes)
 4.  FactorySingleton probably has the code you are looking for plus you need a class called nosql.Persistence or it won't scan your jar file.(class file not xml file like JPA)

Do you mean I need to load all the keys in memory to do a multi get?

No, you batch.  I am not sure about CQL, but PlayOrm returns a Cursor not the results so you can loop through every key and behind the scenes it is doing batch requests so you can load up 100 keys and make one multi get request for those 100 keys and then can load up the next 100 keys, etc. etc. etc.  I need to look more into the apis and protocol of CQL to see if it allows this style of batching.  PlayOrm does support this style of batching today.  Aaron would know if CQL does.

Why did you move? Hector is being considered for being the "official" client for Cassandra, isn't it?

At the time, I wanted the file streaming feature.  Also, Hector seemed a bit cumbersome as well compared to astyanax or at least if you were building a platform and had no use for typing the columns.  Just personal preference really here.

I am not sure I understood this part. If I need to refactor, having the partition id in the key would be a bad thing? What would be the alternative? In my case, as I use userId : partitionId as row key, this might be a problem, right?

PlayOrm indexes the columns you choose(ie. The ones you want to use in the where clause) and partitions by columns you choose not based on the key so in PlayOrm, the key is typically a TimeUUID or something cluster unique…..any tables referencing that TimeUUID never have to change.  With Cassandra partitioning, if you repartition that table a different way or go for some kind of major change(usually done with map/reduce), all your foreign keys "may" have to change….it really depends on the situation though.  Maybe you get the design right and never have to change.

@NoSqlQuery(name="findWithJoinQuery", query="PARTITIONS t(:partId) SELECT t FROM TABLE as t "+
"INNER JOIN t.activityTypeInfo as i WHERE i.type = :type and t.numShares < :shares"),

What would happen behind the scenes when I execute this query?

In this case, t or TABLE is a partitioned table since a partition is defined.  And t.activityTypeInfo refers to the ActivityTypeInfo table which is not partitioned(AND ActivityTypeInfo won't scale to billions of rows because there is no partitioning but maybe you don't need it!!!).  Behind the scenes when you call getResult, it returns a cursor that has NOT done anything yet.  When you start looping through the cursor, behind the scenes it is batching requests asking for next 500 matches(configurable) so you never run out of memory….it is EXACTLY like a database cursor.  You can even use the cursor to show a user the first set of results and when user clicks next pick up right where the cursor left off (if you saved it to the HttpSession).

You can only use joins with partition keys, right?

Nope, joins work on anything.  You only need to specify the partitionId when you have a partitioned table in the list of join tables. (That is what the PARTITIONS clause is for, to identify partitionId = what?)…it was put BEFORE the SQL instead of within it…CQL took the opposite approach but PlayOrm can also join different partitions together as well ;) ).

In this case, is partId the row id of TABLE CF?

Nope, partId is one of the columns.  There is a test case on this class in PlayOrm …(notice the annotation NoSqlPartitionByThisField on the column/field in the entity)…

https://github.com/deanhiller/playorm/blob/master/input/javasrc/com/alvazan/test/db/PartitionedSingleTrade.java

PlayOrm allows partitioned tables AND non-partioned tables(non-partitioned tables won't scale but maybe you will never have that many rows).  You can join any two combinations(non-partitioned with partitioned, non-partitioned with non-partitioned, partition with another partition).

I only prefer stackoverflow as I like referencing links/questions with their urls.  To reference this email is very hard later on as I have to find it so in general, I HATE email lists ;) but it seems cassandra prefers them so any questions on PlayOrm you can put there and I am not sure how many on this may or may not be interested so it creates less noise on this list too.

Later,
Dean


From: Marcelo Elias Del Valle <mv...@gmail.com>>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>>" <us...@cassandra.apache.org>>>
Date: Monday, September 24, 2012 11:07 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>>" <us...@cassandra.apache.org>>>
Subject: Re: Correct model



2012/9/24 Hiller, Dean <De...@nrel.gov>>>
I am confused.  In this email you say you want "get all requests for a user" and in a previous one you said "Select all the users which has new requests, since date D" so let me answer both…

I have both needs. These are the two queries I need to perform on the model.

For latter, you make ONE query into the latest partition(ONE partition) of the GlobalRequestsCF which gives you the most recent requests ALONG with the user ids of those requests.  If you queried all partitions, you would most likely blow out your JVM memory.

For the former, you make ONE query to the UserRequestsCF with userid = <your user id> to get all the requests for that user

Now I think I got the main idea! This answered a lot!

Sorry, I was skipping some context.  A lot of the backing indexing sometimes is done as a long row so in playOrm, too many rows in a partition means == too many columns in the indexing row for that partition.  I believe the same is true in cassandra for their indexing.

Oh, ok, you were talking about the wide row pattern, right? But playORM is compatible with Aaron's model, isn't it? Can I map exactly this using playORM? The hardest thing for me to use playORM now is I don't know Cassandra well yet, and I know playORM even less. Can I ask playOrm questions in this list? I will try to create a POC here!
Only now I am starting to understand what it does ;-) The examples directory is empty for now, I would like to see how to set up the connection with it.

Cassandra spreads all your data out on all nodes with or without partitions.  A single partition does have it's data co-located though.

Now I see. The main advantage of using partitions is keeping the indexes small enough. It has nothing to do with the nodes. Thanks!

If you are at 100k(and the requests are rather small), you could embed all the requests in the user or go with Aaron's below suggestion of a UserRequestsCF.  If your requests are rather large, you probably don't want to embed them in the User.  Either way, it's one query or one row key lookup.

I see it now.

Multiget ignores partitions…you feed it a LIST of keys and it gets them.  It just so happens that partitionId had to be part of your row key.

Do you mean I need to load all the keys in memory to do a multiget?

I have used Hector and now use Astyanax, I don't worry much about that layer, but I feed astyanax 3 nodes and I believe it discovers some of the other ones.  I believe the latter is true but am not 100% sure as I have not looked at that code.

Why did you move? Hector is being considered for being the "official" client for Cassandra, isn't it? I looked at the Astyanax api and it seemed much more high level though

As an analogy on the above, if you happen to have used PlayOrm, you would ONLY need one Requests table and you partition by user AND time(two views into the same data partitioned two different ways) and you can do exactly the same thing as Aaron's example.  PlayOrm doesn't embed the partition ids in the key leaving it free to partition twice like in your case….and in a refactor, you have to map/reduce A LOT more rows because of rows having the FK of <partitionid><subrowkey> whereas if you don't have partition id in the key, you only map/reduce the partitioned table in a redesign/refactor.  That said, we will be adding support for CQL partitioning in addition to PlayOrm partitioning even though it can be a little less flexible sometimes.

I am not sure I understood this part. If I need to refactor, having the partition id in the key would be a bad thing? What would be the alternative? In my case, as I use userId : partitionId as row key, this might be a problem, right?

Also, CQL locates all the data on one node for a partition.  We have found it can be faster "sometimes" with the parallelized disks that the partitions are NOT all on one node so PlayOrm partitions are virtual only and do not relate to where the rows are stored.  An example on our 6 nodes was a join query on a partition with 1,000,000 rows took 60ms (of course I can't compare to CQL here since it doesn't do joins).  It really depends how much data is going to come back in the query though too?  There are tradeoff's between disk parallel nodes and having your data all on one node of course.

I guess I am still not ready for this level of info. :D
In the playORM readme, we have the following:


@NoSqlQuery(name="findWithJoinQuery", query="PARTITIONS t(:partId) SELECT t FROM TABLE as t "+
"INNER JOIN t.activityTypeInfo as i WHERE i.type = :type and t.numShares < :shares"),

What would happen behind the scenes when I execute this query? You can only use joins with partition keys, right?
In this case, is partId the row id of TABLE CF?


Thanks a lot for the answers

--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr



--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: Correct model

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

Dean,

    There is one last thing I would like to ask about playOrm by this list,
the next questiosn will come by stackOverflow. Just because of the context,
I prefer asking this here:
     When you say playOrm indexes a table (which would be a CF behind the
scenes), what do you mean? PlayOrm will automatically create a CF to index
my CF? Will it auto-manage it, like Cassandra's secondary indexes?
     In Cassandra, the application is responsible for maintaining the
index, right? I might be wrong, but unless I am using secondary indexes I
need to update index values manually, right?
     I got confused when you said "PlayOrm indexes the columns you choose".
How do I choose and what exactly it means?

Best regards,
Marcelo Valle.

2012/9/24 Hiller, Dean <De...@nrel.gov>

> Oh, ok, you were talking about the wide row pattern, right?
>
> yes
>
> But playORM is compatible with Aaron's model, isn't it?
>
> Not yet, PlayOrm supports partitioning one table multiple ways as it
> indexes the columns(in your case, the userid FK column and the time column)
>
> Can I map exactly this using playORM?
>
> Not yet, but the plan is to map these typical Cassandra scenarios as well.
>
>  Can I ask playOrm questions in this list?
>
> The best place to ask PlayOrm questions is on stack overflow and tag with
> PlayOrm though I monitor this list and stack overflow for questions(there
> are already a few questions on stack overflow).
>
> The examples directory is empty for now, I would like to see how to set up
> the connection with it.
>
> Running build or build.bat is always kept working and all 62 tests pass(or
> we don't merge to master) so to see how to make a connection or run an
> example
>
>  1.  Run build.bat or build which generates parsing code
>  2.  Import into eclipse (it already has .classpath and .project for you
> already there)
>  3.  In FactorySingleton.java you can modify IN_MEMORY to CASSANDRA or not
> and run any of the tests in-memory or against localhost(We run the test
> suite also against a 6 node cluster as well and all passes)
>  4.  FactorySingleton probably has the code you are looking for plus you
> need a class called nosql.Persistence or it won't scan your jar file.(class
> file not xml file like JPA)
>
> Do you mean I need to load all the keys in memory to do a multi get?
>
> No, you batch.  I am not sure about CQL, but PlayOrm returns a Cursor not
> the results so you can loop through every key and behind the scenes it is
> doing batch requests so you can load up 100 keys and make one multi get
> request for those 100 keys and then can load up the next 100 keys, etc.
> etc. etc.  I need to look more into the apis and protocol of CQL to see if
> it allows this style of batching.  PlayOrm does support this style of
> batching today.  Aaron would know if CQL does.
>
> Why did you move? Hector is being considered for being the "official"
> client for Cassandra, isn't it?
>
> At the time, I wanted the file streaming feature.  Also, Hector seemed a
> bit cumbersome as well compared to astyanax or at least if you were
> building a platform and had no use for typing the columns.  Just personal
> preference really here.
>
> I am not sure I understood this part. If I need to refactor, having the
> partition id in the key would be a bad thing? What would be the
> alternative? In my case, as I use userId : partitionId as row key, this
> might be a problem, right?
>
> PlayOrm indexes the columns you choose(ie. The ones you want to use in the
> where clause) and partitions by columns you choose not based on the key so
> in PlayOrm, the key is typically a TimeUUID or something cluster
> unique…..any tables referencing that TimeUUID never have to change.  With
> Cassandra partitioning, if you repartition that table a different way or go
> for some kind of major change(usually done with map/reduce), all your
> foreign keys "may" have to change….it really depends on the situation
> though.  Maybe you get the design right and never have to change.
>
> @NoSqlQuery(name="findWithJoinQuery", query="PARTITIONS t(:partId) SELECT
> t FROM TABLE as t "+
> "INNER JOIN t.activityTypeInfo as i WHERE i.type = :type and t.numShares <
> :shares"),
>
> What would happen behind the scenes when I execute this query?
>
> In this case, t or TABLE is a partitioned table since a partition is
> defined.  And t.activityTypeInfo refers to the ActivityTypeInfo table which
> is not partitioned(AND ActivityTypeInfo won't scale to billions of rows
> because there is no partitioning but maybe you don't need it!!!).  Behind
> the scenes when you call getResult, it returns a cursor that has NOT done
> anything yet.  When you start looping through the cursor, behind the scenes
> it is batching requests asking for next 500 matches(configurable) so you
> never run out of memory….it is EXACTLY like a database cursor.  You can
> even use the cursor to show a user the first set of results and when user
> clicks next pick up right where the cursor left off (if you saved it to the
> HttpSession).
>
> You can only use joins with partition keys, right?
>
> Nope, joins work on anything.  You only need to specify the partitionId
> when you have a partitioned table in the list of join tables. (That is what
> the PARTITIONS clause is for, to identify partitionId = what?)…it was put
> BEFORE the SQL instead of within it…CQL took the opposite approach but
> PlayOrm can also join different partitions together as well ;) ).
>
> In this case, is partId the row id of TABLE CF?
>
> Nope, partId is one of the columns.  There is a test case on this class in
> PlayOrm …(notice the annotation NoSqlPartitionByThisField on the
> column/field in the entity)…
>
>
> https://github.com/deanhiller/playorm/blob/master/input/javasrc/com/alvazan/test/db/PartitionedSingleTrade.java
>
> PlayOrm allows partitioned tables AND non-partioned tables(non-partitioned
> tables won't scale but maybe you will never have that many rows).  You can
> join any two combinations(non-partitioned with partitioned, non-partitioned
> with non-partitioned, partition with another partition).
>
> I only prefer stackoverflow as I like referencing links/questions with
> their urls.  To reference this email is very hard later on as I have to
> find it so in general, I HATE email lists ;) but it seems cassandra prefers
> them so any questions on PlayOrm you can put there and I am not sure how
> many on this may or may not be interested so it creates less noise on this
> list too.
>
> Later,
> Dean
>
>
> From: Marcelo Elias Del Valle <mvallebr@gmail.com<mailto:
> mvallebr@gmail.com>>
> Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org>>
> Date: Monday, September 24, 2012 11:07 AM
> To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org>>
> Subject: Re: Correct model
>
>
>
> 2012/9/24 Hiller, Dean <De...@nrel.gov>>
> I am confused.  In this email you say you want "get all requests for a
> user" and in a previous one you said "Select all the users which has new
> requests, since date D" so let me answer both…
>
> I have both needs. These are the two queries I need to perform on the
> model.
>
> For latter, you make ONE query into the latest partition(ONE partition) of
> the GlobalRequestsCF which gives you the most recent requests ALONG with
> the user ids of those requests.  If you queried all partitions, you would
> most likely blow out your JVM memory.
>
> For the former, you make ONE query to the UserRequestsCF with userid =
> <your user id> to get all the requests for that user
>
> Now I think I got the main idea! This answered a lot!
>
> Sorry, I was skipping some context.  A lot of the backing indexing
> sometimes is done as a long row so in playOrm, too many rows in a partition
> means == too many columns in the indexing row for that partition.  I
> believe the same is true in cassandra for their indexing.
>
> Oh, ok, you were talking about the wide row pattern, right? But playORM is
> compatible with Aaron's model, isn't it? Can I map exactly this using
> playORM? The hardest thing for me to use playORM now is I don't know
> Cassandra well yet, and I know playORM even less. Can I ask playOrm
> questions in this list? I will try to create a POC here!
> Only now I am starting to understand what it does ;-) The examples
> directory is empty for now, I would like to see how to set up the
> connection with it.
>
> Cassandra spreads all your data out on all nodes with or without
> partitions.  A single partition does have it's data co-located though.
>
> Now I see. The main advantage of using partitions is keeping the indexes
> small enough. It has nothing to do with the nodes. Thanks!
>
> If you are at 100k(and the requests are rather small), you could embed all
> the requests in the user or go with Aaron's below suggestion of a
> UserRequestsCF.  If your requests are rather large, you probably don't want
> to embed them in the User.  Either way, it's one query or one row key
> lookup.
>
> I see it now.
>
> Multiget ignores partitions…you feed it a LIST of keys and it gets them.
>  It just so happens that partitionId had to be part of your row key.
>
> Do you mean I need to load all the keys in memory to do a multiget?
>
> I have used Hector and now use Astyanax, I don't worry much about that
> layer, but I feed astyanax 3 nodes and I believe it discovers some of the
> other ones.  I believe the latter is true but am not 100% sure as I have
> not looked at that code.
>
> Why did you move? Hector is being considered for being the "official"
> client for Cassandra, isn't it? I looked at the Astyanax api and it seemed
> much more high level though
>
> As an analogy on the above, if you happen to have used PlayOrm, you would
> ONLY need one Requests table and you partition by user AND time(two views
> into the same data partitioned two different ways) and you can do exactly
> the same thing as Aaron's example.  PlayOrm doesn't embed the partition ids
> in the key leaving it free to partition twice like in your case….and in a
> refactor, you have to map/reduce A LOT more rows because of rows having the
> FK of <partitionid><subrowkey> whereas if you don't have partition id in
> the key, you only map/reduce the partitioned table in a redesign/refactor.
>  That said, we will be adding support for CQL partitioning in addition to
> PlayOrm partitioning even though it can be a little less flexible sometimes.
>
> I am not sure I understood this part. If I need to refactor, having the
> partition id in the key would be a bad thing? What would be the
> alternative? In my case, as I use userId : partitionId as row key, this
> might be a problem, right?
>
> Also, CQL locates all the data on one node for a partition.  We have found
> it can be faster "sometimes" with the parallelized disks that the
> partitions are NOT all on one node so PlayOrm partitions are virtual only
> and do not relate to where the rows are stored.  An example on our 6 nodes
> was a join query on a partition with 1,000,000 rows took 60ms (of course I
> can't compare to CQL here since it doesn't do joins).  It really depends
> how much data is going to come back in the query though too?  There are
> tradeoff's between disk parallel nodes and having your data all on one node
> of course.
>
> I guess I am still not ready for this level of info. :D
> In the playORM readme, we have the following:
>
>
> @NoSqlQuery(name="findWithJoinQuery", query="PARTITIONS t(:partId) SELECT
> t FROM TABLE as t "+
> "INNER JOIN t.activityTypeInfo as i WHERE i.type = :type and t.numShares <
> :shares"),
>
> What would happen behind the scenes when I execute this query? You can
> only use joins with partition keys, right?
> In this case, is partId the row id of TABLE CF?
>
>
> Thanks a lot for the answers
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>



-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: Correct model

Posted by "Hiller, Dean" <De...@nrel.gov>.

Oh, ok, you were talking about the wide row pattern, right?

yes

But playORM is compatible with Aaron's model, isn't it?

Not yet, PlayOrm supports partitioning one table multiple ways as it indexes the columns(in your case, the userid FK column and the time column)

Can I map exactly this using playORM?

Not yet, but the plan is to map these typical Cassandra scenarios as well.

 Can I ask playOrm questions in this list?

The best place to ask PlayOrm questions is on stack overflow and tag with PlayOrm though I monitor this list and stack overflow for questions(there are already a few questions on stack overflow).

The examples directory is empty for now, I would like to see how to set up the connection with it.

Running build or build.bat is always kept working and all 62 tests pass(or we don't merge to master) so to see how to make a connection or run an example

 1.  Run build.bat or build which generates parsing code
 2.  Import into eclipse (it already has .classpath and .project for you already there)
 3.  In FactorySingleton.java you can modify IN_MEMORY to CASSANDRA or not and run any of the tests in-memory or against localhost(We run the test suite also against a 6 node cluster as well and all passes)
 4.  FactorySingleton probably has the code you are looking for plus you need a class called nosql.Persistence or it won't scan your jar file.(class file not xml file like JPA)

Do you mean I need to load all the keys in memory to do a multi get?

No, you batch.  I am not sure about CQL, but PlayOrm returns a Cursor not the results so you can loop through every key and behind the scenes it is doing batch requests so you can load up 100 keys and make one multi get request for those 100 keys and then can load up the next 100 keys, etc. etc. etc.  I need to look more into the apis and protocol of CQL to see if it allows this style of batching.  PlayOrm does support this style of batching today.  Aaron would know if CQL does.

Why did you move? Hector is being considered for being the "official" client for Cassandra, isn't it?

At the time, I wanted the file streaming feature.  Also, Hector seemed a bit cumbersome as well compared to astyanax or at least if you were building a platform and had no use for typing the columns.  Just personal preference really here.

I am not sure I understood this part. If I need to refactor, having the partition id in the key would be a bad thing? What would be the alternative? In my case, as I use userId : partitionId as row key, this might be a problem, right?

PlayOrm indexes the columns you choose(ie. The ones you want to use in the where clause) and partitions by columns you choose not based on the key so in PlayOrm, the key is typically a TimeUUID or something cluster unique…..any tables referencing that TimeUUID never have to change.  With Cassandra partitioning, if you repartition that table a different way or go for some kind of major change(usually done with map/reduce), all your foreign keys "may" have to change….it really depends on the situation though.  Maybe you get the design right and never have to change.

@NoSqlQuery(name="findWithJoinQuery", query="PARTITIONS t(:partId) SELECT t FROM TABLE as t "+
"INNER JOIN t.activityTypeInfo as i WHERE i.type = :type and t.numShares < :shares"),

What would happen behind the scenes when I execute this query?

In this case, t or TABLE is a partitioned table since a partition is defined.  And t.activityTypeInfo refers to the ActivityTypeInfo table which is not partitioned(AND ActivityTypeInfo won't scale to billions of rows because there is no partitioning but maybe you don't need it!!!).  Behind the scenes when you call getResult, it returns a cursor that has NOT done anything yet.  When you start looping through the cursor, behind the scenes it is batching requests asking for next 500 matches(configurable) so you never run out of memory….it is EXACTLY like a database cursor.  You can even use the cursor to show a user the first set of results and when user clicks next pick up right where the cursor left off (if you saved it to the HttpSession).

You can only use joins with partition keys, right?

Nope, joins work on anything.  You only need to specify the partitionId when you have a partitioned table in the list of join tables. (That is what the PARTITIONS clause is for, to identify partitionId = what?)…it was put BEFORE the SQL instead of within it…CQL took the opposite approach but PlayOrm can also join different partitions together as well ;) ).

In this case, is partId the row id of TABLE CF?

Nope, partId is one of the columns.  There is a test case on this class in PlayOrm …(notice the annotation NoSqlPartitionByThisField on the column/field in the entity)…

https://github.com/deanhiller/playorm/blob/master/input/javasrc/com/alvazan/test/db/PartitionedSingleTrade.java

PlayOrm allows partitioned tables AND non-partioned tables(non-partitioned tables won't scale but maybe you will never have that many rows).  You can join any two combinations(non-partitioned with partitioned, non-partitioned with non-partitioned, partition with another partition).

I only prefer stackoverflow as I like referencing links/questions with their urls.  To reference this email is very hard later on as I have to find it so in general, I HATE email lists ;) but it seems cassandra prefers them so any questions on PlayOrm you can put there and I am not sure how many on this may or may not be interested so it creates less noise on this list too.

Later,
Dean


From: Marcelo Elias Del Valle <mv...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Monday, September 24, 2012 11:07 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Correct model



2012/9/24 Hiller, Dean <De...@nrel.gov>>
I am confused.  In this email you say you want "get all requests for a user" and in a previous one you said "Select all the users which has new requests, since date D" so let me answer both…

I have both needs. These are the two queries I need to perform on the model.

For latter, you make ONE query into the latest partition(ONE partition) of the GlobalRequestsCF which gives you the most recent requests ALONG with the user ids of those requests.  If you queried all partitions, you would most likely blow out your JVM memory.

For the former, you make ONE query to the UserRequestsCF with userid = <your user id> to get all the requests for that user

Now I think I got the main idea! This answered a lot!

Sorry, I was skipping some context.  A lot of the backing indexing sometimes is done as a long row so in playOrm, too many rows in a partition means == too many columns in the indexing row for that partition.  I believe the same is true in cassandra for their indexing.

Oh, ok, you were talking about the wide row pattern, right? But playORM is compatible with Aaron's model, isn't it? Can I map exactly this using playORM? The hardest thing for me to use playORM now is I don't know Cassandra well yet, and I know playORM even less. Can I ask playOrm questions in this list? I will try to create a POC here!
Only now I am starting to understand what it does ;-) The examples directory is empty for now, I would like to see how to set up the connection with it.

Cassandra spreads all your data out on all nodes with or without partitions.  A single partition does have it's data co-located though.

Now I see. The main advantage of using partitions is keeping the indexes small enough. It has nothing to do with the nodes. Thanks!

If you are at 100k(and the requests are rather small), you could embed all the requests in the user or go with Aaron's below suggestion of a UserRequestsCF.  If your requests are rather large, you probably don't want to embed them in the User.  Either way, it's one query or one row key lookup.

I see it now.

Multiget ignores partitions…you feed it a LIST of keys and it gets them.  It just so happens that partitionId had to be part of your row key.

Do you mean I need to load all the keys in memory to do a multiget?

I have used Hector and now use Astyanax, I don't worry much about that layer, but I feed astyanax 3 nodes and I believe it discovers some of the other ones.  I believe the latter is true but am not 100% sure as I have not looked at that code.

Why did you move? Hector is being considered for being the "official" client for Cassandra, isn't it? I looked at the Astyanax api and it seemed much more high level though

As an analogy on the above, if you happen to have used PlayOrm, you would ONLY need one Requests table and you partition by user AND time(two views into the same data partitioned two different ways) and you can do exactly the same thing as Aaron's example.  PlayOrm doesn't embed the partition ids in the key leaving it free to partition twice like in your case….and in a refactor, you have to map/reduce A LOT more rows because of rows having the FK of <partitionid><subrowkey> whereas if you don't have partition id in the key, you only map/reduce the partitioned table in a redesign/refactor.  That said, we will be adding support for CQL partitioning in addition to PlayOrm partitioning even though it can be a little less flexible sometimes.

I am not sure I understood this part. If I need to refactor, having the partition id in the key would be a bad thing? What would be the alternative? In my case, as I use userId : partitionId as row key, this might be a problem, right?

Also, CQL locates all the data on one node for a partition.  We have found it can be faster "sometimes" with the parallelized disks that the partitions are NOT all on one node so PlayOrm partitions are virtual only and do not relate to where the rows are stored.  An example on our 6 nodes was a join query on a partition with 1,000,000 rows took 60ms (of course I can't compare to CQL here since it doesn't do joins).  It really depends how much data is going to come back in the query though too?  There are tradeoff's between disk parallel nodes and having your data all on one node of course.

I guess I am still not ready for this level of info. :D
In the playORM readme, we have the following:


@NoSqlQuery(name="findWithJoinQuery", query="PARTITIONS t(:partId) SELECT t FROM TABLE as t "+
"INNER JOIN t.activityTypeInfo as i WHERE i.type = :type and t.numShares < :shares"),

What would happen behind the scenes when I execute this query? You can only use joins with partition keys, right?
In this case, is partId the row id of TABLE CF?


Thanks a lot for the answers

--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: Correct model

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

2012/9/24 Hiller, Dean <De...@nrel.gov>

> I am confused.  In this email you say you want "get all requests for a
> user" and in a previous one you said "Select all the users which has new
> requests, since date D" so let me answer both…
>

I have both needs. These are the two queries I need to perform on the model.


> For latter, you make ONE query into the latest partition(ONE partition) of
> the GlobalRequestsCF which gives you the most recent requests ALONG with
> the user ids of those requests.  If you queried all partitions, you would
> most likely blow out your JVM memory.
>
> For the former, you make ONE query to the UserRequestsCF with userid =
> <your user id> to get all the requests for that user
>

Now I think I got the main idea! This answered a lot!


> Sorry, I was skipping some context.  A lot of the backing indexing
> sometimes is done as a long row so in playOrm, too many rows in a partition
> means == too many columns in the indexing row for that partition.  I
> believe the same is true in cassandra for their indexing.
>

Oh, ok, you were talking about the wide row pattern, right? But playORM is
compatible with Aaron's model, isn't it? Can I map exactly this using
playORM? The hardest thing for me to use playORM now is I don't know
Cassandra well yet, and I know playORM even less. Can I ask playOrm
questions in this list? I will try to create a POC here!
Only now I am starting to understand what it does ;-) The examples
directory is empty for now, I would like to see how to set up the
connection with it.


> Cassandra spreads all your data out on all nodes with or without
> partitions.  A single partition does have it's data co-located though.
>

Now I see. The main advantage of using partitions is keeping the indexes
small enough. It has nothing to do with the nodes. Thanks!


> If you are at 100k(and the requests are rather small), you could embed all
> the requests in the user or go with Aaron's below suggestion of a
> UserRequestsCF.  If your requests are rather large, you probably don't want
> to embed them in the User.  Either way, it's one query or one row key
> lookup.
>

I see it now.


> Multiget ignores partitions…you feed it a LIST of keys and it gets them.
>  It just so happens that partitionId had to be part of your row key.
>

Do you mean I need to load all the keys in memory to do a multiget?


> I have used Hector and now use Astyanax, I don't worry much about that
> layer, but I feed astyanax 3 nodes and I believe it discovers some of the
> other ones.  I believe the latter is true but am not 100% sure as I have
> not looked at that code.
>

Why did you move? Hector is being considered for being the "official"
client for Cassandra, isn't it? I looked at the Astyanax api and it seemed
much more high level though


> As an analogy on the above, if you happen to have used PlayOrm, you would
> ONLY need one Requests table and you partition by user AND time(two views
> into the same data partitioned two different ways) and you can do exactly
> the same thing as Aaron's example.  PlayOrm doesn't embed the partition ids
> in the key leaving it free to partition twice like in your case….and in a
> refactor, you have to map/reduce A LOT more rows because of rows having the
> FK of <partitionid><subrowkey> whereas if you don't have partition id in
> the key, you only map/reduce the partitioned table in a redesign/refactor.
>  That said, we will be adding support for CQL partitioning in addition to
> PlayOrm partitioning even though it can be a little less flexible sometimes.
>

I am not sure I understood this part. If I need to refactor, having the
partition id in the key would be a bad thing? What would be the
alternative? In my case, as I use userId : partitionId as row key, this
might be a problem, right?


> Also, CQL locates all the data on one node for a partition.  We have found
> it can be faster "sometimes" with the parallelized disks that the
> partitions are NOT all on one node so PlayOrm partitions are virtual only
> and do not relate to where the rows are stored.  An example on our 6 nodes
> was a join query on a partition with 1,000,000 rows took 60ms (of course I
> can't compare to CQL here since it doesn't do joins).  It really depends
> how much data is going to come back in the query though too?  There are
> tradeoff's between disk parallel nodes and having your data all on one node
> of course.


I guess I am still not ready for this level of info. :D
In the playORM readme, we have the following:

@NoSqlQuery(name="findWithJoinQuery", query="PARTITIONS t(:partId)
SELECT t FROM TABLE as t "+
"INNER JOIN t.activityTypeInfo as i WHERE i.type = :type and
t.numShares < :shares"),

What would happen behind the scenes when I execute this query? You can only
use joins with partition keys, right?
In this case, is partId the row id of TABLE CF?


Thanks a lot for the answers

-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: Correct model

Posted by "Hiller, Dean" <De...@nrel.gov>.

I am confused.  In this email you say you want "get all requests for a user" and in a previous one you said "Select all the users which has new requests, since date D" so let me answer both…

For latter, you make ONE query into the latest partition(ONE partition) of the GlobalRequestsCF which gives you the most recent requests ALONG with the user ids of those requests.  If you queried all partitions, you would most likely blow out your JVM memory.

For the former, you make ONE query to the UserRequestsCF with userid = <your user id> to get all the requests for that user

You mean too many rows, not a row too long, right? I am assuming each request will be a different row, not a new column. Is having billions of ROWS something non performatic in Cassandra? I know Cassandra allows up to 2 billion columns for a CF, but I am not aware of a limitation for rows…

Sorry, I was skipping some context.  A lot of the backing indexing sometimes is done as a long row so in playOrm, too many rows in a partition means == too many columns in the indexing row for that partition.  I believe the same is true in cassandra for their indexing.

If I understood it correctly, if I don't specify partitions, Cassandra will store all my data in a single node?

Cassandra spreads all your data out on all nodes with or without partitions.  A single partition does have it's data co-located though.

I 99,999% of my users will have less than 100k requests, would it make sense to partition by user?

If you are at 100k(and the requests are rather small), you could embed all the requests in the user or go with Aaron's below suggestion of a UserRequestsCF.  If your requests are rather large, you probably don't want to embed them in the User.  Either way, it's one query or one row key lookup.

That's cool! :D So if I need to query data split in 10 partitions, for instance, I can perform the query in parallel by using a multiget, right?

Multiget ignores partitions…you feed it a LIST of keys and it gets them.  It just so happens that partitionId had to be part of your row key.

Out of curiosity, if each get will occur on a different node, I would need to connect to each of the nodes? Or would I query 1 node and it would communicate to others?

I have used Hector and now use Astyanax, I don't worry much about that layer, but I feed astyanax 3 nodes and I believe it discovers some of the other ones.  I believe the latter is true but am not 100% sure as I have not looked at that code.

As an analogy on the above, if you happen to have used PlayOrm, you would ONLY need one Requests table and you partition by user AND time(two views into the same data partitioned two different ways) and you can do exactly the same thing as Aaron's example.  PlayOrm doesn't embed the partition ids in the key leaving it free to partition twice like in your case….and in a refactor, you have to map/reduce A LOT more rows because of rows having the FK of <partitionid><subrowkey> whereas if you don't have partition id in the key, you only map/reduce the partitioned table in a redesign/refactor.  That said, we will be adding support for CQL partitioning in addition to PlayOrm partitioning even though it can be a little less flexible sometimes.

Also, CQL locates all the data on one node for a partition.  We have found it can be faster "sometimes" with the parallelized disks that the partitions are NOT all on one node so PlayOrm partitions are virtual only and do not relate to where the rows are stored.  An example on our 6 nodes was a join query on a partition with 1,000,000 rows took 60ms (of course I can't compare to CQL here since it doesn't do joins).  It really depends how much data is going to come back in the query though too?  There are tradeoff's between disk parallel nodes and having your data all on one node of course.

Later,
Dean



From: Marcelo Elias Del Valle <mv...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Monday, September 24, 2012 7:45 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Correct model



2012/9/23 Hiller, Dean <De...@nrel.gov>>
You need to split data among partitions or your query won't scale as more and more data is added to table.  Having the partition means you are querying a lot less rows.
This will happen in case I can query just one partition. But if I need to query things in multiple partitions, wouldn't it be slower?

He means determine the ONE partition key and query that partition.  Ie. If you want just latest user requests, figure out the partition key based on which month you are in and query it.  If you want the latest independent of user, query the correct single partition for GlobalRequests CF.

But in this case, I didn't understand Aaron's model then. My first query is to get  all requests for a user. If I did partitions by time, I will need to query all partitions to get the results, right? In his answer it was said I would query ONE partition...

If I want all the requests for the user, couldn't I just select all UserRequest records which start with "userId"?
He designed it so the user requests table was completely scalable so he has partitions there.  If you don't have partitions, you could run into a row that is toooo long.  You don't need to design it this way if you know none of your users are going to go into the millions as far as number of requests.  In his design then, you need to pick the correct partition and query into that partition.
You mean too many rows, not a row too long, right? I am assuming each request will be a different row, not a new column. Is having billions of ROWS something non performatic in Cassandra? I know Cassandra allows up to 2 billion columns for a CF, but I am not aware of a limitation for rows...

I really didn't understand why to use partitions.
Partitions are a way if you know your rows will go into the trillions of breaking them up so each partition has 100k rows or so or even 1 million but maxes out in the millions most likely.  Without partitions, you hit a limit in the millions.  With partitions, you can keep scaling past that as you can have as many partitions as you want.

If I understood it correctly, if I don't specify partitions, Cassandra will store all my data in a single node? I thought Cassandra would automatically distribute my data among nodes as I insert rows into a CF. Of course if I use partitions I understand I could query just one partition (node) to get the data, if I have the partition field, but to the best of my knowledge, this is not what happens in my case, right? In the first query I would have to query all the partitions...
Or you are saying partitions have nothing to do with nodes?? I 99,999% of my users will have less than 100k requests, would it make sense to partition by user?

A multi-get is a query that finds IN PARALLEL all the rows with the matching keys you send to cassandra.  If you do 1000 gets(instead of a multi-get) with 1ms latency, you will find, it takes 1 second+processing time.  If you do ONE multi-get, you only have 1 request and therefore 1ms latency.  The other solution is you could send 1000 "asycnh" gets but I have a feeling that would be slower with all the marshalling/unmarshalling of the envelope…..really depends on the envelope size like if we were using http, you would get killed doing 1000 requests instead of 1 with 1000 keys in it.
That's cool! :D So if I need to query data split in 10 partitions, for instance, I can perform the query in parallel by using a multiget, right? Out of curiosity, if each get will occur on a different node, I would need to connect to each of the nodes? Or would I query 1 node and it would communicate to others?


Later,
Dean

From: Marcelo Elias Del Valle <mv...@gmail.com>>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>>" <us...@cassandra.apache.org>>>
Date: Sunday, September 23, 2012 10:23 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>>" <us...@cassandra.apache.org>>>
Subject: Re: Correct model


2012/9/20 aaron morton <aa...@thelastpickle.com>>>
I would consider:

# User CF
* row_key: user_id
* columns: user properties, key=value

# UserRequests CF
* row_key: <user_id : partition_start> where partition_start is the start of a time partition that makes sense in your domain. e.g. partition monthly. Generally want to avoid rows the grow forever, as a rule of thumb avoid rows more than a few 10's of MB.
* columns: two possible approaches:
1) If the requests are immutable and you generally want all of the data store the request in a single column using JSON or similar, with the column name a timestamp.
2) Otherwise use a composite column name of <timestamp : request_property> to store the request in many columns.
* In either case consider using Reversed comparators so the most recent columns are first  see http://thelastpickle.com/2011/10/03/Reverse-Comparators/

# GlobalRequests CF
* row_key: partition_start - time partition as above. It may be easier to use the same partition scheme.
* column name: <timestamp : user_id>
* column value: empty

Ok, I think I understood your suggestion... But the only advantage in this solution is to split data among partitions? I understood how it would work, but I didn't understand why it's better than the other solution, without the GlobalRequests CF

- Select all the requests for an user
Work out the current partition client side, get the first N columns. Then page.

What do you mean here by current partition? You mean I would perform a query for each particition? If I want all the requests for the user, couldn't I just select all UserRequest records which start with "userId"? I might be missing something here, but in my understanding if I use hector to query a column familly I can do that and Cassandra servers will automatically communicate to each other to get the data I need, right? Is it bad? I really didn't understand why to use partitions.


- Select all the users which has new requests, since date D
Worm out the current partition client side, get the first N columns from GlobalRequests, make a multi get call to UserRequests
NOTE: Assuming the size of the global requests space is not huge.
Hope that helps.
 For sure it is helping a lot. However, I don't know what is a multiget... I saw the hector api reference and found this method, but not sure about what Cassandra would do internally if I do a multiget... Is this expensive in terms of performance and latency?

--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr



--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: Correct model

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

2012/9/23 Hiller, Dean <De...@nrel.gov>

> You need to split data among partitions or your query won't scale as more
> and more data is added to table.  Having the partition means you are
> querying a lot less rows.
>
This will happen in case I can query just one partition. But if I need to
query things in multiple partitions, wouldn't it be slower?


> He means determine the ONE partition key and query that partition.  Ie. If
> you want just latest user requests, figure out the partition key based on
> which month you are in and query it.  If you want the latest independent of
> user, query the correct single partition for GlobalRequests CF.
>

But in this case, I didn't understand Aaron's model then. My first query is
to get  all requests for a user. If I did partitions by time, I will need
to query all partitions to get the results, right? In his answer it was
said I would query ONE partition...


> If I want all the requests for the user, couldn't I just select all
> UserRequest records which start with "userId"?
> He designed it so the user requests table was completely scalable so he
> has partitions there.  If you don't have partitions, you could run into a
> row that is toooo long.  You don't need to design it this way if you know
> none of your users are going to go into the millions as far as number of
> requests.  In his design then, you need to pick the correct partition and
> query into that partition.
>
You mean too many rows, not a row too long, right? I am assuming each
request will be a different row, not a new column. Is having billions of
ROWS something non performatic in Cassandra? I know Cassandra allows up to
2 billion columns for a CF, but I am not aware of a limitation for rows...


> I really didn't understand why to use partitions.
> Partitions are a way if you know your rows will go into the trillions of
> breaking them up so each partition has 100k rows or so or even 1 million
> but maxes out in the millions most likely.  Without partitions, you hit a
> limit in the millions.  With partitions, you can keep scaling past that as
> you can have as many partitions as you want.
>

If I understood it correctly, if I don't specify partitions, Cassandra will
store all my data in a single node? I thought Cassandra would automatically
distribute my data among nodes as I insert rows into a CF. Of course if I
use partitions I understand I could query just one partition (node) to get
the data, if I have the partition field, but to the best of my knowledge,
this is not what happens in my case, right? In the first query I would have
to query all the partitions...
Or you are saying partitions have nothing to do with nodes?? I 99,999% of
my users will have less than 100k requests, would it make sense to
partition by user?


> A multi-get is a query that finds IN PARALLEL all the rows with the
> matching keys you send to cassandra.  If you do 1000 gets(instead of a
> multi-get) with 1ms latency, you will find, it takes 1 second+processing
> time.  If you do ONE multi-get, you only have 1 request and therefore 1ms
> latency.  The other solution is you could send 1000 "asycnh" gets but I
> have a feeling that would be slower with all the marshalling/unmarshalling
> of the envelope…..really depends on the envelope size like if we were using
> http, you would get killed doing 1000 requests instead of 1 with 1000 keys
> in it.
>
That's cool! :D So if I need to query data split in 10 partitions, for
instance, I can perform the query in parallel by using a multiget, right?
Out of curiosity, if each get will occur on a different node, I would need
to connect to each of the nodes? Or would I query 1 node and it would
communicate to others?


>
> Later,
> Dean
>
> From: Marcelo Elias Del Valle <mvallebr@gmail.com<mailto:
> mvallebr@gmail.com>>
> Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org>>
> Date: Sunday, September 23, 2012 10:23 AM
> To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org>>
> Subject: Re: Correct model
>
>
> 2012/9/20 aaron morton <aaron@thelastpickle.com<mailto:
> aaron@thelastpickle.com>>
> I would consider:
>
> # User CF
> * row_key: user_id
> * columns: user properties, key=value
>
> # UserRequests CF
> * row_key: <user_id : partition_start> where partition_start is the start
> of a time partition that makes sense in your domain. e.g. partition
> monthly. Generally want to avoid rows the grow forever, as a rule of thumb
> avoid rows more than a few 10's of MB.
> * columns: two possible approaches:
> 1) If the requests are immutable and you generally want all of the data
> store the request in a single column using JSON or similar, with the column
> name a timestamp.
> 2) Otherwise use a composite column name of <timestamp : request_property>
> to store the request in many columns.
> * In either case consider using Reversed comparators so the most recent
> columns are first  see
> http://thelastpickle.com/2011/10/03/Reverse-Comparators/
>
> # GlobalRequests CF
> * row_key: partition_start - time partition as above. It may be easier to
> use the same partition scheme.
> * column name: <timestamp : user_id>
> * column value: empty
>
> Ok, I think I understood your suggestion... But the only advantage in this
> solution is to split data among partitions? I understood how it would work,
> but I didn't understand why it's better than the other solution, without
> the GlobalRequests CF
>
> - Select all the requests for an user
> Work out the current partition client side, get the first N columns. Then
> page.
>
> What do you mean here by current partition? You mean I would perform a
> query for each particition? If I want all the requests for the user,
> couldn't I just select all UserRequest records which start with "userId"? I
> might be missing something here, but in my understanding if I use hector to
> query a column familly I can do that and Cassandra servers will
> automatically communicate to each other to get the data I need, right? Is
> it bad? I really didn't understand why to use partitions.
>
>
> - Select all the users which has new requests, since date D
> Worm out the current partition client side, get the first N columns from
> GlobalRequests, make a multi get call to UserRequests
> NOTE: Assuming the size of the global requests space is not huge.
> Hope that helps.
>  For sure it is helping a lot. However, I don't know what is a multiget...
> I saw the hector api reference and found this method, but not sure about
> what Cassandra would do internally if I do a multiget... Is this expensive
> in terms of performance and latency?
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>



-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: Correct model

Posted by aaron morton <aa...@thelastpickle.com>.

Yup.

(Multi get is just a convenience method, it explodes into multiple gets on the server side. )

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 24/09/2012, at 5:01 AM, "Hiller, Dean" <De...@nrel.gov> wrote:

> But the only advantage in this solution is to split data among partitions?
> 
> You need to split data among partitions or your query won't scale as more and more data is added to table.  Having the partition means you are querying a lot less rows.
> 
> What do you mean here by current partition?
> 
> He means determine the ONE partition key and query that partition.  Ie. If you want just latest user requests, figure out the partition key based on which month you are in and query it.  If you want the latest independent of user, query the correct single partition for GlobalRequests CF.
> 
> If I want all the requests for the user, couldn't I just select all UserRequest records which start with "userId"?
> 
> He designed it so the user requests table was completely scalable so he has partitions there.  If you don't have partitions, you could run into a row that is toooo long.  You don't need to design it this way if you know none of your users are going to go into the millions as far as number of requests.  In his design then, you need to pick the correct partition and query into that partition.
> 
> I really didn't understand why to use partitions.
> 
> Partitions are a way if you know your rows will go into the trillions of breaking them up so each partition has 100k rows or so or even 1 million but maxes out in the millions most likely.  Without partitions, you hit a limit in the millions.  With partitions, you can keep scaling past that as you can have as many partitions as you want.
> 
> A multi-get is a query that finds IN PARALLEL all the rows with the matching keys you send to cassandra.  If you do 1000 gets(instead of a multi-get) with 1ms latency, you will find, it takes 1 second+processing time.  If you do ONE multi-get, you only have 1 request and therefore 1ms latency.  The other solution is you could send 1000 "asycnh" gets but I have a feeling that would be slower with all the marshalling/unmarshalling of the envelope…..really depends on the envelope size like if we were using http, you would get killed doing 1000 requests instead of 1 with 1000 keys in it.
> 
> Later,
> Dean
> 
> From: Marcelo Elias Del Valle <mv...@gmail.com>>
> Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
> Date: Sunday, September 23, 2012 10:23 AM
> To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
> Subject: Re: Correct model
> 
> 
> 2012/9/20 aaron morton <aa...@thelastpickle.com>>
> I would consider:
> 
> # User CF
> * row_key: user_id
> * columns: user properties, key=value
> 
> # UserRequests CF
> * row_key: <user_id : partition_start> where partition_start is the start of a time partition that makes sense in your domain. e.g. partition monthly. Generally want to avoid rows the grow forever, as a rule of thumb avoid rows more than a few 10's of MB.
> * columns: two possible approaches:
> 1) If the requests are immutable and you generally want all of the data store the request in a single column using JSON or similar, with the column name a timestamp.
> 2) Otherwise use a composite column name of <timestamp : request_property> to store the request in many columns.
> * In either case consider using Reversed comparators so the most recent columns are first  see http://thelastpickle.com/2011/10/03/Reverse-Comparators/
> 
> # GlobalRequests CF
> * row_key: partition_start - time partition as above. It may be easier to use the same partition scheme.
> * column name: <timestamp : user_id>
> * column value: empty
> 
> Ok, I think I understood your suggestion... But the only advantage in this solution is to split data among partitions? I understood how it would work, but I didn't understand why it's better than the other solution, without the GlobalRequests CF
> 
> - Select all the requests for an user
> Work out the current partition client side, get the first N columns. Then page.
> 
> What do you mean here by current partition? You mean I would perform a query for each particition? If I want all the requests for the user, couldn't I just select all UserRequest records which start with "userId"? I might be missing something here, but in my understanding if I use hector to query a column familly I can do that and Cassandra servers will automatically communicate to each other to get the data I need, right? Is it bad? I really didn't understand why to use partitions.
> 
> 
> - Select all the users which has new requests, since date D
> Worm out the current partition client side, get the first N columns from GlobalRequests, make a multi get call to UserRequests
> NOTE: Assuming the size of the global requests space is not huge.
> Hope that helps.
> For sure it is helping a lot. However, I don't know what is a multiget... I saw the hector api reference and found this method, but not sure about what Cassandra would do internally if I do a multiget... Is this expensive in terms of performance and latency?
> 
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr

found major difference in CQL vs Scalable SQL(PlayOrm) and question

Posted by "Hiller, Dean" <De...@nrel.gov>.

I have been digging more and more into CQL vs. PlayOrm S-SQL and found a
major difference that is quite interesting(thought you might be interested
plus I have a question).  CQL uses a composite row key with the prefix so
now any other tables that want to reference that entity have references to
that row "with" the partition key embedded in the row key.

Scalable-SQL does a similar form of partitioning but
1. You can partition a table 2 ways not just one way (ie. By account and
by user perhaps) for queries into either types of partition
2. If you decide to repartition your data a different way, with S-SQL you
don't have to map/reduce all those rows with foreign keys.  In CQL, you
have to map/reduce the partitioned table AND all the rows referencing all
those rows since the partition key is basically embedded everywhere.

I found that quite interesting, but that said, I need to add support for
PlayOrm on top of partitioned CF's so we are compatible with CQL as well.

1. Is there any meta information I can grab from the meta model on this?
2. Also, how can I query the indexes without involving CQL at all such
that I can translate the playOrm Scalable-SQL to re-use the existing
indices?  (ie. Is there an index column family and how to form the row key
to access the index?)

Thanks,
Dean

Re: Correct model

Posted by "Hiller, Dean" <De...@nrel.gov>.

But the only advantage in this solution is to split data among partitions?

You need to split data among partitions or your query won't scale as more and more data is added to table.  Having the partition means you are querying a lot less rows.

What do you mean here by current partition?

He means determine the ONE partition key and query that partition.  Ie. If you want just latest user requests, figure out the partition key based on which month you are in and query it.  If you want the latest independent of user, query the correct single partition for GlobalRequests CF.

If I want all the requests for the user, couldn't I just select all UserRequest records which start with "userId"?

He designed it so the user requests table was completely scalable so he has partitions there.  If you don't have partitions, you could run into a row that is toooo long.  You don't need to design it this way if you know none of your users are going to go into the millions as far as number of requests.  In his design then, you need to pick the correct partition and query into that partition.

I really didn't understand why to use partitions.

Partitions are a way if you know your rows will go into the trillions of breaking them up so each partition has 100k rows or so or even 1 million but maxes out in the millions most likely.  Without partitions, you hit a limit in the millions.  With partitions, you can keep scaling past that as you can have as many partitions as you want.

A multi-get is a query that finds IN PARALLEL all the rows with the matching keys you send to cassandra.  If you do 1000 gets(instead of a multi-get) with 1ms latency, you will find, it takes 1 second+processing time.  If you do ONE multi-get, you only have 1 request and therefore 1ms latency.  The other solution is you could send 1000 "asycnh" gets but I have a feeling that would be slower with all the marshalling/unmarshalling of the envelope…..really depends on the envelope size like if we were using http, you would get killed doing 1000 requests instead of 1 with 1000 keys in it.

Later,
Dean

From: Marcelo Elias Del Valle <mv...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Sunday, September 23, 2012 10:23 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Correct model


2012/9/20 aaron morton <aa...@thelastpickle.com>>
I would consider:

# User CF
* row_key: user_id
* columns: user properties, key=value

# UserRequests CF
* row_key: <user_id : partition_start> where partition_start is the start of a time partition that makes sense in your domain. e.g. partition monthly. Generally want to avoid rows the grow forever, as a rule of thumb avoid rows more than a few 10's of MB.
* columns: two possible approaches:
1) If the requests are immutable and you generally want all of the data store the request in a single column using JSON or similar, with the column name a timestamp.
2) Otherwise use a composite column name of <timestamp : request_property> to store the request in many columns.
* In either case consider using Reversed comparators so the most recent columns are first  see http://thelastpickle.com/2011/10/03/Reverse-Comparators/

# GlobalRequests CF
* row_key: partition_start - time partition as above. It may be easier to use the same partition scheme.
* column name: <timestamp : user_id>
* column value: empty

Ok, I think I understood your suggestion... But the only advantage in this solution is to split data among partitions? I understood how it would work, but I didn't understand why it's better than the other solution, without the GlobalRequests CF

- Select all the requests for an user
Work out the current partition client side, get the first N columns. Then page.

What do you mean here by current partition? You mean I would perform a query for each particition? If I want all the requests for the user, couldn't I just select all UserRequest records which start with "userId"? I might be missing something here, but in my understanding if I use hector to query a column familly I can do that and Cassandra servers will automatically communicate to each other to get the data I need, right? Is it bad? I really didn't understand why to use partitions.


- Select all the users which has new requests, since date D
Worm out the current partition client side, get the first N columns from GlobalRequests, make a multi get call to UserRequests
NOTE: Assuming the size of the global requests space is not huge.
Hope that helps.
 For sure it is helping a lot. However, I don't know what is a multiget... I saw the hector api reference and found this method, but not sure about what Cassandra would do internally if I do a multiget... Is this expensive in terms of performance and latency?

--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: Correct model

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

2012/9/20 aaron morton <aa...@thelastpickle.com>

> I would consider:
>
> # User CF
> * row_key: user_id
> * columns: user properties, key=value
>
> # UserRequests CF
> * row_key: <user_id : partition_start> where partition_start is the start
> of a time partition that makes sense in your domain. e.g. partition
> monthly. Generally want to avoid rows the grow forever, as a rule of thumb
> avoid rows more than a few 10's of MB.
> * columns: two possible approaches:
> 1) If the requests are immutable and you generally want all of the data
> store the request in a single column using JSON or similar, with the column
> name a timestamp.
> 2) Otherwise use a composite column name of <timestamp : request_property>
> to store the request in many columns.
> * In either case consider using Reversed comparators so the most recent
> columns are first  see
> http://thelastpickle.com/2011/10/03/Reverse-Comparators/
>
> # GlobalRequests CF
> * row_key: partition_start - time partition as above. It may be easier to
> use the same partition scheme.
> * column name: <timestamp : user_id>
> * column value: empty
>

Ok, I think I understood your suggestion... But the only advantage in this
solution is to split data among partitions? I understood how it would work,
but I didn't understand why it's better than the other solution, without
the GlobalRequests CF


> - Select all the requests for an user
>
> Work out the current partition client side, get the first N columns. Then
> page.
>

What do you mean here by current partition? You mean I would perform a
query for each particition? If I want all the requests for the user,
couldn't I just select all UserRequest records which start with "userId"? I
might be missing something here, but in my understanding if I use hector to
query a column familly I can do that and Cassandra servers will
automatically communicate to each other to get the data I need, right? Is
it bad? I really didn't understand why to use partitions.



> - Select all the users which has new requests, since date D
>
> Worm out the current partition client side, get the first N columns from
> GlobalRequests, make a multi get call to UserRequests
> NOTE: Assuming the size of the global requests space is not huge.
> Hope that helps.
>
 For sure it is helping a lot. However, I don't know what is a multiget...
I saw the hector api reference and found this method, but not sure about
what Cassandra would do internally if I do a multiget... Is this expensive
in terms of performance and latency?

-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: Correct model

Posted by aaron morton <aa...@thelastpickle.com>.

> I created the following model: an UserCF, whose key is a userID generated by TimeUUID, and a RequestCF, whose key is composite: UserUUID + timestamp. For each user, I will store basic data and, for each request, I will insert a lot of columns.

I would consider:

# User CF
* row_key: user_id
* columns: user properties, key=value

# UserRequests CF
* row_key: <user_id : partition_start> where partition_start is the start of a time partition that makes sense in your domain. e.g. partition monthly. Generally want to avoid rows the grow forever, as a rule of thumb avoid rows more than a few 10's of MB. 
* columns: two possible approaches:
	1) If the requests are immutable and you generally want all of the data store the request in a single column using JSON or similar, with the column name a timestamp. 
	2) Otherwise use a composite column name of <timestamp : request_property> to store the request in many columns. 
	* In either case consider using Reversed comparators so the most recent columns are first  see http://thelastpickle.com/2011/10/03/Reverse-Comparators/

# GlobalRequests CF
	* row_key: partition_start - time partition as above. It may be easier to use the same partition scheme. 
	* column name: <timestamp : user_id>
	* column value: empty 

> - Select all the requests for an user

Work out the current partition client side, get the first N columns. Then page. 

> - Select all the users which has new requests, since date D
Worm out the current partition client side, get the first N columns from GlobalRequests, make a multi get call to UserRequests 

NOTE: Assuming the size of the global requests space is not huge.

Hope that helps. 
 
-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 20/09/2012, at 11:19 AM, Marcelo Elias Del Valle <mv...@gmail.com> wrote:

> In your first email, you get a request and seem to shove it and a user in
> generating the ids which means that user never generates a request ever
> again???  If a user sends multiple requests in, how are you looking up his
> TimeUUID row key from your first email(I would do the same in my
> implementation)?
> 
> Actually, I don't get it from Cassandra. I am using Cassandra for the writes, but to find the userId I look on a pre-indexed structure, because I think the reads would be faster this way. I need to find the userId by some key fields, so I use an index like this:
> 
> user ID 5596 -> { name -> "john denver", phone -> "5555 5555", field3 -> "field 3 data"...., field 10 -> "field 10 data"}
>    
> The values are just examples. This part is not implemented yet and I am looking for alternatives. Currently we have some similar indexes in SOLR, but we are thinking in keeping the index in memory and replicating manually in the cluster, or using Voldemort, etc. 
> I might be wrong, but I think Cassandra is great for writes, but a solution like this would be better for reads.
> 
>  
> If you had an ldap unique username, I would just use that as the primary
> key meaning you NEVER have to do reads.  If you have a username and need
> to lookup a UUID, you would have to do that in both implementationsŠnot a
> real big deal thoughŠa quick quick lookup table does the trick there and
> in most cases is still fast enough(ie. Read before write here is ok in a
> lot of cases).
> 
> That X-ref table would simple be rowkey=username and value=users real
> primary key
> 
> Though again, we use ldap and know no one's username is really going to
> change so username is our primary key.
> 
> In my case, a single user can have thousands of requests. In my userCF, I will have just 1 user with uuid X, but I am not sure about what to have in my requestCF.
>  
> -- 
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr

Re: Correct model

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

>
> In your first email, you get a request and seem to shove it and a user in
> generating the ids which means that user never generates a request ever
> again???  If a user sends multiple requests in, how are you looking up his
> TimeUUID row key from your first email(I would do the same in my
> implementation)?
>

Actually, I don't get it from Cassandra. I am using Cassandra for the
writes, but to find the userId I look on a pre-indexed structure, because I
think the reads would be faster this way. I need to find the userId by some
key fields, so I use an index like this:

user ID 5596 -> { name -> "john denver", phone -> "5555 5555", field3 ->
"field 3 data"...., field 10 -> "field 10 data"}

The values are just examples. This part is not implemented yet and I am
looking for alternatives. Currently we have some similar indexes in SOLR,
but we are thinking in keeping the index in memory and replicating manually
in the cluster, or using Voldemort, etc.
I might be wrong, but I think Cassandra is great for writes, but a solution
like this would be better for reads.



> If you had an ldap unique username, I would just use that as the primary
> key meaning you NEVER have to do reads.  If you have a username and need
> to lookup a UUID, you would have to do that in both implementationsŠnot a
> real big deal thoughŠa quick quick lookup table does the trick there and
> in most cases is still fast enough(ie. Read before write here is ok in a
> lot of cases).
>
> That X-ref table would simple be rowkey=username and value=users real
> primary key
>
> Though again, we use ldap and know no one's username is really going to
> change so username is our primary key.
>

In my case, a single user can have thousands of requests. In my userCF, I
will have just 1 user with uuid X, but I am not sure about what to have in
my requestCF.

-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: Correct model

Posted by "Hiller, Dean" <De...@nrel.gov>.

Oh, quick correction, I was thinking your user row key was in the request
coming in from your first email.

In your first email, you get a request and seem to shove it and a user in
generating the ids which means that user never generates a request ever
again???  If a user sends multiple requests in, how are you looking up his
TimeUUID row key from your first email(I would do the same in my
implementation)?

If you had an ldap unique username, I would just use that as the primary
key meaning you NEVER have to do reads.  If you have a username and need
to lookup a UUID, you would have to do that in both implementationsŠnot a
real big deal thoughŠa quick quick lookup table does the trick there and
in most cases is still fast enough(ie. Read before write here is ok in a
lot of cases).

That X-ref table would simple be rowkey=username and value=users real
primary key

Though again, we use ldap and know no one's username is really going to
change so username is our primary key.

Later,
Dean


On 9/19/12 2:33 PM, "Hiller, Dean" <De...@nrel.gov> wrote:

>Uhm, unless I am mistaken, a NEW request implies a new UUID so you can
>just write it to both the index to the request row and to the user that
>request was for all in one shot with no need to read, right?
>
>(Also, read before write is not necessarily badŠit really depends on your
>situation but in this case, I don't think you need read before write).
>
>For your structured data commentŠ.
>Actually playOrm stores structured and unstructured data.  It follows the
>pattern cassandra is adopting more and more of "partial" schemas and
>plans to hold to that path.  It is a complete break from JPA due to noSQL
>being so different.
>
>and each request would have its own id, right
>
>Yes, in my design, I choose each request with it's own id.
>
>Wouldn't it be faster to have a composite key in the requestCF itself?
>
>In CQL, don't you have to have an == in the first part of the clause
>meaning you would have to select the user id, BUT you wanted requests >
>date no matter which user so the indices I gave you have that information
>with a simple column slice of the data.  The indices I gave you look like
>this(composite column names)Š. <time1>.<req1>.<user1>,
><time2>.<req2>.<user1>, <time3>.<req3>.<user2>  NOTE that each is a UUID
>there in the <> so are unique.
>
>Maybe there is a way, but I am not sure on how to get all the latest
>request > data for every userŠ.I guess you could always map/reduce but
>that is generally reserved for analytics or maybe updating new index
>tables you are creating for reading faster.
>
>Later,
>Dean
>
>From: Marcelo Elias Del Valle
><mv...@gmail.com>>
>Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
><us...@cassandra.apache.org>>
>Date: Wednesday, September 19, 2012 1:47 PM
>To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
><us...@cassandra.apache.org>>
>Subject: Re: Correct model
>
>2012/9/19 Hiller, Dean <De...@nrel.gov>>
>Thinking out loud and I think a bit towards playOrm's model though you
>don¹t' need to use playroom for this.
>
>1. I would probably have a User with the requests either embedded in or
>the Foreign keys to the requestsŠeither is fine as long as you get the
>user get ALL FK's and make one request to get the requests for that user
>
>This was my first option. However, everytime I have a new request I would
>need to read the column "request_ids", update its value, and them write
>the result. This would be a read-before-write, which is bad in Cassandra,
>right? Or you were talking about other kinds of FKs?
>
>2. I would create rows for index and index each month of data OR maybe
>index each day of data(depends on your system).  Then, I can just query
>into the index for that one month.  With playOrm S-SQL, this is a simple
>PARTITIONS r(:thismonthParititonId) SELECT r FROM Request r where r.date
>> :date OR you just do a column range query doing the same thing into
>>your index.  The index is basically the wide row pattern ;) with
>>composite keys of <date>.<rowkey of request>
>
>I would consider playOrm in a later step in my project, as my
>understanding now is it is good to store relational data, structured
>data. I cannot predict which columns I am going to store in requestCF.
>But regardless, even in Cassandra, you would still use a composite key,
>but it seems you would create an indexCf using the wide row pattern, and
>each request would have its own id, right? But why? Wouldn't it be faster
>to have a composite key in the requestCF itself?
>
>
>From: Marcelo Elias Del Valle
><mv...@gmail.com><mailto:mvallebr@gmail.com<m
>ailto:mvallebr@gmail.com>>>
>Reply-To: 
>"user@cassandra.apache.org<ma...@cassandra.apache.org><mailto:user@c
>assandra.apache.org<ma...@cassandra.apache.org>>"
><us...@cassandra.apache.org><mailto:user@c
>assandra.apache.org<ma...@cassandra.apache.org>>>
>Date: Wednesday, September 19, 2012 1:02 PM
>To: 
>"user@cassandra.apache.org<ma...@cassandra.apache.org><mailto:user@c
>assandra.apache.org<ma...@cassandra.apache.org>>"
><us...@cassandra.apache.org><mailto:user@c
>assandra.apache.org<ma...@cassandra.apache.org>>>
>Subject: Correct model
>
>I am new to Cassandra and NoSQL at all.
>I built my first model and any comments would be of great help. I am
>describing my thoughts bellow.
>
>It's a very simple model. I will need to store several users and, for
>each user, I will need to store several requests. It request has it's
>insertion time. As the query comes first, here are the only queries I
>will need to run against this model:
>- Select all the requests for an user
>- Select all the users which has new requests, since date D
>
>I created the following model: an UserCF, whose key is a userID generated
>by TimeUUID, and a RequestCF, whose key is composite: UserUUID +
>timestamp. For each user, I will store basic data and, for each request,
>I will insert a lot of columns.
>
>My questions:
>- Is the strategy of using a composite key good for this case? I thought
>in other solutions, but this one seemed to be the best. Another solution
>would be have a non-composite key of type UUID for the requests, and have
>another CF to relate user and request.
>- To perform the second query, instead of selecting if each user has a
>request inserted after date D, I thought in storing the last request
>insertion date into the userCF, everytime I have a new insert for the
>user. It would be a data replication, but I would have no
>read-before-write and I am guessing the second query would perform faster.
>
>Any thoughts?
>
>--
>Marcelo Elias Del Valle
>http://mvalle.com - @mvallebr
>
>
>
>--
>Marcelo Elias Del Valle
>http://mvalle.com - @mvallebr

Re: Correct model

Posted by "Hiller, Dean" <De...@nrel.gov>.

Uhm, unless I am mistaken, a NEW request implies a new UUID so you can just write it to both the index to the request row and to the user that request was for all in one shot with no need to read, right?

(Also, read before write is not necessarily bad…it really depends on your situation but in this case, I don't think you need read before write).

For your structured data comment….
Actually playOrm stores structured and unstructured data.  It follows the pattern cassandra is adopting more and more of "partial" schemas and plans to hold to that path.  It is a complete break from JPA due to noSQL being so different.

and each request would have its own id, right

Yes, in my design, I choose each request with it's own id.

Wouldn't it be faster to have a composite key in the requestCF itself?

In CQL, don't you have to have an == in the first part of the clause meaning you would have to select the user id, BUT you wanted requests > date no matter which user so the indices I gave you have that information with a simple column slice of the data.  The indices I gave you look like this(composite column names)…. <time1>.<req1>.<user1>, <time2>.<req2>.<user1>, <time3>.<req3>.<user2>  NOTE that each is a UUID there in the <> so are unique.

Maybe there is a way, but I am not sure on how to get all the latest request > data for every user….I guess you could always map/reduce but that is generally reserved for analytics or maybe updating new index tables you are creating for reading faster.

Later,
Dean

From: Marcelo Elias Del Valle <mv...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Wednesday, September 19, 2012 1:47 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Correct model

2012/9/19 Hiller, Dean <De...@nrel.gov>>
Thinking out loud and I think a bit towards playOrm's model though you don’t' need to use playroom for this.

1. I would probably have a User with the requests either embedded in or the Foreign keys to the requests…either is fine as long as you get the user get ALL FK's and make one request to get the requests for that user

This was my first option. However, everytime I have a new request I would need to read the column "request_ids", update its value, and them write the result. This would be a read-before-write, which is bad in Cassandra, right? Or you were talking about other kinds of FKs?

2. I would create rows for index and index each month of data OR maybe index each day of data(depends on your system).  Then, I can just query into the index for that one month.  With playOrm S-SQL, this is a simple PARTITIONS r(:thismonthParititonId) SELECT r FROM Request r where r.date > :date OR you just do a column range query doing the same thing into your index.  The index is basically the wide row pattern ;) with composite keys of <date>.<rowkey of request>

I would consider playOrm in a later step in my project, as my understanding now is it is good to store relational data, structured data. I cannot predict which columns I am going to store in requestCF. But regardless, even in Cassandra, you would still use a composite key, but it seems you would create an indexCf using the wide row pattern, and each request would have its own id, right? But why? Wouldn't it be faster to have a composite key in the requestCF itself?


From: Marcelo Elias Del Valle <mv...@gmail.com>>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>>" <us...@cassandra.apache.org>>>
Date: Wednesday, September 19, 2012 1:02 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>>" <us...@cassandra.apache.org>>>
Subject: Correct model

I am new to Cassandra and NoSQL at all.
I built my first model and any comments would be of great help. I am describing my thoughts bellow.

It's a very simple model. I will need to store several users and, for each user, I will need to store several requests. It request has it's insertion time. As the query comes first, here are the only queries I will need to run against this model:
- Select all the requests for an user
- Select all the users which has new requests, since date D

I created the following model: an UserCF, whose key is a userID generated by TimeUUID, and a RequestCF, whose key is composite: UserUUID + timestamp. For each user, I will store basic data and, for each request, I will insert a lot of columns.

My questions:
- Is the strategy of using a composite key good for this case? I thought in other solutions, but this one seemed to be the best. Another solution would be have a non-composite key of type UUID for the requests, and have another CF to relate user and request.
- To perform the second query, instead of selecting if each user has a request inserted after date D, I thought in storing the last request insertion date into the userCF, everytime I have a new insert for the user. It would be a data replication, but I would have no read-before-write and I am guessing the second query would perform faster.

Any thoughts?

--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr



--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: Correct model

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

2012/9/19 Hiller, Dean <De...@nrel.gov>

> Thinking out loud and I think a bit towards playOrm's model though you
> don’t' need to use playroom for this.
>
> 1. I would probably have a User with the requests either embedded in or
> the Foreign keys to the requests…either is fine as long as you get the user
> get ALL FK's and make one request to get the requests for that user
>

This was my first option. However, everytime I have a new request I would
need to read the column "request_ids", update its value, and them write the
result. This would be a read-before-write, which is bad in Cassandra,
right? Or you were talking about other kinds of FKs?


> 2. I would create rows for index and index each month of data OR maybe
> index each day of data(depends on your system).  Then, I can just query
> into the index for that one month.  With playOrm S-SQL, this is a simple
> PARTITIONS r(:thismonthParititonId) SELECT r FROM Request r where r.date >
> :date OR you just do a column range query doing the same thing into your
> index.  The index is basically the wide row pattern ;) with composite keys
> of <date>.<rowkey of request>
>

I would consider playOrm in a later step in my project, as my understanding
now is it is good to store relational data, structured data. I cannot
predict which columns I am going to store in requestCF. But regardless,
even in Cassandra, you would still use a composite key, but it seems you
would create an indexCf using the wide row pattern, and each request would
have its own id, right? But why? Wouldn't it be faster to have a composite
key in the requestCF itself?


From: Marcelo Elias Del Valle <mvallebr@gmail.com<mailto:mvallebr@gmail.com
> >>
> Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org>>
> Date: Wednesday, September 19, 2012 1:02 PM
> To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org>>
> Subject: Correct model
>
> I am new to Cassandra and NoSQL at all.
> I built my first model and any comments would be of great help. I am
> describing my thoughts bellow.
>
> It's a very simple model. I will need to store several users and, for each
> user, I will need to store several requests. It request has it's insertion
> time. As the query comes first, here are the only queries I will need to
> run against this model:
> - Select all the requests for an user
> - Select all the users which has new requests, since date D
>
> I created the following model: an UserCF, whose key is a userID generated
> by TimeUUID, and a RequestCF, whose key is composite: UserUUID + timestamp.
> For each user, I will store basic data and, for each request, I will insert
> a lot of columns.
>
> My questions:
> - Is the strategy of using a composite key good for this case? I thought
> in other solutions, but this one seemed to be the best. Another solution
> would be have a non-composite key of type UUID for the requests, and have
> another CF to relate user and request.
> - To perform the second query, instead of selecting if each user has a
> request inserted after date D, I thought in storing the last request
> insertion date into the userCF, everytime I have a new insert for the user.
> It would be a data replication, but I would have no read-before-write and I
> am guessing the second query would perform faster.
>
> Any thoughts?
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>



-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: Correct model

Posted by "Hiller, Dean" <De...@nrel.gov>.

Thinking out loud and I think a bit towards playOrm's model though you don’t' need to use playroom for this.

1. I would probably have a User with the requests either embedded in or the Foreign keys to the requests…either is fine as long as you get the user get ALL FK's and make one request to get the requests for that user

2. I would create rows for index and index each month of data OR maybe index each day of data(depends on your system).  Then, I can just query into the index for that one month.  With playOrm S-SQL, this is a simple PARTITIONS r(:thismonthParititonId) SELECT r FROM Request r where r.date > :date OR you just do a column range query doing the same thing into your index.  The index is basically the wide row pattern ;) with composite keys of <date>.<rowkey of request>

Later,
Dean

From: Marcelo Elias Del Valle <mv...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Wednesday, September 19, 2012 1:02 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Correct model

I am new to Cassandra and NoSQL at all.
I built my first model and any comments would be of great help. I am describing my thoughts bellow.

It's a very simple model. I will need to store several users and, for each user, I will need to store several requests. It request has it's insertion time. As the query comes first, here are the only queries I will need to run against this model:
- Select all the requests for an user
- Select all the users which has new requests, since date D

I created the following model: an UserCF, whose key is a userID generated by TimeUUID, and a RequestCF, whose key is composite: UserUUID + timestamp. For each user, I will store basic data and, for each request, I will insert a lot of columns.

My questions:
- Is the strategy of using a composite key good for this case? I thought in other solutions, but this one seemed to be the best. Another solution would be have a non-composite key of type UUID for the requests, and have another CF to relate user and request.
- To perform the second query, instead of selecting if each user has a request inserted after date D, I thought in storing the last request insertion date into the userCF, everytime I have a new insert for the user. It would be a data replication, but I would have no read-before-write and I am guessing the second query would perform faster.

Any thoughts?

--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr