You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Ram N <yr...@gmail.com> on 2014/09/14 00:49:34 UTC

C 2.1

Team,

I am pretty new to cassandra (with just 2 weeks of playing around with it
on and off) and planning a fresh deployment with 2.1 release. The
data-model is pretty simple for my use-case.  Questions I have in mind are

Is 2.1 a production ready release?
Driver selection?
    I played around with Hector, Astyanax and Java driver?
     I don't see much activity happening on Hector,
     For Astyanax - Love the Fluent style of writing code and abstractions,
recipes, pooling etc
     Datastax Java driver - I get too confused with CQL and the underlying
storage model. I am also not clear on the indexing structure of columns.
Does CQL indexes create a separate CF for the index table? How is it
different from maintaining inverted index? Internally both are the same?
Does cql stmt to create index, creates a separate CF and has an atomic way
of updating/managing them? Which one is better to scale? (something like
stargate-core or the ones done by usergrid? or the CQL approach?)

On a separate note just curious if I have 1000's of columns in a given row
and a fixed set of indexed column  (say 30 - 50 columns) which approach
should I be taking? Will cassandra scale with these many indexed column?
Are there any limits? How much of an impact do CQL indexes create on the
system? I am also not sure if these use cases are the right choice for
cassandra but would really appreciate any response on these. Thanks.

-R

Re: C 2.1

Posted by James Briggs <ja...@yahoo.com>.

Ram,

The reason secondary indexes are not recommended is that since
they can't use the partition key, the values have to be fetched from
all nodes. So you have higher latency, and likely timeouts.

The C* solutions are:

a) use a denormalized ("materialized") table

b) use a clustered index if all the data related to the row key is
in the same partition (read my blog link from this thread for more)

That's the price of using distributed systems.

Oh, and then there's the need to rewrite the data access layer
of your entire existing app. :)

AOL and Blizzard talked about porting a couple apps to Cassandra
at the conference last week, but they sounded like trivial user-db
("UDB") apps, and even then Patrick was usually credited with the
data modelling.

I haven't heard of anybody porting a 100+ table Oracle or MySQL
app to C* yet. I'm sure it's been done, but most of the
apps written for C* are greenfield or v2.0 rewrites.

Thanks, James Briggs
--
Cassandra/MySQL DBA. Available in San Jose area or remote.

________________________________
 From: Ram N <yr...@gmail.com>
To: user <us...@cassandra.apache.org> 
Sent: Monday, September 15, 2014 1:34 PM
Subject: Re: C 2.1

Jack, 

Using Solr or an external search/indexing service is an option but increases the complexity of managing different systems. I am curious to understand the impact of having wide-rows on a separate CF for inverted index purpose which if I understand correctly is what Rob's response, having a separate CF for index is better than using the default Secondary index option. 

Would be great to understand the design decision to go with present implementation on Secondary Index when the alternative is better? Looking at JIRAs is still confusing to come up with the why :) 

--R 

On Mon, Sep 15, 2014 at 11:17 AM, Jack Krupansky <ja...@basetechnology.com> wrote:

If you’re indexing and querying on that many columns (dozens, or more than 
a handful), consider DSE/Solr, especially if you need to query on multiple 
columns in the same query.
> 
>-- Jack 
Krupansky
> 
>From: Robert Coli 
>Sent: Monday, September 15, 2014 11:07 AM
>To: user@cassandra.apache.org 
>Subject: Re: C 2.1
> 
>On Sat, Sep 13, 2014 at 3:49 PM, Ram N <yr...@gmail.com> wrote:
>
>Is 2.1 a production ready release? 
> 
>https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/
>
> 
>     Datastax Java driver - I get too confused with  CQL and the underlying storage model. I am also not clear on the indexing  structure of columns. Does CQL indexes create a separate CF for the index  table? How is it different from maintaining inverted index? Internally both  are the same? Does cql stmt to create index, creates a separate CF and has an  atomic way of updating/managing them? Which one is better to scale? (something  like stargate-core or the ones done by usergrid? or the CQL  approach?)
> 
>New projects should use CQL. Access to underlying storage via Thrift is 
likely to eventually be removed from Cassandra.
> 
>On a separate note just curious if I have 1000's of columns in a given  row and a fixed set of indexed column  (say 30 - 50 columns) which  approach should I be taking? Will cassandra scale with these many indexed  column? Are there any limits? How much of an impact do CQL indexes create on  the system? I am also not sure if these use cases are the right choice for  cassandra but would really appreciate any response on these.  Thanks.
> 
>Use of the "Secondary Indexes" feature is generally an anti-pattern in 
Cassandra. 30-50 indexed columns in a row sounds insane to me. However 30-50 
column families into which one manually denormalized does not sound too insane 
to me...
> 
>=Rob
>http://twitter.com/rcolidba

Re: C 2.1

Posted by Robert Coli <rc...@eventbrite.com>.

On Mon, Sep 15, 2014 at 1:34 PM, Ram N <yr...@gmail.com> wrote:

> Would be great to understand the design decision to go with present
> implementation on Secondary Index when the alternative is better? Looking
> at JIRAs is still confusing to come up with the why :)
>

http://mail-archives.apache.org/mod_mbox/incubator-cassandra-user/201405.mbox/%3CCAEDUwd1i2BwJ-PAFE1qhjQFZ=qz2vA_vXWo_jDYCmS8EvkBSLQ@mail.gmail.com%3E

(I really should formalize this guy into a blog post, recursive mailing
list links are kinda lol...)

=Rob

Re: C 2.1

Posted by Jack Krupansky <ja...@basetechnology.com>.

Stratio and Stargate are at the Lucene level – DSE/Solr is at the Solr level. DSE/Solr supports both inserts and queries from either Cassandra or Solr – a Solr server is running on each Cassandra node that indexes and queries the data on that node.

DSE/Solr does have CQL SELECT integration as well, but supports Solr query syntax rather than needing to pass a structured JSON format.

SELECT * FROM persons WHERE solr_query=’name:jo* age:[20 TO 40]’;

And your app can use SolrJ or raw HTTP requests to talk to Solr within DSE as well.

-- Jack Krupansky

From: Ram N 
Sent: Wednesday, September 17, 2014 5:25 PM
To: user 
Subject: Re: C 2.1


Thanks Rob for pointing me to that link. I haven't gone through all the JIRAs but I guess it talks about adv & disadv of Secondary Index in Cassandra which I understand by now but doesn't really talk about why the default implementation of Secondary Index didn't take the DSE/Solr approach?

Hi Jack,

Thats good to know but any pointers on how is this any different than https://github.com/Stratio/stratio-cassandra or http://stargate-core.readthedocs.org/en/latest/intro.html ? 

--Ram


On Tue, Sep 16, 2014 at 10:32 PM, Jack Krupansky <ja...@basetechnology.com> wrote:

  DSE/Solr is tightly integrated, so there is no “external” system to manage – insert data in CQL and within a few seconds it is available for query from Solr running in the same JVM as Cassandra. DSE/Solr indexes the data on each Cassandra node, and uses Cassandra’s cluster management for distributing queries across the cluster. And... Lucene (underneath Solr) is optimal for queries that span multiple fields. DSE/Solr supports CQL3 wide rows (clustering columns.)

  -- Jack Krupansky

  From: Ram N 
  Sent: Monday, September 15, 2014 4:34 PM
  To: user 
  Subject: Re: C 2.1


  Jack, 

  Using Solr or an external search/indexing service is an option but increases the complexity of managing different systems. I am curious to understand the impact of having wide-rows on a separate CF for inverted index purpose which if I understand correctly is what Rob's response, having a separate CF for index is better than using the default Secondary index option. 

  Would be great to understand the design decision to go with present implementation on Secondary Index when the alternative is better? Looking at JIRAs is still confusing to come up with the why :) 

  --R 





  On Mon, Sep 15, 2014 at 11:17 AM, Jack Krupansky <ja...@basetechnology.com> wrote:

    If you’re indexing and querying on that many columns (dozens, or more than a handful), consider DSE/Solr, especially if you need to query on multiple columns in the same query.

    -- Jack Krupansky

    From: Robert Coli 
    Sent: Monday, September 15, 2014 11:07 AM
    To: user@cassandra.apache.org 
    Subject: Re: C 2.1

    On Sat, Sep 13, 2014 at 3:49 PM, Ram N <yr...@gmail.com> wrote:

      Is 2.1 a production ready release? 

    https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/


           Datastax Java driver - I get too confused with CQL and the underlying storage model. I am also not clear on the indexing structure of columns. Does CQL indexes create a separate CF for the index table? How is it different from maintaining inverted index? Internally both are the same? Does cql stmt to create index, creates a separate CF and has an atomic way of updating/managing them? Which one is better to scale? (something like stargate-core or the ones done by usergrid? or the CQL approach?)

    New projects should use CQL. Access to underlying storage via Thrift is likely to eventually be removed from Cassandra.

      On a separate note just curious if I have 1000's of columns in a given row and a fixed set of indexed column  (say 30 - 50 columns) which approach should I be taking? Will cassandra scale with these many indexed column? Are there any limits? How much of an impact do CQL indexes create on the system? I am also not sure if these use cases are the right choice for cassandra but would really appreciate any response on these. Thanks.

    Use of the "Secondary Indexes" feature is generally an anti-pattern in Cassandra. 30-50 indexed columns in a row sounds insane to me. However 30-50 column families into which one manually denormalized does not sound too insane to me...

    =Rob
    http://twitter.com/rcolidba

Re: C 2.1

Posted by Ram N <yr...@gmail.com>.

Thanks Rob for pointing me to that link. I haven't gone through all the
JIRAs but I guess it talks about adv & disadv of Secondary Index in
Cassandra which I understand by now but doesn't really talk about why the
default implementation of Secondary Index didn't take the DSE/Solr approach?

Hi Jack,

Thats good to know but any pointers on how is this any different than
https://github.com/Stratio/stratio-cassandra or
http://stargate-core.readthedocs.org/en/latest/intro.html ?

--Ram


On Tue, Sep 16, 2014 at 10:32 PM, Jack Krupansky <ja...@basetechnology.com>
wrote:

>   DSE/Solr is tightly integrated, so there is no “external” system to
> manage – insert data in CQL and within a few seconds it is available for
> query from Solr running in the same JVM as Cassandra. DSE/Solr indexes the
> data on each Cassandra node, and uses Cassandra’s cluster management for
> distributing queries across the cluster. And... Lucene (underneath Solr) is
> optimal for queries that span multiple fields. DSE/Solr supports CQL3 wide
> rows (clustering columns.)
>
> -- Jack Krupansky
>
>  *From:* Ram N <yr...@gmail.com>
> *Sent:* Monday, September 15, 2014 4:34 PM
> *To:* user <us...@cassandra.apache.org>
> *Subject:* Re: C 2.1
>
>
> Jack,
>
> Using Solr or an external search/indexing service is an option but
> increases the complexity of managing different systems. I am curious to
> understand the impact of having wide-rows on a separate CF for inverted
> index purpose which if I understand correctly is what Rob's response,
> having a separate CF for index is better than using the default Secondary
> index option.
>
> Would be great to understand the design decision to go with present
> implementation on Secondary Index when the alternative is better? Looking
> at JIRAs is still confusing to come up with the why :)
>
> --R
>
>
>
>
>
> On Mon, Sep 15, 2014 at 11:17 AM, Jack Krupansky <ja...@basetechnology.com>
> wrote:
>
>>   If you’re indexing and querying on that many columns (dozens, or more
>> than a handful), consider DSE/Solr, especially if you need to query on
>> multiple columns in the same query.
>>
>> -- Jack Krupansky
>>
>>  *From:* Robert Coli <rc...@eventbrite.com>
>> *Sent:* Monday, September 15, 2014 11:07 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: C 2.1
>>
>>    On Sat, Sep 13, 2014 at 3:49 PM, Ram N <yr...@gmail.com> wrote:
>>
>>>  Is 2.1 a production ready release?
>>>
>>
>> https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/
>>
>>
>>>       Datastax Java driver - I get too confused with CQL and the
>>> underlying storage model. I am also not clear on the indexing structure of
>>> columns. Does CQL indexes create a separate CF for the index table? How is
>>> it different from maintaining inverted index? Internally both are the same?
>>> Does cql stmt to create index, creates a separate CF and has an atomic way
>>> of updating/managing them? Which one is better to scale? (something like
>>> stargate-core or the ones done by usergrid? or the CQL approach?)
>>>
>>
>> New projects should use CQL. Access to underlying storage via Thrift is
>> likely to eventually be removed from Cassandra.
>>
>>
>>>  On a separate note just curious if I have 1000's of columns in a given
>>> row and a fixed set of indexed column  (say 30 - 50 columns) which approach
>>> should I be taking? Will cassandra scale with these many indexed column?
>>> Are there any limits? How much of an impact do CQL indexes create on the
>>> system? I am also not sure if these use cases are the right choice for
>>> cassandra but would really appreciate any response on these. Thanks.
>>>
>>
>> Use of the "Secondary Indexes" feature is generally an anti-pattern in
>> Cassandra. 30-50 indexed columns in a row sounds insane to me. However
>> 30-50 column families into which one manually denormalized does not sound
>> too insane to me...
>>
>> =Rob
>> http://twitter.com/rcolidba
>>
>
>

Re: C 2.1

Posted by Jack Krupansky <ja...@basetechnology.com>.

DSE/Solr is tightly integrated, so there is no “external” system to manage – insert data in CQL and within a few seconds it is available for query from Solr running in the same JVM as Cassandra. DSE/Solr indexes the data on each Cassandra node, and uses Cassandra’s cluster management for distributing queries across the cluster. And... Lucene (underneath Solr) is optimal for queries that span multiple fields. DSE/Solr supports CQL3 wide rows (clustering columns.)

-- Jack Krupansky

From: Ram N 
Sent: Monday, September 15, 2014 4:34 PM
To: user 
Subject: Re: C 2.1

Jack, 

Using Solr or an external search/indexing service is an option but increases the complexity of managing different systems. I am curious to understand the impact of having wide-rows on a separate CF for inverted index purpose which if I understand correctly is what Rob's response, having a separate CF for index is better than using the default Secondary index option. 

Would be great to understand the design decision to go with present implementation on Secondary Index when the alternative is better? Looking at JIRAs is still confusing to come up with the why :) 

--R 

On Mon, Sep 15, 2014 at 11:17 AM, Jack Krupansky <ja...@basetechnology.com> wrote:

  If you’re indexing and querying on that many columns (dozens, or more than a handful), consider DSE/Solr, especially if you need to query on multiple columns in the same query.

  -- Jack Krupansky

  From: Robert Coli 
  Sent: Monday, September 15, 2014 11:07 AM
  To: user@cassandra.apache.org 
  Subject: Re: C 2.1

  On Sat, Sep 13, 2014 at 3:49 PM, Ram N <yr...@gmail.com> wrote:

    Is 2.1 a production ready release? 

  https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/

         Datastax Java driver - I get too confused with CQL and the underlying storage model. I am also not clear on the indexing structure of columns. Does CQL indexes create a separate CF for the index table? How is it different from maintaining inverted index? Internally both are the same? Does cql stmt to create index, creates a separate CF and has an atomic way of updating/managing them? Which one is better to scale? (something like stargate-core or the ones done by usergrid? or the CQL approach?)

  New projects should use CQL. Access to underlying storage via Thrift is likely to eventually be removed from Cassandra.

    On a separate note just curious if I have 1000's of columns in a given row and a fixed set of indexed column  (say 30 - 50 columns) which approach should I be taking? Will cassandra scale with these many indexed column? Are there any limits? How much of an impact do CQL indexes create on the system? I am also not sure if these use cases are the right choice for cassandra but would really appreciate any response on these. Thanks.

  Use of the "Secondary Indexes" feature is generally an anti-pattern in Cassandra. 30-50 indexed columns in a row sounds insane to me. However 30-50 column families into which one manually denormalized does not sound too insane to me...

  =Rob
  http://twitter.com/rcolidba

Re: C 2.1

Posted by Ram N <yr...@gmail.com>.

Jack,

Using Solr or an external search/indexing service is an option but
increases the complexity of managing different systems. I am curious to
understand the impact of having wide-rows on a separate CF for inverted
index purpose which if I understand correctly is what Rob's response,
having a separate CF for index is better than using the default Secondary
index option.

Would be great to understand the design decision to go with present
implementation on Secondary Index when the alternative is better? Looking
at JIRAs is still confusing to come up with the why :)

--R





On Mon, Sep 15, 2014 at 11:17 AM, Jack Krupansky <ja...@basetechnology.com>
wrote:

>   If you’re indexing and querying on that many columns (dozens, or more
> than a handful), consider DSE/Solr, especially if you need to query on
> multiple columns in the same query.
>
> -- Jack Krupansky
>
>  *From:* Robert Coli <rc...@eventbrite.com>
> *Sent:* Monday, September 15, 2014 11:07 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: C 2.1
>
>   On Sat, Sep 13, 2014 at 3:49 PM, Ram N <yr...@gmail.com> wrote:
>
>>  Is 2.1 a production ready release?
>>
>
> https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/
>
>
>>       Datastax Java driver - I get too confused with CQL and the
>> underlying storage model. I am also not clear on the indexing structure of
>> columns. Does CQL indexes create a separate CF for the index table? How is
>> it different from maintaining inverted index? Internally both are the same?
>> Does cql stmt to create index, creates a separate CF and has an atomic way
>> of updating/managing them? Which one is better to scale? (something like
>> stargate-core or the ones done by usergrid? or the CQL approach?)
>>
>
> New projects should use CQL. Access to underlying storage via Thrift is
> likely to eventually be removed from Cassandra.
>
>
>>  On a separate note just curious if I have 1000's of columns in a given
>> row and a fixed set of indexed column  (say 30 - 50 columns) which approach
>> should I be taking? Will cassandra scale with these many indexed column?
>> Are there any limits? How much of an impact do CQL indexes create on the
>> system? I am also not sure if these use cases are the right choice for
>> cassandra but would really appreciate any response on these. Thanks.
>>
>
> Use of the "Secondary Indexes" feature is generally an anti-pattern in
> Cassandra. 30-50 indexed columns in a row sounds insane to me. However
> 30-50 column families into which one manually denormalized does not sound
> too insane to me...
>
> =Rob
> http://twitter.com/rcolidba
>

Re: C 2.1

Posted by Jack Krupansky <ja...@basetechnology.com>.

If you’re indexing and querying on that many columns (dozens, or more than a handful), consider DSE/Solr, especially if you need to query on multiple columns in the same query.

-- Jack Krupansky

From: Robert Coli 
Sent: Monday, September 15, 2014 11:07 AM
To: user@cassandra.apache.org 
Subject: Re: C 2.1

On Sat, Sep 13, 2014 at 3:49 PM, Ram N <yr...@gmail.com> wrote:

  Is 2.1 a production ready release? 

https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/

       Datastax Java driver - I get too confused with CQL and the underlying storage model. I am also not clear on the indexing structure of columns. Does CQL indexes create a separate CF for the index table? How is it different from maintaining inverted index? Internally both are the same? Does cql stmt to create index, creates a separate CF and has an atomic way of updating/managing them? Which one is better to scale? (something like stargate-core or the ones done by usergrid? or the CQL approach?)

New projects should use CQL. Access to underlying storage via Thrift is likely to eventually be removed from Cassandra.

  On a separate note just curious if I have 1000's of columns in a given row and a fixed set of indexed column  (say 30 - 50 columns) which approach should I be taking? Will cassandra scale with these many indexed column? Are there any limits? How much of an impact do CQL indexes create on the system? I am also not sure if these use cases are the right choice for cassandra but would really appreciate any response on these. Thanks.

Use of the "Secondary Indexes" feature is generally an anti-pattern in Cassandra. 30-50 indexed columns in a row sounds insane to me. However 30-50 column families into which one manually denormalized does not sound too insane to me...

=Rob
http://twitter.com/rcolidba

Re: C 2.1

Posted by Robert Coli <rc...@eventbrite.com>.

On Sat, Sep 13, 2014 at 3:49 PM, Ram N <yr...@gmail.com> wrote:

> Is 2.1 a production ready release?
>

https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/


>      Datastax Java driver - I get too confused with CQL and the underlying
> storage model. I am also not clear on the indexing structure of columns.
> Does CQL indexes create a separate CF for the index table? How is it
> different from maintaining inverted index? Internally both are the same?
> Does cql stmt to create index, creates a separate CF and has an atomic way
> of updating/managing them? Which one is better to scale? (something like
> stargate-core or the ones done by usergrid? or the CQL approach?)
>

New projects should use CQL. Access to underlying storage via Thrift is
likely to eventually be removed from Cassandra.


> On a separate note just curious if I have 1000's of columns in a given row
> and a fixed set of indexed column  (say 30 - 50 columns) which approach
> should I be taking? Will cassandra scale with these many indexed column?
> Are there any limits? How much of an impact do CQL indexes create on the
> system? I am also not sure if these use cases are the right choice for
> cassandra but would really appreciate any response on these. Thanks.
>

Use of the "Secondary Indexes" feature is generally an anti-pattern in
Cassandra. 30-50 indexed columns in a row sounds insane to me. However
30-50 column families into which one manually denormalized does not sound
too insane to me...

=Rob
http://twitter.com/rcolidba

Re: C 2.1

Posted by James Briggs <ja...@yahoo.com>.

Hi Ram.

1) As an Operations DBA, I consider all versions of Cassandra to be alpha.

So whether you pick 2.0.10 or 2.1.0 doesn't really matter since you
will have to do your own acceptance testing.

2) Data modelling is everything when it comes to a distributed database
like Cassandra. You can read my blog post which is a quick way to get
up to speed with CQL:

Notes on “Getting Started with Time Series Data Modeling” in Cassandra
http://jbriggs.com/blog/2014/09/notes-on-getting-started-with-time-series-data-modeling-in-cassandra/
 
Thanks, James Briggs
--
Cassandra/MySQL DBA. Available in San Jose area or remote.



________________________________
 From: Ram N <yr...@gmail.com>
To: user@cassandra.apache.org 
Sent: Saturday, September 13, 2014 3:49 PM
Subject: C 2.1
 


Team,

I am pretty new to cassandra (with just 2 weeks of playing around with it on and off) and planning a fresh deployment with 2.1 release. The data-model is pretty simple for my use-case.  Questions I have in mind are

Is 2.1 a production ready release? 
Driver selection?
    I played around with Hector, Astyanax and Java driver? 
     I don't see much activity happening on Hector,
     For Astyanax - Love the Fluent style of writing code and abstractions, recipes, pooling etc
     Datastax Java driver - I get too confused with CQL and the underlying storage model. I am also not clear on the indexing structure of columns. Does CQL indexes create a separate CF for the index table? How is it different from maintaining inverted index? Internally both are the same? Does cql stmt to create index, creates a separate CF and has an atomic way of updating/managing them? Which one is better to scale? (something like stargate-core or the ones done by usergrid? or the CQL approach?)

On a separate note just curious if I have 1000's of columns in a given row and a fixed set of indexed column  (say 30 - 50 columns) which approach should I be taking? Will cassandra scale with these many indexed column? Are there any limits? How much of an impact do CQL indexes create on the system? I am also not sure if these use cases are the right choice for cassandra but would really appreciate any response on these. Thanks.

-R