You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Marcelo Elias Del Valle <mv...@gmail.com> on 2012/09/18 00:28:00 UTC

Is Cassandra right for me?

Hello,

     I am new to Cassandra and I am in doubt if Cassandra is the right
technology to use in the architecture I am defining. Also, I saw a
presentation which said that if I don't have rows with more than a hundred
rows in Cassandra, whether I am doing something wrong or I shouldn't be
using Cassandra. Therefore, it might be the case I am doing something
wrong. If you could help me to find out the answer for these questions by
giving any feedback, it would be highly appreciated.
     Here is my need and what I am thinking in using Cassandra for:

   - I need to support a high volume of writes per second. I might have a
   billion writes per hour
   - I need to write non-structured data that will be processed later by
   hadoop processes to generate structured data from it. Later, I index the
   structured data using SOLR or SOLANDRA, so the data can be consulted by my
   end user application. Is Cassandra recommended for that, or should I be
   thinking in writting directly to HDFS files, for instance? What's the main
   advantage I get from storing data in a nosql service like Cassandra, when
   compared to storing files into HDFS?
   - Usually I will write json data associated to an ID and my hadoop
   processes will process this data to write data to a database. I have two
   doubts here:
      - If I don't need to perform complicated queries in Cassandra, should
      I store the json-like data just as a column value? I am afraid of doing
      something wrong here, as I would need just to store the json
file and some
      more 5 or 6 fields to query the files later.
      - Does it make sense to you to use hadoop to process data from
      Cassandra and store the results in a database, like HBase? Once I have
      structured data, is there any reason I should use Cassandra instead of
      HBase?

     I am sorry if the questions are too dummy, I have been watching a lot
of videos and reading a lot of documentation about Cassandra, but honestly,
more I read more I have questions.

Thanks in advance.

Best regards,
-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: Is Cassandra right for me?

Posted by "Hiller, Dean" <De...@nrel.gov>.

Cassandra is fully aware of all tables created with playOrm and you can still use DataStax enterprise features to get real time analytics.  Playroom is a layer on top of cassandra and with any layer it makes a developer more productive at a slight cost of performance just like hibernate on top of JDBC.  In some cases though, we find because someone uses the S-SQL instead of reading in full rows themselves, it has actually sped up their application in certain use cases…this is kind of unusual when putting a layer on top of the interface to cassandra.

Also, playOrm is working on a ad-hoc query tool to view all indexes created by playOrm as well as query into all rows in partitions so you can ad-hoc inspect your data much easier.  CQL can also be used as a complement to S-SQL(playOrm's SQL with partitions) in that you could analyze a full table but CQL doesn't do joins and you have to use equality operator and other limitations.  S-SQL is limited by only viewing into partitions which is okay for many OLTP applications.  For analytics, usually one needs to break out of the partitions and look at the more global data set…ie. Map/reduce and CQL help there.

Later,
Dean

From: Marcelo Elias Del Valle <mv...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Tuesday, September 18, 2012 10:50 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Is Cassandra right for me?

I also saw that if I plan to use Data Stax enterprise to get real time analytics, my data would need to be stored in Cassandra's usual format. It would harder for me use PlayOrm if I am planning to use advanced data stax features, like Solr indexing data on Cassandra without copying columns, realtime, wouldn't it? I don't know much of this Solr feature yet, but my understanding today is it wouldn't be aware of the tables I create with playOrm, just of the column families this framework uses to store the data, right?

Re: Is Cassandra right for me?

Posted by "Hiller, Dean" <De...@nrel.gov>.

Oh, and yes, that is the correct link.

Dean

From: Marcelo Elias Del Valle <mv...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Tuesday, September 18, 2012 10:50 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Is Cassandra right for me?

You're talking about this project, right?  https://github.com/deanhiller/playorm
I will take a look. However, I don't think using Cassandra's model itself (with CFs / key-values) would be a problem, I just need to know where the advantage relies on. By your answer, my guess is it relies on better performance and more control.

I also saw that if I plan to use Data Stax enterprise to get real time analytics, my data would need to be stored in Cassandra's usual format. It would harder for me use PlayOrm if I am planning to use advanced data stax features, like Solr indexing data on Cassandra without copying columns, realtime, wouldn't it? I don't know much of this Solr feature yet, but my understanding today is it wouldn't be aware of the tables I create with playOrm, just of the column families this framework uses to store the data, right?

2012/9/18 Hiller, Dean <De...@nrel.gov>>
Until Aaron replies, here are my thoughts on the relational piece…

           If everything in my model fits into a relational database, if my data is structured, would it still be a good idea to use Cassandra? Why?

The playOrm project explores exactly this issue……A query on 1,000,000 rows in a single partition only took 60ms AND you can do joins with it's S-SQL language.  The answer is a resounding YES, you can put relational data in cassandra.  The writes are way faster than a DBMS and joins and SQL can be just as fast and in many cases FASTER on noSQL IF you partition your data properly.  A S-SQL statement looks like so on playOrm

PARTITIONS t(:partitionId) SELECT t FROM Trades as t where t.numShares > 10

You can have as many partitions as you want and a single partition can have millions of rows though I would not exceed 10 million probably.

Later,
Dean

2012/9/18 aaron morton <aa...@thelastpickle.com>>>
Also, I saw a presentation which said that if I don't have rows with more than a hundred rows in Cassandra, whether I am doing something wrong or I shouldn't be using Cassandra.
I do not agree with that statement. (I read that as rows with ore than a hundred _columns_)

 *   I need to support a high volume of writes per second. I might have a billion writes per hour

Thats about 280K /sec. Netflix did a benchmark that shows 1.1M/sec http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html

 *   I need to write non-structured data that will be processed later by hadoop processes to generate structured data from it. Later, I index the structured data using SOLR or SOLANDRA, so the data can be consulted by my end user application. Is Cassandra recommended for that, or should I be thinking in writting directly to HDFS files, for instance? What's the main advantage I get from storing data in a nosql service like Cassandra, when compared to storing files into HDFS?
 *

You can query your data using Hadoop easily enough. You may want take a look at DSE from  http://datastax.com/ it makes using Hadoop and Solr with cassandra easier.

 *   If I don't need to perform complicated queries in Cassandra, should I store the json-like data just as a column value? I am afraid of doing something wrong here, as I would need just to store the json file and some more 5 or 6 fields to query the files later.
 *

Store the data in the way that best supports the read queries you want to make. If you always read all the fields, or it's a canonical record of events storing as JSON may be best. If you often get a few fields, and maybe they are updated, storing each field as a column value may be best.

 *   Does it make sense to you to use hadoop to process data from Cassandra and store the results in a database, like HBase? Once I have structured data, is there any reason I should use Cassandra instead of HBase?
 *

It depends on how many moving parts you are comfortable with. Same for the questions about HDFS etc. Start with the smallest about of infrastructure.

Hope that helps.

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 18/09/2012, at 10:28 AM, Marcelo Elias Del Valle <mv...@gmail.com>>> wrote:

Hello,

     I am new to Cassandra and I am in doubt if Cassandra is the right technology to use in the architecture I am defining. Also, I saw a presentation which said that if I don't have rows with more than a hundred rows in Cassandra, whether I am doing something wrong or I shouldn't be using Cassandra. Therefore, it might be the case I am doing something wrong. If you could help me to find out the answer for these questions by giving any feedback, it would be highly appreciated.
     Here is my need and what I am thinking in using Cassandra for:

 *   I need to support a high volume of writes per second. I might have a billion writes per hour
 *   I need to write non-structured data that will be processed later by hadoop processes to generate structured data from it. Later, I index the structured data using SOLR or SOLANDRA, so the data can be consulted by my end user application. Is Cassandra recommended for that, or should I be thinking in writting directly to HDFS files, for instance? What's the main advantage I get from storing data in a nosql service like Cassandra, when compared to storing files into HDFS?
 *   Usually I will write json data associated to an ID and my hadoop processes will process this data to write data to a database. I have two doubts here:
    *   If I don't need to perform complicated queries in Cassandra, should I store the json-like data just as a column value? I am afraid of doing something wrong here, as I would need just to store the json file and some more 5 or 6 fields to query the files later.
    *   Does it make sense to you to use hadoop to process data from Cassandra and store the results in a database, like HBase? Once I have structured data, is there any reason I should use Cassandra instead of HBase?

     I am sorry if the questions are too dummy, I have been watching a lot of videos and reading a lot of documentation about Cassandra, but honestly, more I read more I have questions.

Thanks in advance.

Best regards,
--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: Is Cassandra right for me?

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

You're talking about this project, right?
https://github.com/deanhiller/playorm
I will take a look. However, I don't think using Cassandra's model itself
(with CFs / key-values) would be a problem, I just need to know where the
advantage relies on. By your answer, my guess is it relies on better
performance and more control.

I also saw that if I plan to use Data Stax enterprise to get real time
analytics, my data would need to be stored in Cassandra's usual format. It
would harder for me use PlayOrm if I am planning to use advanced data stax
features, like Solr indexing data on Cassandra without copying columns,
realtime, wouldn't it? I don't know much of this Solr feature yet, but my
understanding today is it wouldn't be aware of the tables I create with
playOrm, just of the column families this framework uses to store the data,
right?




2012/9/18 Hiller, Dean <De...@nrel.gov>

> Until Aaron replies, here are my thoughts on the relational piece…
>
>            If everything in my model fits into a relational database, if
> my data is structured, would it still be a good idea to use Cassandra? Why?
>
> The playOrm project explores exactly this issue……A query on 1,000,000 rows
> in a single partition only took 60ms AND you can do joins with it's S-SQL
> language.  The answer is a resounding YES, you can put relational data in
> cassandra.  The writes are way faster than a DBMS and joins and SQL can be
> just as fast and in many cases FASTER on noSQL IF you partition your data
> properly.  A S-SQL statement looks like so on playOrm
>
> PARTITIONS t(:partitionId) SELECT t FROM Trades as t where t.numShares > 10
>
> You can have as many partitions as you want and a single partition can
> have millions of rows though I would not exceed 10 million probably.
>
> Later,
> Dean
>
> 2012/9/18 aaron morton <aaron@thelastpickle.com<mailto:
> aaron@thelastpickle.com>>
> Also, I saw a presentation which said that if I don't have rows with more
> than a hundred rows in Cassandra, whether I am doing something wrong or I
> shouldn't be using Cassandra.
> I do not agree with that statement. (I read that as rows with ore than a
> hundred _columns_)
>
>
>  *   I need to support a high volume of writes per second. I might have a
> billion writes per hour
>
> Thats about 280K /sec. Netflix did a benchmark that shows 1.1M/sec
> http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
>
>
>  *   I need to write non-structured data that will be processed later by
> hadoop processes to generate structured data from it. Later, I index the
> structured data using SOLR or SOLANDRA, so the data can be consulted by my
> end user application. Is Cassandra recommended for that, or should I be
> thinking in writting directly to HDFS files, for instance? What's the main
> advantage I get from storing data in a nosql service like Cassandra, when
> compared to storing files into HDFS?
>  *
>
> You can query your data using Hadoop easily enough. You may want take a
> look at DSE from  http://datastax.com/ it makes using Hadoop and Solr
> with cassandra easier.
>
>
>  *   If I don't need to perform complicated queries in Cassandra, should I
> store the json-like data just as a column value? I am afraid of doing
> something wrong here, as I would need just to store the json file and some
> more 5 or 6 fields to query the files later.
>  *
>
> Store the data in the way that best supports the read queries you want to
> make. If you always read all the fields, or it's a canonical record of
> events storing as JSON may be best. If you often get a few fields, and
> maybe they are updated, storing each field as a column value may be best.
>
>
>  *   Does it make sense to you to use hadoop to process data from
> Cassandra and store the results in a database, like HBase? Once I have
> structured data, is there any reason I should use Cassandra instead of
> HBase?
>  *
>
> It depends on how many moving parts you are comfortable with. Same for the
> questions about HDFS etc. Start with the smallest about of infrastructure.
>
> Hope that helps.
>
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 18/09/2012, at 10:28 AM, Marcelo Elias Del Valle <mvallebr@gmail.com
> <ma...@gmail.com>> wrote:
>
> Hello,
>
>      I am new to Cassandra and I am in doubt if Cassandra is the right
> technology to use in the architecture I am defining. Also, I saw a
> presentation which said that if I don't have rows with more than a hundred
> rows in Cassandra, whether I am doing something wrong or I shouldn't be
> using Cassandra. Therefore, it might be the case I am doing something
> wrong. If you could help me to find out the answer for these questions by
> giving any feedback, it would be highly appreciated.
>      Here is my need and what I am thinking in using Cassandra for:
>
>  *   I need to support a high volume of writes per second. I might have a
> billion writes per hour
>  *   I need to write non-structured data that will be processed later by
> hadoop processes to generate structured data from it. Later, I index the
> structured data using SOLR or SOLANDRA, so the data can be consulted by my
> end user application. Is Cassandra recommended for that, or should I be
> thinking in writting directly to HDFS files, for instance? What's the main
> advantage I get from storing data in a nosql service like Cassandra, when
> compared to storing files into HDFS?
>  *   Usually I will write json data associated to an ID and my hadoop
> processes will process this data to write data to a database. I have two
> doubts here:
>     *   If I don't need to perform complicated queries in Cassandra,
> should I store the json-like data just as a column value? I am afraid of
> doing something wrong here, as I would need just to store the json file and
> some more 5 or 6 fields to query the files later.
>     *   Does it make sense to you to use hadoop to process data from
> Cassandra and store the results in a database, like HBase? Once I have
> structured data, is there any reason I should use Cassandra instead of
> HBase?
>
>      I am sorry if the questions are too dummy, I have been watching a lot
> of videos and reading a lot of documentation about Cassandra, but honestly,
> more I read more I have questions.
>
> Thanks in advance.
>
> Best regards,
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>
>
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>



-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: Is Cassandra right for me?

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

Thanks a lot! Things are much more clear now.

2012/9/21 Michael Kjellman <mk...@barracuda.com>

> Brisk is no longer actively developed by the original author or Datastax.
> It was left up for the community.
>
> https://github.com/steeve/brisk
>
> Has a fork that is supposedly compatible with 1.0 API
>
> Your more than welcome to fork that and make it work with 1.1 :)
>
> DSE != (Cassandra + Brisk)
>
> From: Marcelo Elias Del Valle <mvallebr@gmail.com<mailto:
> mvallebr@gmail.com>>
> Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org>>
> Date: Friday, September 21, 2012 10:27 AM
> To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org>>
> Subject: Re: Is Cassandra right for me?
>
>
>
> 2012/9/20 aaron morton <aaron@thelastpickle.com<mailto:
> aaron@thelastpickle.com>>
> Actually, if I use community edition for now, I wouldn't be able to use
> hadoop against data stored in CFS?
> AFAIK DSC is a packaged deployment of Apache Cassandra. You should be ale
> to use Hadoop against it, in the same way you can use hadoop against Apache
> Cassandra.
>
> You "can do" anything with computers if you have enough time and patience.
> DSE reduces the amount of time and patience needed to run Hadoop over
> Cassandra. Specifically it helps by providing a HDFS and Hive Meta Store
> that run on Cassandra. This reduces the number of moving parts you need to
> provision.
>
> Can I use BRISK with Apache Cassandra, without changing Brisk or
> Cassandra's code? To the best of my knowledge, DSE uses Brisk, so I am
> afraid of writting hadoop process now and have to change them when I hire
> DSE support.
>
> I am not an expert in the Apache 2.0 license, but in my understanding Data
> Stax modified Apache Cassandra and included modifications to it in the
> version they sell. At the same time I am interested in hiring their
> support, I wanna keep compatibility with the open source version
> distributed in the mainstream, just in case I want to stop hiring their
> support at any time.
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>
> 'Like' us on Facebook for exclusive content and other resources on all
> Barracuda Networks solutions.
>
> Visit http://barracudanetworks.com/facebook
>
>
>
>
>


-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: Is Cassandra right for me?

Posted by Michael Kjellman <mk...@barracuda.com>.

Brisk is no longer actively developed by the original author or Datastax. It was left up for the community.

https://github.com/steeve/brisk

Has a fork that is supposedly compatible with 1.0 API

Your more than welcome to fork that and make it work with 1.1 :)

DSE != (Cassandra + Brisk)

From: Marcelo Elias Del Valle <mv...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Friday, September 21, 2012 10:27 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Is Cassandra right for me?



2012/9/20 aaron morton <aa...@thelastpickle.com>>
Actually, if I use community edition for now, I wouldn't be able to use hadoop against data stored in CFS?
AFAIK DSC is a packaged deployment of Apache Cassandra. You should be ale to use Hadoop against it, in the same way you can use hadoop against Apache Cassandra.

You "can do" anything with computers if you have enough time and patience. DSE reduces the amount of time and patience needed to run Hadoop over Cassandra. Specifically it helps by providing a HDFS and Hive Meta Store that run on Cassandra. This reduces the number of moving parts you need to provision.

Can I use BRISK with Apache Cassandra, without changing Brisk or Cassandra's code? To the best of my knowledge, DSE uses Brisk, so I am afraid of writting hadoop process now and have to change them when I hire DSE support.

I am not an expert in the Apache 2.0 license, but in my understanding Data Stax modified Apache Cassandra and included modifications to it in the version they sell. At the same time I am interested in hiring their support, I wanna keep compatibility with the open source version distributed in the mainstream, just in case I want to stop hiring their support at any time.


--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

'Like' us on Facebook for exclusive content and other resources on all Barracuda Networks solutions.
Visit http://barracudanetworks.com/facebook

Re: Is Cassandra right for me?

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

2012/9/20 aaron morton <aa...@thelastpickle.com>

> Actually, if I use community edition for now, I wouldn't be able to use
> hadoop against data stored in CFS?
>
> AFAIK DSC is a packaged deployment of Apache Cassandra. You should be ale
> to use Hadoop against it, in the same way you can use hadoop against Apache
> Cassandra.
>
> You "can do" anything with computers if you have enough time and patience.
> DSE reduces the amount of time and patience needed to run Hadoop over
> Cassandra. Specifically it helps by providing a HDFS and Hive Meta Store
> that run on Cassandra. This reduces the number of moving parts you need to
> provision.
>

Can I use BRISK with Apache Cassandra, without changing Brisk or
Cassandra's code? To the best of my knowledge, DSE uses Brisk, so I am
afraid of writting hadoop process now and have to change them when I hire
DSE support.

I am not an expert in the Apache 2.0 license, but in my understanding Data
Stax modified Apache Cassandra and included modifications to it in the
version they sell. At the same time I am interested in hiring their
support, I wanna keep compatibility with the open source version
distributed in the mainstream, just in case I want to stop hiring their
support at any time.


-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: Is Cassandra right for me?

Posted by aaron morton <aa...@thelastpickle.com>.

> Actually, if I use community edition for now, I wouldn't be able to use hadoop against data stored in CFS? 
AFAIK DSC is a packaged deployment of Apache Cassandra. You should be ale to use Hadoop against it, in the same way you can use hadoop against Apache Cassandra. 

You "can do" anything with computers if you have enough time and patience. DSE reduces the amount of time and patience needed to run Hadoop over Cassandra. Specifically it helps by providing a HDFS and Hive Meta Store that run on Cassandra. This reduces the number of moving parts you need to provision. 

> Would writes on HDFS be so quick as in Cassandra?
Yes and no. 
HDFS uses a big bock size, so while it may absorb writes quickly you may not be able to read them immediately. 
Remember you may need a HDFS layer for intermediate results. 
 
> would I have advantages in using Cassandra instead of HBase?

Cassandra provides no single point of failure, great scalability, tuneable consistency, a flexible data model and very easy single package deployment. My HBase knowledge is limited, but I would check those points and go with whatever you feel comfortable with. 

> If everything in my model fits into a relational database, if my data is structured, would it still be a good idea to use Cassandra? Why?
It's reasonable to use cassandra for structured data. After a few iterations of development you may find that the current structure is not the best for a non-RDBMS. e.g. It's often easier to work with larger entities that violate Normal Form requirements.

There are lots of advantages to use Cassandra, just as there are benefits to using a RDBMS rather than custom flat files. If you feel your project will benefit from those advantages, and/or you are technically curious, I would recommend  trying Cassandra. 

Chose a small part of your product and create a Proof of Concept, it should only take a week or so. Make as many mistakes as you can as fast as you can and have fun.   

Hope that helps. 

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 19/09/2012, at 1:51 AM, Marcelo Elias Del Valle <mv...@gmail.com> wrote:

> Aaron,
> 
>     Thank you very much for the answers! Helped me a lot!
>     I would like just a bit more clarification about the points bellow, if you allow me:
> 
> You can query your data using Hadoop easily enough. You may want take a look at DSE from  http://datastax.com/ it makes using Hadoop and Solr with cassandra easier.
> Actually, if I use community edition for now, I wouldn't be able to use hadoop against data stored in CFS? We are considering the enterprise edition here, but the best scenario would be using it just when really needed. Would writes on HDFS be so quick as in Cassandra?
> 
> It depends on how many moving parts you are comfortable with. Same for the questions about HDFS etc. Start with the smallest about of infrastructure.
> Sorry, I didn't really understand this part. I am not sure what you wanted to say, but the question was about using nosql instead a relational database in this case. If learning nosql is not a problem, would I have advantages in using Cassandra instead of HBase? If everything in my model fits into a relational database, if my data is structured, would it still be a good idea to use Cassandra? Why?
> 
> 
> Thanks,
> Marcelo.
> 
> 2012/9/18 aaron morton <aa...@thelastpickle.com>
>> Also, I saw a presentation which said that if I don't have rows with more than a hundred rows in Cassandra, whether I am doing something wrong or I shouldn't be using Cassandra. 
> I do not agree with that statement. (I read that as rows with ore than a hundred _columns_)
> 
>> I need to support a high volume of writes per second. I might have a billion writes per hour
> Thats about 280K /sec. Netflix did a benchmark that shows 1.1M/sec http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
> 
>> I need to write non-structured data that will be processed later by hadoop processes to generate structured data from it. Later, I index the structured data using SOLR or SOLANDRA, so the data can be consulted by my end user application. Is Cassandra recommended for that, or should I be thinking in writting directly to HDFS files, for instance? What's the main advantage I get from storing data in a nosql service like Cassandra, when compared to storing files into HDFS?
> You can query your data using Hadoop easily enough. You may want take a look at DSE from  http://datastax.com/ it makes using Hadoop and Solr with cassandra easier. 
> 
>> If I don't need to perform complicated queries in Cassandra, should I store the json-like data just as a column value? I am afraid of doing something wrong here, as I would need just to store the json file and some more 5 or 6 fields to query the files later.
> Store the data in the way that best supports the read queries you want to make. If you always read all the fields, or it's a canonical record of events storing as JSON may be best. If you often get a few fields, and maybe they are updated, storing each field as a column value may be best. 
> 
>> Does it make sense to you to use hadoop to process data from Cassandra and store the results in a database, like HBase? Once I have structured data, is there any reason I should use Cassandra instead of HBase?
> It depends on how many moving parts you are comfortable with. Same for the questions about HDFS etc. Start with the smallest about of infrastructure. 
> 
> Hope that helps. 
> 
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 18/09/2012, at 10:28 AM, Marcelo Elias Del Valle <mv...@gmail.com> wrote:
> 
>> Hello,
>> 
>>      I am new to Cassandra and I am in doubt if Cassandra is the right technology to use in the architecture I am defining. Also, I saw a presentation which said that if I don't have rows with more than a hundred rows in Cassandra, whether I am doing something wrong or I shouldn't be using Cassandra. Therefore, it might be the case I am doing something wrong. If you could help me to find out the answer for these questions by giving any feedback, it would be highly appreciated. 
>>      Here is my need and what I am thinking in using Cassandra for:
>> I need to support a high volume of writes per second. I might have a billion writes per hour
>> I need to write non-structured data that will be processed later by hadoop processes to generate structured data from it. Later, I index the structured data using SOLR or SOLANDRA, so the data can be consulted by my end user application. Is Cassandra recommended for that, or should I be thinking in writting directly to HDFS files, for instance? What's the main advantage I get from storing data in a nosql service like Cassandra, when compared to storing files into HDFS?
>> Usually I will write json data associated to an ID and my hadoop processes will process this data to write data to a database. I have two doubts here:
>> If I don't need to perform complicated queries in Cassandra, should I store the json-like data just as a column value? I am afraid of doing something wrong here, as I would need just to store the json file and some more 5 or 6 fields to query the files later.
>> Does it make sense to you to use hadoop to process data from Cassandra and store the results in a database, like HBase? Once I have structured data, is there any reason I should use Cassandra instead of HBase?
>>      I am sorry if the questions are too dummy, I have been watching a lot of videos and reading a lot of documentation about Cassandra, but honestly, more I read more I have questions. 
>> 
>> Thanks in advance.
>> 
>> Best regards,
>> -- 
>> Marcelo Elias Del Valle
>> http://mvalle.com - @mvallebr
> 
> 
> 
> 
> -- 
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr

Re: Is Cassandra right for me?

Posted by "Hiller, Dean" <De...@nrel.gov>.

Until Aaron replies, here are my thoughts on the relational piece…

If everything in my model fits into a relational database, if my data is structured, would it still be a good idea to use Cassandra? Why?

The playOrm project explores exactly this issue……A query on 1,000,000 rows in a single partition only took 60ms AND you can do joins with it's S-SQL language. The answer is a resounding YES, you can put relational data in cassandra. The writes are way faster than a DBMS and joins and SQL can be just as fast and in many cases FASTER on noSQL IF you partition your data properly. A S-SQL statement looks like so on playOrm

PARTITIONS t(:partitionId) SELECT t FROM Trades as t where t.numShares > 10

You can have as many partitions as you want and a single partition can have millions of rows though I would not exceed 10 million probably.

Later,
Dean

2012/9/18 aaron morton <aa...@thelastpickle.com>>
Also, I saw a presentation which said that if I don't have rows with more than a hundred rows in Cassandra, whether I am doing something wrong or I shouldn't be using Cassandra.
I do not agree with that statement. (I read that as rows with ore than a hundred _columns_)

* I need to support a high volume of writes per second. I might have a billion writes per hour

Thats about 280K /sec. Netflix did a benchmark that shows 1.1M/sec http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html

* I need to write non-structured data that will be processed later by hadoop processes to generate structured data from it. Later, I index the structured data using SOLR or SOLANDRA, so the data can be consulted by my end user application. Is Cassandra recommended for that, or should I be thinking in writting directly to HDFS files, for instance? What's the main advantage I get from storing data in a nosql service like Cassandra, when compared to storing files into HDFS?
*

You can query your data using Hadoop easily enough. You may want take a look at DSE from http://datastax.com/ it makes using Hadoop and Solr with cassandra easier.

* If I don't need to perform complicated queries in Cassandra, should I store the json-like data just as a column value? I am afraid of doing something wrong here, as I would need just to store the json file and some more 5 or 6 fields to query the files later.
*

Store the data in the way that best supports the read queries you want to make. If you always read all the fields, or it's a canonical record of events storing as JSON may be best. If you often get a few fields, and maybe they are updated, storing each field as a column value may be best.

* Does it make sense to you to use hadoop to process data from Cassandra and store the results in a database, like HBase? Once I have structured data, is there any reason I should use Cassandra instead of HBase?
*

It depends on how many moving parts you are comfortable with. Same for the questions about HDFS etc. Start with the smallest about of infrastructure.

Hope that helps.

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 18/09/2012, at 10:28 AM, Marcelo Elias Del Valle <mv...@gmail.com>> wrote:

Hello,

I am new to Cassandra and I am in doubt if Cassandra is the right technology to use in the architecture I am defining. Also, I saw a presentation which said that if I don't have rows with more than a hundred rows in Cassandra, whether I am doing something wrong or I shouldn't be using Cassandra. Therefore, it might be the case I am doing something wrong. If you could help me to find out the answer for these questions by giving any feedback, it would be highly appreciated.
Here is my need and what I am thinking in using Cassandra for:

* I need to support a high volume of writes per second. I might have a billion writes per hour
* I need to write non-structured data that will be processed later by hadoop processes to generate structured data from it. Later, I index the structured data using SOLR or SOLANDRA, so the data can be consulted by my end user application. Is Cassandra recommended for that, or should I be thinking in writting directly to HDFS files, for instance? What's the main advantage I get from storing data in a nosql service like Cassandra, when compared to storing files into HDFS?
* Usually I will write json data associated to an ID and my hadoop processes will process this data to write data to a database. I have two doubts here:
* If I don't need to perform complicated queries in Cassandra, should I store the json-like data just as a column value? I am afraid of doing something wrong here, as I would need just to store the json file and some more 5 or 6 fields to query the files later.
* Does it make sense to you to use hadoop to process data from Cassandra and store the results in a database, like HBase? Once I have structured data, is there any reason I should use Cassandra instead of HBase?

I am sorry if the questions are too dummy, I have been watching a lot of videos and reading a lot of documentation about Cassandra, but honestly, more I read more I have questions.

Thanks in advance.

Best regards,
--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: Is Cassandra right for me?

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

Aaron,

    Thank you very much for the answers! Helped me a lot!
    I would like just a bit more clarification about the points bellow, if
you allow me:


   - You can query your data using Hadoop easily enough. You may want take
   a look at DSE from  http://datastax.com/ it makes using Hadoop and Solr
   with cassandra easier.

Actually, if I use community edition for now, I wouldn't be able to use
hadoop against data stored in CFS? We are considering the enterprise
edition here, but the best scenario would be using it just when really
needed. Would writes on HDFS be so quick as in Cassandra?


   - It depends on how many moving parts you are comfortable with. Same for
   the questions about HDFS etc. Start with the smallest about of
   infrastructure.

Sorry, I didn't really understand this part. I am not sure what you wanted
to say, but the question was about using nosql instead a relational
database in this case. If learning nosql is not a problem, would I have
advantages in using Cassandra instead of HBase? If everything in my model
fits into a relational database, if my data is structured, would it still
be a good idea to use Cassandra? Why?


Thanks,
Marcelo.

2012/9/18 aaron morton <aa...@thelastpickle.com>

> Also, I saw a presentation which said that if I don't have rows with more
> than a hundred rows in Cassandra, whether I am doing something wrong or I
> shouldn't be using Cassandra.
>
> I do not agree with that statement. (I read that as rows with ore than a
> hundred _columns_)
>
>
>    - I need to support a high volume of writes per second. I might have a
>    billion writes per hour
>
> Thats about 280K /sec. Netflix did a benchmark that shows 1.1M/sec
> http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
>
>
>    - I need to write non-structured data that will be processed later by
>    hadoop processes to generate structured data from it. Later, I index the
>    structured data using SOLR or SOLANDRA, so the data can be consulted by my
>    end user application. Is Cassandra recommended for that, or should I be
>    thinking in writting directly to HDFS files, for instance? What's the main
>    advantage I get from storing data in a nosql service like Cassandra, when
>    compared to storing files into HDFS?
>    -
>
> You can query your data using Hadoop easily enough. You may want take a
> look at DSE from  http://datastax.com/ it makes using Hadoop and Solr
> with cassandra easier.
>
>
>    - If I don't need to perform complicated queries in Cassandra, should
>    I store the json-like data just as a column value? I am afraid of doing
>    something wrong here, as I would need just to store the json file and some
>    more 5 or 6 fields to query the files later.
>    -
>
> Store the data in the way that best supports the read queries you want to
> make. If you always read all the fields, or it's a canonical record of
> events storing as JSON may be best. If you often get a few fields, and
> maybe they are updated, storing each field as a column value may be best.
>
>
>    - Does it make sense to you to use hadoop to process data from
>    Cassandra and store the results in a database, like HBase? Once I have
>    structured data, is there any reason I should use Cassandra instead of
>    HBase?
>    -
>
> It depends on how many moving parts you are comfortable with. Same for the
> questions about HDFS etc. Start with the smallest about of infrastructure.
>
> Hope that helps.
>
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 18/09/2012, at 10:28 AM, Marcelo Elias Del Valle <mv...@gmail.com>
> wrote:
>
> Hello,
>
>      I am new to Cassandra and I am in doubt if Cassandra is the right
> technology to use in the architecture I am defining. Also, I saw a
> presentation which said that if I don't have rows with more than a hundred
> rows in Cassandra, whether I am doing something wrong or I shouldn't be
> using Cassandra. Therefore, it might be the case I am doing something
> wrong. If you could help me to find out the answer for these questions by
> giving any feedback, it would be highly appreciated.
>      Here is my need and what I am thinking in using Cassandra for:
>
>    - I need to support a high volume of writes per second. I might have a
>    billion writes per hour
>    - I need to write non-structured data that will be processed later by
>    hadoop processes to generate structured data from it. Later, I index the
>    structured data using SOLR or SOLANDRA, so the data can be consulted by my
>    end user application. Is Cassandra recommended for that, or should I be
>    thinking in writting directly to HDFS files, for instance? What's the main
>    advantage I get from storing data in a nosql service like Cassandra, when
>    compared to storing files into HDFS?
>    - Usually I will write json data associated to an ID and my hadoop
>    processes will process this data to write data to a database. I have two
>    doubts here:
>       - If I don't need to perform complicated queries in Cassandra,
>       should I store the json-like data just as a column value? I am afraid of
>       doing something wrong here, as I would need just to store the json file and
>       some more 5 or 6 fields to query the files later.
>       - Does it make sense to you to use hadoop to process data from
>       Cassandra and store the results in a database, like HBase? Once I have
>       structured data, is there any reason I should use Cassandra instead of
>       HBase?
>
>      I am sorry if the questions are too dummy, I have been watching a lot
> of videos and reading a lot of documentation about Cassandra, but honestly,
> more I read more I have questions.
>
> Thanks in advance.
>
> Best regards,
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>
>
>


-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: Is Cassandra right for me?

Posted by aaron morton <aa...@thelastpickle.com>.

> Also, I saw a presentation which said that if I don't have rows with more than a hundred rows in Cassandra, whether I am doing something wrong or I shouldn't be using Cassandra. 
I do not agree with that statement. (I read that as rows with ore than a hundred _columns_)

> I need to support a high volume of writes per second. I might have a billion writes per hour
Thats about 280K /sec. Netflix did a benchmark that shows 1.1M/sec http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html

> I need to write non-structured data that will be processed later by hadoop processes to generate structured data from it. Later, I index the structured data using SOLR or SOLANDRA, so the data can be consulted by my end user application. Is Cassandra recommended for that, or should I be thinking in writting directly to HDFS files, for instance? What's the main advantage I get from storing data in a nosql service like Cassandra, when compared to storing files into HDFS?
You can query your data using Hadoop easily enough. You may want take a look at DSE from  http://datastax.com/ it makes using Hadoop and Solr with cassandra easier. 

> If I don't need to perform complicated queries in Cassandra, should I store the json-like data just as a column value? I am afraid of doing something wrong here, as I would need just to store the json file and some more 5 or 6 fields to query the files later.
Store the data in the way that best supports the read queries you want to make. If you always read all the fields, or it's a canonical record of events storing as JSON may be best. If you often get a few fields, and maybe they are updated, storing each field as a column value may be best. 

> Does it make sense to you to use hadoop to process data from Cassandra and store the results in a database, like HBase? Once I have structured data, is there any reason I should use Cassandra instead of HBase?
It depends on how many moving parts you are comfortable with. Same for the questions about HDFS etc. Start with the smallest about of infrastructure. 

Hope that helps. 

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 18/09/2012, at 10:28 AM, Marcelo Elias Del Valle <mv...@gmail.com> wrote:

> Hello,
> 
>      I am new to Cassandra and I am in doubt if Cassandra is the right technology to use in the architecture I am defining. Also, I saw a presentation which said that if I don't have rows with more than a hundred rows in Cassandra, whether I am doing something wrong or I shouldn't be using Cassandra. Therefore, it might be the case I am doing something wrong. If you could help me to find out the answer for these questions by giving any feedback, it would be highly appreciated. 
>      Here is my need and what I am thinking in using Cassandra for:
> I need to support a high volume of writes per second. I might have a billion writes per hour
> I need to write non-structured data that will be processed later by hadoop processes to generate structured data from it. Later, I index the structured data using SOLR or SOLANDRA, so the data can be consulted by my end user application. Is Cassandra recommended for that, or should I be thinking in writting directly to HDFS files, for instance? What's the main advantage I get from storing data in a nosql service like Cassandra, when compared to storing files into HDFS?
> Usually I will write json data associated to an ID and my hadoop processes will process this data to write data to a database. I have two doubts here:
> If I don't need to perform complicated queries in Cassandra, should I store the json-like data just as a column value? I am afraid of doing something wrong here, as I would need just to store the json file and some more 5 or 6 fields to query the files later.
> Does it make sense to you to use hadoop to process data from Cassandra and store the results in a database, like HBase? Once I have structured data, is there any reason I should use Cassandra instead of HBase?
>      I am sorry if the questions are too dummy, I have been watching a lot of videos and reading a lot of documentation about Cassandra, but honestly, more I read more I have questions. 
> 
> Thanks in advance.
> 
> Best regards,
> -- 
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr

Re: Is Cassandra right for me?

Posted by Marcelo Elias Del Valle <mv...@gmail.com>.

I will have just 6 columns in my CF, but I will have about a billion writes
per hour. In this case, I think Cassandra applies then, by what you are
saying.
This answer helped a lot too, thanks!

2012/9/18 Hiller, Dean <De...@nrel.gov>

> I wanted to clarify the where that statement comes from on wide rows ….
>
> Realize some people make the claim that if you don’t' have 1000's of
> columns in "some" rows in cassandra you are doing something wrong.  This is
> not true, BUT it comes from the fact that people are setting up indexes.
>  This is what leads to the very wide row affect.  playOrm is one such
> library using wide rows like this BUT it is NOT necessary for all
> applications.
>
> You can easily use map/reduce on a cassandra cluster.  You can map/reduce
> your dataset into a new model if you make a mistake as well and don't get
> it right the first time.  This wide row affect is 80% of the time used for
> indexing.  I draw off playOrm examples a lot but one table may be
> partitioned by time so each month of data is in a partition, you can then
> have indexes on each partition allowing you to do quick queries into
> partitions.
>
> Later,
> Dean
>
> From: Marcelo Elias Del Valle <mvallebr@gmail.com<mailto:
> mvallebr@gmail.com>>
> Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org>>
> Date: Monday, September 17, 2012 4:28 PM
> To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <
> user@cassandra.apache.org<ma...@cassandra.apache.org>>
> Subject: Is Cassandra right for me?
>
> Hello,
>
>      I am new to Cassandra and I am in doubt if Cassandra is the right
> technology to use in the architecture I am defining. Also, I saw a
> presentation which said that if I don't have rows with more than a hundred
> rows in Cassandra, whether I am doing something wrong or I shouldn't be
> using Cassandra. Therefore, it might be the case I am doing something
> wrong. If you could help me to find out the answer for these questions by
> giving any feedback, it would be highly appreciated.
>      Here is my need and what I am thinking in using Cassandra for:
>
>  *   I need to support a high volume of writes per second. I might have a
> billion writes per hour
>  *   I need to write non-structured data that will be processed later by
> hadoop processes to generate structured data from it. Later, I index the
> structured data using SOLR or SOLANDRA, so the data can be consulted by my
> end user application. Is Cassandra recommended for that, or should I be
> thinking in writting directly to HDFS files, for instance? What's the main
> advantage I get from storing data in a nosql service like Cassandra, when
> compared to storing files into HDFS?
>  *   Usually I will write json data associated to an ID and my hadoop
> processes will process this data to write data to a database. I have two
> doubts here:
>     *   If I don't need to perform complicated queries in Cassandra,
> should I store the json-like data just as a column value? I am afraid of
> doing something wrong here, as I would need just to store the json file and
> some more 5 or 6 fields to query the files later.
>     *   Does it make sense to you to use hadoop to process data from
> Cassandra and store the results in a database, like HBase? Once I have
> structured data, is there any reason I should use Cassandra instead of
> HBase?
>
>      I am sorry if the questions are too dummy, I have been watching a lot
> of videos and reading a lot of documentation about Cassandra, but honestly,
> more I read more I have questions.
>
> Thanks in advance.
>
> Best regards,
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>



-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: Is Cassandra right for me?

Posted by "Hiller, Dean" <De...@nrel.gov>.

I wanted to clarify the where that statement comes from on wide rows ….

Realize some people make the claim that if you don’t' have 1000's of columns in "some" rows in cassandra you are doing something wrong.  This is not true, BUT it comes from the fact that people are setting up indexes.  This is what leads to the very wide row affect.  playOrm is one such library using wide rows like this BUT it is NOT necessary for all applications.

You can easily use map/reduce on a cassandra cluster.  You can map/reduce your dataset into a new model if you make a mistake as well and don't get it right the first time.  This wide row affect is 80% of the time used for indexing.  I draw off playOrm examples a lot but one table may be partitioned by time so each month of data is in a partition, you can then have indexes on each partition allowing you to do quick queries into partitions.

Later,
Dean

From: Marcelo Elias Del Valle <mv...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Monday, September 17, 2012 4:28 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Is Cassandra right for me?

Hello,

     I am new to Cassandra and I am in doubt if Cassandra is the right technology to use in the architecture I am defining. Also, I saw a presentation which said that if I don't have rows with more than a hundred rows in Cassandra, whether I am doing something wrong or I shouldn't be using Cassandra. Therefore, it might be the case I am doing something wrong. If you could help me to find out the answer for these questions by giving any feedback, it would be highly appreciated.
     Here is my need and what I am thinking in using Cassandra for:

 *   I need to support a high volume of writes per second. I might have a billion writes per hour
 *   I need to write non-structured data that will be processed later by hadoop processes to generate structured data from it. Later, I index the structured data using SOLR or SOLANDRA, so the data can be consulted by my end user application. Is Cassandra recommended for that, or should I be thinking in writting directly to HDFS files, for instance? What's the main advantage I get from storing data in a nosql service like Cassandra, when compared to storing files into HDFS?
 *   Usually I will write json data associated to an ID and my hadoop processes will process this data to write data to a database. I have two doubts here:
    *   If I don't need to perform complicated queries in Cassandra, should I store the json-like data just as a column value? I am afraid of doing something wrong here, as I would need just to store the json file and some more 5 or 6 fields to query the files later.
    *   Does it make sense to you to use hadoop to process data from Cassandra and store the results in a database, like HBase? Once I have structured data, is there any reason I should use Cassandra instead of HBase?

     I am sorry if the questions are too dummy, I have been watching a lot of videos and reading a lot of documentation about Cassandra, but honestly, more I read more I have questions.

Thanks in advance.

Best regards,
--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr