You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Ralph Soika <ra...@imixs.com> on 2018/06/10 09:54:26 UTC

Size of a single Data Row?

Hi,
I have a general question concerning the Cassandra technology. I already 
read 2 books but after all I am more and more confused about the 
question if Cassandra is the right technology. My goal is to store 
Business Data form a workflow engine into Cassandra. I want to use 
Cassandra as a kind of archive service because of its fault tolerant and 
decentralized approach.

But here are two things which are confusing me. On the one hand the 
project claims that a single column value can be 2 GB (1 MB is 
recommended). On the other hand people explain that a partition should 
not be larger than 100MB.

I plan only one single simple table:

     CREATE TABLE documents (
        created text,
        id text,
        data text,
        PRIMARY KEY (created,id)
     );

'created' is the partition key holding the date in ISO fomat 
(YYYY-MM-DD). The 'id' is a clustering key and is unique.

But my 'data' column holds a XML document with business data. This cell 
contains many unstructured data and also media data. The data cell will 
be between 1 and 10 MB. BUT it can also hold more than 100MB and less 
than 2GB in some cases.

Is Cassandra able to handle this kind of table? Or is Cassandra at the 
end not recommended for this kind of data?

For example I would like to ask if data for a specific date is available :

     SELECT created,id WHERE created = '2018-06-10'

I select without the data column and just ask if data exists. Is the 
performance automatically poor only because the data cell (no primary 
key) of some rows is grater then 100MB? Or is cassandra running out of 
heap space in any case? It is perfectly clear that it makes no sense to 
select multiple cells which each contain over 100 MB of data in one 
single query. But this is a fundamental problem and has nothing to do 
with Cassandra. My java application running in Wildfly would also not be 
able to handle a data result with multiple GB of data.  But I would 
expect hat I can select a set of keys just to decide whether to load one 
single data cell.

Cassandra seems like a great system. But many people seem to claim that 
it is only suitable for mapping a user status list ala Facebook? Is this 
true? Thanks for you comments in advance.




===
Ralph


Re: Size of a single Data Row?

Posted by Ralph Soika <ra...@imixs.com>.
Hi Jeff,

thanks for that answer. I understand the problem now much better. As you 
explain the problem also exists in the VM and so also in the 'other' 
part of my application which is running on JavaEE/JPA. At the end the 
100MB byte arrays also cause a HeapSpace problem there. So Cassandra is 
not the core problem in my consideration.

Your solution with splitting up the blob in junks is good but I did not 
need this because in deed the same problem exists on the Wildfly 
Application Server side. It was my mistake to say that I have no heap 
size problem with files over 100MB.

So I will create simply two tables:

CREATE TABLE documents_meta (
created text,
document_id text,
hash text
PRIMARY KEY (created,document_id))

CREATE TABLE documents_data (
document_id text,
data blob,
PRIMARY KEY(document_id))


The table 'documents_meta' is to verify the data consistency of files 
stored in the JavaEE part. As I explained, Cassandra plays the role of a 
high available backup cluster.

What I was not aware is the "problem" with the partition size. Can you 
give me a link where to read about the CQL partition issue. In the book 
"Cassandra: The Definitive Guide" I did not find this.

best regards

===
Ralph


Am 10.06.2018 um 19:04 schrieb Jeff Jirsa:
> Let's talk about what the real limitations are. There are two here 
> that you should care about:
>
> 1) Cassandra runs in the JVM. When you read and write to Cassandra, 
> those objects end up in the heap as byte arrays. If you're regularly 
> reading and writing 100MB byte arrays, it's easy to see situations 
> where you'll have some latency pains, especially if you have a lot of 
> concurrent requests.
> 2) On the read path, we build up an index of CQL rows within a CQL 
> partition. You've been reading books, I suspect you know the 
> difference (if not, ask, and I'll re-explain). In all versions of 
> cassandra released so far, the cost of that index scales with the 
> width of the partition and is paid ON READ (not on write like other 
> databases). If you have a very wide CQL partition and you query it 
> quickly, you will create JVM GC pressure. It sounds like this is a 
> secondary concern here.
>
> That doesn't mean it's not a good fit. There are workarounds to both 
> of these issues.
>
>
> For example:
> - On the write path, running with offheap memtables will get the cell 
> value into direct memory for the period of time between when it's 
> written in the commitlog and when it's flushed to disk. This is likely 
> important for you.
> - Instead of writing the 100MB document in a single cell, chunk it 
> into 1MB chunks
>
> CREATE TABLE documents (
> document_id text,
> chunk_order int,
> chunk_id text,
> PRIMARY KEY (document_id, chunk_order))
>
> CREATE TABLE chunks (
> chunk_id text,
> chunk blob,
> PRIMARY KEY(chunk_id))
>
> Then when you go to write the document, you break it into 1MB blobs, 
> and take the hash (md5, sha1, sha256, whatever suits your needs based 
> on pain of collisions),  write the chunk into the chunks table, and 
> the chunk_id into the documents table for the document (in the right 
> order).
>
> This does a few things:
> 1) You can reassemble the document chunk by chunk by querying it in 
> pieces. Each piece is small enough not to overwhelm the garbage 
> collector (and you control that with paging)
> 2) The only partition here that can get large is document_id, and it'd 
> be incredibly unlikely that you'll get 100MB per partition here based 
> on your description, so you dont have to worry about the index pain on 
> the read path
> 3) You naturally dedup chunks, which you didnt ask for, but may care 
> about.
>
> Hope that helps,
> - Jeff
>
>
> On Sun, Jun 10, 2018 at 9:35 AM, Ralph Soika <ralph.soika@imixs.com 
> <ma...@imixs.com>> wrote:
>
>     Thanks for your answer. Ok - I think I understand your points and
>     the worries you have about my architecture.
>     To give more inside information: We are working on the Open Source
>     Project Imixs-Workflow <http://www.imixs.org>. This is a
>     human-centric workflow engine based on Java EE. The engine runs on
>     JPA/SQL Databases. This is to have full transactional support. We
>     also use Lucene Search technology to find records in a very
>     unstructured amount of business data.  Everything runs stable and
>     fast (for example with PostgreSQL) - also if we have records
>     containing 100MB of attachments.
>
>     But we need also a stable archive strategy. Normal Backups are not
>     really an option because of the fact that databases grow over the
>     years and so we are seeking a Big Table solution. Cassandra seems
>     much stronger in this area than traditional SQL solutions. And it
>     seems to be easy to setup a cluster of 3 nodes. It is not easy to
>     build the same with Hadoop.
>
>     Our Cassandra approach is not for data live access. It is for an
>     asynchronous archive service with the goal of an highly data
>     consistence decentralized storage. And this is why I am not
>     worried about performance. Only in case of an restore or a
>     big-data analyses we are reading data from Cassandra.
>
>     I can't change the fact that I have business transactions that
>     contain files with more than 100MB of data.
>     Do you really think Cassandra has less performance in
>     writing/reading a 200MB media file than PostgreSQL? In my first
>     test I have not. I have the concern that through some Internet
>     discussion the impression is, that Cassandra is worse than a
>     traditional SQL solution. I thought Cassandra is basically a
>     big-data solution??
>     If Cassandra is not suitable to store records larger than 100MB, I
>     ask if the only alternative would be HBase?
>
>     To put it more clearly: it's always a challenge to handle a record
>     with more than 100MB. But the question is: Does Cassandra break in
>     this kind of task?
>
>     So if we exclude the performance issue for a moment, would you
>     agree to the solution or advise against it?
>
>     Thanks again for you help
>
>
>     ===
>     Ralph
>
>
>
>     Am 10.06.2018 um 17:43 schrieb daemeon reiydelle:
>>     I'd like to split your question into two parts.
>>
>>     Part one is around recovery. If you lose a copy of the underlying
>>     data because a note fails and let's assume you have three copies,
>>     how long can you tolerate the time to restore the third copy?
>>
>>     The second question is about the absolute length of a row. This
>>     question is more about the time to read a row if it's a single
>>     super long row, that can only be read from one node, if the row
>>     is split into multiple shorter rows then in most cases there is
>>     an opportunity to read it in parallel.
>>
>>     The sizes you're looking at are not in themselves an issue, it's
>>     more how you want to access and use the data.
>>
>>     I might argue that you might not want to use Cassandra, if this
>>     is your only use case for Cassandra. I might suggest you look at
>>     something like elk, whether or not you use elasticsearch or
>>     Cassandra might get you thinking about your architecture to meet
>>     this particular business case. But of course if you have multiple
>>     use cases to store something some tables or shorter columns and
>>     others, then overall Cassandra would be an excellent choice.
>>
>>     But as is often the case, and I do hope I'm being helpful in this
>>     response, your overall family of business processes can drive
>>     compromises in one business process to facilitate a single
>>     storage solution and simplified Administration
>>
>>
>>     Daemeon (Dæmœn) Reiydelle
>>     USA 1.415.501.0198
>>
>>     On Sun, Jun 10, 2018, 02:54 Ralph Soika <ralph.soika@imixs.com
>>     <ma...@imixs.com>> wrote:
>>
>>         Hi,
>>         I have a general question concerning the Cassandra
>>         technology. I already read 2 books but after all I am more
>>         and more confused about the question if Cassandra is the
>>         right technology. My goal is to store Business Data form a
>>         workflow engine into Cassandra. I want to use Cassandra as a
>>         kind of archive service because of its fault tolerant and
>>         decentralized approach.
>>
>>         But here are two things which are confusing me. On the one
>>         hand the project claims that a single column value can be 2
>>         GB (1 MB is recommended). On the other hand people explain
>>         that a partition should not be larger than 100MB.
>>
>>         I plan only one single simple table:
>>
>>             CREATE TABLE documents (
>>                created text,
>>                id text,
>>                data text,
>>                PRIMARY KEY (created,id)
>>             );
>>
>>         'created' is the partition key holding the date in ISO fomat
>>         (YYYY-MM-DD). The 'id' is a clustering key and is unique.
>>
>>         But my 'data' column holds a XML document with business data.
>>         This cell contains many unstructured data and also media
>>         data. The data cell will be between 1 and 10 MB. BUT it can
>>         also hold more than 100MB and less than 2GB in some cases.
>>
>>         Is Cassandra able to handle this kind of table? Or is
>>         Cassandra at the end not recommended for this kind of data?
>>
>>         For example I would like to ask if data for a specific date
>>         is available :
>>
>>             SELECT created,id WHERE created = '2018-06-10'
>>
>>         I select without the data column and just ask if data exists.
>>         Is the performance automatically poor only because the data
>>         cell (no primary key) of some rows is grater then 100MB? Or
>>         is cassandra running out of heap space in any case? It is
>>         perfectly clear that it makes no sense to select multiple
>>         cells which each contain over 100 MB of data in one single
>>         query. But this is a fundamental problem and has nothing to
>>         do with Cassandra. My java application running in Wildfly
>>         would also not be able to handle a data result with multiple
>>         GB of data.  But I would expect hat I can select a set of
>>         keys just to decide whether to load one single data cell.
>>
>>         Cassandra seems like a great system. But many people seem to
>>         claim that it is only suitable for mapping a user status list
>>         ala Facebook? Is this true? Thanks for you comments in advance.
>>
>>
>>
>>
>>         ===
>>         Ralph
>>


Re: Size of a single Data Row?

Posted by Jeff Jirsa <jj...@gmail.com>.
Let's talk about what the real limitations are. There are two here that you
should care about:

1) Cassandra runs in the JVM. When you read and write to Cassandra, those
objects end up in the heap as byte arrays. If you're regularly reading and
writing 100MB byte arrays, it's easy to see situations where you'll have
some latency pains, especially if you have a lot of concurrent requests.
2) On the read path, we build up an index of CQL rows within a CQL
partition. You've been reading books, I suspect you know the difference (if
not, ask, and I'll re-explain). In all versions of cassandra released so
far, the cost of that index scales with the width of the partition and is
paid ON READ (not on write like other databases). If you have a very wide
CQL partition and you query it quickly, you will create JVM GC pressure. It
sounds like this is a secondary concern here.

That doesn't mean it's not a good fit. There are workarounds to both of
these issues.


For example:
- On the write path, running with offheap memtables will get the cell value
into direct memory for the period of time between when it's written in the
commitlog and when it's flushed to disk. This is likely important for you.
- Instead of writing the 100MB document in a single cell, chunk it into 1MB
chunks

CREATE TABLE documents (
document_id text,
chunk_order int,
chunk_id text,
PRIMARY KEY (document_id, chunk_order))

CREATE TABLE chunks (
chunk_id text,
chunk blob,
PRIMARY KEY(chunk_id))

Then when you go to write the document, you break it into 1MB blobs, and
take the hash (md5, sha1, sha256, whatever suits your needs based on pain
of collisions),  write the chunk into the chunks table, and the chunk_id
into the documents table for the document (in the right order).

This does a few things:
1) You can reassemble the document chunk by chunk by querying it in pieces.
Each piece is small enough not to overwhelm the garbage collector (and you
control that with paging)
2) The only partition here that can get large is document_id, and it'd be
incredibly unlikely that you'll get 100MB per partition here based on your
description, so you dont have to worry about the index pain on the read path
3) You naturally dedup chunks, which you didnt ask for, but may care about.

Hope that helps,
- Jeff


On Sun, Jun 10, 2018 at 9:35 AM, Ralph Soika <ra...@imixs.com> wrote:

> Thanks for your answer. Ok - I think I understand your points and the
> worries you have about my architecture.
> To give more inside information: We are working on the Open Source Project
> Imixs-Workflow <http://www.imixs.org>. This is a human-centric workflow
> engine based on Java EE. The engine runs on JPA/SQL Databases. This is to
> have full transactional support. We also use Lucene Search technology to
> find records in a very unstructured amount of business data.  Everything
> runs stable and fast (for example with PostgreSQL) - also if we have
> records containing 100MB of attachments.
>
> But we need also a stable archive strategy. Normal Backups are not really
> an option because of the fact that databases grow over the years and so we
> are seeking a Big Table solution. Cassandra seems much stronger in this
> area than traditional SQL solutions. And it seems to be easy to setup a
> cluster of 3 nodes. It is not easy to build the same with Hadoop.
>
> Our Cassandra approach is not for data live access. It is for an
> asynchronous archive service with the goal of an highly data consistence
> decentralized storage. And this is why I am not worried about performance.
> Only in case of an restore or a big-data analyses we are reading data from
> Cassandra.
>
> I can't change the fact that I have business transactions that contain
> files with more than 100MB of data.
> Do you really think Cassandra has less performance in writing/reading a
> 200MB media file than PostgreSQL? In my first test I have not. I have the
> concern that through some Internet discussion the impression is, that
> Cassandra is worse than a traditional SQL solution.  I thought Cassandra
> is basically a big-data solution??
> If Cassandra is not suitable to store records larger than 100MB, I ask if
> the only alternative would be HBase?
>
> To put it more clearly: it's always a challenge to handle a record with
> more than 100MB. But the question is: Does Cassandra break in this kind of
> task?
>
> So if we exclude the performance issue for a moment, would you agree to
> the solution or advise against it?
>
> Thanks again for you help
>
>
> ===
> Ralph
>
>
>
> Am 10.06.2018 um 17:43 schrieb daemeon reiydelle:
>
> I'd like to split your question into two parts.
>
> Part one is around recovery. If you lose a copy of the underlying data
> because a note fails and let's assume you have three copies, how long can
> you tolerate the time to restore the third copy?
>
> The second question is about the absolute length of a row. This question
> is more about the time to read a row if it's a single super long row, that
> can only be read from one node, if the row is split into multiple shorter
> rows then in most cases there is an opportunity to read it in parallel.
>
> The sizes you're looking at are not in themselves an issue, it's more how
> you want to access and use the data.
>
> I might argue that you might not want to use Cassandra, if this is your
> only use case for Cassandra. I might suggest you look at something like
> elk, whether or not you use elasticsearch or Cassandra might get you
> thinking about your architecture to meet this particular business case. But
> of course if you have multiple use cases to store something some tables or
> shorter columns and others, then overall Cassandra would be an excellent
> choice.
>
> But as is often the case, and I do hope I'm being helpful in this
> response, your overall family of business processes can drive compromises
> in one business process to facilitate a single storage solution and
> simplified Administration
>
>
> Daemeon (Dæmœn) Reiydelle
> USA 1.415.501.0198
>
> On Sun, Jun 10, 2018, 02:54 Ralph Soika <ra...@imixs.com> wrote:
>
>> Hi,
>> I have a general question concerning the Cassandra technology. I already
>> read 2 books but after all I am more and more confused about the question
>> if Cassandra is the right technology. My goal is to store Business Data
>> form a workflow engine into Cassandra. I want to use Cassandra as a kind of
>> archive service because of its fault tolerant and decentralized approach.
>>
>> But here are two things which are confusing me. On the one hand the
>> project claims that a single column value can be 2 GB (1 MB is
>> recommended). On the other hand people explain that a partition should not
>> be larger than 100MB.
>>
>> I plan only one single simple table:
>>
>>     CREATE TABLE documents (
>>        created text,
>>        id text,
>>        data text,
>>        PRIMARY KEY (created,id)
>>     );
>>
>> 'created' is the partition key holding the date in ISO fomat
>> (YYYY-MM-DD). The 'id' is a clustering key and is unique.
>>
>> But my 'data' column holds a XML document with business data. This cell
>> contains many unstructured data and also media data. The data cell will be
>> between 1 and 10 MB. BUT it can also hold more than 100MB and less than 2GB
>> in some cases.
>>
>> Is Cassandra able to handle this kind of table? Or is Cassandra at the
>> end not recommended for this kind of data?
>>
>> For example I would like to ask if data for a specific date is available
>> :
>>
>>     SELECT created,id WHERE created = '2018-06-10'
>>
>> I select without the data column and just ask if data exists. Is the
>> performance automatically poor only because the data cell (no primary key)
>> of some rows is grater then 100MB? Or is cassandra running out of heap
>> space in any case? It is perfectly clear that it makes no sense to select
>> multiple cells which each contain over 100 MB of data in one single query.
>> But this is a fundamental problem and has nothing to do with Cassandra. My
>> java application running in Wildfly would also not be able to handle a data
>> result with multiple GB of data.  But I would expect hat I can select a set
>> of keys just to decide whether to load one single data cell.
>>
>> Cassandra seems like a great system. But many people seem to claim that
>> it is only suitable for mapping a user status list ala Facebook? Is this
>> true? Thanks for you comments in advance.
>>
>>
>>
>>
>> ===
>> Ralph
>>
>>
> --
>
> *Imixs Software Solutions GmbH*
> *Web:* www.imixs.com *Phone:* +49 (0)89-452136 16
> *Office:* Agnes-Pockels-Bogen 1, 80992 München
> <https://maps.google.com/?q=Agnes-Pockels-Bogen+1,+80992+M%C3%BCnchen&entry=gmail&source=g>
> Registergericht: Amtsgericht Muenchen, HRB 136045
> Geschaeftsführer: Gaby Heinle u. Ralph Soika
>
> *Imixs* is an open source company, read more: www.imixs.org
>

Re: Size of a single Data Row?

Posted by Ralph Soika <ra...@imixs.com>.
Thanks for your answer. Ok - I think I understand your points and the 
worries you have about my architecture.
To give more inside information: We are working on the Open Source 
Project Imixs-Workflow <http://www.imixs.org>. This is a human-centric 
workflow engine based on Java EE. The engine runs on JPA/SQL Databases. 
This is to have full transactional support. We also use Lucene Search 
technology to find records in a very unstructured amount of business 
data.  Everything runs stable and fast (for example with PostgreSQL) - 
also if we have records containing 100MB of attachments.

But we need also a stable archive strategy. Normal Backups are not 
really an option because of the fact that databases grow over the years 
and so we are seeking a Big Table solution. Cassandra seems much 
stronger in this area than traditional SQL solutions. And it seems to be 
easy to setup a cluster of 3 nodes. It is not easy to build the same 
with Hadoop.

Our Cassandra approach is not for data live access. It is for an 
asynchronous archive service with the goal of an highly data consistence 
decentralized storage. And this is why I am not worried about 
performance. Only in case of an restore or a big-data analyses we are 
reading data from Cassandra.

I can't change the fact that I have business transactions that contain 
files with more than 100MB of data.
Do you really think Cassandra has less performance in writing/reading a 
200MB media file than PostgreSQL? In my first test I have not. I have 
the concern that through some Internet discussion the impression is, 
that Cassandra is worse than a traditional SQL solution. I thought 
Cassandra is basically a big-data solution??
If Cassandra is not suitable to store records larger than 100MB, I ask 
if the only alternative would be HBase?

To put it more clearly: it's always a challenge to handle a record with 
more than 100MB. But the question is: Does Cassandra break in this kind 
of task?

So if we exclude the performance issue for a moment, would you agree to 
the solution or advise against it?

Thanks again for you help


===
Ralph



Am 10.06.2018 um 17:43 schrieb daemeon reiydelle:
> I'd like to split your question into two parts.
>
> Part one is around recovery. If you lose a copy of the underlying data 
> because a note fails and let's assume you have three copies, how long 
> can you tolerate the time to restore the third copy?
>
> The second question is about the absolute length of a row. This 
> question is more about the time to read a row if it's a single super 
> long row, that can only be read from one node, if the row is split 
> into multiple shorter rows then in most cases there is an opportunity 
> to read it in parallel.
>
> The sizes you're looking at are not in themselves an issue, it's more 
> how you want to access and use the data.
>
> I might argue that you might not want to use Cassandra, if this is 
> your only use case for Cassandra. I might suggest you look at 
> something like elk, whether or not you use elasticsearch or Cassandra 
> might get you thinking about your architecture to meet this particular 
> business case. But of course if you have multiple use cases to store 
> something some tables or shorter columns and others, then overall 
> Cassandra would be an excellent choice.
>
> But as is often the case, and I do hope I'm being helpful in this 
> response, your overall family of business processes can drive 
> compromises in one business process to facilitate a single storage 
> solution and simplified Administration
>
>
> Daemeon (Dæmœn) Reiydelle
> USA 1.415.501.0198
>
> On Sun, Jun 10, 2018, 02:54 Ralph Soika <ralph.soika@imixs.com 
> <ma...@imixs.com>> wrote:
>
>     Hi,
>     I have a general question concerning the Cassandra technology. I
>     already read 2 books but after all I am more and more confused
>     about the question if Cassandra is the right technology. My goal
>     is to store Business Data form a workflow engine into Cassandra. I
>     want to use Cassandra as a kind of archive service because of its
>     fault tolerant and decentralized approach.
>
>     But here are two things which are confusing me. On the one hand
>     the project claims that a single column value can be 2 GB (1 MB is
>     recommended). On the other hand people explain that a partition
>     should not be larger than 100MB.
>
>     I plan only one single simple table:
>
>         CREATE TABLE documents (
>            created text,
>            id text,
>            data text,
>            PRIMARY KEY (created,id)
>         );
>
>     'created' is the partition key holding the date in ISO fomat
>     (YYYY-MM-DD). The 'id' is a clustering key and is unique.
>
>     But my 'data' column holds a XML document with business data. This
>     cell contains many unstructured data and also media data. The data
>     cell will be between 1 and 10 MB. BUT it can also hold more than
>     100MB and less than 2GB in some cases.
>
>     Is Cassandra able to handle this kind of table? Or is Cassandra at
>     the end not recommended for this kind of data?
>
>     For example I would like to ask if data for a specific date is
>     available :
>
>         SELECT created,id WHERE created = '2018-06-10'
>
>     I select without the data column and just ask if data exists. Is
>     the performance automatically poor only because the data cell (no
>     primary key) of some rows is grater then 100MB? Or is cassandra
>     running out of heap space in any case? It is perfectly clear that
>     it makes no sense to select multiple cells which each contain over
>     100 MB of data in one single query. But this is a fundamental
>     problem and has nothing to do with Cassandra. My java application
>     running in Wildfly would also not be able to handle a data result
>     with multiple GB of data.  But I would expect hat I can select a
>     set of keys just to decide whether to load one single data cell.
>
>     Cassandra seems like a great system. But many people seem to claim
>     that it is only suitable for mapping a user status list ala
>     Facebook? Is this true? Thanks for you comments in advance.
>
>
>
>
>     ===
>     Ralph
>

-- 

*Imixs Software Solutions GmbH*
*Web:* www.imixs.com <http://www.imixs.com> *Phone:* +49 (0)89-452136 16
*Office:* Agnes-Pockels-Bogen 1, 80992 München
Registergericht: Amtsgericht Muenchen, HRB 136045
Geschaeftsführer: Gaby Heinle u. Ralph Soika

*Imixs* is an open source company, read more: www.imixs.org 
<http://www.imixs.org>


Re: Size of a single Data Row?

Posted by daemeon reiydelle <da...@gmail.com>.
I'd like to split your question into two parts.

Part one is around recovery. If you lose a copy of the underlying data
because a note fails and let's assume you have three copies, how long can
you tolerate the time to restore the third copy?

The second question is about the absolute length of a row. This question is
more about the time to read a row if it's a single super long row, that can
only be read from one node, if the row is split into multiple shorter rows
then in most cases there is an opportunity to read it in parallel.

The sizes you're looking at are not in themselves an issue, it's more how
you want to access and use the data.

I might argue that you might not want to use Cassandra, if this is your
only use case for Cassandra. I might suggest you look at something like
elk, whether or not you use elasticsearch or Cassandra might get you
thinking about your architecture to meet this particular business case. But
of course if you have multiple use cases to store something some tables or
shorter columns and others, then overall Cassandra would be an excellent
choice.

But as is often the case, and I do hope I'm being helpful in this response,
your overall family of business processes can drive compromises in one
business process to facilitate a single storage solution and simplified
Administration


Daemeon (Dæmœn) Reiydelle
USA 1.415.501.0198

On Sun, Jun 10, 2018, 02:54 Ralph Soika <ra...@imixs.com> wrote:

> Hi,
> I have a general question concerning the Cassandra technology. I already
> read 2 books but after all I am more and more confused about the question
> if Cassandra is the right technology. My goal is to store Business Data
> form a workflow engine into Cassandra. I want to use Cassandra as a kind of
> archive service because of its fault tolerant and decentralized approach.
>
> But here are two things which are confusing me. On the one hand the
> project claims that a single column value can be 2 GB (1 MB is
> recommended). On the other hand people explain that a partition should not
> be larger than 100MB.
>
> I plan only one single simple table:
>
>     CREATE TABLE documents (
>        created text,
>        id text,
>        data text,
>        PRIMARY KEY (created,id)
>     );
>
> 'created' is the partition key holding the date in ISO fomat (YYYY-MM-DD).
> The 'id' is a clustering key and is unique.
>
> But my 'data' column holds a XML document with business data. This cell
> contains many unstructured data and also media data. The data cell will be
> between 1 and 10 MB. BUT it can also hold more than 100MB and less than 2GB
> in some cases.
>
> Is Cassandra able to handle this kind of table? Or is Cassandra at the end
> not recommended for this kind of data?
>
> For example I would like to ask if data for a specific date is available :
>
>     SELECT created,id WHERE created = '2018-06-10'
>
> I select without the data column and just ask if data exists. Is the
> performance automatically poor only because the data cell (no primary key)
> of some rows is grater then 100MB? Or is cassandra running out of heap
> space in any case? It is perfectly clear that it makes no sense to select
> multiple cells which each contain over 100 MB of data in one single query.
> But this is a fundamental problem and has nothing to do with Cassandra. My
> java application running in Wildfly would also not be able to handle a data
> result with multiple GB of data.  But I would expect hat I can select a set
> of keys just to decide whether to load one single data cell.
>
> Cassandra seems like a great system. But many people seem to claim that it
> is only suitable for mapping a user status list ala Facebook? Is this true?
> Thanks for you comments in advance.
>
>
>
>
> ===
> Ralph
>
>

Re: Size of a single Data Row?

Posted by Ralph Soika <ra...@imixs.com>.
Hi Eevee,

thanks for your response. Low latency is not an issue because I do read 
only in rarely cases and also I write rarely cases. But for me it is 
important to have a high data consistency over a decentralized cluster. 
And Cassandra fills that perfectly. Hadoop is much more complex in setup 
in compare to cassandra.

Extracting the XML is not an option because it is mostly unstructured 
set of field/value pairs.

But I still stumble across this sense of a clustering key.  What if I 
shift the date column into a second table?

     CREATE TABLE documents (
        id text,
        data text,
        PRIMARY KEY (id)
     );

     CREATE TABLE documents_created (
        created text,
        id text,
        PRIMARY KEY (created,id)
     );

So my 'big-Table' holds only the uniqueID as the primary key. Is this 
table design more performant? I am trying to keep things simple.


Best regards

Ralph





Am 10.06.2018 um 14:24 schrieb Evelyn Smith:
> Hi Ralph,
>
> Yes, having partitions of 100mb will seriously hit your performance. 
> But usually the issue here is for people handling large numbers of 
> transactions and aiming for low latency. My understanding is the 
> column value up to 2GB is it’s max. Like after that the system would 
> start to fail, but well before that you are going to be seeing a 
> significant performance hit (for most use cases).
>
> I think an important question for you is are you going to be reading 
> these files from Cassandra regularly? It sounds like something S3 or 
> Hadoop might be more appropriate for.
>
> The other option is if your xml files have some format you could 
> extract the data from it and store it that way.
>
> One final point, I’m pretty sure a TEXT type won’t hold a 10mb file 
> let alone a 1GB file, I think the max size is like 64K characters.
>
> Regards,
> Eevee.
>
>> On 10 Jun 2018, at 7:54 pm, Ralph Soika <ralph.soika@imixs.com 
>> <ma...@imixs.com>> wrote:
>>
>> Hi,
>> I have a general question concerning the Cassandra technology. I 
>> already read 2 books but after all I am more and more confused about 
>> the question if Cassandra is the right technology. My goal is to 
>> store Business Data form a workflow engine into Cassandra. I want to 
>> use Cassandra as a kind of archive service because of its fault 
>> tolerant and decentralized approach.
>>
>> But here are two things which are confusing me. On the one hand the 
>> project claims that a single column value can be 2 GB (1 MB is 
>> recommended). On the other hand people explain that a partition 
>> should not be larger than 100MB.
>>
>> I plan only one single simple table:
>>
>>     CREATE TABLE documents (
>>        created text,
>>        id text,
>>        data text,
>>        PRIMARY KEY (created,id)
>>     );
>>
>> 'created' is the partition key holding the date in ISO fomat 
>> (YYYY-MM-DD). The 'id' is a clustering key and is unique.
>>
>> But my 'data' column holds a XML document with business data. This 
>> cell contains many unstructured data and also media data. The data 
>> cell will be between 1 and 10 MB. BUT it can also hold more than 
>> 100MB and less than 2GB in some cases.
>>
>> Is Cassandra able to handle this kind of table? Or is Cassandra at 
>> the end not recommended for this kind of data?
>>
>> For example I would like to ask if data for a specific date is 
>> available :
>>
>>     SELECT created,id WHERE created = '2018-06-10'
>>
>> I select without the data column and just ask if data exists. Is the 
>> performance automatically poor only because the data cell (no primary 
>> key) of some rows is grater then 100MB? Or is cassandra running out 
>> of heap space in any case? It is perfectly clear that it makes no 
>> sense to select multiple cells which each contain over 100 MB of data 
>> in one single query. But this is a fundamental problem and has 
>> nothing to do with Cassandra. My java application running in Wildfly 
>> would also not be able to handle a data result with multiple GB of 
>> data.  But I would expect hat I can select a set of keys just to 
>> decide whether to load one single data cell.
>>
>> Cassandra seems like a great system. But many people seem to claim 
>> that it is only suitable for mapping a user status list ala Facebook? 
>> Is this true? Thanks for you comments in advance.
>>
>>
>>
>>
>> ===
>> Ralph
>>
>

-- 

*Imixs Software Solutions GmbH*
*Web:* www.imixs.com <http://www.imixs.com> *Phone:* +49 (0)89-452136 16
*Office:* Agnes-Pockels-Bogen 1, 80992 München
Registergericht: Amtsgericht Muenchen, HRB 136045
Geschaeftsführer: Gaby Heinle u. Ralph Soika

*Imixs* is an open source company, read more: www.imixs.org 
<http://www.imixs.org>


Re: Size of a single Data Row?

Posted by Evelyn Smith <u5...@gmail.com>.
Hi Ralph,

Yes, having partitions of 100mb will seriously hit your performance. But usually the issue here is for people handling large numbers of transactions and aiming for low latency. My understanding is the column value up to 2GB is it’s max. Like after that the system would start to fail, but well before that you are going to be seeing a significant performance hit (for most use cases).

I think an important question for you is are you going to be reading these files from Cassandra regularly? It sounds like something S3 or Hadoop might be more appropriate for.

The other option is if your xml files have some format you could extract the data from it and store it that way.

One final point, I’m pretty sure a TEXT type won’t hold a 10mb file let alone a 1GB file, I think the max size is like 64K characters.

Regards,
Eevee.

> On 10 Jun 2018, at 7:54 pm, Ralph Soika <ra...@imixs.com> wrote:
> 
> Hi, 
> I have a general question concerning the Cassandra technology. I already read 2 books but after all I am more and more confused about the question if Cassandra is the right technology. My goal is to store Business Data form a workflow engine into Cassandra. I want to use Cassandra as a kind of archive service because of its fault tolerant and decentralized approach. 
> 
> But here are two things which are confusing me. On the one hand the project claims that a single column value can be 2 GB (1 MB is recommended). On the other hand people explain that a partition should not be larger than 100MB. 
> 
> I plan only one single simple table: 
> 
>     CREATE TABLE documents ( 
>        created text, 
>        id text, 
>        data text, 
>        PRIMARY KEY (created,id) 
>     ); 
> 
> 'created' is the partition key holding the date in ISO fomat (YYYY-MM-DD). The 'id' is a clustering key and is unique. 
> 
> But my 'data' column holds a XML document with business data. This cell contains many unstructured data and also media data. The data cell will be between 1 and 10 MB. BUT it can also hold more than 100MB and less than 2GB in some cases. 
> 
> Is Cassandra able to handle this kind of table? Or is Cassandra at the end not recommended for this kind of data? 
> 
> For example I would like to ask if data for a specific date is available : 
> 
>     SELECT created,id WHERE created = '2018-06-10' 
> 
> I select without the data column and just ask if data exists. Is the performance automatically poor only because the data cell (no primary key) of some rows is grater then 100MB? Or is cassandra running out of heap space in any case? It is perfectly clear that it makes no sense to select multiple cells which each contain over 100 MB of data in one single query. But this is a fundamental problem and has nothing to do with Cassandra. My java application running in Wildfly would also not be able to handle a data result with multiple GB of data.  But I would expect hat I can select a set of keys just to decide whether to load one single data cell. 
> 
> Cassandra seems like a great system. But many people seem to claim that it is only suitable for mapping a user status list ala Facebook? Is this true? Thanks for you comments in advance. 
> 
> 
> 
> 
> === 
> Ralph 
>