You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Gianluca Borello <gi...@sysdig.com> on 2016/02/14 23:22:20 UTC

Performance issues with "many" CQL columns

Hi

I've just painfully discovered a "little" detail in Cassandra: Cassandra
touches all columns on a CQL select (related issues
https://issues.apache.org/jira/browse/CASSANDRA-6586,
https://issues.apache.org/jira/browse/CASSANDRA-6588,
https://issues.apache.org/jira/browse/CASSANDRA-7085).

My data model is fairly simple: I have a bunch of "sensors" reporting a
blob of data (~10-100KB) periodically. When reading, 99% of the times I'm
interested in a subportion of that blob of data across an arbitrary period
of time. What I do is simply splitting those blobs of data in about 30
logical units and write them in a CQL table such as:

create table data (
id bigint,
ts bigint,
column1 blob,
column2 blob,
column3 blob,
...
column29 blob,
column30 blob
primary key (id, ts)

id is a combination of the sensor id and a time bucket, in order to not get
the row too wide. Essentially, I thought this was a very legit data model
that helps me keep my application code very simple (because I can work on a
single table, I can write a split sensor blob in a single CQL query and I
can read a subset of the columns very efficiently with one single CQL
query).

What I didn't realize is that Cassandra seems to always process all the
columns of the CQL row, regardless of the fact that my query asks just one
column, and this has dramatic effect on the performance of my reads.

I wrote a simple isolated test case where I test how long it takes to read
one *single* column in a CQL table composed of several columns (at each
iteration I add and populate 10 new columns), each filled with 1MB blobs:

10 columns: 209 ms
20 columns: 339 ms
30 columns: 510 ms
40 columns: 670 ms
50 columns: 884 ms
60 columns: 1056 ms
70 columns: 1527 ms
80 columns: 1503 ms
90 columns: 1600 ms
100 columns: 1792 ms

In other words, even if the result set returned is exactly the same across
all these iteration, the response time increases linearly with the size of
the other columns, and this is really causing a lot of problems in my
application.

By reading the JIRA issues, it seems like this is considered a very minor
optimization not worth the effort of fixing, so I'm asking: is my use case
really so anomalous that the horrible performance that I'm experiencing are
to be considered "expected" and need to be fixed with some painful column
family splitting and messy application code?

Thanks

Re: Performance issues with "many" CQL columns

Posted by Gianluca Borello <gi...@sysdig.com>.

Considering the (simplified) table that I wrote before:

create table data (
id bigint,
ts bigint,
column1 blob,
column2 blob,
column3 blob,
...
column29 blob,
column30 blob
primary key (id, ts)

A user request (varies every time) translates into a set of queries asking
a subset of the columns (< 10) for a specific set of sensors (< 100) for a
specific time range (< 300):

SELECT column1, column7, column20, column25 FROM data where id =
SENSOR_ID_1 and ts > x and ts < y)
SELECT column1, column7, column20, column25 FROM data where id =
SENSOR_ID_2 and ts > x and ts < y)
...
SELECT column1, column7, column20, column25 FROM data where id =
SENSOR_ID_N and ts > x and ts < y)

To answer your question, each non-EQ predicate on timestamp returns a few
hundreds rows (it's essentially a time series).

If I put the column number as clustering key with the timestamp, I'll have
to further increase the number of queries and make my code more complicated:

SELECT value FROM data where id = SENSOR_ID_1 and ts > x and ts < y and
column_number = 1)
SELECT value FROM data where id = SENSOR_ID_1 and ts > x and ts < y and
column_number = 7)
SELECT value FROM data where id = SENSOR_ID_1 and ts > x and ts < y and
column_number = 20)
SELECT value FROM data where id = SENSOR_ID_1 and ts > x and ts < y and
column_number = 25)
...

Again, not too terrible and I'll definitely have to do something similar
because the performance penalty I'm paying now is very significant, but by
all means this seems to me a complication in the data model (and in my
application).

Thanks again


On Sun, Feb 14, 2016 at 5:21 PM, Jack Krupansky <ja...@gmail.com>
wrote:

> What does your query actually look like today?
>
> Is your non-EQ on timestamp selecting a single row a few rows or many rows
> (dozens, hundreds, thousands)?
>
>
> -- Jack Krupansky
>
> On Sun, Feb 14, 2016 at 7:40 PM, Gianluca Borello <gi...@sysdig.com>
> wrote:
>
>> Thanks again.
>>
>> One clarification about "reading in a single SELECT": in my point 2, I
>> mentioned the need to read a variable subset of columns every time, usually
>> in the range of ~5 out of 30. I can't find a way to do that in a single
>> SELECT unless I use the IN operator (which I can't, as explained).
>>
>> Is there any other method you were thinking of, or your "reading in a
>> single SELECT" is just applicable when I need to read the whole set of
>> columns (which is never my case, unfortunately)?
>>
>> Thanks
>>
>>
>> On Sun, Feb 14, 2016 at 4:34 PM, Jack Krupansky <jack.krupansky@gmail.com
>> > wrote:
>>
>>> You can definitely read all of columns in a single SELECT. And the
>>> n-INSERTS can be batched and will insert fewer cells in the storage engine
>>> than the previous approach.
>>>
>>> -- Jack Krupansky
>>>
>>> On Sun, Feb 14, 2016 at 7:31 PM, Gianluca Borello <gi...@sysdig.com>
>>> wrote:
>>>
>>>> Thank you for your reply.
>>>>
>>>> Your advice is definitely sound, although it still seems suboptimal to
>>>> me because:
>>>>
>>>> 1) It requires N INSERT queries from the application code (where N is
>>>> the number of columns)
>>>>
>>>> 2) It requires N SELECT queries from my application code (where N is
>>>> the number of columns I need to read at any given time, which is determined
>>>> at runtime). I can't even use the IN operator (e.g. WHERE column_number IN
>>>> (1, 2, 3, ...)) because I am already using a non-EQ relation on the
>>>> timestamp key and Cassandra restricts me to only one non-EQ relation.
>>>>
>>>> In summary, I can (and will) adapt my code to use a similar approach
>>>> despite everything, but the goal of my message was mainly to understand why
>>>> the jira issues I linked above are not full of dozens of "+1" comments.
>>>>
>>>> To me this really feels like a terrible performance issue that should
>>>> be fixed by default (or in the very worst case clearly documented), even
>>>> after understanding the motivation for reading all the columns in the CQL
>>>> row.
>>>>
>>>> Thanks
>>>>
>>>> On Sun, Feb 14, 2016 at 3:05 PM, Jack Krupansky <
>>>> jack.krupansky@gmail.com> wrote:
>>>>
>>>>> You could add the column number as an additional clustering key. And
>>>>> then you can actually use COMPACT STORAGE for even more efficient storage
>>>>> and access (assuming there is only  a single non-PK data column, the blob
>>>>> value.) You can then access (read or write) an individual column/blob or a
>>>>> slice of them.
>>>>>
>>>>> -- Jack Krupansky
>>>>>
>>>>> On Sun, Feb 14, 2016 at 5:22 PM, Gianluca Borello <gianluca@sysdig.com
>>>>> > wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> I've just painfully discovered a "little" detail in Cassandra:
>>>>>> Cassandra touches all columns on a CQL select (related issues
>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-6586,
>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-6588,
>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-7085).
>>>>>>
>>>>>> My data model is fairly simple: I have a bunch of "sensors" reporting
>>>>>> a blob of data (~10-100KB) periodically. When reading, 99% of the times I'm
>>>>>> interested in a subportion of that blob of data across an arbitrary period
>>>>>> of time. What I do is simply splitting those blobs of data in about 30
>>>>>> logical units and write them in a CQL table such as:
>>>>>>
>>>>>> create table data (
>>>>>> id bigint,
>>>>>> ts bigint,
>>>>>> column1 blob,
>>>>>> column2 blob,
>>>>>> column3 blob,
>>>>>> ...
>>>>>> column29 blob,
>>>>>> column30 blob
>>>>>> primary key (id, ts)
>>>>>>
>>>>>> id is a combination of the sensor id and a time bucket, in order to
>>>>>> not get the row too wide. Essentially, I thought this was a very legit data
>>>>>> model that helps me keep my application code very simple (because I can
>>>>>> work on a single table, I can write a split sensor blob in a single CQL
>>>>>> query and I can read a subset of the columns very efficiently with one
>>>>>> single CQL query).
>>>>>>
>>>>>> What I didn't realize is that Cassandra seems to always process all
>>>>>> the columns of the CQL row, regardless of the fact that my query asks just
>>>>>> one column, and this has dramatic effect on the performance of my reads.
>>>>>>
>>>>>> I wrote a simple isolated test case where I test how long it takes to
>>>>>> read one *single* column in a CQL table composed of several columns (at
>>>>>> each iteration I add and populate 10 new columns), each filled with 1MB
>>>>>> blobs:
>>>>>>
>>>>>> 10 columns: 209 ms
>>>>>> 20 columns: 339 ms
>>>>>> 30 columns: 510 ms
>>>>>> 40 columns: 670 ms
>>>>>> 50 columns: 884 ms
>>>>>> 60 columns: 1056 ms
>>>>>> 70 columns: 1527 ms
>>>>>> 80 columns: 1503 ms
>>>>>> 90 columns: 1600 ms
>>>>>> 100 columns: 1792 ms
>>>>>>
>>>>>> In other words, even if the result set returned is exactly the same
>>>>>> across all these iteration, the response time increases linearly with the
>>>>>> size of the other columns, and this is really causing a lot of problems in
>>>>>> my application.
>>>>>>
>>>>>> By reading the JIRA issues, it seems like this is considered a very
>>>>>> minor optimization not worth the effort of fixing, so I'm asking: is my use
>>>>>> case really so anomalous that the horrible performance that I'm
>>>>>> experiencing are to be considered "expected" and need to be fixed with some
>>>>>> painful column family splitting and messy application code?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Performance issues with "many" CQL columns

Posted by Jack Krupansky <ja...@gmail.com>.

What does your query actually look like today?

Is your non-EQ on timestamp selecting a single row a few rows or many rows
(dozens, hundreds, thousands)?


-- Jack Krupansky

On Sun, Feb 14, 2016 at 7:40 PM, Gianluca Borello <gi...@sysdig.com>
wrote:

> Thanks again.
>
> One clarification about "reading in a single SELECT": in my point 2, I
> mentioned the need to read a variable subset of columns every time, usually
> in the range of ~5 out of 30. I can't find a way to do that in a single
> SELECT unless I use the IN operator (which I can't, as explained).
>
> Is there any other method you were thinking of, or your "reading in a
> single SELECT" is just applicable when I need to read the whole set of
> columns (which is never my case, unfortunately)?
>
> Thanks
>
>
> On Sun, Feb 14, 2016 at 4:34 PM, Jack Krupansky <ja...@gmail.com>
> wrote:
>
>> You can definitely read all of columns in a single SELECT. And the
>> n-INSERTS can be batched and will insert fewer cells in the storage engine
>> than the previous approach.
>>
>> -- Jack Krupansky
>>
>> On Sun, Feb 14, 2016 at 7:31 PM, Gianluca Borello <gi...@sysdig.com>
>> wrote:
>>
>>> Thank you for your reply.
>>>
>>> Your advice is definitely sound, although it still seems suboptimal to
>>> me because:
>>>
>>> 1) It requires N INSERT queries from the application code (where N is
>>> the number of columns)
>>>
>>> 2) It requires N SELECT queries from my application code (where N is the
>>> number of columns I need to read at any given time, which is determined at
>>> runtime). I can't even use the IN operator (e.g. WHERE column_number IN (1,
>>> 2, 3, ...)) because I am already using a non-EQ relation on the timestamp
>>> key and Cassandra restricts me to only one non-EQ relation.
>>>
>>> In summary, I can (and will) adapt my code to use a similar approach
>>> despite everything, but the goal of my message was mainly to understand why
>>> the jira issues I linked above are not full of dozens of "+1" comments.
>>>
>>> To me this really feels like a terrible performance issue that should be
>>> fixed by default (or in the very worst case clearly documented), even after
>>> understanding the motivation for reading all the columns in the CQL row.
>>>
>>> Thanks
>>>
>>> On Sun, Feb 14, 2016 at 3:05 PM, Jack Krupansky <
>>> jack.krupansky@gmail.com> wrote:
>>>
>>>> You could add the column number as an additional clustering key. And
>>>> then you can actually use COMPACT STORAGE for even more efficient storage
>>>> and access (assuming there is only  a single non-PK data column, the blob
>>>> value.) You can then access (read or write) an individual column/blob or a
>>>> slice of them.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> On Sun, Feb 14, 2016 at 5:22 PM, Gianluca Borello <gi...@sysdig.com>
>>>> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> I've just painfully discovered a "little" detail in Cassandra:
>>>>> Cassandra touches all columns on a CQL select (related issues
>>>>> https://issues.apache.org/jira/browse/CASSANDRA-6586,
>>>>> https://issues.apache.org/jira/browse/CASSANDRA-6588,
>>>>> https://issues.apache.org/jira/browse/CASSANDRA-7085).
>>>>>
>>>>> My data model is fairly simple: I have a bunch of "sensors" reporting
>>>>> a blob of data (~10-100KB) periodically. When reading, 99% of the times I'm
>>>>> interested in a subportion of that blob of data across an arbitrary period
>>>>> of time. What I do is simply splitting those blobs of data in about 30
>>>>> logical units and write them in a CQL table such as:
>>>>>
>>>>> create table data (
>>>>> id bigint,
>>>>> ts bigint,
>>>>> column1 blob,
>>>>> column2 blob,
>>>>> column3 blob,
>>>>> ...
>>>>> column29 blob,
>>>>> column30 blob
>>>>> primary key (id, ts)
>>>>>
>>>>> id is a combination of the sensor id and a time bucket, in order to
>>>>> not get the row too wide. Essentially, I thought this was a very legit data
>>>>> model that helps me keep my application code very simple (because I can
>>>>> work on a single table, I can write a split sensor blob in a single CQL
>>>>> query and I can read a subset of the columns very efficiently with one
>>>>> single CQL query).
>>>>>
>>>>> What I didn't realize is that Cassandra seems to always process all
>>>>> the columns of the CQL row, regardless of the fact that my query asks just
>>>>> one column, and this has dramatic effect on the performance of my reads.
>>>>>
>>>>> I wrote a simple isolated test case where I test how long it takes to
>>>>> read one *single* column in a CQL table composed of several columns (at
>>>>> each iteration I add and populate 10 new columns), each filled with 1MB
>>>>> blobs:
>>>>>
>>>>> 10 columns: 209 ms
>>>>> 20 columns: 339 ms
>>>>> 30 columns: 510 ms
>>>>> 40 columns: 670 ms
>>>>> 50 columns: 884 ms
>>>>> 60 columns: 1056 ms
>>>>> 70 columns: 1527 ms
>>>>> 80 columns: 1503 ms
>>>>> 90 columns: 1600 ms
>>>>> 100 columns: 1792 ms
>>>>>
>>>>> In other words, even if the result set returned is exactly the same
>>>>> across all these iteration, the response time increases linearly with the
>>>>> size of the other columns, and this is really causing a lot of problems in
>>>>> my application.
>>>>>
>>>>> By reading the JIRA issues, it seems like this is considered a very
>>>>> minor optimization not worth the effort of fixing, so I'm asking: is my use
>>>>> case really so anomalous that the horrible performance that I'm
>>>>> experiencing are to be considered "expected" and need to be fixed with some
>>>>> painful column family splitting and messy application code?
>>>>>
>>>>> Thanks
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Performance issues with "many" CQL columns

Posted by Gianluca Borello <gi...@sysdig.com>.

Thanks again.

One clarification about "reading in a single SELECT": in my point 2, I
mentioned the need to read a variable subset of columns every time, usually
in the range of ~5 out of 30. I can't find a way to do that in a single
SELECT unless I use the IN operator (which I can't, as explained).

Is there any other method you were thinking of, or your "reading in a
single SELECT" is just applicable when I need to read the whole set of
columns (which is never my case, unfortunately)?

Thanks


On Sun, Feb 14, 2016 at 4:34 PM, Jack Krupansky <ja...@gmail.com>
wrote:

> You can definitely read all of columns in a single SELECT. And the
> n-INSERTS can be batched and will insert fewer cells in the storage engine
> than the previous approach.
>
> -- Jack Krupansky
>
> On Sun, Feb 14, 2016 at 7:31 PM, Gianluca Borello <gi...@sysdig.com>
> wrote:
>
>> Thank you for your reply.
>>
>> Your advice is definitely sound, although it still seems suboptimal to me
>> because:
>>
>> 1) It requires N INSERT queries from the application code (where N is the
>> number of columns)
>>
>> 2) It requires N SELECT queries from my application code (where N is the
>> number of columns I need to read at any given time, which is determined at
>> runtime). I can't even use the IN operator (e.g. WHERE column_number IN (1,
>> 2, 3, ...)) because I am already using a non-EQ relation on the timestamp
>> key and Cassandra restricts me to only one non-EQ relation.
>>
>> In summary, I can (and will) adapt my code to use a similar approach
>> despite everything, but the goal of my message was mainly to understand why
>> the jira issues I linked above are not full of dozens of "+1" comments.
>>
>> To me this really feels like a terrible performance issue that should be
>> fixed by default (or in the very worst case clearly documented), even after
>> understanding the motivation for reading all the columns in the CQL row.
>>
>> Thanks
>>
>> On Sun, Feb 14, 2016 at 3:05 PM, Jack Krupansky <jack.krupansky@gmail.com
>> > wrote:
>>
>>> You could add the column number as an additional clustering key. And
>>> then you can actually use COMPACT STORAGE for even more efficient storage
>>> and access (assuming there is only  a single non-PK data column, the blob
>>> value.) You can then access (read or write) an individual column/blob or a
>>> slice of them.
>>>
>>> -- Jack Krupansky
>>>
>>> On Sun, Feb 14, 2016 at 5:22 PM, Gianluca Borello <gi...@sysdig.com>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> I've just painfully discovered a "little" detail in Cassandra:
>>>> Cassandra touches all columns on a CQL select (related issues
>>>> https://issues.apache.org/jira/browse/CASSANDRA-6586,
>>>> https://issues.apache.org/jira/browse/CASSANDRA-6588,
>>>> https://issues.apache.org/jira/browse/CASSANDRA-7085).
>>>>
>>>> My data model is fairly simple: I have a bunch of "sensors" reporting a
>>>> blob of data (~10-100KB) periodically. When reading, 99% of the times I'm
>>>> interested in a subportion of that blob of data across an arbitrary period
>>>> of time. What I do is simply splitting those blobs of data in about 30
>>>> logical units and write them in a CQL table such as:
>>>>
>>>> create table data (
>>>> id bigint,
>>>> ts bigint,
>>>> column1 blob,
>>>> column2 blob,
>>>> column3 blob,
>>>> ...
>>>> column29 blob,
>>>> column30 blob
>>>> primary key (id, ts)
>>>>
>>>> id is a combination of the sensor id and a time bucket, in order to not
>>>> get the row too wide. Essentially, I thought this was a very legit data
>>>> model that helps me keep my application code very simple (because I can
>>>> work on a single table, I can write a split sensor blob in a single CQL
>>>> query and I can read a subset of the columns very efficiently with one
>>>> single CQL query).
>>>>
>>>> What I didn't realize is that Cassandra seems to always process all the
>>>> columns of the CQL row, regardless of the fact that my query asks just one
>>>> column, and this has dramatic effect on the performance of my reads.
>>>>
>>>> I wrote a simple isolated test case where I test how long it takes to
>>>> read one *single* column in a CQL table composed of several columns (at
>>>> each iteration I add and populate 10 new columns), each filled with 1MB
>>>> blobs:
>>>>
>>>> 10 columns: 209 ms
>>>> 20 columns: 339 ms
>>>> 30 columns: 510 ms
>>>> 40 columns: 670 ms
>>>> 50 columns: 884 ms
>>>> 60 columns: 1056 ms
>>>> 70 columns: 1527 ms
>>>> 80 columns: 1503 ms
>>>> 90 columns: 1600 ms
>>>> 100 columns: 1792 ms
>>>>
>>>> In other words, even if the result set returned is exactly the same
>>>> across all these iteration, the response time increases linearly with the
>>>> size of the other columns, and this is really causing a lot of problems in
>>>> my application.
>>>>
>>>> By reading the JIRA issues, it seems like this is considered a very
>>>> minor optimization not worth the effort of fixing, so I'm asking: is my use
>>>> case really so anomalous that the horrible performance that I'm
>>>> experiencing are to be considered "expected" and need to be fixed with some
>>>> painful column family splitting and messy application code?
>>>>
>>>> Thanks
>>>>
>>>
>>>
>>
>

Re: Performance issues with "many" CQL columns

Posted by Jack Krupansky <ja...@gmail.com>.

You can definitely read all of columns in a single SELECT. And the
n-INSERTS can be batched and will insert fewer cells in the storage engine
than the previous approach.

-- Jack Krupansky

On Sun, Feb 14, 2016 at 7:31 PM, Gianluca Borello <gi...@sysdig.com>
wrote:

> Thank you for your reply.
>
> Your advice is definitely sound, although it still seems suboptimal to me
> because:
>
> 1) It requires N INSERT queries from the application code (where N is the
> number of columns)
>
> 2) It requires N SELECT queries from my application code (where N is the
> number of columns I need to read at any given time, which is determined at
> runtime). I can't even use the IN operator (e.g. WHERE column_number IN (1,
> 2, 3, ...)) because I am already using a non-EQ relation on the timestamp
> key and Cassandra restricts me to only one non-EQ relation.
>
> In summary, I can (and will) adapt my code to use a similar approach
> despite everything, but the goal of my message was mainly to understand why
> the jira issues I linked above are not full of dozens of "+1" comments.
>
> To me this really feels like a terrible performance issue that should be
> fixed by default (or in the very worst case clearly documented), even after
> understanding the motivation for reading all the columns in the CQL row.
>
> Thanks
>
> On Sun, Feb 14, 2016 at 3:05 PM, Jack Krupansky <ja...@gmail.com>
> wrote:
>
>> You could add the column number as an additional clustering key. And then
>> you can actually use COMPACT STORAGE for even more efficient storage and
>> access (assuming there is only  a single non-PK data column, the blob
>> value.) You can then access (read or write) an individual column/blob or a
>> slice of them.
>>
>> -- Jack Krupansky
>>
>> On Sun, Feb 14, 2016 at 5:22 PM, Gianluca Borello <gi...@sysdig.com>
>> wrote:
>>
>>> Hi
>>>
>>> I've just painfully discovered a "little" detail in Cassandra: Cassandra
>>> touches all columns on a CQL select (related issues
>>> https://issues.apache.org/jira/browse/CASSANDRA-6586,
>>> https://issues.apache.org/jira/browse/CASSANDRA-6588,
>>> https://issues.apache.org/jira/browse/CASSANDRA-7085).
>>>
>>> My data model is fairly simple: I have a bunch of "sensors" reporting a
>>> blob of data (~10-100KB) periodically. When reading, 99% of the times I'm
>>> interested in a subportion of that blob of data across an arbitrary period
>>> of time. What I do is simply splitting those blobs of data in about 30
>>> logical units and write them in a CQL table such as:
>>>
>>> create table data (
>>> id bigint,
>>> ts bigint,
>>> column1 blob,
>>> column2 blob,
>>> column3 blob,
>>> ...
>>> column29 blob,
>>> column30 blob
>>> primary key (id, ts)
>>>
>>> id is a combination of the sensor id and a time bucket, in order to not
>>> get the row too wide. Essentially, I thought this was a very legit data
>>> model that helps me keep my application code very simple (because I can
>>> work on a single table, I can write a split sensor blob in a single CQL
>>> query and I can read a subset of the columns very efficiently with one
>>> single CQL query).
>>>
>>> What I didn't realize is that Cassandra seems to always process all the
>>> columns of the CQL row, regardless of the fact that my query asks just one
>>> column, and this has dramatic effect on the performance of my reads.
>>>
>>> I wrote a simple isolated test case where I test how long it takes to
>>> read one *single* column in a CQL table composed of several columns (at
>>> each iteration I add and populate 10 new columns), each filled with 1MB
>>> blobs:
>>>
>>> 10 columns: 209 ms
>>> 20 columns: 339 ms
>>> 30 columns: 510 ms
>>> 40 columns: 670 ms
>>> 50 columns: 884 ms
>>> 60 columns: 1056 ms
>>> 70 columns: 1527 ms
>>> 80 columns: 1503 ms
>>> 90 columns: 1600 ms
>>> 100 columns: 1792 ms
>>>
>>> In other words, even if the result set returned is exactly the same
>>> across all these iteration, the response time increases linearly with the
>>> size of the other columns, and this is really causing a lot of problems in
>>> my application.
>>>
>>> By reading the JIRA issues, it seems like this is considered a very
>>> minor optimization not worth the effort of fixing, so I'm asking: is my use
>>> case really so anomalous that the horrible performance that I'm
>>> experiencing are to be considered "expected" and need to be fixed with some
>>> painful column family splitting and messy application code?
>>>
>>> Thanks
>>>
>>
>>
>

Re: Performance issues with "many" CQL columns

Posted by Gianluca Borello <gi...@sysdig.com>.

Thank you for your reply.

Your advice is definitely sound, although it still seems suboptimal to me
because:

1) It requires N INSERT queries from the application code (where N is the
number of columns)

2) It requires N SELECT queries from my application code (where N is the
number of columns I need to read at any given time, which is determined at
runtime). I can't even use the IN operator (e.g. WHERE column_number IN (1,
2, 3, ...)) because I am already using a non-EQ relation on the timestamp
key and Cassandra restricts me to only one non-EQ relation.

In summary, I can (and will) adapt my code to use a similar approach
despite everything, but the goal of my message was mainly to understand why
the jira issues I linked above are not full of dozens of "+1" comments.

To me this really feels like a terrible performance issue that should be
fixed by default (or in the very worst case clearly documented), even after
understanding the motivation for reading all the columns in the CQL row.

Thanks

On Sun, Feb 14, 2016 at 3:05 PM, Jack Krupansky <ja...@gmail.com>
wrote:

> You could add the column number as an additional clustering key. And then
> you can actually use COMPACT STORAGE for even more efficient storage and
> access (assuming there is only  a single non-PK data column, the blob
> value.) You can then access (read or write) an individual column/blob or a
> slice of them.
>
> -- Jack Krupansky
>
> On Sun, Feb 14, 2016 at 5:22 PM, Gianluca Borello <gi...@sysdig.com>
> wrote:
>
>> Hi
>>
>> I've just painfully discovered a "little" detail in Cassandra: Cassandra
>> touches all columns on a CQL select (related issues
>> https://issues.apache.org/jira/browse/CASSANDRA-6586,
>> https://issues.apache.org/jira/browse/CASSANDRA-6588,
>> https://issues.apache.org/jira/browse/CASSANDRA-7085).
>>
>> My data model is fairly simple: I have a bunch of "sensors" reporting a
>> blob of data (~10-100KB) periodically. When reading, 99% of the times I'm
>> interested in a subportion of that blob of data across an arbitrary period
>> of time. What I do is simply splitting those blobs of data in about 30
>> logical units and write them in a CQL table such as:
>>
>> create table data (
>> id bigint,
>> ts bigint,
>> column1 blob,
>> column2 blob,
>> column3 blob,
>> ...
>> column29 blob,
>> column30 blob
>> primary key (id, ts)
>>
>> id is a combination of the sensor id and a time bucket, in order to not
>> get the row too wide. Essentially, I thought this was a very legit data
>> model that helps me keep my application code very simple (because I can
>> work on a single table, I can write a split sensor blob in a single CQL
>> query and I can read a subset of the columns very efficiently with one
>> single CQL query).
>>
>> What I didn't realize is that Cassandra seems to always process all the
>> columns of the CQL row, regardless of the fact that my query asks just one
>> column, and this has dramatic effect on the performance of my reads.
>>
>> I wrote a simple isolated test case where I test how long it takes to
>> read one *single* column in a CQL table composed of several columns (at
>> each iteration I add and populate 10 new columns), each filled with 1MB
>> blobs:
>>
>> 10 columns: 209 ms
>> 20 columns: 339 ms
>> 30 columns: 510 ms
>> 40 columns: 670 ms
>> 50 columns: 884 ms
>> 60 columns: 1056 ms
>> 70 columns: 1527 ms
>> 80 columns: 1503 ms
>> 90 columns: 1600 ms
>> 100 columns: 1792 ms
>>
>> In other words, even if the result set returned is exactly the same
>> across all these iteration, the response time increases linearly with the
>> size of the other columns, and this is really causing a lot of problems in
>> my application.
>>
>> By reading the JIRA issues, it seems like this is considered a very minor
>> optimization not worth the effort of fixing, so I'm asking: is my use case
>> really so anomalous that the horrible performance that I'm experiencing are
>> to be considered "expected" and need to be fixed with some painful column
>> family splitting and messy application code?
>>
>> Thanks
>>
>
>

Re: Performance issues with "many" CQL columns

Posted by Jack Krupansky <ja...@gmail.com>.

You could add the column number as an additional clustering key. And then
you can actually use COMPACT STORAGE for even more efficient storage and
access (assuming there is only  a single non-PK data column, the blob
value.) You can then access (read or write) an individual column/blob or a
slice of them.

-- Jack Krupansky

On Sun, Feb 14, 2016 at 5:22 PM, Gianluca Borello <gi...@sysdig.com>
wrote:

> Hi
>
> I've just painfully discovered a "little" detail in Cassandra: Cassandra
> touches all columns on a CQL select (related issues
> https://issues.apache.org/jira/browse/CASSANDRA-6586,
> https://issues.apache.org/jira/browse/CASSANDRA-6588,
> https://issues.apache.org/jira/browse/CASSANDRA-7085).
>
> My data model is fairly simple: I have a bunch of "sensors" reporting a
> blob of data (~10-100KB) periodically. When reading, 99% of the times I'm
> interested in a subportion of that blob of data across an arbitrary period
> of time. What I do is simply splitting those blobs of data in about 30
> logical units and write them in a CQL table such as:
>
> create table data (
> id bigint,
> ts bigint,
> column1 blob,
> column2 blob,
> column3 blob,
> ...
> column29 blob,
> column30 blob
> primary key (id, ts)
>
> id is a combination of the sensor id and a time bucket, in order to not
> get the row too wide. Essentially, I thought this was a very legit data
> model that helps me keep my application code very simple (because I can
> work on a single table, I can write a split sensor blob in a single CQL
> query and I can read a subset of the columns very efficiently with one
> single CQL query).
>
> What I didn't realize is that Cassandra seems to always process all the
> columns of the CQL row, regardless of the fact that my query asks just one
> column, and this has dramatic effect on the performance of my reads.
>
> I wrote a simple isolated test case where I test how long it takes to read
> one *single* column in a CQL table composed of several columns (at each
> iteration I add and populate 10 new columns), each filled with 1MB blobs:
>
> 10 columns: 209 ms
> 20 columns: 339 ms
> 30 columns: 510 ms
> 40 columns: 670 ms
> 50 columns: 884 ms
> 60 columns: 1056 ms
> 70 columns: 1527 ms
> 80 columns: 1503 ms
> 90 columns: 1600 ms
> 100 columns: 1792 ms
>
> In other words, even if the result set returned is exactly the same across
> all these iteration, the response time increases linearly with the size of
> the other columns, and this is really causing a lot of problems in my
> application.
>
> By reading the JIRA issues, it seems like this is considered a very minor
> optimization not worth the effort of fixing, so I'm asking: is my use case
> really so anomalous that the horrible performance that I'm experiencing are
> to be considered "expected" and need to be fixed with some painful column
> family splitting and messy application code?
>
> Thanks
>