You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Gil Ganz <gi...@gmail.com> on 2023/04/11 06:13:49 UTC

Optimization for partitions with high number of rows

Hey
I have a 4.0.4 cluster, with reads of partitions that are a bit on the
bigger side, taking longer than I would expect.
Reading entire partition that has ~70000 rows, total partition size of 4mb,
takes 120ms, I would expect it to take less.

This is after major compaction, so there is only one sstables. local_one
consistency level, no tombstones,  and reading the entire partition in one
fetch.
Cluster is not doing much else at the time, nvme disk. I can see most of
the time is spent on getting the data from the sstable.

Is there any specific optimization one can do to speed up cases like this?
I would expect fetching 4mb to take less, I assume if this was one blob of
4mb that would be the case.

Table definition :

CREATE TABLE ks1.item_to_item (
    x1 bigint,
    x2 bigint,
    x3 int,
    x4 int,
    y1 bigint,
    y2 bigint,
    y3 bigint,
    y4 bigint,
    y5 bigint,
    metadata text,
    m_metadata_created_at timestamp,
    m_metadata_created_by bigint,
    m_metadata_updated_at timestamp,
    PRIMARY KEY ((x1, x2, x3, x4), y1, y2, y3, y4, y5)
) WITH CLUSTERING ORDER BY (y1 ASC, y2 ASC, y3 ASC, y4 ASC, y5 ASC)



cqlsh> select count(0) from  ks1.item_to_item where x1=4 and x2=70000 and
x4=0 and x3=1;

 count
-------
 70000

(1 rows)

Tracing session: 6356d290-d785-11ed-aba5-ab86979f2f58

 activity
  | timestamp                  | source     | source_elapsed | client
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+------------+----------------+-----------
 Execute CQL3 query
             | 2023-04-10 09:52:21.561000 | 172.25.0.4 |              0 |
127.0.0.1
 Parsing XXXX  [Native-Transport-Requests-1]
              | 2023-04-10 09:52:21.561000 | 172.25.0.4 |            428 |
127.0.0.1
                            Preparing statement
[Native-Transport-Requests-1]            | 2023-04-10 09:52:21.562000 |
172.25.0.4 |            973 | 127.0.0.1
                                   Acquiring sstable references
[ReadStage-2]            | 2023-04-10 09:52:21.563000 | 172.25.0.4 |
    2255 | 127.0.0.1
 Skipped 0/1 non-slice-intersecting sstables, included 0 due to tombstones
[ReadStage-2] | 2023-04-10 09:52:21.563000 | 172.25.0.4 |           2524 |
127.0.0.1
                                    Key cache hit for sstable 9
[ReadStage-2]            | 2023-04-10 09:52:21.563000 | 172.25.0.4 |
    2692 | 127.0.0.1
                      Merged data from memtables and 1 sstables
[ReadStage-2]            | 2023-04-10 09:52:21.651000 | 172.25.0.4 |
   90405 | 127.0.0.1
                     Read 70000 live rows and 0 tombstone cells
[ReadStage-2]            | 2023-04-10 09:52:21.651000 | 172.25.0.4 |
   90726 | 127.0.0.1
                                                             Request
complete            | 2023-04-10 09:52:21.682603 | 172.25.0.4 |
121603 | 127.0.0.1


gil

Re: Optimization for partitions with high number of rows

Posted by Gil Ganz <gi...@gmail.com>.

Thank you, I will look into that option.

On Mon, Apr 17, 2023 at 3:29 AM Bowen Song via user <
user@cassandra.apache.org> wrote:

> Using a frozen UDT for all the non-key columns is a good starting point.
> You can go a step further and use frozen UDTs for the partition keys and
> clustering keys too if appropriate. This alone will dramatically reduce the
> number of cells per row from from 13 to 3, and save 77% of deserialisation
> work for Cassandra.
>
> If the performance is still suboptimal after you've done the above, you
> should consider creating a batch process that reads the smaller rows from
> this table and combining them into bigger rows, and then storing the new
> row in another table which has the same partition key but each row is a
> frozen list that contains many original rows. If you combine all rows from
> each partition of the old table into a single row in the new table, the
> read speed should be much faster. Keep in mind that this may not work if
> the partition size of the original table is too large (approximately
> >16MB), as the mutation size is limited to up to half of the commitlog
> segment size.
>
>
> On 12/04/2023 06:14, Gil Ganz wrote:
>
> Is there something I can do to speed up the  deserialisation ?
> In this example I did a count query, but in reality I need the actual data.
> Write pattern in this table is such that all data for a given row is
> written at the same time, so I know I can use frozen udt instead of this,
> making it faster, but I wonder if there is another way.
>
> On Tue, Apr 11, 2023 at 9:06 PM Bowen Song via user <
> user@cassandra.apache.org> wrote:
>
>> Reading 4MB from 70k rows and 13 columns (0.91 million cells) from disk
>> in 120ms doesn't sound bad. That's a lots of deserialisation to do. If you
>> want it to be faster, you can store the number of rows elsewhere if that's
>> the only thing you need.
>> On 11/04/2023 07:13, Gil Ganz wrote:
>>
>> Hey
>> I have a 4.0.4 cluster, with reads of partitions that are a bit on the
>> bigger side, taking longer than I would expect.
>> Reading entire partition that has ~70000 rows, total partition size of
>> 4mb, takes 120ms, I would expect it to take less.
>>
>> This is after major compaction, so there is only one sstables. local_one
>> consistency level, no tombstones,  and reading the entire partition in one
>> fetch.
>> Cluster is not doing much else at the time, nvme disk. I can see most of
>> the time is spent on getting the data from the sstable.
>>
>> Is there any specific optimization one can do to speed up cases like this?
>> I would expect fetching 4mb to take less, I assume if this was one blob
>> of 4mb that would be the case.
>>
>> Table definition :
>>
>> CREATE TABLE ks1.item_to_item (
>>     x1 bigint,
>>     x2 bigint,
>>     x3 int,
>>     x4 int,
>>     y1 bigint,
>>     y2 bigint,
>>     y3 bigint,
>>     y4 bigint,
>>     y5 bigint,
>>     metadata text,
>>     m_metadata_created_at timestamp,
>>     m_metadata_created_by bigint,
>>     m_metadata_updated_at timestamp,
>>     PRIMARY KEY ((x1, x2, x3, x4), y1, y2, y3, y4, y5)
>> ) WITH CLUSTERING ORDER BY (y1 ASC, y2 ASC, y3 ASC, y4 ASC, y5 ASC)
>>
>>
>>
>> cqlsh> select count(0) from  ks1.item_to_item where x1=4 and x2=70000 and
>> x4=0 and x3=1;
>>
>>  count
>> -------
>>  70000
>>
>> (1 rows)
>>
>> Tracing session: 6356d290-d785-11ed-aba5-ab86979f2f58
>>
>>  activity
>>     | timestamp                  | source     | source_elapsed | client
>>
>> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+------------+----------------+-----------
>>  Execute CQL3 query
>>                | 2023-04-10 09:52:21.561000 | 172.25.0.4 |              0 |
>> 127.0.0.1
>>  Parsing XXXX  [Native-Transport-Requests-1]
>>                 | 2023-04-10 09:52:21.561000 | 172.25.0.4 |            428
>> | 127.0.0.1
>>                             Preparing statement
>> [Native-Transport-Requests-1]            | 2023-04-10 09:52:21.562000 |
>> 172.25.0.4 |            973 | 127.0.0.1
>>                                    Acquiring sstable references
>> [ReadStage-2]            | 2023-04-10 09:52:21.563000 | 172.25.0.4 |
>>     2255 | 127.0.0.1
>>  Skipped 0/1 non-slice-intersecting sstables, included 0 due to
>> tombstones [ReadStage-2] | 2023-04-10 09:52:21.563000 | 172.25.0.4 |
>>     2524 | 127.0.0.1
>>                                     Key cache hit for sstable 9
>> [ReadStage-2]            | 2023-04-10 09:52:21.563000 | 172.25.0.4 |
>>     2692 | 127.0.0.1
>>                       Merged data from memtables and 1 sstables
>> [ReadStage-2]            | 2023-04-10 09:52:21.651000 | 172.25.0.4 |
>>    90405 | 127.0.0.1
>>                      Read 70000 live rows and 0 tombstone cells
>> [ReadStage-2]            | 2023-04-10 09:52:21.651000 | 172.25.0.4 |
>>    90726 | 127.0.0.1
>>                                                              Request
>> complete            | 2023-04-10 09:52:21.682603 | 172.25.0.4 |
>> 121603 | 127.0.0.1
>>
>>
>> gil
>>
>>

Re: Optimization for partitions with high number of rows

Posted by Bowen Song via user <us...@cassandra.apache.org>.

Using a frozen UDT for all the non-key columns is a good starting point. 
You can go a step further and use frozen UDTs for the partition keys and 
clustering keys too if appropriate. This alone will dramatically reduce 
the number of cells per row from from 13 to 3, and save 77% of 
deserialisation work for Cassandra.

If the performance is still suboptimal after you've done the above, you 
should consider creating a batch process that reads the smaller rows 
from this table and combining them into bigger rows, and then storing 
the new row in another table which has the same partition key but each 
row is a frozen list that contains many original rows. If you combine 
all rows from each partition of the old table into a single row in the 
new table, the read speed should be much faster. Keep in mind that this 
may not work if the partition size of the original table is too large 
(approximately >16MB), as the mutation size is limited to up to half of 
the commitlog segment size.


On 12/04/2023 06:14, Gil Ganz wrote:
> Is there something I can do to speed up the deserialisation ?
> In this example I did a count query, but in reality I need the actual 
> data.
> Write pattern in this table is such that all data for a given row is 
> written at the same time, so I know I can use frozen udt instead of 
> this, making it faster, but I wonder if there is another way.
>
> On Tue, Apr 11, 2023 at 9:06 PM Bowen Song via user 
> <us...@cassandra.apache.org> wrote:
>
>     Reading 4MB from 70k rows and 13 columns (0.91 million cells) from
>     disk in 120ms doesn't sound bad. That's a lots of deserialisation
>     to do. If you want it to be faster, you can store the number of
>     rows elsewhere if that's the only thing you need.
>
>     On 11/04/2023 07:13, Gil Ganz wrote:
>>     Hey
>>     I have a 4.0.4 cluster, with reads of partitions that are a bit
>>     on the bigger side, taking longer than I would expect.
>>     Reading entire partition that has ~70000 rows, total partition
>>     size of 4mb, takes 120ms, I would expect it to take less.
>>
>>     This is after major compaction, so there is only one sstables.
>>     local_one consistency level, no tombstones, and reading the
>>     entire partition in one fetch.
>>     Cluster is not doing much else at the time, nvme disk. I can see
>>     most of the time is spent on getting the data from the sstable.
>>
>>     Is there any specific optimization one can do to speed up cases
>>     like this?
>>     I would expect fetching 4mb to take less, I assume if this was
>>     one blob of 4mb that would be the case.
>>
>>     Table definition :
>>
>>     CREATE TABLE ks1.item_to_item (
>>         x1 bigint,
>>         x2 bigint,
>>         x3 int,
>>         x4 int,
>>         y1 bigint,
>>         y2 bigint,
>>         y3 bigint,
>>         y4 bigint,
>>         y5 bigint,
>>         metadata text,
>>         m_metadata_created_at timestamp,
>>         m_metadata_created_by bigint,
>>         m_metadata_updated_at timestamp,
>>         PRIMARY KEY ((x1, x2, x3, x4), y1, y2, y3, y4, y5)
>>     ) WITH CLUSTERING ORDER BY (y1 ASC, y2 ASC, y3 ASC, y4 ASC, y5 ASC)
>>
>>
>>
>>     cqlsh> select count(0) from  ks1.item_to_item where x1=4 and
>>     x2=70000 and x4=0 and x3=1;
>>
>>      count
>>     -------
>>      70000
>>
>>     (1 rows)
>>
>>     Tracing session: 6356d290-d785-11ed-aba5-ab86979f2f58
>>
>>      activity                           | timestamp    | source     |
>>     source_elapsed | client
>>     ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+------------+----------------+-----------
>>      Execute CQL3 query                                      |
>>     2023-04-10 09:52:21.561000 | 172.25.0.4 |              0 | 127.0.0.1
>>      Parsing XXXX  [Native-Transport-Requests-1]                    
>>                     | 2023-04-10 09:52:21.561000 | 172.25.0.4 |      
>>          428 | 127.0.0.1
>>                                 Preparing statement
>>     [Native-Transport-Requests-1]            | 2023-04-10
>>     09:52:21.562000 | 172.25.0.4 |  973 | 127.0.0.1
>>                                        Acquiring sstable references
>>     [ReadStage-2]            | 2023-04-10 09:52:21.563000 |
>>     172.25.0.4 |           2255 | 127.0.0.1
>>      Skipped 0/1 non-slice-intersecting sstables, included 0 due to
>>     tombstones [ReadStage-2] | 2023-04-10 09:52:21.563000 |
>>     172.25.0.4 | 2524 | 127.0.0.1
>>                                         Key cache hit for sstable 9
>>     [ReadStage-2]            | 2023-04-10 09:52:21.563000 |
>>     172.25.0.4 |           2692 | 127.0.0.1
>>                           Merged data from memtables and 1 sstables
>>     [ReadStage-2]            | 2023-04-10 09:52:21.651000 |
>>     172.25.0.4 |          90405 | 127.0.0.1
>>                          Read 70000 live rows and 0 tombstone cells
>>     [ReadStage-2]            | 2023-04-10 09:52:21.651000 |
>>     172.25.0.4 |  90726 | 127.0.0.1
>>              Request complete            | 2023-04-10 09:52:21.682603
>>     | 172.25.0.4 |         121603 | 127.0.0.1
>>
>>
>>     gil
>

Re: Optimization for partitions with high number of rows

Posted by Gil Ganz <gi...@gmail.com>.

Is there something I can do to speed up the  deserialisation ?
In this example I did a count query, but in reality I need the actual data.
Write pattern in this table is such that all data for a given row is
written at the same time, so I know I can use frozen udt instead of this,
making it faster, but I wonder if there is another way.

On Tue, Apr 11, 2023 at 9:06 PM Bowen Song via user <
user@cassandra.apache.org> wrote:

> Reading 4MB from 70k rows and 13 columns (0.91 million cells) from disk in
> 120ms doesn't sound bad. That's a lots of deserialisation to do. If you
> want it to be faster, you can store the number of rows elsewhere if that's
> the only thing you need.
> On 11/04/2023 07:13, Gil Ganz wrote:
>
> Hey
> I have a 4.0.4 cluster, with reads of partitions that are a bit on the
> bigger side, taking longer than I would expect.
> Reading entire partition that has ~70000 rows, total partition size of
> 4mb, takes 120ms, I would expect it to take less.
>
> This is after major compaction, so there is only one sstables. local_one
> consistency level, no tombstones,  and reading the entire partition in one
> fetch.
> Cluster is not doing much else at the time, nvme disk. I can see most of
> the time is spent on getting the data from the sstable.
>
> Is there any specific optimization one can do to speed up cases like this?
> I would expect fetching 4mb to take less, I assume if this was one blob of
> 4mb that would be the case.
>
> Table definition :
>
> CREATE TABLE ks1.item_to_item (
>     x1 bigint,
>     x2 bigint,
>     x3 int,
>     x4 int,
>     y1 bigint,
>     y2 bigint,
>     y3 bigint,
>     y4 bigint,
>     y5 bigint,
>     metadata text,
>     m_metadata_created_at timestamp,
>     m_metadata_created_by bigint,
>     m_metadata_updated_at timestamp,
>     PRIMARY KEY ((x1, x2, x3, x4), y1, y2, y3, y4, y5)
> ) WITH CLUSTERING ORDER BY (y1 ASC, y2 ASC, y3 ASC, y4 ASC, y5 ASC)
>
>
>
> cqlsh> select count(0) from  ks1.item_to_item where x1=4 and x2=70000 and
> x4=0 and x3=1;
>
>  count
> -------
>  70000
>
> (1 rows)
>
> Tracing session: 6356d290-d785-11ed-aba5-ab86979f2f58
>
>  activity
>     | timestamp                  | source     | source_elapsed | client
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+------------+----------------+-----------
>  Execute CQL3 query
>                | 2023-04-10 09:52:21.561000 | 172.25.0.4 |              0 |
> 127.0.0.1
>  Parsing XXXX  [Native-Transport-Requests-1]
>               | 2023-04-10 09:52:21.561000 | 172.25.0.4 |            428 |
> 127.0.0.1
>                             Preparing statement
> [Native-Transport-Requests-1]            | 2023-04-10 09:52:21.562000 |
> 172.25.0.4 |            973 | 127.0.0.1
>                                    Acquiring sstable references
> [ReadStage-2]            | 2023-04-10 09:52:21.563000 | 172.25.0.4 |
>     2255 | 127.0.0.1
>  Skipped 0/1 non-slice-intersecting sstables, included 0 due to tombstones
> [ReadStage-2] | 2023-04-10 09:52:21.563000 | 172.25.0.4 |           2524 |
> 127.0.0.1
>                                     Key cache hit for sstable 9
> [ReadStage-2]            | 2023-04-10 09:52:21.563000 | 172.25.0.4 |
>     2692 | 127.0.0.1
>                       Merged data from memtables and 1 sstables
> [ReadStage-2]            | 2023-04-10 09:52:21.651000 | 172.25.0.4 |
>    90405 | 127.0.0.1
>                      Read 70000 live rows and 0 tombstone cells
> [ReadStage-2]            | 2023-04-10 09:52:21.651000 | 172.25.0.4 |
>    90726 | 127.0.0.1
>                                                              Request
> complete            | 2023-04-10 09:52:21.682603 | 172.25.0.4 |
> 121603 | 127.0.0.1
>
>
> gil
>
>

Re: Optimization for partitions with high number of rows

Posted by Bowen Song via user <us...@cassandra.apache.org>.

Reading 4MB from 70k rows and 13 columns (0.91 million cells) from disk 
in 120ms doesn't sound bad. That's a lots of deserialisation to do. If 
you want it to be faster, you can store the number of rows elsewhere if 
that's the only thing you need.

On 11/04/2023 07:13, Gil Ganz wrote:
> Hey
> I have a 4.0.4 cluster, with reads of partitions that are a bit on the 
> bigger side, taking longer than I would expect.
> Reading entire partition that has ~70000 rows, total partition size of 
> 4mb, takes 120ms, I would expect it to take less.
>
> This is after major compaction, so there is only one sstables. 
> local_one consistency level, no tombstones,  and reading the entire 
> partition in one fetch.
> Cluster is not doing much else at the time, nvme disk. I can see most 
> of the time is spent on getting the data from the sstable.
>
> Is there any specific optimization one can do to speed up cases like this?
> I would expect fetching 4mb to take less, I assume if this was one 
> blob of 4mb that would be the case.
>
> Table definition :
>
> CREATE TABLE ks1.item_to_item (
>     x1 bigint,
>     x2 bigint,
>     x3 int,
>     x4 int,
>     y1 bigint,
>     y2 bigint,
>     y3 bigint,
>     y4 bigint,
>     y5 bigint,
>     metadata text,
>     m_metadata_created_at timestamp,
>     m_metadata_created_by bigint,
>     m_metadata_updated_at timestamp,
>     PRIMARY KEY ((x1, x2, x3, x4), y1, y2, y3, y4, y5)
> ) WITH CLUSTERING ORDER BY (y1 ASC, y2 ASC, y3 ASC, y4 ASC, y5 ASC)
>
>
>
> cqlsh> select count(0) from  ks1.item_to_item where x1=4 and x2=70000 
> and x4=0 and x3=1;
>
>  count
> -------
>  70000
>
> (1 rows)
>
> Tracing session: 6356d290-d785-11ed-aba5-ab86979f2f58
>
>  activity                   | timestamp                  | source | 
> source_elapsed | client
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+------------+----------------+-----------
>  Execute CQL3 query                              | 2023-04-10 
> 09:52:21.561000 | 172.25.0.4 |              0 | 127.0.0.1
>  Parsing XXXX  [Native-Transport-Requests-1]                           
>   | 2023-04-10 09:52:21.561000 | 172.25.0.4 |            428 | 127.0.0.1
>                             Preparing statement 
> [Native-Transport-Requests-1]            | 2023-04-10 09:52:21.562000 
> | 172.25.0.4 |            973 | 127.0.0.1
>                                    Acquiring sstable references 
> [ReadStage-2]            | 2023-04-10 09:52:21.563000 | 172.25.0.4 |   
>         2255 | 127.0.0.1
>  Skipped 0/1 non-slice-intersecting sstables, included 0 due to 
> tombstones [ReadStage-2] | 2023-04-10 09:52:21.563000 | 172.25.0.4 |   
>         2524 | 127.0.0.1
>                                     Key cache hit for sstable 9 
> [ReadStage-2]            | 2023-04-10 09:52:21.563000 | 172.25.0.4 |   
>         2692 | 127.0.0.1
>                       Merged data from memtables and 1 sstables 
> [ReadStage-2]            | 2023-04-10 09:52:21.651000 | 172.25.0.4 |   
>        90405 | 127.0.0.1
>                      Read 70000 live rows and 0 tombstone cells 
> [ReadStage-2]            | 2023-04-10 09:52:21.651000 | 172.25.0.4 |   
>        90726 | 127.0.0.1
>  Request complete            | 2023-04-10 09:52:21.682603 | 172.25.0.4 
> |         121603 | 127.0.0.1
>
>
> gil