You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Caleb Rackliffe (Jira)" <ji...@apache.org> on 2021/07/06 20:35:00 UTC
[jira] [Updated] (CASSANDRA-16776) modify SecondaryIndexManager#indexPartition() to retrieve only columns for which indexes are actually being built

     [ https://issues.apache.org/jira/browse/CASSANDRA-16776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Caleb Rackliffe updated CASSANDRA-16776:
----------------------------------------
    Test and Documentation Plan: The correctness of this improvement relies on the existing 2i tests. Proof that it's actually an improvement is illustrated by two new tests in {{CompactionAllocationTest}}.
                         Status: Patch Available  (was: In Progress)

[trunk|https://github.com/apache/cassandra/pull/1098]
 [CircleCI J8|https://app.circleci.com/pipelines/github/maedhroz/cassandra/284/workflows/d095f8c2-d17d-4f9f-b5dd-0ab50f98901f]
 [CircleCI J11|https://app.circleci.com/pipelines/github/maedhroz/cassandra/284/workflows/4664cf14-7e0c-4680-8154-0a4fd340770a]

To see how the patch reduces allocations, enable compaction profiling in {{CompactionAllocationTest}} and run the test {{widePartitionsSingleIndexedColumn}}. (This test indexes only one of 4 normal columns.) You should get about a 13% improvement in bytes allocated for index builds.

ex.
{noformat}
INFO  [main] 2021-07-06 15:16:01,720 CompactionAllocationTest.java:466 - *** widePartitionsSingleIndexedColumn compaction summary
INFO  [main] 2021-07-06 15:16:01,720 CompactionAllocationTest.java:467 - 463337000 bytes, 13099437 objects, 2145078 /partition, 2145 /row, 0 cpu
{noformat}
...then with the patch...
{noformat}
INFO  [main] 2021-07-06 15:11:51,958 CompactionAllocationTest.java:466 - *** widePartitionsSingleIndexedColumn compaction summary
INFO  [main] 2021-07-06 15:11:51,958 CompactionAllocationTest.java:467 - 402830648 bytes, 11802336 objects, 1864956 /partition, 1864 /row, 0 cpu
{noformat}

> modify SecondaryIndexManager#indexPartition() to retrieve only columns for which indexes are actually being built
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-16776
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16776
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Feature/2i Index
>            Reporter: Caleb Rackliffe
>            Assignee: Caleb Rackliffe
>            Priority: Normal
>             Fix For: 4.x
>
>         Attachments: index1.png, index2.png
>
>
> Secondary indexes are (for the moment) built as special compaction tasks via {{SecondaryIndexBuilder}}. From a profiling perspective, the fun begins in {{SecondaryIndexManager.indexPartition()}}. The work above it in {{SecondaryIndexBuilder}} is just key iteration.
>  !index1.png! 
> Two basic things happen in {{indexPartition()}}. First, we read a single partition in its entirety, and then we send individual rows to the {{Indexer}}. When we read these partitions, we use {{ColumnFilter.all()}}, which ends up materializing full rows, even when we’re indexing a single column (or at least fewer columns than we need for all the indexes participating in the build). If we narrowed this to fetch only the necessary columns, we might be able to create less garbage in {{AbstractBTreePartition#searchIterator()}} when we create a copy of the underlying full row from disk.
> In some initial testing, I’ve been using a simple schema with fairly narrow rows.
> {noformat}
> CREATE TABLE tlp_stress.allow_filtering (
>     partition_id text,
>     row_id int,
>     payload text,
>     value int,
>     PRIMARY KEY (partition_id, row_id)
> ) WITH CLUSTERING ORDER BY (row_id ASC)
> {noformat}
> The price of deserializing these rows is still visible, however, in the results of some basic sampling profiling.
>  !index2.png! 
> The possible optimization above to avoid unnecessary copying of a row’s columns would also narrow cell deserialization only to indexed cells, which would probably be very beneficial for index builds with very wide rows. One minor wrinkle in all of this is that since 3.0, it has been possible to create indexes one entire rows, rather than single columns, so we’d have to keep that case in mind.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org