You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Paulo Motta (JIRA)" <ji...@apache.org> on 2017/08/15 04:39:00 UTC
[jira] [Commented] (CASSANDRA-12245) initial view build can be parallel

    [ https://issues.apache.org/jira/browse/CASSANDRA-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126791#comment-16126791 ] 

Paulo Motta commented on CASSANDRA-12245:
-----------------------------------------

Thanks for the patch and sorry for the delay! Had an initial look at the patch, overall looks good, I have the following comments/questions/remarks:

bq. The newly created ViewBuilderController reads the local token ranges, splits them to satisfy a concurrency factor, and runs a ViewBuilder for each of them.

It would be nice to maybe try to reuse the {{Splitter}} methods if possible, so we can reuse tests, or if that's not straightforward maybe put the methods on splitter and add some tests to make sure it's working correctly.

bq.  When ViewBuilderController receives the finalization signal of the last ViewBuilder, it double-checks if there are new local ranges that weren't considered at the beginning of the build. If there are new ranges, new {{ViewBuilder}}s are created for them.

This will not work if the range movement which created the new local range finishes after the view has finished building. This problem exists currently and is unrelated to the view build process itself, but more related to the range movement completion which should ensure the views are properly built before the operation finishes, so I created CASSANDRA-13762 to handle this properly.

bq. Given that we have a ViewBuilder per local range, the key of the table system.views_builds_in_progress is modified to include the bounds of the token range. So, we will have an entry in the table per each ViewBuilder. The number of covered keys per range is also recorded in the table.

Can probably remove the generation field from the builds in progress table and [remove this comment|https://github.com/adelapena/cassandra/blob/94f3d0d02bb5f849e4d54857d0b33531a5650643/src/java/org/apache/cassandra/db/view/ViewBuilder.java#L124]

bq. I have updated the patch to use a new separate table, system.views_builds_in_progress_v2

{{views_builds_in_progress_v2}} sounds a bit hacky, so perhaps we should call it {{system.view_builds_in_progress}} (remove the s) and also add a NOTICE entry informing the previous table was replaced and data files can be removed.

bq. The downside is that pending view builds will be restarted during an upgrade to 4.x, which seems reasonable to me.

Sounds reasonable to me too.

bq. ViewBuilder and ViewBuilderController are probably not the best names. Maybe we could rename ViewBuilder to something like ViewBuilderTask or ViewBuilderForRange, and rename ViewBuilderController to ViewBuilder.

{{ViewBuilder}} and {{ViewBuilderTask}} LGTM

bq. The concurrency factor is based on conf.concurrent_compactors because the views are built on the CompactionManager, but we may be interested in a different value.

I'm a bit concerned about starving the compaction executor for a long period during view build of large base tables, so we should probably have another option like {{concurret_view_builders}} with a conservative default and perhaps control the concurrency at the {{ViewBuilderController}}. WDYT?

bq. The patch tries to evenly split the token ranges in the minimum number of parts to satisfy the concurrency factor, and it never merges ranges. So, with the default 256 virtual nodes (and a lesser concurrency factor) we create 256 build tasks. We might be interested in a different planning. If we want the number of tasks to be lesser than the number of local ranges we should modify the ViewBuilder task to be responsible for several ranges, although it will complicate the status tracking.

I think this is good to start with, we can improve the planning later if necessary. I don't think there is much gain from merging ranges to have smaller tasks.

bq. Probably there is a better way of implementing ViewBuilder.getCompactionInfo. The patch uses keysBuilt/ColumnFamilyStore.estimatedKeysForRange to estimate the completion, which could deal to have task completion status over 100%, depending on the estimation.

How about using {{prevToken.size(range.right)}} (introduced by CASSANDRA-7032)? Even though this will not be available for BytesToken (used by ByteOrderedPartitioner, which is rarely used, so could maybe fallback to the current imprecise calculation in that case).

Other comments:

* Avoid submitting view builder when view is already built instead of checking on the ViewBuilder ([here|https://github.com/adelapena/cassandra/blob/94f3d0d02bb5f849e4d54857d0b33531a5650643/src/java/org/apache/cassandra/db/view/ViewBuilder.java#L109])
* ViewBuilder seems to be reimplementing some of the logic of {{PartitionRangeReadCommand}}, so I wonder if we shoud take this chance to simplify and use that instead of manually constructing the commands via ReducingKeyIterator and multiple {{SinglePartitionReadCommands}}? We can totally do this in other ticket if you prefer.
* Perform view marking on ViewBuilderController [instead of ViewBuilder|https://github.com/adelapena/cassandra/blob/94f3d0d02bb5f849e4d54857d0b33531a5650643/src/java/org/apache/cassandra/db/view/ViewBuilder.java#L177]
* Updating the view built status [at every key|https://github.com/adelapena/cassandra/blob/94f3d0d02bb5f849e4d54857d0b33531a5650643/src/java/org/apache/cassandra/db/view/ViewBuilder.java#L163] is perhaps a bit inefficient and unnecessary, so perhaps we should update it every 1000 keys or so.
* Would be nice to update the {{interrupt_build_process_test}} to stop halfway through the build (instead of the start of the build) and verify it it's being resumed correctly with the new changes.

> initial view build can be parallel
> ----------------------------------
>
>                 Key: CASSANDRA-12245
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12245
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Materialized Views
>            Reporter: Tom van der Woerdt
>            Assignee: Andrés de la Peña
>             Fix For: 4.x
>
>
> On a node with lots of data (~3TB) building a materialized view takes several weeks, which is not ideal. It's doing this in a single thread.
> There are several potential ways this can be optimized :
>  * do vnodes in parallel, instead of going through the entire range in one thread
>  * just iterate through sstables, not worrying about duplicates, and include the timestamp of the original write in the MV mutation. since this doesn't exclude duplicates it does increase the amount of work and could temporarily surface ghost rows (yikes) but I guess that's why they call it eventual consistency. doing it this way can avoid holding references to all tables on disk, allows parallelization, and removes the need to check other sstables for existing data. this is essentially the 'do a full repair' path



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org