You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Serban Teodorescu (Jira)" <ji...@apache.org> on 2020/12/14 15:42:00 UTC

[jira] [Commented] (CASSANDRA-12416) sstableloader to stream sstables in a sorted order

    [ https://issues.apache.org/jira/browse/CASSANDRA-12416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249059#comment-17249059 ] 

Serban Teodorescu commented on CASSANDRA-12416:
-----------------------------------------------

I think this can be solved by having a separate tool that would merge multiple SSTables into one, then run the SSTableLoader on the result. Something like [https://github.com/tolbertam/sstable-tools#compact.] It's debatable if there should be such a tool in Cassandra, and if so there should be a new ticket for this anyway.

Theoretically it would also be possible to merge the tables and stream it instead of writing them as a new SSTable to disk. But this would require refactoring the SSTableLoader, since as it is now it relies on using some table metadata to prepare the streaming, metadata that won't be available until the merging is done. 

Another point partially related to this is that in Cassandra 4 it is more efficient to stream SSTables that belong to a single token range (see https://cassandra.apache.org/blog/2019/04/09/benchmarking_streaming.html). So a mix of merge/split by token range would be the most efficient (or you could implement the split at the source, in the code that uses cqlsstablewriter) 

 

> sstableloader to stream sstables in a sorted order
> --------------------------------------------------
>
>                 Key: CASSANDRA-12416
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12416
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Legacy/Tools
>            Reporter: Zhaojun Zhang
>            Priority: Normal
>
> Within each sstable, the data is sorted. However, this is not true across multiple sstables. We have a workflow which will create a read-only cluster by bulk loading data from sstables (written by cqlsstablewirter) to cassandra cluster. We don't want to trigger compaction, and the best way to do so is to write data in a sorted order, which requires us to do a global sort across all data sources using an external sort algorithm. If we are able to use sstableloader to load data into clusters in order, we don't need to do such global sort, which will dramatically simply our implementation and code redundancy. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org