You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Stefania (JIRA)" <ji...@apache.org> on 2016/02/02 07:49:40 UTC

[jira] [Updated] (CASSANDRA-11053) COPY FROM on large datasets: fix progress report and debug performance

     [ https://issues.apache.org/jira/browse/CASSANDRA-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stefania updated CASSANDRA-11053:
---------------------------------
    Attachment: parent_profile.txt
                worker_profiles.txt

I've repeated the test in the exact some conditions as described above with {{cProfile}} profiling all processes. I am attaching full profile results (_worker_profiles.txt_ and _parent_profile.txt_). 

The total test time was approx 15 minutes (900 seconds), of which 15 seconds were an artificial sleep in the parent to allow workers to dump their profile results.

It is clear that with these large datasets we can no longer afford to read all data in the parent and dish out rows as it has been the approach so far. We spend in fact over 600 seconds in {{read_rows}}. We also spend significant time in the worker processes receiving data (30 seconds). Distributing file names to workers and letting them do all the work is pretty easy to do and would solve these two issues. However it comes with some consequences:

* We would end up with one process per file unless we somehow split large files but splitting large files would take time and users can prepare their data themselves. Further, COPY TO can now export to multiple files. Therefore I think we should keep things simple and adapt our bulk tests to export to multiple files.
* Either we change the meaning of the *max ingest rate* and make it per worker process, or we would need to use a global lock which could become a bottleneck. I would prefer changing the meaning of max ingest rate as users can always specify a rate that is equal to {{max_rate / num_processes}} if they really need to.
* To keep things simple, retries would be best handled by worker processes and therefore if one process fails then the import fails at least partially; I think we can live with this. 

In terms of the worker processes, there is room for improvement there too but it is not as straightforward. One interesting thing to do would be to use a cythonized driver version but this would not work out of the box due to the formatting hooks we inject in the driver. We spend a lot of time batching records, getting the replicas, binding parameters and hashing (_murmur3).

WDYK [~pauloricardomg] and [~thobbs]?

> COPY FROM on large datasets: fix progress report and debug performance
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-11053
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11053
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: Stefania
>            Assignee: Stefania
>             Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
>         Attachments: copy_from_large_benchmark.txt, parent_profile.txt, worker_profiles.txt
>
>
> Running COPY from on a large dataset (20G divided in 20M records) revealed two issues:
> * The progress report is incorrect, it is very slow until almost the end of the test at which point it catches up extremely quickly.
> * The performance in rows per second is similar to running smaller tests with a smaller cluster locally (approx 35,000 rows per second). As a comparison, cassandra-stress manages 50,000 rows per second under the same set-up, therefore resulting 1.5 times faster. 
> See attached file _copy_from_large_benchmark.txt_ for the benchmark details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)