You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "xiangdong Huang (JIRA)" <ji...@apache.org> on 2017/04/13 13:50:41 UTC

[jira] [Updated] (CASSANDRA-13446) CQLSSTableWriter takes 100% CPU when the buffer_size_in_mb is larger than 64MB

     [ https://issues.apache.org/jira/browse/CASSANDRA-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

xiangdong Huang updated CASSANDRA-13446:
----------------------------------------
    Description: 
I want to use CQLSSTableWriter to load large amounts of data as SSTables, however the CPU cost and the speed is not good.
```
CQLSSTableWriter writer = CQLSSTableWriter.builder()
                .inDirectory(new File("output"+j))
                .forTable(SCHEMA)
                .withBufferSizeInMB(Integer.parseInt(System.getProperty("buffer_size_in_mb", "256")))//FIXME!! if the size is 64, it is ok, if it is 128 or larger, boom!!
                .using(INSERT_STMT)
                .withPartitioner(new Murmur3Partitioner()).build();
```
if the `buffer_size_in_mb` is less than 64MB in my  PC, everything is ok: the CPU utilization is about 60% and the memory is about 3GB (why 3GB? Luckly, I can bear that...).  The process creates 24MB per sstable (I think it is because sstable compresses data) one by one.

However, if the `buffer_size_in_mb` is greater, e.g., 128MB on my PC,  The CPU utilization is about 70%, the memory is still about 3GB.
When the CQLSSTableWriter receives 128MB data, it begins to flush data as a sstable. At this time, the bad thing comes:
CQLSSTableWriter.addRow() becomes very slow, and!! NO SSTABLE IS WRITTEN. Windows task manager shows the disk I/O for this process is 0.0 MB/s.  There is no file appears in the output folder (Sometimes a _zero-KB mc-1-big-Data.db_ and a _zero-KB mc-1-big-Index.db_ appear, and some transaction log file comes and disappears..). At this time, the process spends 99% CPU! and the memory is a little larger than 3GB....
Long long time later, the process crashes because of "GC overhead...", and there is still no sstable file built.....

When I use jprofile 10 to check who uses so much CPU, it says CQLSSTableWriter.addRow() takes about 99% CPU....

I have no idea to optimize the process, because Cassandra's SStable writing process is so complex...

The important thing is, 64MB buffer size is too small in production environments: it creates many 24MB SSTables, but we want a large sstable which can hold all the data in the batch load process. 

Now I wonder whether Spark and MapReduce work well with Cassandra, because when I have a glance of the source code, I notice that they also use CQLSSTableWriter to save output data....

The  cassandra version is 3.10. The datastax driver (for typec) is 3.2.0.

The attachment is my test program and the csv data. 
A complete test program can be found from: https://bitbucket.org/jixuan1989/csv2sstable


 

  was:
I want to use CQLSSTableWriter to load large amounts of data as SSTables, however the CPU cost and the speed is not good.
```java
CQLSSTableWriter writer = CQLSSTableWriter.builder()
                .inDirectory(new File("output"+j))
                .forTable(SCHEMA)
                .withBufferSizeInMB(Integer.parseInt(System.getProperty("buffer_size_in_mb", "256")))//FIXME!! if the size is 64, it is ok, if it is 128 or larger, boom!!
                .using(INSERT_STMT)
                .withPartitioner(new Murmur3Partitioner()).build();
```
if the `buffer_size_in_mb` is less than 64MB in my  PC, everything is ok: the CPU utilization is about 60% and the memory is about 3GB (why 3GB? Luckly, I can bear that...).  The process creates 24MB per sstable (I think it is because sstable compresses data) one by one.

However, if the `buffer_size_in_mb` is greater, e.g., 128MB on my PC,  The CPU utilization is about 70%, the memory is still about 3GB.
When the CQLSSTableWriter receives 128MB data, it begins to flush data as a sstable. At this time, the bad thing comes:
CQLSSTableWriter.addRow() becomes very slow, and!! NO SSTABLE IS WRITTEN. Windows task manager shows the disk I/O for this process is 0.0 MB/s.  There is no file appears in the output folder (Sometimes a _zero-KB mc-1-big-Data.db_ and a _zero-KB mc-1-big-Index.db_ appear, and some transaction log file comes and disappears..). At this time, the process spends 99% CPU! and the memory is a little larger than 3GB....
Long long time later, the process crashes because of "GC overhead...", and there is still no sstable file built.....

When I use jprofile 10 to check who uses so much CPU, it says CQLSSTableWriter.addRow() takes about 99% CPU....

I have no idea to optimize the process, because Cassandra's SStable writing process is so complex...

The important thing is, 64MB buffer size is too small in production environments: it creates many 24MB SSTables, but we want a large sstable which can hold all the data in the batch load process. 

Now I wonder whether Spark and MapReduce work well with Cassandra, because when I have a glance of the source code, I notice that they also use CQLSSTableWriter to save output data....

The  cassandra version is 3.10. The datastax driver (for typec) is 3.2.0.

The attachment is my test program and the csv data. 
A complete test program can be found from: https://bitbucket.org/jixuan1989/csv2sstable


 


> CQLSSTableWriter takes 100% CPU when the buffer_size_in_mb is larger than 64MB 
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-13446
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13446
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>         Environment: Windows 10, 8GB memory, i7 CPU
>            Reporter: xiangdong Huang
>         Attachments: csv2sstable.java, pom.xml, test.csv
>
>
> I want to use CQLSSTableWriter to load large amounts of data as SSTables, however the CPU cost and the speed is not good.
> ```
> CQLSSTableWriter writer = CQLSSTableWriter.builder()
>                 .inDirectory(new File("output"+j))
>                 .forTable(SCHEMA)
>                 .withBufferSizeInMB(Integer.parseInt(System.getProperty("buffer_size_in_mb", "256")))//FIXME!! if the size is 64, it is ok, if it is 128 or larger, boom!!
>                 .using(INSERT_STMT)
>                 .withPartitioner(new Murmur3Partitioner()).build();
> ```
> if the `buffer_size_in_mb` is less than 64MB in my  PC, everything is ok: the CPU utilization is about 60% and the memory is about 3GB (why 3GB? Luckly, I can bear that...).  The process creates 24MB per sstable (I think it is because sstable compresses data) one by one.
> However, if the `buffer_size_in_mb` is greater, e.g., 128MB on my PC,  The CPU utilization is about 70%, the memory is still about 3GB.
> When the CQLSSTableWriter receives 128MB data, it begins to flush data as a sstable. At this time, the bad thing comes:
> CQLSSTableWriter.addRow() becomes very slow, and!! NO SSTABLE IS WRITTEN. Windows task manager shows the disk I/O for this process is 0.0 MB/s.  There is no file appears in the output folder (Sometimes a _zero-KB mc-1-big-Data.db_ and a _zero-KB mc-1-big-Index.db_ appear, and some transaction log file comes and disappears..). At this time, the process spends 99% CPU! and the memory is a little larger than 3GB....
> Long long time later, the process crashes because of "GC overhead...", and there is still no sstable file built.....
> When I use jprofile 10 to check who uses so much CPU, it says CQLSSTableWriter.addRow() takes about 99% CPU....
> I have no idea to optimize the process, because Cassandra's SStable writing process is so complex...
> The important thing is, 64MB buffer size is too small in production environments: it creates many 24MB SSTables, but we want a large sstable which can hold all the data in the batch load process. 
> Now I wonder whether Spark and MapReduce work well with Cassandra, because when I have a glance of the source code, I notice that they also use CQLSSTableWriter to save output data....
> The  cassandra version is 3.10. The datastax driver (for typec) is 3.2.0.
> The attachment is my test program and the csv data. 
> A complete test program can be found from: https://bitbucket.org/jixuan1989/csv2sstable
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)