You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Bhuvan Rawal <bh...@gmail.com> on 2016/04/23 00:34:40 UTC

Hi Memory consumption with Copy command

Hi,

Im trying to copy a 20 GB CSV file into a 3 node fresh cassandra cluster
with 32 GB memory each, sufficient disk, RF-1 and durable write false. The
machine im feeding into is external to the cluster and shares 1GBps line
and has 16 GB RAM. (We have chosen this setup to possibly reduce CPU and IO
usage).

Im trying to use COPY command to feed in data. It kicks off well, launches
a set of processes, does about 50,000 rows per second. But I can see that
the parent process starts aggregating memory almost of the size of data
processed and after a point the processes just hang. The parent process was
consuming 95% system memory when it had processed around 60% data.

I had earlier tried to feed in data from multiple files (Less than 4GB
each) and it was working as expected.

Is it a valid scenario?

Regards,
Bhuvan

Re: Hi Memory consumption with Copy command

Posted by Stefania Alborghetti <st...@datastax.com>.
That's really excellent! Thank you so much for sharing the results.

Regarding sstableloader, I am not familiar with its performance so I cannot
make any recommendation as I've never compared it with COPY FROM.

I have however compared COPY FROM with another bulk import tool,
cassandra-loader, <https://github.com/brianmhess/cassandra-loader/releases>
during the tests for CASSANDRA-11053. COPY FROM should now be as efficient
as this tool if not better (depending on data sets and test environment).

There is then this presentation
<http://www.slideshare.net/BrianHess4/bulk-loading-into-cassandra>, from
Cassandra Summit 2015, where it compares sstableloader, cassandra-loader
and the "old" COPY FROM. According to the results at slide #18,
sstableloader is slightly better than cassandra-loader for small records,
then the sstableloader performance decreases as the record size increases.

So my guess would be that sstableloader may or may not better, depending on
the record size. If it is better, I would think that the difference should
be minimal.  Sorry this is not very accurate but that's the best I have.


On Sat, Apr 23, 2016 at 6:00 PM, Bhuvan Rawal <bh...@gmail.com> wrote:

> I built cython and disabled bundled driver, the performance has been
> impressive. Memory issue is resolved and Im currently getting around
> 100,000 rows per second, its stressing both the client CPU as well as
> cassandra nodes. Thats the fastest I have ever seen it perform. With 60
> Million rows already transferred in ~5 Minutes.
>
> Just a final question before we close this thread, at this performance
> level would you recommend sstable loader or copy command?
>
> On Sat, Apr 23, 2016 at 2:00 PM, Bhuvan Rawal <bh...@gmail.com> wrote:
>
>> Thanks Stefania for the informative answer.  The next blog was pretty
>> useful as well:
>> http://www.datastax.com/dev/blog/how-we-optimized-cassandra-cqlsh-copy-from
>> . Ill upgrade to 3.0.5 and test with C extensions enabled and report on
>> this thread.
>>
>> On Sat, Apr 23, 2016 at 8:54 AM, Stefania Alborghetti <
>> stefania.alborghetti@datastax.com> wrote:
>>
>>> Hi Bhuvan
>>>
>>> Support for large datasets in COPY FROM was added by CASSANDRA-11053
>>> <https://issues.apache.org/jira/browse/CASSANDRA-11053>, which is
>>> available in 2.1.14, 2.2.6, 3.0.5 and 3.5. Your scenario is valid with this
>>> patch applied.
>>>
>>> The 3.0.x and 3.x releases are already available, whilst the other two
>>> releases are due in the next few days. You only need to install an
>>> up-to-date release on the machine where COPY FROM is running.
>>>
>>> You may find the setup instructions in this blog
>>> <http://www.datastax.com/dev/blog/six-parameters-affecting-cqlsh-copy-from-performance>
>>> interesting. Specifically, for large datasets, I would highly recommend
>>> installing the Python driver with C extensions, as it will speed things up
>>> considerably. Again, this is only possible with the 11053 patch. Please
>>> ignore the suggestion to also compile the cqlsh copy module itself with C
>>> extensions (Cython), as you may hit CASSANDRA-11574
>>> <https://issues.apache.org/jira/browse/CASSANDRA-11574> in the 3.0.5
>>> and 3.5 releases.
>>>
>>> Before CASSANDRA-11053, the parent process was a bottleneck. This is
>>> explained further in  this blog
>>> <http://www.datastax.com/dev/blog/how-we-optimized-cassandra-cqlsh-copy-from>,
>>> second paragraph in the "worker processes" section. As a workaround, if you
>>> are unable to upgrade, you may try reducing the INGESTRATE and introducing
>>> a few extra worker processes via NUMPROCESSES. Also, the parent process is
>>> overloaded and is therefore not able to report progress correctly.
>>> Therefore, if the progress report is frozen, it doesn't mean the COPY
>>> OPERATION is not making progress.
>>>
>>> Do let us know if you still have problems, as this is new functionality.
>>>
>>> With best regards,
>>> Stefania
>>>
>>>
>>> On Sat, Apr 23, 2016 at 6:34 AM, Bhuvan Rawal <bh...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Im trying to copy a 20 GB CSV file into a 3 node fresh cassandra
>>>> cluster with 32 GB memory each, sufficient disk, RF-1 and durable write
>>>> false. The machine im feeding into is external to the cluster and shares
>>>> 1GBps line and has 16 GB RAM. (We have chosen this setup to possibly reduce
>>>> CPU and IO usage).
>>>>
>>>> Im trying to use COPY command to feed in data. It kicks off well,
>>>> launches a set of processes, does about 50,000 rows per second. But I can
>>>> see that the parent process starts aggregating memory almost of the size of
>>>> data processed and after a point the processes just hang. The parent
>>>> process was consuming 95% system memory when it had processed around 60%
>>>> data.
>>>>
>>>> I had earlier tried to feed in data from multiple files (Less than 4GB
>>>> each) and it was working as expected.
>>>>
>>>> Is it a valid scenario?
>>>>
>>>> Regards,
>>>> Bhuvan
>>>>
>>>
>>>
>>>
>>> --
>>>
>>>
>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>
>>> Stefania Alborghetti
>>>
>>> Apache Cassandra Software Engineer
>>>
>>> |+852 6114 9265| stefania.alborghetti@datastax.com
>>>
>>>
>>> [image: cassandrasummit.org/Email_Signature]
>>> <http://cassandrasummit.org/Email_Signature>
>>>
>>
>>
>


-- 


[image: datastax_logo.png] <http://www.datastax.com/>

Stefania Alborghetti

Apache Cassandra Software Engineer

|+852 6114 9265| stefania.alborghetti@datastax.com


[image: cassandrasummit.org/Email_Signature]
<http://cassandrasummit.org/Email_Signature>

Re: Hi Memory consumption with Copy command

Posted by Bhuvan Rawal <bh...@gmail.com>.
I built cython and disabled bundled driver, the performance has been
impressive. Memory issue is resolved and Im currently getting around
100,000 rows per second, its stressing both the client CPU as well as
cassandra nodes. Thats the fastest I have ever seen it perform. With 60
Million rows already transferred in ~5 Minutes.

Just a final question before we close this thread, at this performance
level would you recommend sstable loader or copy command?

On Sat, Apr 23, 2016 at 2:00 PM, Bhuvan Rawal <bh...@gmail.com> wrote:

> Thanks Stefania for the informative answer.  The next blog was pretty
> useful as well:
> http://www.datastax.com/dev/blog/how-we-optimized-cassandra-cqlsh-copy-from
> . Ill upgrade to 3.0.5 and test with C extensions enabled and report on
> this thread.
>
> On Sat, Apr 23, 2016 at 8:54 AM, Stefania Alborghetti <
> stefania.alborghetti@datastax.com> wrote:
>
>> Hi Bhuvan
>>
>> Support for large datasets in COPY FROM was added by CASSANDRA-11053
>> <https://issues.apache.org/jira/browse/CASSANDRA-11053>, which is
>> available in 2.1.14, 2.2.6, 3.0.5 and 3.5. Your scenario is valid with this
>> patch applied.
>>
>> The 3.0.x and 3.x releases are already available, whilst the other two
>> releases are due in the next few days. You only need to install an
>> up-to-date release on the machine where COPY FROM is running.
>>
>> You may find the setup instructions in this blog
>> <http://www.datastax.com/dev/blog/six-parameters-affecting-cqlsh-copy-from-performance>
>> interesting. Specifically, for large datasets, I would highly recommend
>> installing the Python driver with C extensions, as it will speed things up
>> considerably. Again, this is only possible with the 11053 patch. Please
>> ignore the suggestion to also compile the cqlsh copy module itself with C
>> extensions (Cython), as you may hit CASSANDRA-11574
>> <https://issues.apache.org/jira/browse/CASSANDRA-11574> in the 3.0.5 and
>> 3.5 releases.
>>
>> Before CASSANDRA-11053, the parent process was a bottleneck. This is
>> explained further in  this blog
>> <http://www.datastax.com/dev/blog/how-we-optimized-cassandra-cqlsh-copy-from>,
>> second paragraph in the "worker processes" section. As a workaround, if you
>> are unable to upgrade, you may try reducing the INGESTRATE and introducing
>> a few extra worker processes via NUMPROCESSES. Also, the parent process is
>> overloaded and is therefore not able to report progress correctly.
>> Therefore, if the progress report is frozen, it doesn't mean the COPY
>> OPERATION is not making progress.
>>
>> Do let us know if you still have problems, as this is new functionality.
>>
>> With best regards,
>> Stefania
>>
>>
>> On Sat, Apr 23, 2016 at 6:34 AM, Bhuvan Rawal <bh...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Im trying to copy a 20 GB CSV file into a 3 node fresh cassandra cluster
>>> with 32 GB memory each, sufficient disk, RF-1 and durable write false. The
>>> machine im feeding into is external to the cluster and shares 1GBps line
>>> and has 16 GB RAM. (We have chosen this setup to possibly reduce CPU and IO
>>> usage).
>>>
>>> Im trying to use COPY command to feed in data. It kicks off well,
>>> launches a set of processes, does about 50,000 rows per second. But I can
>>> see that the parent process starts aggregating memory almost of the size of
>>> data processed and after a point the processes just hang. The parent
>>> process was consuming 95% system memory when it had processed around 60%
>>> data.
>>>
>>> I had earlier tried to feed in data from multiple files (Less than 4GB
>>> each) and it was working as expected.
>>>
>>> Is it a valid scenario?
>>>
>>> Regards,
>>> Bhuvan
>>>
>>
>>
>>
>> --
>>
>>
>> [image: datastax_logo.png] <http://www.datastax.com/>
>>
>> Stefania Alborghetti
>>
>> Apache Cassandra Software Engineer
>>
>> |+852 6114 9265| stefania.alborghetti@datastax.com
>>
>>
>> [image: cassandrasummit.org/Email_Signature]
>> <http://cassandrasummit.org/Email_Signature>
>>
>
>

Re: Hi Memory consumption with Copy command

Posted by Bhuvan Rawal <bh...@gmail.com>.
Thanks Stefania for the informative answer.  The next blog was pretty
useful as well:
http://www.datastax.com/dev/blog/how-we-optimized-cassandra-cqlsh-copy-from
. Ill upgrade to 3.0.5 and test with C extensions enabled and report on
this thread.

On Sat, Apr 23, 2016 at 8:54 AM, Stefania Alborghetti <
stefania.alborghetti@datastax.com> wrote:

> Hi Bhuvan
>
> Support for large datasets in COPY FROM was added by CASSANDRA-11053
> <https://issues.apache.org/jira/browse/CASSANDRA-11053>, which is
> available in 2.1.14, 2.2.6, 3.0.5 and 3.5. Your scenario is valid with this
> patch applied.
>
> The 3.0.x and 3.x releases are already available, whilst the other two
> releases are due in the next few days. You only need to install an
> up-to-date release on the machine where COPY FROM is running.
>
> You may find the setup instructions in this blog
> <http://www.datastax.com/dev/blog/six-parameters-affecting-cqlsh-copy-from-performance>
> interesting. Specifically, for large datasets, I would highly recommend
> installing the Python driver with C extensions, as it will speed things up
> considerably. Again, this is only possible with the 11053 patch. Please
> ignore the suggestion to also compile the cqlsh copy module itself with C
> extensions (Cython), as you may hit CASSANDRA-11574
> <https://issues.apache.org/jira/browse/CASSANDRA-11574> in the 3.0.5 and
> 3.5 releases.
>
> Before CASSANDRA-11053, the parent process was a bottleneck. This is
> explained further in  this blog
> <http://www.datastax.com/dev/blog/how-we-optimized-cassandra-cqlsh-copy-from>,
> second paragraph in the "worker processes" section. As a workaround, if you
> are unable to upgrade, you may try reducing the INGESTRATE and introducing
> a few extra worker processes via NUMPROCESSES. Also, the parent process is
> overloaded and is therefore not able to report progress correctly.
> Therefore, if the progress report is frozen, it doesn't mean the COPY
> OPERATION is not making progress.
>
> Do let us know if you still have problems, as this is new functionality.
>
> With best regards,
> Stefania
>
>
> On Sat, Apr 23, 2016 at 6:34 AM, Bhuvan Rawal <bh...@gmail.com> wrote:
>
>> Hi,
>>
>> Im trying to copy a 20 GB CSV file into a 3 node fresh cassandra cluster
>> with 32 GB memory each, sufficient disk, RF-1 and durable write false. The
>> machine im feeding into is external to the cluster and shares 1GBps line
>> and has 16 GB RAM. (We have chosen this setup to possibly reduce CPU and IO
>> usage).
>>
>> Im trying to use COPY command to feed in data. It kicks off well,
>> launches a set of processes, does about 50,000 rows per second. But I can
>> see that the parent process starts aggregating memory almost of the size of
>> data processed and after a point the processes just hang. The parent
>> process was consuming 95% system memory when it had processed around 60%
>> data.
>>
>> I had earlier tried to feed in data from multiple files (Less than 4GB
>> each) and it was working as expected.
>>
>> Is it a valid scenario?
>>
>> Regards,
>> Bhuvan
>>
>
>
>
> --
>
>
> [image: datastax_logo.png] <http://www.datastax.com/>
>
> Stefania Alborghetti
>
> Apache Cassandra Software Engineer
>
> |+852 6114 9265| stefania.alborghetti@datastax.com
>
>
> [image: cassandrasummit.org/Email_Signature]
> <http://cassandrasummit.org/Email_Signature>
>

Re: Hi Memory consumption with Copy command

Posted by Stefania Alborghetti <st...@datastax.com>.
Hi Bhuvan

Support for large datasets in COPY FROM was added by CASSANDRA-11053
<https://issues.apache.org/jira/browse/CASSANDRA-11053>, which is available
in 2.1.14, 2.2.6, 3.0.5 and 3.5. Your scenario is valid with this patch
applied.

The 3.0.x and 3.x releases are already available, whilst the other two
releases are due in the next few days. You only need to install an
up-to-date release on the machine where COPY FROM is running.

You may find the setup instructions in this blog
<http://www.datastax.com/dev/blog/six-parameters-affecting-cqlsh-copy-from-performance>
interesting. Specifically, for large datasets, I would highly recommend
installing the Python driver with C extensions, as it will speed things up
considerably. Again, this is only possible with the 11053 patch. Please
ignore the suggestion to also compile the cqlsh copy module itself with C
extensions (Cython), as you may hit CASSANDRA-11574
<https://issues.apache.org/jira/browse/CASSANDRA-11574> in the 3.0.5 and
3.5 releases.

Before CASSANDRA-11053, the parent process was a bottleneck. This is
explained further in  this blog
<http://www.datastax.com/dev/blog/how-we-optimized-cassandra-cqlsh-copy-from>,
second paragraph in the "worker processes" section. As a workaround, if you
are unable to upgrade, you may try reducing the INGESTRATE and introducing
a few extra worker processes via NUMPROCESSES. Also, the parent process is
overloaded and is therefore not able to report progress correctly.
Therefore, if the progress report is frozen, it doesn't mean the COPY
OPERATION is not making progress.

Do let us know if you still have problems, as this is new functionality.

With best regards,
Stefania


On Sat, Apr 23, 2016 at 6:34 AM, Bhuvan Rawal <bh...@gmail.com> wrote:

> Hi,
>
> Im trying to copy a 20 GB CSV file into a 3 node fresh cassandra cluster
> with 32 GB memory each, sufficient disk, RF-1 and durable write false. The
> machine im feeding into is external to the cluster and shares 1GBps line
> and has 16 GB RAM. (We have chosen this setup to possibly reduce CPU and IO
> usage).
>
> Im trying to use COPY command to feed in data. It kicks off well, launches
> a set of processes, does about 50,000 rows per second. But I can see that
> the parent process starts aggregating memory almost of the size of data
> processed and after a point the processes just hang. The parent process was
> consuming 95% system memory when it had processed around 60% data.
>
> I had earlier tried to feed in data from multiple files (Less than 4GB
> each) and it was working as expected.
>
> Is it a valid scenario?
>
> Regards,
> Bhuvan
>



-- 


[image: datastax_logo.png] <http://www.datastax.com/>

Stefania Alborghetti

Apache Cassandra Software Engineer

|+852 6114 9265| stefania.alborghetti@datastax.com


[image: cassandrasummit.org/Email_Signature]
<http://cassandrasummit.org/Email_Signature>