You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by "Steinmaurer, Thomas" <th...@dynatrace.com> on 2018/04/26 12:02:19 UTC

Repair of 5GB data vs. disk throughput does not make sense

Hello,

yet another question/issue with repair.

Cassandra 2.1.18, 3 nodes, RF=3, vnode=256, data volume ~ 5G per node only. A repair (nodetool repair -par) issued on a single node at this data volume takes around 36min with an AVG of ~ 15MByte/s disk throughput (read+write) for the entire time-frame, thus processing ~ 32GByte from a disk perspective so ~ 6 times of the real data volume reported by nodetool status. Does this make any sense? This is with 4 compaction threads and compaction throughput = 64. Similar results doing this test a few times, where most/all inconsistent data should be already sorted out by previous runs.

I know there is e.g. reaper, but the above is a simple use case simply after a single failed node recovers beyond the 3h hinted handoff window. How should this finish in a timely manner for > 500G on a recovering node?

I have to admit this is with NFS as storage. I know, NFS might not be the best idea, but with the above test at ~ 5GB data volume, we see an IOPS rate at ~ 700 at a disk latency of ~ 15ms, thus I wouldn't treat it as that bad. This all is using/running Cassandra on-premise (at the customer, so not hosted by us), so while we can make recommendations storage-wise (of course preferring local disks), it may and will happen that NFS is being in use then.

Why we are using -par in combination with NFS is a different story and related to this issue: https://issues.apache.org/jira/browse/CASSANDRA-8743. Without switching from sequential to parallel repair, we basically kill Cassandra.

Throughput-wise, I also don't think it is related to NFS, cause we see similar repair throughput values with AWS EBS (gp2, SSD based) running regular repairs on small-sized CFs.

Thanks for any input.
Thomas
The contents of this e-mail are intended for the named addressee only. It contains information that may be confidential. Unless you are the named addressee or an authorized designee, you may not copy or use it, or disclose it to anyone else. If you received it in error please notify us immediately and then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) is a company registered in Linz whose registered office is at 4040 Linz, Austria, Freist?dterstra?e 313

Re: Repair of 5GB data vs. disk throughput does not make sense

Posted by horschi <ho...@gmail.com>.

Hi Thomas,

I don't think I have seen compaction ever being faster.

For me, tables with small values usually are around 5 MB/s with a single
compaction. With larger blobs (few KB per blob) I have seen 16MB/s. Both
with "nodetool setcompactionthroughput 0".

I don't think its disk related either. I think parsing the data simply
utilizes the CPU or perhaps the issue is GC related? But I have never dug
into it, I just observed low IO-wait percentages in top.

regards,
Christian




On Thu, Apr 26, 2018 at 7:39 PM, Jonathan Haddad <jo...@jonhaddad.com> wrote:

> I can't say for sure, because I haven't measured it, but I've seen a
> combination of readahead + large chunk size with compression cause serious
> issues with read amplification, although I'm not sure if or how it would
> apply here.  Likely depends on the size of your partitions and the
> fragmentation of the sstables, although at only 5GB I'm really surprised to
> hear 32GB read in, that seems a bit absurd.
>
> Definitely something to dig deeper into.
>
> On Thu, Apr 26, 2018 at 5:02 AM Steinmaurer, Thomas <
> thomas.steinmaurer@dynatrace.com> wrote:
>
>> Hello,
>>
>>
>>
>> yet another question/issue with repair.
>>
>>
>>
>> Cassandra 2.1.18, 3 nodes, RF=3, vnode=256, data volume ~ 5G per node
>> only. A repair (nodetool repair -par) issued on a single node at this data
>> volume takes around 36min with an AVG of ~ 15MByte/s disk throughput
>> (read+write) for the entire time-frame, thus processing ~ 32GByte from a
>> disk perspective so ~ 6 times of the real data volume reported by nodetool
>> status. Does this make any sense? This is with 4 compaction threads and
>> compaction throughput = 64. Similar results doing this test a few times,
>> where most/all inconsistent data should be already sorted out by previous
>> runs.
>>
>>
>>
>> I know there is e.g. reaper, but the above is a simple use case simply
>> after a single failed node recovers beyond the 3h hinted handoff window.
>> How should this finish in a timely manner for > 500G on a recovering node?
>>
>>
>>
>> I have to admit this is with NFS as storage. I know, NFS might not be the
>> best idea, but with the above test at ~ 5GB data volume, we see an IOPS
>> rate at ~ 700 at a disk latency of ~ 15ms, thus I wouldn’t treat it as that
>> bad. This all is using/running Cassandra on-premise (at the customer, so
>> not hosted by us), so while we can make recommendations storage-wise (of
>> course preferring local disks), it may and will happen that NFS is being in
>> use then.
>>
>>
>>
>> Why we are using -par in combination with NFS is a different story and
>> related to this issue: https://issues.apache.org/
>> jira/browse/CASSANDRA-8743. Without switching from sequential to
>> parallel repair, we basically kill Cassandra.
>>
>>
>>
>> Throughput-wise, I also don’t think it is related to NFS, cause we see
>> similar repair throughput values with AWS EBS (gp2, SSD based) running
>> regular repairs on small-sized CFs.
>>
>>
>>
>> Thanks for any input.
>>
>> Thomas
>> The contents of this e-mail are intended for the named addressee only. It
>> contains information that may be confidential. Unless you are the named
>> addressee or an authorized designee, you may not copy or use it, or
>> disclose it to anyone else. If you received it in error please notify us
>> immediately and then destroy it. Dynatrace Austria GmbH (registration
>> number FN 91482h) is a company registered in Linz whose registered office
>> is at 4040 Linz, Austria, Freistädterstraße 313
>>
>

Re: Repair of 5GB data vs. disk throughput does not make sense

Posted by Jonathan Haddad <jo...@jonhaddad.com>.

I can't say for sure, because I haven't measured it, but I've seen a
combination of readahead + large chunk size with compression cause serious
issues with read amplification, although I'm not sure if or how it would
apply here.  Likely depends on the size of your partitions and the
fragmentation of the sstables, although at only 5GB I'm really surprised to
hear 32GB read in, that seems a bit absurd.

Definitely something to dig deeper into.

On Thu, Apr 26, 2018 at 5:02 AM Steinmaurer, Thomas <
thomas.steinmaurer@dynatrace.com> wrote:

> Hello,
>
>
>
> yet another question/issue with repair.
>
>
>
> Cassandra 2.1.18, 3 nodes, RF=3, vnode=256, data volume ~ 5G per node
> only. A repair (nodetool repair -par) issued on a single node at this data
> volume takes around 36min with an AVG of ~ 15MByte/s disk throughput
> (read+write) for the entire time-frame, thus processing ~ 32GByte from a
> disk perspective so ~ 6 times of the real data volume reported by nodetool
> status. Does this make any sense? This is with 4 compaction threads and
> compaction throughput = 64. Similar results doing this test a few times,
> where most/all inconsistent data should be already sorted out by previous
> runs.
>
>
>
> I know there is e.g. reaper, but the above is a simple use case simply
> after a single failed node recovers beyond the 3h hinted handoff window.
> How should this finish in a timely manner for > 500G on a recovering node?
>
>
>
> I have to admit this is with NFS as storage. I know, NFS might not be the
> best idea, but with the above test at ~ 5GB data volume, we see an IOPS
> rate at ~ 700 at a disk latency of ~ 15ms, thus I wouldn’t treat it as that
> bad. This all is using/running Cassandra on-premise (at the customer, so
> not hosted by us), so while we can make recommendations storage-wise (of
> course preferring local disks), it may and will happen that NFS is being in
> use then.
>
>
>
> Why we are using -par in combination with NFS is a different story and
> related to this issue:
> https://issues.apache.org/jira/browse/CASSANDRA-8743. Without switching
> from sequential to parallel repair, we basically kill Cassandra.
>
>
>
> Throughput-wise, I also don’t think it is related to NFS, cause we see
> similar repair throughput values with AWS EBS (gp2, SSD based) running
> regular repairs on small-sized CFs.
>
>
>
> Thanks for any input.
>
> Thomas
> The contents of this e-mail are intended for the named addressee only. It
> contains information that may be confidential. Unless you are the named
> addressee or an authorized designee, you may not copy or use it, or
> disclose it to anyone else. If you received it in error please notify us
> immediately and then destroy it. Dynatrace Austria GmbH (registration
> number FN 91482h) is a company registered in Linz whose registered office
> is at 4040 Linz, Austria, Freistädterstraße 313
>