You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Mattias Larsson <ml...@yahoo-inc.com> on 2012/10/25 00:26:36 UTC

Hinted Handoff storage inflation

I'm testing various scenarios in a multi data center configuration. The setup is 10 Cassandra 1.1.5 nodes configured into two data centers, 5 nodes in each DC (RF DC1:3,DC2:3, write consistency LOCAL_QUORUM). I have a synthetic random data generator that I can run, and each run adds roughly 1GiB of data to each node per run,

DC          Rack        Status State   Load            Effective-Ownership
                                                                          
DC1         RAC1        Up     Normal  1010.71 MB      60.00%             
DC2         RAC1        Up     Normal  1009.08 MB      60.00%             
DC1         RAC1        Up     Normal  1.01 GB         60.00%             
DC2         RAC1        Up     Normal  1 GB            60.00%             
DC1         RAC1        Up     Normal  1.01 GB         60.00%             
DC2         RAC1        Up     Normal  1014.45 MB      60.00%             
DC1         RAC1        Up     Normal  1.01 GB         60.00%             
DC2         RAC1        Up     Normal  1.01 GB         60.00%             
DC1         RAC1        Up     Normal  1.01 GB         60.00%             
DC2         RAC1        Up     Normal  1.01 GB         60.00%             

Now, if I kill all the nodes in DC2, and run the data generator again, I would expect roughly 2GiB to be added to each node in DC1 (local replicas + hints to other data center), instead I get this:

DC          Rack        Status State   Load            Effective-Ownership
                                                                          
DC1         RAC1        Up     Normal  17.56 GB        60.00%             
DC2         RAC1        Down   Normal  1009.08 MB      60.00%             
DC1         RAC1        Up     Normal  17.47 GB        60.00%             
DC2         RAC1        Down   Normal  1 GB            60.00%             
DC1         RAC1        Up     Normal  17.22 GB        60.00%             
DC2         RAC1        Down   Normal  1014.45 MB      60.00%             
DC1         RAC1        Up     Normal  16.94 GB        60.00%             
DC2         RAC1        Down   Normal  1.01 GB         60.00%             
DC1         RAC1        Up     Normal  17.26 GB        60.00%             
DC2         RAC1        Down   Normal  1.01 GB         60.00%             

Checking the sstables on a node reveals this,

-bash-3.2$ du -hs HintsColumnFamily/
16G	HintsColumnFamily/
-bash-3.2$

So it seems that what I would have expected to be 1GiB of hints is much larger in reality, a 15x-16x inflation. This has a huge impact on write performance as well.

If I bring DC2 up again, eventually the load will drop down and even out to 2GiB across the entire cluster.

I'm wondering if this inflation is intended or if it is possibly a bug or something I'm doing wrong? Assuming this inflation is correct, what is the best way to deal with temporary connectivity issues with a second data center? Write performance is paramount in my use case. A 2x-3x overhead is doable, but not 15x-16x.

Thanks,
/dml

Re: Hinted Handoff storage inflation

Posted by aaron morton <aa...@thelastpickle.com>.

> With both data centers functional, the test takes just a few minutes to run, with one data center down, 15x the amount of time.
Could you provide the numbers, it's easier to get a feel for how the throughput is dropping. Does latency reported by nodetool cf stats change ? 
I'm also interested to know how long hints were collected for. 

Each coordinator will be writing three hints, which will be slowing down the other writes it needs to do. 

> but I found that the storage overhead was the same regardless of the size of the batch mutation (i.e., 5 vs 25 mutations made no difference).
Batch size makes no difference. Each row mutation is treated as an individual command, the batch is simply a way to reduce network calls. 

> Each write is new data only (no overwrites). Each mutation adds a row to one column family with a column containing about ~100 bytes of data and a new row to another column family with a SuperColumn containing 2x17KiB payloads.
I cannot remember anyone raising this sort of issue about HH before. It may be that no one has looked at how that level of hints is handled. 
Could you reproduce the problem with a smaller test case ? 

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 27/10/2012, at 7:56 AM, Mattias Larsson <ml...@yahoo-inc.com> wrote:

> 
> On Oct 24, 2012, at 6:05 PM, aaron morton wrote:
> 
>> Hints store the columns, row key, KS name and CF id(s) for each mutation to each node. Where an executed mutation will store the most recent columns collated with others under the same row key. So depending on the type of mutation hints will take up more space. 
>> 
>> The worse case would be lots of overwrites. After that writing a small amount of data to many rows would result in a lot of the serialised space being devoted to row keys, KS name and CF id.
>> 
>> 16Gb is a lot though. What was the write workload like ?
> 
> Each write is new data only (no overwrites). Each mutation adds a row to one column family with a column containing about ~100 bytes of data and a new row to another column family with a SuperColumn containing 2x17KiB payloads. These are sent in batches with several in them, but I found that the storage overhead was the same regardless of the size of the batch mutation (i.e., 5 vs 25 mutations made no difference). A total of 1,000,000 mutations like these are sent over the duration of the test.
> 
> 
>> You can get an estimate on the number of keys in the Hints CF using nodetool cfstats. Also some metrics in the JMX will tell you how many hints are stored. 
>> 
>>> This has a huge impact on write performance as well.
>> Yup. Hints are added to the same Mutation thread pool as normal mutations. They are processed async to the mutation request but they still take resources to store. 
>> 
>> You can adjust how long hints a collected for with max_hint_window_in_ms in the yaml file. 
>> 
>> How long did the test run for ? 
>> 
> 
> With both data centers functional, the test takes just a few minutes to run, with one data center down, 15x the amount of time.
> 
> /dml
> 
>

Re: Hinted Handoff storage inflation

Posted by Mattias Larsson <ml...@yahoo-inc.com>.

On Oct 24, 2012, at 6:05 PM, aaron morton wrote:

> Hints store the columns, row key, KS name and CF id(s) for each mutation to each node. Where an executed mutation will store the most recent columns collated with others under the same row key. So depending on the type of mutation hints will take up more space. 
> 
> The worse case would be lots of overwrites. After that writing a small amount of data to many rows would result in a lot of the serialised space being devoted to row keys, KS name and CF id.
> 
> 16Gb is a lot though. What was the write workload like ?

Each write is new data only (no overwrites). Each mutation adds a row to one column family with a column containing about ~100 bytes of data and a new row to another column family with a SuperColumn containing 2x17KiB payloads. These are sent in batches with several in them, but I found that the storage overhead was the same regardless of the size of the batch mutation (i.e., 5 vs 25 mutations made no difference). A total of 1,000,000 mutations like these are sent over the duration of the test.

> You can get an estimate on the number of keys in the Hints CF using nodetool cfstats. Also some metrics in the JMX will tell you how many hints are stored. 
> 
>> This has a huge impact on write performance as well.
> Yup. Hints are added to the same Mutation thread pool as normal mutations. They are processed async to the mutation request but they still take resources to store. 
> 
> You can adjust how long hints a collected for with max_hint_window_in_ms in the yaml file. 
> 
> How long did the test run for ? 
> 

With both data centers functional, the test takes just a few minutes to run, with one data center down, 15x the amount of time.

/dml

Re: Hinted Handoff storage inflation

Posted by aaron morton <aa...@thelastpickle.com>.

Hints store the columns, row key, KS name and CF id(s) for each mutation to each node. Where an executed mutation will store the most recent columns collated with others under the same row key. So depending on the type of mutation hints will take up more space. 

The worse case would be lots of overwrites. After that writing a small amount of data to many rows would result in a lot of the serialised space being devoted to row keys, KS name and CF id.

16Gb is a lot though. What was the write workload like ?
You can get an estimate on the number of keys in the Hints CF using nodetool cfstats. Also some metrics in the JMX will tell you how many hints are stored. 

> This has a huge impact on write performance as well.
Yup. Hints are added to the same Mutation thread pool as normal mutations. They are processed async to the mutation request but they still take resources to store. 

You can adjust how long hints a collected for with max_hint_window_in_ms in the yaml file. 

How long did the test run for ? 


Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 25/10/2012, at 11:26 AM, Mattias Larsson <ml...@yahoo-inc.com> wrote:

> 
> I'm testing various scenarios in a multi data center configuration. The setup is 10 Cassandra 1.1.5 nodes configured into two data centers, 5 nodes in each DC (RF DC1:3,DC2:3, write consistency LOCAL_QUORUM). I have a synthetic random data generator that I can run, and each run adds roughly 1GiB of data to each node per run,
> 
> DC          Rack        Status State   Load            Effective-Ownership
> 
> DC1         RAC1        Up     Normal  1010.71 MB      60.00%             
> DC2         RAC1        Up     Normal  1009.08 MB      60.00%             
> DC1         RAC1        Up     Normal  1.01 GB         60.00%             
> DC2         RAC1        Up     Normal  1 GB            60.00%             
> DC1         RAC1        Up     Normal  1.01 GB         60.00%             
> DC2         RAC1        Up     Normal  1014.45 MB      60.00%             
> DC1         RAC1        Up     Normal  1.01 GB         60.00%             
> DC2         RAC1        Up     Normal  1.01 GB         60.00%             
> DC1         RAC1        Up     Normal  1.01 GB         60.00%             
> DC2         RAC1        Up     Normal  1.01 GB         60.00%             
> 
> Now, if I kill all the nodes in DC2, and run the data generator again, I would expect roughly 2GiB to be added to each node in DC1 (local replicas + hints to other data center), instead I get this:
> 
> DC          Rack        Status State   Load            Effective-Ownership
> 
> DC1         RAC1        Up     Normal  17.56 GB        60.00%             
> DC2         RAC1        Down   Normal  1009.08 MB      60.00%             
> DC1         RAC1        Up     Normal  17.47 GB        60.00%             
> DC2         RAC1        Down   Normal  1 GB            60.00%             
> DC1         RAC1        Up     Normal  17.22 GB        60.00%             
> DC2         RAC1        Down   Normal  1014.45 MB      60.00%             
> DC1         RAC1        Up     Normal  16.94 GB        60.00%             
> DC2         RAC1        Down   Normal  1.01 GB         60.00%             
> DC1         RAC1        Up     Normal  17.26 GB        60.00%             
> DC2         RAC1        Down   Normal  1.01 GB         60.00%             
> 
> Checking the sstables on a node reveals this,
> 
> -bash-3.2$ du -hs HintsColumnFamily/
> 16G	HintsColumnFamily/
> -bash-3.2$
> 
> So it seems that what I would have expected to be 1GiB of hints is much larger in reality, a 15x-16x inflation. This has a huge impact on write performance as well.
> 
> If I bring DC2 up again, eventually the load will drop down and even out to 2GiB across the entire cluster.
> 
> I'm wondering if this inflation is intended or if it is possibly a bug or something I'm doing wrong? Assuming this inflation is correct, what is the best way to deal with temporary connectivity issues with a second data center? Write performance is paramount in my use case. A 2x-3x overhead is doable, but not 15x-16x.
> 
> Thanks,
> /dml
> 
>