You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Jack Krupansky <ja...@gmail.com> on 2016/02/24 02:47:39 UTC

JBOD device space allocation?

I just wanted to confirm whether my understanding of how JBOD allocates
device space is correct of not...

Pre-3.2:
On each memtable flush Cassandra will select the directory (device) which
has the most available space as a percentage of the total available space
on all of the listed directories/devices. A random weighted value is used
so it won't always pick the same directory/device with the most space, the
goal being to balance writes for performance.

As of 3.2:
The ranges of tokens stored on the local node will be evenly distributed
among the configured storage devices - even by token range, even if that
may be uneven by actual partition sizes. The code presumes that each of the
configured local storage devices has the same capacity.

The relevant change in 3.2 appears to be:
Make sure tokens don't exist in several data directories (CASSANDRA-6696)

The code for the pre-3.2 model is still in 3.x - is there some other code
path which will cause the pre-3.2 behavior even when runing 3.2 or later?

I see this code which seems to allow for at least some cases where the
pre-3.2 behavior would still be invoked, but I'm not sure what user-level
cases that might be:

if (!cfs.getPartitioner().splitter().isPresent() || localRanges.isEmpty())
  return Collections.singletonList(new
FlushRunnable(lastReplayPosition.get(), txn));

return createFlushRunnables(localRanges, txn);

IOW, if the partitioner does not have a splitter present or the localRanges
for the node cannot be determined. But... what exactly would a user do to
cause that?

There is no doc for this stuff - can a committer (or adventurous user!)
confirm what is actually implemented, both pre and post 3.2? (I already
pinged docs on this.)

Or if anybody is actually using JBOD, what behavior they are seeing for
device space utilization.

Thanks!

-- Jack Krupansky

Re: JBOD device space allocation?

Posted by Marcus Eriksson <kr...@gmail.com>.

On Wed, Feb 24, 2016 at 6:28 PM, Jack Krupansky <ja...@gmail.com>
wrote:

> Thanks. I didn't pay enough attention to that statement on my initial
> reading of that post (which was where I became aware of the 3.2 behavior in
> the first place.)
>
> Considering that the doc explicitly recommends that the byte ordered
> partitioner not be used, that implies that the 3.2 JBOD behavior should be
> used for all recommended partitioner use cases.
>
> I'm still not clear on when exactly a node would not have "localRanges" -
> in terms of how the user would hit that scenario, or is than merely a
> defensive check for a scenario which cannot normally be encountered? I
> mean, it means that the endpoint is not responsible for any range of
> tokens, but how can that ever be true, or is that simply if the user
> configures the node to own zero tokens? But other than that, is there any
> normal way a user could end up with a node that has no "localRanges"?
>

IIRC it is only defensive now - before
https://issues.apache.org/jira/browse/CASSANDRA-9317 it could be empty
during startup


>
> But even if the node owns no "local" ranges, can't it have replicated data
> from RF=k-1 other nodes? Or does empty localRanges mean than the RF=k-1
> nodes that might have replicated data for this node are all also configured
> to own zero tokens? Seems that way. But is there any reasonable scenario
> under which the user would hit this? I mean, why would the code care either
> way with respect to JBOD strategy for the case where no local data is
> stored?
>

local ranges are all ranges the node should store - if you have 256 vnode
tokens and RF=3, you will have 768 local ranges

/Marcus


>
>
> -- Jack Krupansky
>
> On Wed, Feb 24, 2016 at 2:15 AM, Marcus Eriksson <kr...@gmail.com>
> wrote:
>
>> It is mentioned here btw: http://www.datastax.com/dev/blog/improving-jbod
>>
>> On Wed, Feb 24, 2016 at 8:14 AM, Marcus Eriksson <kr...@gmail.com>
>> wrote:
>>
>>> If you don't use RandomPartitioner/Murmur3Partitioner you will get the
>>> old behavior.
>>>
>>> On Wed, Feb 24, 2016 at 2:47 AM, Jack Krupansky <
>>> jack.krupansky@gmail.com> wrote:
>>>
>>>> I just wanted to confirm whether my understanding of how JBOD allocates
>>>> device space is correct of not...
>>>>
>>>> Pre-3.2:
>>>> On each memtable flush Cassandra will select the directory (device)
>>>> which has the most available space as a percentage of the total available
>>>> space on all of the listed directories/devices. A random weighted value is
>>>> used so it won't always pick the same directory/device with the most space,
>>>> the goal being to balance writes for performance.
>>>>
>>>> As of 3.2:
>>>> The ranges of tokens stored on the local node will be evenly
>>>> distributed among the configured storage devices - even by token range,
>>>> even if that may be uneven by actual partition sizes. The code presumes
>>>> that each of the configured local storage devices has the same capacity.
>>>>
>>>> The relevant change in 3.2 appears to be:
>>>> Make sure tokens don't exist in several data directories
>>>> (CASSANDRA-6696)
>>>>
>>>> The code for the pre-3.2 model is still in 3.x - is there some other
>>>> code path which will cause the pre-3.2 behavior even when runing 3.2 or
>>>> later?
>>>>
>>>> I see this code which seems to allow for at least some cases where the
>>>> pre-3.2 behavior would still be invoked, but I'm not sure what user-level
>>>> cases that might be:
>>>>
>>>> if (!cfs.getPartitioner().splitter().isPresent() ||
>>>> localRanges.isEmpty())
>>>>   return Collections.singletonList(new
>>>> FlushRunnable(lastReplayPosition.get(), txn));
>>>>
>>>> return createFlushRunnables(localRanges, txn);
>>>>
>>>> IOW, if the partitioner does not have a splitter present or the
>>>> localRanges for the node cannot be determined. But... what exactly would a
>>>> user do to cause that?
>>>>
>>>> There is no doc for this stuff - can a committer (or adventurous user!)
>>>> confirm what is actually implemented, both pre and post 3.2? (I already
>>>> pinged docs on this.)
>>>>
>>>> Or if anybody is actually using JBOD, what behavior they are seeing for
>>>> device space utilization.
>>>>
>>>> Thanks!
>>>>
>>>> -- Jack Krupansky
>>>>
>>>
>>>
>>
>

Re: JBOD device space allocation?

Posted by Jack Krupansky <ja...@gmail.com>.

Thanks. I didn't pay enough attention to that statement on my initial
reading of that post (which was where I became aware of the 3.2 behavior in
the first place.)

Considering that the doc explicitly recommends that the byte ordered
partitioner not be used, that implies that the 3.2 JBOD behavior should be
used for all recommended partitioner use cases.

I'm still not clear on when exactly a node would not have "localRanges" -
in terms of how the user would hit that scenario, or is than merely a
defensive check for a scenario which cannot normally be encountered? I
mean, it means that the endpoint is not responsible for any range of
tokens, but how can that ever be true, or is that simply if the user
configures the node to own zero tokens? But other than that, is there any
normal way a user could end up with a node that has no "localRanges"?

But even if the node owns no "local" ranges, can't it have replicated data
from RF=k-1 other nodes? Or does empty localRanges mean than the RF=k-1
nodes that might have replicated data for this node are all also configured
to own zero tokens? Seems that way. But is there any reasonable scenario
under which the user would hit this? I mean, why would the code care either
way with respect to JBOD strategy for the case where no local data is
stored?

-- Jack Krupansky

On Wed, Feb 24, 2016 at 2:15 AM, Marcus Eriksson <kr...@gmail.com> wrote:

> It is mentioned here btw: http://www.datastax.com/dev/blog/improving-jbod
>
> On Wed, Feb 24, 2016 at 8:14 AM, Marcus Eriksson <kr...@gmail.com>
> wrote:
>
>> If you don't use RandomPartitioner/Murmur3Partitioner you will get the
>> old behavior.
>>
>> On Wed, Feb 24, 2016 at 2:47 AM, Jack Krupansky <jack.krupansky@gmail.com
>> > wrote:
>>
>>> I just wanted to confirm whether my understanding of how JBOD allocates
>>> device space is correct of not...
>>>
>>> Pre-3.2:
>>> On each memtable flush Cassandra will select the directory (device)
>>> which has the most available space as a percentage of the total available
>>> space on all of the listed directories/devices. A random weighted value is
>>> used so it won't always pick the same directory/device with the most space,
>>> the goal being to balance writes for performance.
>>>
>>> As of 3.2:
>>> The ranges of tokens stored on the local node will be evenly distributed
>>> among the configured storage devices - even by token range, even if that
>>> may be uneven by actual partition sizes. The code presumes that each of the
>>> configured local storage devices has the same capacity.
>>>
>>> The relevant change in 3.2 appears to be:
>>> Make sure tokens don't exist in several data directories (CASSANDRA-6696)
>>>
>>> The code for the pre-3.2 model is still in 3.x - is there some other
>>> code path which will cause the pre-3.2 behavior even when runing 3.2 or
>>> later?
>>>
>>> I see this code which seems to allow for at least some cases where the
>>> pre-3.2 behavior would still be invoked, but I'm not sure what user-level
>>> cases that might be:
>>>
>>> if (!cfs.getPartitioner().splitter().isPresent() ||
>>> localRanges.isEmpty())
>>>   return Collections.singletonList(new
>>> FlushRunnable(lastReplayPosition.get(), txn));
>>>
>>> return createFlushRunnables(localRanges, txn);
>>>
>>> IOW, if the partitioner does not have a splitter present or the
>>> localRanges for the node cannot be determined. But... what exactly would a
>>> user do to cause that?
>>>
>>> There is no doc for this stuff - can a committer (or adventurous user!)
>>> confirm what is actually implemented, both pre and post 3.2? (I already
>>> pinged docs on this.)
>>>
>>> Or if anybody is actually using JBOD, what behavior they are seeing for
>>> device space utilization.
>>>
>>> Thanks!
>>>
>>> -- Jack Krupansky
>>>
>>
>>
>

Re: JBOD device space allocation?

Posted by Marcus Eriksson <kr...@gmail.com>.

It is mentioned here btw: http://www.datastax.com/dev/blog/improving-jbod

On Wed, Feb 24, 2016 at 8:14 AM, Marcus Eriksson <kr...@gmail.com> wrote:

> If you don't use RandomPartitioner/Murmur3Partitioner you will get the old
> behavior.
>
> On Wed, Feb 24, 2016 at 2:47 AM, Jack Krupansky <ja...@gmail.com>
> wrote:
>
>> I just wanted to confirm whether my understanding of how JBOD allocates
>> device space is correct of not...
>>
>> Pre-3.2:
>> On each memtable flush Cassandra will select the directory (device) which
>> has the most available space as a percentage of the total available space
>> on all of the listed directories/devices. A random weighted value is used
>> so it won't always pick the same directory/device with the most space, the
>> goal being to balance writes for performance.
>>
>> As of 3.2:
>> The ranges of tokens stored on the local node will be evenly distributed
>> among the configured storage devices - even by token range, even if that
>> may be uneven by actual partition sizes. The code presumes that each of the
>> configured local storage devices has the same capacity.
>>
>> The relevant change in 3.2 appears to be:
>> Make sure tokens don't exist in several data directories (CASSANDRA-6696)
>>
>> The code for the pre-3.2 model is still in 3.x - is there some other code
>> path which will cause the pre-3.2 behavior even when runing 3.2 or later?
>>
>> I see this code which seems to allow for at least some cases where the
>> pre-3.2 behavior would still be invoked, but I'm not sure what user-level
>> cases that might be:
>>
>> if (!cfs.getPartitioner().splitter().isPresent() || localRanges.isEmpty())
>>   return Collections.singletonList(new
>> FlushRunnable(lastReplayPosition.get(), txn));
>>
>> return createFlushRunnables(localRanges, txn);
>>
>> IOW, if the partitioner does not have a splitter present or the
>> localRanges for the node cannot be determined. But... what exactly would a
>> user do to cause that?
>>
>> There is no doc for this stuff - can a committer (or adventurous user!)
>> confirm what is actually implemented, both pre and post 3.2? (I already
>> pinged docs on this.)
>>
>> Or if anybody is actually using JBOD, what behavior they are seeing for
>> device space utilization.
>>
>> Thanks!
>>
>> -- Jack Krupansky
>>
>
>

Re: JBOD device space allocation?

Posted by Marcus Eriksson <kr...@gmail.com>.

If you don't use RandomPartitioner/Murmur3Partitioner you will get the old
behavior.

On Wed, Feb 24, 2016 at 2:47 AM, Jack Krupansky <ja...@gmail.com>
wrote:

> I just wanted to confirm whether my understanding of how JBOD allocates
> device space is correct of not...
>
> Pre-3.2:
> On each memtable flush Cassandra will select the directory (device) which
> has the most available space as a percentage of the total available space
> on all of the listed directories/devices. A random weighted value is used
> so it won't always pick the same directory/device with the most space, the
> goal being to balance writes for performance.
>
> As of 3.2:
> The ranges of tokens stored on the local node will be evenly distributed
> among the configured storage devices - even by token range, even if that
> may be uneven by actual partition sizes. The code presumes that each of the
> configured local storage devices has the same capacity.
>
> The relevant change in 3.2 appears to be:
> Make sure tokens don't exist in several data directories (CASSANDRA-6696)
>
> The code for the pre-3.2 model is still in 3.x - is there some other code
> path which will cause the pre-3.2 behavior even when runing 3.2 or later?
>
> I see this code which seems to allow for at least some cases where the
> pre-3.2 behavior would still be invoked, but I'm not sure what user-level
> cases that might be:
>
> if (!cfs.getPartitioner().splitter().isPresent() || localRanges.isEmpty())
>   return Collections.singletonList(new
> FlushRunnable(lastReplayPosition.get(), txn));
>
> return createFlushRunnables(localRanges, txn);
>
> IOW, if the partitioner does not have a splitter present or the
> localRanges for the node cannot be determined. But... what exactly would a
> user do to cause that?
>
> There is no doc for this stuff - can a committer (or adventurous user!)
> confirm what is actually implemented, both pre and post 3.2? (I already
> pinged docs on this.)
>
> Or if anybody is actually using JBOD, what behavior they are seeing for
> device space utilization.
>
> Thanks!
>
> -- Jack Krupansky
>