You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Anubhav Kale <An...@microsoft.com> on 2016/03/17 18:24:33 UTC

DTCS bucketing Question

<Not sure if this is the right alias or Dev, so asking in both places>

Hello,

I am trying to concretely understand how DTCS makes buckets and I am looking at the DateTieredCompactionStrategyTest.testGetBuckets method and played with some of the parameters to GetBuckets method call (Cassandra 2.1.12).

I don't think I fully understand something there. Let me try to explain.

Consider the second test there. I changed the pairs a bit for easier explanation and changed base (initial window size)=1000L and Min_Threshold=2

pairs = Lists.newArrayList(
                Pair.create("a", 200L),
                Pair.create("b", 2000L),
                Pair.create("c", 3600L),
                Pair.create("d", 3899L),
                Pair.create("e", 3900L),
                Pair.create("f", 3950L),
                Pair.create("too new", 4125L)
        );
        buckets = getBuckets(pairs, 1000L, 2, 4050L, Long.MAX_VALUE);

In this case, the buckets should look like [0-4000] [4000-]. Is this correct ? The buckets that I get back are different ("a" lives in its bucket and everyone else in another). What I am missing here ?

Another case,

pairs = Lists.newArrayList(
                Pair.create("a", 200L),
                Pair.create("b", 2000L),
                Pair.create("c", 3600L),
                Pair.create("d", 3899L),
                Pair.create("e", 3900L),
                Pair.create("f", 3950L),
                Pair.create("too new", 4125L)
        );
        buckets = getBuckets(pairs, 50L, 4, 4050L, Long.MAX_VALUE);

Here, the buckets should be [0-3200] [3200-4000] [4000-4050] [4050-]. Is this correct ? Again, the buckets that come back are quite different.

Note, that if I keep the base to original (100L) or increase it and play with min_threshold the results are exactly what I would expect.

The way I think about DTCS is, try to make buckets of maximum possible sizes from 0, and once you can't make do that , make smaller buckets (similar to what the comment suggests). Is this mental model wrong ? I am afraid that the math in Target class is somewhat hard to follow so I am thinking about it this way.

Thanks a lot in advance.

-Anubhav

RE: DTCS bucketing Question

Posted by Anubhav Kale <An...@microsoft.com>.
CIL

From: Jeff Jirsa [mailto:jeff.jirsa@crowdstrike.com]
Sent: Thursday, March 17, 2016 11:01 AM
To: user@cassandra.apache.org
Subject: Re: DTCS bucketing Question

>  am trying to concretely understand how DTCS makes buckets and I am looking at the DateTieredCompactionStrategyTest.testGetBuckets method and played with some of the parameters to GetBuckets method call (Cassandra 2.1.12). I don’t think I fully understand something there.

Don’t feel bad, you’re not alone.

> In this case, the buckets should look like [0-4000] [4000-]. Is this correct ? The buckets that I get back are different (“a” lives in its bucket and everyone else in another). What I am missing here ?

The latest/newest window never gets combined, it’s ALWAYS the base size. Only subsequent windows get merged. First window will always be 0-1000. https://spotifylabscom.files.wordpress.com/2014/12/dtcs3.png
[Anubhav Kale] This doesn’t seem correct. In the original test (look at comments), the first window is pretty big and in many cases, the first window is big.

> Note, that if I keep the base to original (100L) or increase it and play with min_threshold the results are exactly what I would expect.

Because the original base is lower than the lowest timestamp, which means you’re never looking in the first window (0-base).

> I am afraid that the math in Target class is somewhat hard to follow so I am thinking about it this way.

The Target class is too clever for its own good. I couldn’t follow it. You’re having trouble following it.  Other smart people I’ve talked to couldn’t follow it. Last June I proposed an alternative (CASSANDRA-9666 / https://github.com/jeffjirsa/twcs ). It was never taken upstream, but it does get a fair bit of use by people with large time series clusters (we use it on one of our petabyte-scale clusters here). Significantly easier to reason about.

  *   Jeff


From: Anubhav Kale
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
Date: Thursday, March 17, 2016 at 10:24 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>"
Subject: DTCS bucketing Question

<Not sure if this is the right alias or Dev, so asking in both places>

Hello,

I am trying to concretely understand how DTCS makes buckets and I am looking at the DateTieredCompactionStrategyTest.testGetBuckets method and played with some of the parameters to GetBuckets method call (Cassandra 2.1.12).

I don’t think I fully understand something there. Let me try to explain.

Consider the second test there. I changed the pairs a bit for easier explanation and changed base (initial window size)=1000L and Min_Threshold=2

pairs = Lists.newArrayList(
                Pair.create("a", 200L),
                Pair.create("b", 2000L),
                Pair.create("c", 3600L),
                Pair.create("d", 3899L),
                Pair.create("e", 3900L),
                Pair.create("f", 3950L),
                Pair.create("too new", 4125L)
        );
        buckets = getBuckets(pairs, 1000L, 2, 4050L, Long.MAX_VALUE);

In this case, the buckets should look like [0-4000] [4000-]. Is this correct ? The buckets that I get back are different (“a” lives in its bucket and everyone else in another). What I am missing here ?

Another case,

pairs = Lists.newArrayList(
                Pair.create("a", 200L),
                Pair.create("b", 2000L),
                Pair.create("c", 3600L),
                Pair.create("d", 3899L),
                Pair.create("e", 3900L),
                Pair.create("f", 3950L),
                Pair.create("too new", 4125L)
        );
        buckets = getBuckets(pairs, 50L, 4, 4050L, Long.MAX_VALUE);

Here, the buckets should be [0-3200] [3200-4000] [4000-4050] [4050-]. Is this correct ? Again, the buckets that come back are quite different.

Note, that if I keep the base to original (100L) or increase it and play with min_threshold the results are exactly what I would expect.

The way I think about DTCS is, try to make buckets of maximum possible sizes from 0, and once you can’t make do that , make smaller buckets (similar to what the comment suggests). Is this mental model wrong ? I am afraid that the math in Target class is somewhat hard to follow so I am thinking about it this way.

Thanks a lot in advance.

-Anubhav

Re: DTCS bucketing Question

Posted by Jeff Jirsa <je...@crowdstrike.com>.
>  am trying to concretely understand how DTCS makes buckets and I am looking at the DateTieredCompactionStrategyTest.testGetBuckets method and played with some of the parameters to GetBuckets method call (Cassandra 2.1.12). I don’t think I fully understand something there.



Don’t feel bad, you’re not alone. 



> In this case, the buckets should look like [0-4000] [4000-]. Is this correct ? The buckets that I get back are different (“a” lives in its bucket and everyone else in another). What I am missing here ?


The latest/newest window never gets combined, it’s ALWAYS the base size. Only subsequent windows get merged. First window will always be 0-1000. https://spotifylabscom.files.wordpress.com/2014/12/dtcs3.png

> Note, that if I keep the base to original (100L) or increase it and play with min_threshold the results are exactly what I would expect.

Because the original base is lower than the lowest timestamp, which means you’re never looking in the first window (0-base).

> I am afraid that the math in Target class is somewhat hard to follow so I am thinking about it this way.

The Target class is too clever for its own good. I couldn’t follow it. You’re having trouble following it.  Other smart people I’ve talked to couldn’t follow it. Last June I proposed an alternative (CASSANDRA-9666 / https://github.com/jeffjirsa/twcs ). It was never taken upstream, but it does get a fair bit of use by people with large time series clusters (we use it on one of our petabyte-scale clusters here). Significantly easier to reason about. 
Jeff
From:  Anubhav Kale
Reply-To:  "user@cassandra.apache.org"
Date:  Thursday, March 17, 2016 at 10:24 AM
To:  "user@cassandra.apache.org"
Subject:  DTCS bucketing Question

<Not sure if this is the right alias or Dev, so asking in both places>

 

Hello,

 

I am trying to concretely understand how DTCS makes buckets and I am looking at the DateTieredCompactionStrategyTest.testGetBuckets method and played with some of the parameters to GetBuckets method call (Cassandra 2.1.12). 

 

I don’t think I fully understand something there. Let me try to explain.

 

Consider the second test there. I changed the pairs a bit for easier explanation and changed base (initial window size)=1000L and Min_Threshold=2

 

pairs = Lists.newArrayList(

                Pair.create("a", 200L),

                Pair.create("b", 2000L),

                Pair.create("c", 3600L),

                Pair.create("d", 3899L),

                Pair.create("e", 3900L),

                Pair.create("f", 3950L),

                Pair.create("too new", 4125L)

        );

        buckets = getBuckets(pairs, 1000L, 2, 4050L, Long.MAX_VALUE);

 

In this case, the buckets should look like [0-4000] [4000-]. Is this correct ? The buckets that I get back are different (“a” lives in its bucket and everyone else in another). What I am missing here ?

 

Another case, 

 

pairs = Lists.newArrayList(

                Pair.create("a", 200L),

                Pair.create("b", 2000L),

                Pair.create("c", 3600L),

                Pair.create("d", 3899L),

                Pair.create("e", 3900L),

                Pair.create("f", 3950L),

                Pair.create("too new", 4125L)

        );

        buckets = getBuckets(pairs, 50L, 4, 4050L, Long.MAX_VALUE);

 

Here, the buckets should be [0-3200] [3200-4000] [4000-4050] [4050-]. Is this correct ? Again, the buckets that come back are quite different. 

 

Note, that if I keep the base to original (100L) or increase it and play with min_threshold the results are exactly what I would expect. 

 

The way I think about DTCS is, try to make buckets of maximum possible sizes from 0, and once you can’t make do that , make smaller buckets (similar to what the comment suggests). Is this mental model wrong ? I am afraid that the math in Target class is somewhat hard to follow so I am thinking about it this way.

 

Thanks a lot in advance.

 

-Anubhav