You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cassandra.apache.org by Jeremy Hanna <je...@gmail.com> on 2023/06/12 21:59:48 UTC
Re: [DISCUSS] CEP-8 Drivers Donation - take 2
I'd like to close out this thread. As Benjamin notes, we'll have a single subproject for all of the drivers and with 3 PMC members overseeing the subproject as outlined in the linked subproject governance procedures. However we'll introduce the drivers to that subproject one by one out of necessity.
I'll open up a vote thread shortly so that we can move forward on the CEP and subproject approach.
> On May 30, 2023, at 7:32 AM, Benjamin Lerer <b....@gmail.com> wrote:
>
> The idea was to have a single driver sub-project. Even if the code bases are different we believe that it is important to keep the drivers together to retain cohesive API semantics and make sure they have similar functionality and feature support.
> In this scenario we would need only 3 PMC members for the governance. I am willing to be one of them.
>
> For the committers, my understanding, based on subproject governance procedures, <https://lists.apache.org/thread/tgbqpq5x03b7ssoplccxompxj6d1gw90> was that they should be proposed directly to the PMC members.
>
>> Is the vote for the CEP to be for all drivers, but we will introduce each driver one by one? What determines when we are comfortable with one driver subproject and can move on to accepting the next ?
>
> The goal of the CEP is simply to ensure that the community is in favor of the donation. Nothing more.
> The plan is to introduce the drivers, one by one. Each driver donation will need to be accepted first by the PMC members, as it is the case for any donation. Therefore the PMC should have full control on the pace at which new drivers are accepted.
>
>
> Le mar. 30 mai 2023 à 12:22, Josh McKenzie <jmckenzie@apache.org <ma...@apache.org>> a écrit :
>>> Is the vote for the CEP to be for all drivers, but we will introduce each driver one by one? What determines when we are comfortable with one driver subproject and can move on to accepting the next ?
>> Curious to hear on this as well. There's 2 implications from the CEP as written:
>>
>> 1. The Java and Python drivers hold special importance due to their language proximity and/or project's dependence upon them (https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-8%3A+Datastax+Drivers+Donation#CEP8:DatastaxDriversDonation-Scope)
>> 2. Datastax is explicitly offering all 7 drivers for donation (https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-8%3A+Datastax+Drivers+Donation#CEP8:DatastaxDriversDonation-Goals)
>>
>> This is the most complex contribution via CEP thus far from a governance perspective; I suggest we chart a bespoke path to navigate this. Having a top level indication of "the CEP is approved" logically separate from a per-language indication of "the project is ready to absorb this language driver now" makes sense to me. This could look like:
>>
>> * Vote on the CEP itself
>> * Per language (processing one at a time):
>> * identify 3 PMC members willing to take on the governance role for the language driver
>> * Identify 2 contributors who are active on a given driver and stepping forward for a committer role on the driver
>> * Vote on inclusion of that language driver in the project + commit bits
>> * Integrate that driver into the project ecosystem (build, ci, docs, etc)
>>
>> Not sure how else we could handle committers / contributors / PMC members other than on a per-driver basis.
>>
>> On Tue, May 30, 2023, at 5:36 AM, Mick Semb Wever wrote:
>>>
>>> Thank you so much Jeremy and Greg (+others) for all the hard work on this.
>>>
>>>
>>> At this point, we'd like to propose CEP-8 for consideration, starting the process to accept the DataStax Java driver as an official ASF project.
>>>
>>>
>>> Is the vote for the CEP to be for all drivers, but we will introduce each driver one by one? What determines when we are comfortable with one driver subproject and can move on to accepting the next ?
>>>
>>> Are there key committers and contributors on each driver that want to be involved? Should they be listed before the vote?
>>> We also need three PMC for the new subproject. Are we to assign these before the vote?
>>>
>>>
>>
Re: Improved DeletionTime serialization to reduce disk size
Posted by Brandon Williams <dr...@gmail.com>.
On Sun, Jul 16, 2023 at 11:47 PM Berenguer Blasi
<be...@gmail.com> wrote:
> one q that came up during the review: What should we do if we find a markForDeleteAt (mfda) usging the MSByte? That is a mfda beyond year 4254:
>
> A. That is a mistake/bug. I makes no sense when localDeletionTime can't already go any further than year 2106. We should reject/fail, maybe log and add an upgrade note.
I think creation of doomstones is always a bug, but perhaps there is a
use case I cannot think of. One option that was discussed is setting
a default for the maximum_timestamp_fail_threshold which I think could
make sense, since it would provide protection but allow a way out.
> B. That was supported, regardless of how weird it may be. Cap it to the current max year 4254, maybe log and add an upgrade note.
I am not a fan of doing something other than what we were asked to do,
I think we should either reject it, or do it.
Re: Improved DeletionTime serialization to reduce disk size
Posted by Berenguer Blasi <be...@gmail.com>.
Hi All,
one q that came up during the review: What should we do if we find a
markForDeleteAt (mfda) usging the MSByte? That is a mfda beyond year 4254:
A. That is a mistake/bug. I makes no sense when localDeletionTime can't
already go any further than year 2106. We should reject/fail, maybe log
and add an upgrade note.
B. That was supported, regardless of how weird it may be. Cap it to the
current max year 4254, maybe log and add an upgrade note.
Happy to hear your thoughts.
On 5/7/23 7:05, Berenguer Blasi wrote:
>
> Hi All,
>
> https://issues.apache.org/jira/browse/CASSANDRA-18648 up for review
> and the PR is quite small
>
> Regards
>
> On 3/7/23 11:03, Berenguer Blasi wrote:
>>
>> Thanks for the comments Benedict. Given
>> DeletionTime.localDeletionTime is what caps everything to year 2106
>> (uint enconded now) I am ok with a DeletionTime.markForDeleteAt that
>> can go up to year 4284, personal opinion ofc.
>>
>> And yes I hope once I read, doc and understand the sstable format
>> better I can look into your suggestion and anything else I come across.
>>
>> On 3/7/23 9:46, Benedict wrote:
>>> I checked and I’m pretty sure we do, but it doesn’t apply any
>>> liveness optimisation. I had misunderstood the optimisation you
>>> proposed. Ideally we would encode any non-live timestamp with the
>>> delta offset, but since that’s a distinct optimisation perhaps that
>>> can be left to another patch.
>>>
>>> Are we happy, though, that the two different deletion time
>>> serialisers can store different ranges of timestamp? Both are large
>>> ranges, but I am not 100% comfortable with them diverging.
>>>
>>>> On 3 Jul 2023, at 05:45, Berenguer Blasi <be...@gmail.com>
>>>> wrote:
>>>>
>>>>
>>>>
>>>> It can look into it. I don't have a deep knowledge of the sstable
>>>> format hence why I wanted to document it someday. But DeletionTime
>>>> is being serialized in other places as well iirc and I doubt
>>>> (finger in the air) we'll have that Epoch handy.
>>>>
>>>> On 29/6/23 17:22, Benedict wrote:
>>>>> So I’m just taking a quick peek at SerializationHeader and we
>>>>> already have a method for reading and writing a deletion time with
>>>>> offsets from EncodingStats.
>>>>>
>>>>> So perhaps we simply have a bug where we are using DeletionTime
>>>>> Serializer instead of SerializationHeader.writeLocalDeletionTime?
>>>>> It looks to me like this is already available at most (perhaps
>>>>> all) of the relevant call sites.
>>>>>
>>>>>
>>>>>> On 29 Jun 2023, at 15:53, Josh McKenzie <jm...@apache.org> wrote:
>>>>>>
>>>>>>
>>>>>>> I would prefer we not plan on two distinct changes to this
>>>>>> I agree with this sentiment, /*and*/
>>>>>>
>>>>>>> +1, if you have time for this approach and no other in this window.
>>>>>> People are going to use 5.0 for awhile. Better to have an
>>>>>> improvement in their hands for that duration than no improvement
>>>>>> at all IMO. Justifies the cost of the double implementation and
>>>>>> transitions to me.
>>>>>>
>>>>>> On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote:
>>>>>>>
>>>>>>> Just for completeness the change is a handful loc. The rest
>>>>>>> is added tests and we'd loose the sstable format change
>>>>>>> opportunity window.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> +1, if you have time for this approach and no other in this window.
>>>>>>>
>>>>>>> (If you have time for the other, or someone else does, then the
>>>>>>> technically superior approach should win)
>>>>>>>
>>>>>>>
>>>>>>
Re: Improved DeletionTime serialization to reduce disk size
Posted by Berenguer Blasi <be...@gmail.com>.
Hi All,
https://issues.apache.org/jira/browse/CASSANDRA-18648 up for review and
the PR is quite small
Regards
On 3/7/23 11:03, Berenguer Blasi wrote:
>
> Thanks for the comments Benedict. Given DeletionTime.localDeletionTime
> is what caps everything to year 2106 (uint enconded now) I am ok with
> a DeletionTime.markForDeleteAt that can go up to year 4284, personal
> opinion ofc.
>
> And yes I hope once I read, doc and understand the sstable format
> better I can look into your suggestion and anything else I come across.
>
> On 3/7/23 9:46, Benedict wrote:
>> I checked and I’m pretty sure we do, but it doesn’t apply any
>> liveness optimisation. I had misunderstood the optimisation you
>> proposed. Ideally we would encode any non-live timestamp with the
>> delta offset, but since that’s a distinct optimisation perhaps that
>> can be left to another patch.
>>
>> Are we happy, though, that the two different deletion time
>> serialisers can store different ranges of timestamp? Both are large
>> ranges, but I am not 100% comfortable with them diverging.
>>
>>> On 3 Jul 2023, at 05:45, Berenguer Blasi <be...@gmail.com>
>>> wrote:
>>>
>>>
>>>
>>> It can look into it. I don't have a deep knowledge of the sstable
>>> format hence why I wanted to document it someday. But DeletionTime
>>> is being serialized in other places as well iirc and I doubt (finger
>>> in the air) we'll have that Epoch handy.
>>>
>>> On 29/6/23 17:22, Benedict wrote:
>>>> So I’m just taking a quick peek at SerializationHeader and we
>>>> already have a method for reading and writing a deletion time with
>>>> offsets from EncodingStats.
>>>>
>>>> So perhaps we simply have a bug where we are using DeletionTime
>>>> Serializer instead of SerializationHeader.writeLocalDeletionTime?
>>>> It looks to me like this is already available at most (perhaps all)
>>>> of the relevant call sites.
>>>>
>>>>
>>>>> On 29 Jun 2023, at 15:53, Josh McKenzie <jm...@apache.org> wrote:
>>>>>
>>>>>
>>>>>> I would prefer we not plan on two distinct changes to this
>>>>> I agree with this sentiment, /*and*/
>>>>>
>>>>>> +1, if you have time for this approach and no other in this window.
>>>>> People are going to use 5.0 for awhile. Better to have an
>>>>> improvement in their hands for that duration than no improvement
>>>>> at all IMO. Justifies the cost of the double implementation and
>>>>> transitions to me.
>>>>>
>>>>> On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote:
>>>>>>
>>>>>> Just for completeness the change is a handful loc. The rest
>>>>>> is added tests and we'd loose the sstable format change
>>>>>> opportunity window.
>>>>>>
>>>>>>
>>>>>>
>>>>>> +1, if you have time for this approach and no other in this window.
>>>>>>
>>>>>> (If you have time for the other, or someone else does, then the
>>>>>> technically superior approach should win)
>>>>>>
>>>>>>
>>>>>
Re: Improved DeletionTime serialization to reduce disk size
Posted by Berenguer Blasi <be...@gmail.com>.
Thanks for the comments Benedict. Given DeletionTime.localDeletionTime
is what caps everything to year 2106 (uint enconded now) I am ok with a
DeletionTime.markForDeleteAt that can go up to year 4284, personal
opinion ofc.
And yes I hope once I read, doc and understand the sstable format better
I can look into your suggestion and anything else I come across.
On 3/7/23 9:46, Benedict wrote:
> I checked and I’m pretty sure we do, but it doesn’t apply any liveness
> optimisation. I had misunderstood the optimisation you proposed.
> Ideally we would encode any non-live timestamp with the delta offset,
> but since that’s a distinct optimisation perhaps that can be left to
> another patch.
>
> Are we happy, though, that the two different deletion time serialisers
> can store different ranges of timestamp? Both are large ranges, but I
> am not 100% comfortable with them diverging.
>
>> On 3 Jul 2023, at 05:45, Berenguer Blasi <be...@gmail.com>
>> wrote:
>>
>>
>>
>> It can look into it. I don't have a deep knowledge of the sstable
>> format hence why I wanted to document it someday. But DeletionTime is
>> being serialized in other places as well iirc and I doubt (finger in
>> the air) we'll have that Epoch handy.
>>
>> On 29/6/23 17:22, Benedict wrote:
>>> So I’m just taking a quick peek at SerializationHeader and we
>>> already have a method for reading and writing a deletion time with
>>> offsets from EncodingStats.
>>>
>>> So perhaps we simply have a bug where we are using DeletionTime
>>> Serializer instead of SerializationHeader.writeLocalDeletionTime? It
>>> looks to me like this is already available at most (perhaps all) of
>>> the relevant call sites.
>>>
>>>
>>>> On 29 Jun 2023, at 15:53, Josh McKenzie <jm...@apache.org> wrote:
>>>>
>>>>
>>>>> I would prefer we not plan on two distinct changes to this
>>>> I agree with this sentiment, /*and*/
>>>>
>>>>> +1, if you have time for this approach and no other in this window.
>>>> People are going to use 5.0 for awhile. Better to have an
>>>> improvement in their hands for that duration than no improvement at
>>>> all IMO. Justifies the cost of the double implementation and
>>>> transitions to me.
>>>>
>>>> On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote:
>>>>>
>>>>> Just for completeness the change is a handful loc. The rest is
>>>>> added tests and we'd loose the sstable format change
>>>>> opportunity window.
>>>>>
>>>>>
>>>>>
>>>>> +1, if you have time for this approach and no other in this window.
>>>>>
>>>>> (If you have time for the other, or someone else does, then the
>>>>> technically superior approach should win)
>>>>>
>>>>>
>>>>
Re: Improved DeletionTime serialization to reduce disk size
Posted by Benedict <be...@apache.org>.
I checked and I’m pretty sure we do, but it doesn’t apply any liveness
optimisation. I had misunderstood the optimisation you proposed. Ideally we
would encode any non-live timestamp with the delta offset, but since that’s a
distinct optimisation perhaps that can be left to another patch.
Are we happy, though, that the two different deletion time serialisers can
store different ranges of timestamp? Both are large ranges, but I am not 100%
comfortable with them diverging.
> On 3 Jul 2023, at 05:45, Berenguer Blasi <be...@gmail.com> wrote:
>
>
>
>
> It can look into it. I don't have a deep knowledge of the sstable format
> hence why I wanted to document it someday. But DeletionTime is being
> serialized in other places as well iirc and I doubt (finger in the air)
> we'll have that Epoch handy.
>
>
> On 29/6/23 17:22, Benedict wrote:
>
>
>> So I’m just taking a quick peek at SerializationHeader and we already have
a method for reading and writing a deletion time with offsets from
EncodingStats.
>>
>>
>
>>
>> So perhaps we simply have a bug where we are using DeletionTime Serializer
instead of SerializationHeader.writeLocalDeletionTime? It looks to me like
this is already available at most (perhaps all) of the relevant call sites.
>>
>>
>
>>
>>
>>
>>
>
>>
>>> On 29 Jun 2023, at 15:53, Josh McKenzie
[<jm...@apache.org>](mailto:jmckenzie@apache.org) wrote:
>
>
>>
>>>
>>>
>>>> I would prefer we not plan on two distinct changes to this
>
>>>
>>> I agree with this sentiment, _ **and**_
>
>>>
>>>
>
>>>
>>>> +1, if you have time for this approach and no other in this window.
>>>
>>> People are going to use 5.0 for awhile. Better to have an improvement in
their hands for that duration than no improvement at all IMO. Justifies the
cost of the double implementation and transitions to me.
>>>
>>>
>
>>>
>>> On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote:
>
>>>
>>>> > Just for completeness the change is a handful loc. The rest is added
tests and we'd loose the sstable format change opportunity window.
>
>>>>
>>>>
>
>>>>
>>>>
>
>>>>
>>>> +1, if you have time for this approach and no other in this window.
>
>>>>
>>>>
>
>>>>
>>>> (If you have time for the other, or someone else does, then the
technically superior approach should win)
>
>>>>
>>>>
>
>>>>
>>>>
>
>>>
>>>
>
Re: Improved DeletionTime serialization to reduce disk size
Posted by Berenguer Blasi <be...@gmail.com>.
It can look into it. I don't have a deep knowledge of the sstable format
hence why I wanted to document it someday. But DeletionTime is being
serialized in other places as well iirc and I doubt (finger in the air)
we'll have that Epoch handy.
On 29/6/23 17:22, Benedict wrote:
> So I’m just taking a quick peek at SerializationHeader and we already
> have a method for reading and writing a deletion time with offsets
> from EncodingStats.
>
> So perhaps we simply have a bug where we are using DeletionTime
> Serializer instead of SerializationHeader.writeLocalDeletionTime? It
> looks to me like this is already available at most (perhaps all) of
> the relevant call sites.
>
>
>> On 29 Jun 2023, at 15:53, Josh McKenzie <jm...@apache.org> wrote:
>>
>>
>>> I would prefer we not plan on two distinct changes to this
>> I agree with this sentiment, /*and*/
>>
>>> +1, if you have time for this approach and no other in this window.
>> People are going to use 5.0 for awhile. Better to have an improvement
>> in their hands for that duration than no improvement at all IMO.
>> Justifies the cost of the double implementation and transitions to me.
>>
>> On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote:
>>>
>>> Just for completeness the change is a handful loc. The rest is
>>> added tests and we'd loose the sstable format change opportunity
>>> window.
>>>
>>>
>>>
>>> +1, if you have time for this approach and no other in this window.
>>>
>>> (If you have time for the other, or someone else does, then the
>>> technically superior approach should win)
>>>
>>>
>>
Re: Improved DeletionTime serialization to reduce disk size
Posted by Benedict <be...@apache.org>.
So I’m just taking a quick peek at SerializationHeader and we already have a method for reading and writing a deletion time with offsets from EncodingStats.
So perhaps we simply have a bug where we are using DeletionTime Serializer instead of SerializationHeader.writeLocalDeletionTime? It looks to me like this is already available at most (perhaps all) of the relevant call sites.
> On 29 Jun 2023, at 15:53, Josh McKenzie <jm...@apache.org> wrote:
>
>
>>
>> I would prefer we not plan on two distinct changes to this
> I agree with this sentiment, and
>
>> +1, if you have time for this approach and no other in this window.
> People are going to use 5.0 for awhile. Better to have an improvement in their hands for that duration than no improvement at all IMO. Justifies the cost of the double implementation and transitions to me.
>
>> On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote:
>> Just for completeness the change is a handful loc. The rest is added tests and we'd loose the sstable format change opportunity window.
>>
>>
>>
>> +1, if you have time for this approach and no other in this window.
>>
>> (If you have time for the other, or someone else does, then the technically superior approach should win)
>>
>>
>
Re: Improved DeletionTime serialization to reduce disk size
Posted by Josh McKenzie <jm...@apache.org>.
> I would prefer we not plan on two distinct changes to this
I agree with this sentiment, **and**
> +1, if you have time for this approach and no other in this window.
People are going to use 5.0 for awhile. Better to have an improvement in their hands for that duration than no improvement at all IMO. Justifies the cost of the double implementation and transitions to me.
On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote:
>> Just for completeness the change is a handful loc. The rest is added tests and we'd loose the sstable format change opportunity window.
>>
>
>
> +1, if you have time for this approach and no other in this window.
>
> (If you have time for the other, or someone else does, then the technically superior approach should win)
>
>
>
Re: Improved DeletionTime serialization to reduce disk size
Posted by Mick Semb Wever <mc...@apache.org>.
>
> Just for completeness the change is a handful loc. The rest is added tests
> and we'd loose the sstable format change opportunity window.
>
+1, if you have time for this approach and no other in this window.
(If you have time for the other, or someone else does, then the technically
superior approach should win)
Re: Improved DeletionTime serialization to reduce disk size
Posted by Berenguer Blasi <be...@gmail.com>.
Just for completeness the change is a handful loc. The rest is added
tests and we'd loose the sstable format change opportunity window.
Thx again for the replies.
On 26/6/23 9:33, Benedict wrote:
> I would prefer we not plan on two distinct changes to this,
> particularly when neither change is particularly more complex than the
> other. There is a modest cost to maintenance from changing this
> multiple times.
>
> But if others feel strongly otherwise I won’t stand in the way.
>
>> On 26 Jun 2023, at 05:49, Berenguer Blasi <be...@gmail.com>
>> wrote:
>>
>>
>>
>> Thanks for the replies.
>>
>> I intend to javadoc the ssatble format in detail someday and more
>> improvements might come up then, along the vint encoding mentioned
>> here. But unless sbdy volunteers to do that in 5.0, is anybody
>> against I try to get the original proposal (1 byte flags for sentinel
>> values) in?
>>
>> Regards
>>
>>
>>> Distant future people will not be happy about this, I can already
>>> tell you now.
>> Eh, they'll all be AI's anyway and will just rewrite the code in a
>> background thread.
>>
>> LOL
>>
>>
>>
>>
>> On 23/6/23 15:44, Josh McKenzie wrote:
>>>> If we’re doing this, why don’t we delta encode a vint from some
>>>> per-sstable minimum value? I’d expect that to commonly compress to
>>>> a single byte or so.
>>> +1 to this approach.
>>>
>>>> Distant future people will not be happy about this, I can already
>>>> tell you now.
>>> Eh, they'll all be AI's anyway and will just rewrite the code in a
>>> background thread.
>>>
>>> On Fri, Jun 23, 2023, at 9:02 AM, Berenguer Blasi wrote:
>>>> It's a possibility. Though I haven't coded and benchmarked such an
>>>> approach and I don't think I would have the time before the freeze to
>>>> take advantage of the sstable format change opportunity.
>>>>
>>>> Still it's sthg that can be explored later. If we can shave a few
>>>> extra
>>>> % then that would always be great imo.
>>>>
>>>> On 23/6/23 13:57, Benedict wrote:
>>>> > If we’re doing this, why don’t we delta encode a vint from some
>>>> per-sstable minimum value? I’d expect that to commonly compress to
>>>> a single byte or so.
>>>> >
>>>> >> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko <al...@apple.com>
>>>> wrote:
>>>> >>
>>>> >> Distant future people will not be happy about this, I can
>>>> already tell you now.
>>>> >>
>>>> >> Sounds like a reasonable improvement to me however.
>>>> >>
>>>> >>> On 23 Jun 2023, at 07:22, Berenguer Blasi
>>>> <be...@gmail.com> wrote:
>>>> >>>
>>>> >>> Hi all,
>>>> >>>
>>>> >>> DeletionTime.markedForDeleteAt is a long useconds since Unix
>>>> Epoch. But I noticed that with 7 bytes we can already encode ~2284
>>>> years. We can either shed the 8th byte, for reduced IO and disk, or
>>>> can encode some sentinel values (such as LIVE) as flags there. That
>>>> would mean reading and writing 1 byte instead of 12 (8 mfda long +
>>>> 4 ldts int). Yes we already avoid serializing DeletionTime (DT) in
>>>> sstables at _row_ level entirely but not at _partition_ level and
>>>> it is also serialized at index, metadata, etc.
>>>> >>>
>>>> >>> So here's a POC:
>>>> https://github.com/bereng/cassandra/commits/ldtdeser-trunk and some
>>>> jmh (1) to evaluate the impact of the new alg (2). It's tested here
>>>> against a 70% and a 30% LIVE DTs to see how we perform:
>>>> >>>
>>>> >>> [java] Benchmark (liveDTPcParam) (sstableParam) Mode
>>>> Cnt Score Error Units
>>>> >>> [java] DeletionTimeDeSerBench.testRawAlgReads
>>>> 70PcLive NC avgt 15 0.331 ± 0.001 ns/op
>>>> >>> [java] DeletionTimeDeSerBench.testRawAlgReads
>>>> 70PcLive OA avgt 15 0.335 ± 0.004 ns/op
>>>> >>> [java] DeletionTimeDeSerBench.testRawAlgReads
>>>> 30PcLive NC avgt 15 0.334 ± 0.002 ns/op
>>>> >>> [java] DeletionTimeDeSerBench.testRawAlgReads
>>>> 30PcLive OA avgt 15 0.340 ± 0.008 ns/op
>>>> >>> [java] DeletionTimeDeSerBench.testNewAlgWrites
>>>> 70PcLive NC avgt 15 0.337 ± 0.006 ns/op
>>>> >>> [java] DeletionTimeDeSerBench.testNewAlgWrites
>>>> 70PcLive OA avgt 15 0.340 ± 0.004 ns/op
>>>> >>> [java] DeletionTimeDeSerBench.testNewAlgWrites
>>>> 30PcLive NC avgt 15 0.339 ± 0.004 ns/op
>>>> >>> [java] DeletionTimeDeSerBench.testNewAlgWrites
>>>> 30PcLive OA avgt 15 0.343 ± 0.016 ns/op
>>>> >>>
>>>> >>> That was ByteBuffer backed to test the extra bit level
>>>> operations impact. But what would be the impact of an end to end
>>>> test against disk?
>>>> >>>
>>>> >>> [java] Benchmark (diskRAMParam) (liveDTPcParam)
>>>> (sstableParam) Mode Cnt Score Error Units
>>>> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM
>>>> 70PcLive NC avgt 15 605236.515 ± 19929.058 ns/op
>>>> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM
>>>> 70PcLive OA avgt 15 586477.039 ± 7384.632 ns/op
>>>> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM
>>>> 30PcLive NC avgt 15 937580.311 ± 30669.647 ns/op
>>>> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM
>>>> 30PcLive OA avgt 15 914097.770 ± 9865.070 ns/op
>>>> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT
>>>> Disk 70PcLive NC avgt 15 1314417.207 ±
>>>> 37879.012 ns/op
>>>> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT
>>>> Disk 70PcLive OA avgt 15 805256.345 ±
>>>> 15471.587 ns/op
>>>> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT
>>>> Disk 30PcLive NC avgt 15 1583239.011 ±
>>>> 50104.245 ns/op
>>>> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT
>>>> Disk 30PcLive OA avgt 15 1439605.006 ±
>>>> 64342.510 ns/op
>>>> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT
>>>> RAM 70PcLive NC avgt 15 295711.217 ±
>>>> 5432.507 ns/op
>>>> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT
>>>> RAM 70PcLive OA avgt 15 305282.827 ±
>>>> 1906.841 ns/op
>>>> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT
>>>> RAM 30PcLive NC avgt 15 446029.899 ±
>>>> 4038.938 ns/op
>>>> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM
>>>> 30PcLive OA avgt 15 479085.875 ± 10032.804 ns/op
>>>> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT
>>>> Disk 70PcLive NC avgt 15 1789434.838 ±
>>>> 206455.771 ns/op
>>>> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT
>>>> Disk 70PcLive OA avgt 15 589752.861 ±
>>>> 31676.265 ns/op
>>>> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT
>>>> Disk 30PcLive NC avgt 15 1754862.122 ±
>>>> 164903.051 ns/op
>>>> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT
>>>> Disk 30PcLive OA avgt 15 1252162.253 ±
>>>> 121626.818 ns/o
>>>> >>>
>>>> >>> We can see big improvements when backed with the disk and
>>>> little impact from the new alg.
>>>> >>>
>>>> >>> Given we're already introducing a new sstable format (OA) in
>>>> 5.0 I would like to try to get this in before the freeze. The point
>>>> being that sstables with lots of small partitions would benefit
>>>> from a smaller DT at partition level. My tests show a 3%-4% size
>>>> reduction on disk.
>>>> >>>
>>>> >>> Before proceeding though I'd like to bounce the idea against
>>>> the community for all the corner cases and scenarios I might have
>>>> missed where this could be a problem?
>>>> >>>
>>>> >>> Thx in advance!
>>>> >>>
>>>> >>>
>>>> >>> (1)
>>>> https://github.com/bereng/cassandra/blob/ldtdeser-trunk/test/microbench/org/apache/cassandra/test/microbench/DeletionTimeDeSerBench.java
>>>> >>>
>>>> >>> (2)
>>>> https://github.com/bereng/cassandra/blob/ldtdeser-trunk/src/java/org/apache/cassandra/db/DeletionTime.java#L212
>>>> >>>
>>>>
>>>
Re: Improved DeletionTime serialization to reduce disk size
Posted by Benedict <be...@apache.org>.
I would prefer we not plan on two distinct changes to this, particularly when
neither change is particularly more complex than the other. There is a modest
cost to maintenance from changing this multiple times.
But if others feel strongly otherwise I won’t stand in the way.
> On 26 Jun 2023, at 05:49, Berenguer Blasi <be...@gmail.com> wrote:
>
>
>
>
> Thanks for the replies.
>
> I intend to javadoc the ssatble format in detail someday and more
> improvements might come up then, along the vint encoding mentioned here. But
> unless sbdy volunteers to do that in 5.0, is anybody against I try to get
> the original proposal (1 byte flags for sentinel values) in?
>
> Regards
>
>
>
>
>
>> Distant future people will not be happy about this, I can already tell you
now.
>
>
> Eh, they'll all be AI's anyway and will just rewrite the code in a
> background thread.
>
> LOL
>
>
>
>
>
>
>
>
>
>
>
> On 23/6/23 15:44, Josh McKenzie wrote:
>
>
>> > If we’re doing this, why don’t we delta encode a vint from some per-
sstable minimum value? I’d expect that to commonly compress to a single byte
or so.
>
>>
>> +1 to this approach.
>>
>>
>
>>
>>> Distant future people will not be happy about this, I can already tell you
now.
>
>>
>> Eh, they'll all be AI's anyway and will just rewrite the code in a
background thread.
>>
>>
>
>>
>> On Fri, Jun 23, 2023, at 9:02 AM, Berenguer Blasi wrote:
>
>>
>>> It's a possibility. Though I haven't coded and benchmarked such an
>
>>>
>>> approach and I don't think I would have the time before the freeze to
>
>>>
>>> take advantage of the sstable format change opportunity.
>
>>>
>>>
>
>>>
>>> Still it's sthg that can be explored later. If we can shave a few extra
>
>>>
>>> % then that would always be great imo.
>
>>>
>>>
>
>>>
>>> On 23/6/23 13:57, Benedict wrote:
>
>>>
>>> > If we’re doing this, why don’t we delta encode a vint from some per-
sstable minimum value? I’d expect that to commonly compress to a single byte
or so.
>
>>>
>>> >
>
>>>
>>> >> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko
<[aleksey@apple.com](mailto:aleksey@apple.com)> wrote:
>
>>>
>>> >>
>
>>>
>>> >> Distant future people will not be happy about this, I can already tell
you now.
>
>>>
>>> >>
>
>>>
>>> >> Sounds like a reasonable improvement to me however.
>
>>>
>>> >>
>
>>>
>>> >>> On 23 Jun 2023, at 07:22, Berenguer Blasi
<[berenguerblasi@gmail.com](mailto:berenguerblasi@gmail.com)> wrote:
>
>>>
>>> >>>
>
>>>
>>> >>> Hi all,
>
>>>
>>> >>>
>
>>>
>>> >>> DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch.
But I noticed that with 7 bytes we can already encode ~2284 years. We can
either shed the 8th byte, for reduced IO and disk, or can encode some sentinel
values (such as LIVE) as flags there. That would mean reading and writing 1
byte instead of 12 (8 mfda long + 4 ldts int). Yes we already avoid
serializing DeletionTime (DT) in sstables at _row_ level entirely but not at
_partition_ level and it is also serialized at index, metadata, etc.
>
>>>
>>> >>>
>
>>>
>>> >>> So here's a POC:
<https://github.com/bereng/cassandra/commits/ldtdeser-trunk> and some jmh (1)
to evaluate the impact of the new alg (2). It's tested here against a 70% and
a 30% LIVE DTs to see how we perform:
>
>>>
>>> >>>
>
>>>
>>> >>> [java] Benchmark (liveDTPcParam) (sstableParam) Mode Cnt
Score Error Units
>
>>>
>>> >>> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive
NC avgt 15 0.331 ± 0.001 ns/op
>
>>>
>>> >>> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive
OA avgt 15 0.335 ± 0.004 ns/op
>
>>>
>>> >>> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive
NC avgt 15 0.334 ± 0.002 ns/op
>
>>>
>>> >>> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive
OA avgt 15 0.340 ± 0.008 ns/op
>
>>>
>>> >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive
NC avgt 15 0.337 ± 0.006 ns/op
>
>>>
>>> >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive
OA avgt 15 0.340 ± 0.004 ns/op
>
>>>
>>> >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive
NC avgt 15 0.339 ± 0.004 ns/op
>
>>>
>>> >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive
OA avgt 15 0.343 ± 0.016 ns/op
>
>>>
>>> >>>
>
>>>
>>> >>> That was ByteBuffer backed to test the extra bit level operations
impact. But what would be the impact of an end to end test against disk?
>
>>>
>>> >>>
>
>>>
>>> >>> [java] Benchmark (diskRAMParam) (liveDTPcParam) (sstableParam)
Mode Cnt Score Error Units
>
>>>
>>> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM
70PcLive NC avgt 15 605236.515 ± 19929.058 ns/op
>
>>>
>>> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM
70PcLive OA avgt 15 586477.039 ± 7384.632 ns/op
>
>>>
>>> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM
30PcLive NC avgt 15 937580.311 ± 30669.647 ns/op
>
>>>
>>> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM
30PcLive OA avgt 15 914097.770 ± 9865.070 ns/op
>
>>>
>>> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk
70PcLive NC avgt 15 1314417.207 ± 37879.012 ns/op
>
>>>
>>> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk
70PcLive OA avgt 15 805256.345 ± 15471.587 ns/op
>
>>>
>>> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk
30PcLive NC avgt 15 1583239.011 ± 50104.245 ns/op
>
>>>
>>> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk
30PcLive OA avgt 15 1439605.006 ± 64342.510 ns/op
>
>>>
>>> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM
70PcLive NC avgt 15 295711.217 ± 5432.507 ns/op
>
>>>
>>> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM
70PcLive OA avgt 15 305282.827 ± 1906.841 ns/op
>
>>>
>>> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM
30PcLive NC avgt 15 446029.899 ± 4038.938 ns/op
>
>>>
>>> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM
30PcLive OA avgt 15 479085.875 ± 10032.804 ns/op
>
>>>
>>> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk
70PcLive NC avgt 15 1789434.838 ± 206455.771 ns/op
>
>>>
>>> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk
70PcLive OA avgt 15 589752.861 ± 31676.265 ns/op
>
>>>
>>> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk
30PcLive NC avgt 15 1754862.122 ± 164903.051 ns/op
>
>>>
>>> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk
30PcLive OA avgt 15 1252162.253 ± 121626.818 ns/o
>
>>>
>>> >>>
>
>>>
>>> >>> We can see big improvements when backed with the disk and little
impact from the new alg.
>
>>>
>>> >>>
>
>>>
>>> >>> Given we're already introducing a new sstable format (OA) in 5.0 I
would like to try to get this in before the freeze. The point being that
sstables with lots of small partitions would benefit from a smaller DT at
partition level. My tests show a 3%-4% size reduction on disk.
>
>>>
>>> >>>
>
>>>
>>> >>> Before proceeding though I'd like to bounce the idea against the
community for all the corner cases and scenarios I might have missed where
this could be a problem?
>
>>>
>>> >>>
>
>>>
>>> >>> Thx in advance!
>
>>>
>>> >>>
>
>>>
>>> >>>
>
>>>
>>> >>> (1) <https://github.com/bereng/cassandra/blob/ldtdeser-
trunk/test/microbench/org/apache/cassandra/test/microbench/DeletionTimeDeSerBench.java>
>
>>>
>>> >>>
>
>>>
>>> >>> (2) <https://github.com/bereng/cassandra/blob/ldtdeser-
trunk/src/java/org/apache/cassandra/db/DeletionTime.java#L212>
>
>>>
>>> >>>
>
>>>
>>>
>
>>
>>
>
Re: Improved DeletionTime serialization to reduce disk size
Posted by Berenguer Blasi <be...@gmail.com>.
Thanks for the replies.
I intend to javadoc the ssatble format in detail someday and more
improvements might come up then, along the vint encoding mentioned here.
But unless sbdy volunteers to do that in 5.0, is anybody against I try
to get the original proposal (1 byte flags for sentinel values) in?
Regards
> Distant future people will not be happy about this, I can already tell
> you now.
Eh, they'll all be AI's anyway and will just rewrite the code in a
background thread.
LOL
On 23/6/23 15:44, Josh McKenzie wrote:
>> If we’re doing this, why don’t we delta encode a vint from some
>> per-sstable minimum value? I’d expect that to commonly compress to a
>> single byte or so.
> +1 to this approach.
>
>> Distant future people will not be happy about this, I can already
>> tell you now.
> Eh, they'll all be AI's anyway and will just rewrite the code in a
> background thread.
>
> On Fri, Jun 23, 2023, at 9:02 AM, Berenguer Blasi wrote:
>> It's a possibility. Though I haven't coded and benchmarked such an
>> approach and I don't think I would have the time before the freeze to
>> take advantage of the sstable format change opportunity.
>>
>> Still it's sthg that can be explored later. If we can shave a few extra
>> % then that would always be great imo.
>>
>> On 23/6/23 13:57, Benedict wrote:
>> > If we’re doing this, why don’t we delta encode a vint from some
>> per-sstable minimum value? I’d expect that to commonly compress to a
>> single byte or so.
>> >
>> >> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko <al...@apple.com>
>> wrote:
>> >>
>> >> Distant future people will not be happy about this, I can already
>> tell you now.
>> >>
>> >> Sounds like a reasonable improvement to me however.
>> >>
>> >>> On 23 Jun 2023, at 07:22, Berenguer Blasi
>> <be...@gmail.com> wrote:
>> >>>
>> >>> Hi all,
>> >>>
>> >>> DeletionTime.markedForDeleteAt is a long useconds since Unix
>> Epoch. But I noticed that with 7 bytes we can already encode ~2284
>> years. We can either shed the 8th byte, for reduced IO and disk, or
>> can encode some sentinel values (such as LIVE) as flags there. That
>> would mean reading and writing 1 byte instead of 12 (8 mfda long + 4
>> ldts int). Yes we already avoid serializing DeletionTime (DT) in
>> sstables at _row_ level entirely but not at _partition_ level and it
>> is also serialized at index, metadata, etc.
>> >>>
>> >>> So here's a POC:
>> https://github.com/bereng/cassandra/commits/ldtdeser-trunk and some
>> jmh (1) to evaluate the impact of the new alg (2). It's tested here
>> against a 70% and a 30% LIVE DTs to see how we perform:
>> >>>
>> >>> [java] Benchmark (liveDTPcParam) (sstableParam) Mode Cnt
>> Score Error Units
>> >>> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive NC
>> avgt 15 0.331 ± 0.001 ns/op
>> >>> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive OA
>> avgt 15 0.335 ± 0.004 ns/op
>> >>> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive NC
>> avgt 15 0.334 ± 0.002 ns/op
>> >>> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive OA
>> avgt 15 0.340 ± 0.008 ns/op
>> >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive NC
>> avgt 15 0.337 ± 0.006 ns/op
>> >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive OA
>> avgt 15 0.340 ± 0.004 ns/op
>> >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive NC
>> avgt 15 0.339 ± 0.004 ns/op
>> >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive OA
>> avgt 15 0.343 ± 0.016 ns/op
>> >>>
>> >>> That was ByteBuffer backed to test the extra bit level operations
>> impact. But what would be the impact of an end to end test against disk?
>> >>>
>> >>> [java] Benchmark (diskRAMParam) (liveDTPcParam)
>> (sstableParam) Mode Cnt Score Error Units
>> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM
>> 70PcLive NC avgt 15 605236.515 ± 19929.058 ns/op
>> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM
>> 70PcLive OA avgt 15 586477.039 ± 7384.632 ns/op
>> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM
>> 30PcLive NC avgt 15 937580.311 ± 30669.647 ns/op
>> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM
>> 30PcLive OA avgt 15 914097.770 ± 9865.070 ns/op
>> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk
>> 70PcLive NC avgt 15 1314417.207 ± 37879.012 ns/op
>> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT
>> Disk 70PcLive OA avgt 15 805256.345 ±
>> 15471.587 ns/op
>> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT
>> Disk 30PcLive NC avgt 15 1583239.011 ±
>> 50104.245 ns/op
>> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT
>> Disk 30PcLive OA avgt 15 1439605.006 ±
>> 64342.510 ns/op
>> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT
>> RAM 70PcLive NC avgt 15 295711.217 ± 5432.507 ns/op
>> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM
>> 70PcLive OA avgt 15 305282.827 ± 1906.841 ns/op
>> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM
>> 30PcLive NC avgt 15 446029.899 ± 4038.938 ns/op
>> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM
>> 30PcLive OA avgt 15 479085.875 ± 10032.804 ns/op
>> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk
>> 70PcLive NC avgt 15 1789434.838 ± 206455.771 ns/op
>> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT
>> Disk 70PcLive OA avgt 15 589752.861 ±
>> 31676.265 ns/op
>> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk
>> 30PcLive NC avgt 15 1754862.122 ± 164903.051 ns/op
>> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk
>> 30PcLive OA avgt 15 1252162.253 ± 121626.818 ns/o
>> >>>
>> >>> We can see big improvements when backed with the disk and little
>> impact from the new alg.
>> >>>
>> >>> Given we're already introducing a new sstable format (OA) in 5.0
>> I would like to try to get this in before the freeze. The point being
>> that sstables with lots of small partitions would benefit from a
>> smaller DT at partition level. My tests show a 3%-4% size reduction
>> on disk.
>> >>>
>> >>> Before proceeding though I'd like to bounce the idea against the
>> community for all the corner cases and scenarios I might have missed
>> where this could be a problem?
>> >>>
>> >>> Thx in advance!
>> >>>
>> >>>
>> >>> (1)
>> https://github.com/bereng/cassandra/blob/ldtdeser-trunk/test/microbench/org/apache/cassandra/test/microbench/DeletionTimeDeSerBench.java
>> >>>
>> >>> (2)
>> https://github.com/bereng/cassandra/blob/ldtdeser-trunk/src/java/org/apache/cassandra/db/DeletionTime.java#L212
>> >>>
>>
>
Re: Improved DeletionTime serialization to reduce disk size
Posted by Josh McKenzie <jm...@apache.org>.
> If we’re doing this, why don’t we delta encode a vint from some per-sstable minimum value? I’d expect that to commonly compress to a single byte or so.
+1 to this approach.
> Distant future people will not be happy about this, I can already tell you now.
Eh, they'll all be AI's anyway and will just rewrite the code in a background thread.
On Fri, Jun 23, 2023, at 9:02 AM, Berenguer Blasi wrote:
> It's a possibility. Though I haven't coded and benchmarked such an
> approach and I don't think I would have the time before the freeze to
> take advantage of the sstable format change opportunity.
>
> Still it's sthg that can be explored later. If we can shave a few extra
> % then that would always be great imo.
>
> On 23/6/23 13:57, Benedict wrote:
> > If we’re doing this, why don’t we delta encode a vint from some per-sstable minimum value? I’d expect that to commonly compress to a single byte or so.
> >
> >> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko <al...@apple.com> wrote:
> >>
> >> Distant future people will not be happy about this, I can already tell you now.
> >>
> >> Sounds like a reasonable improvement to me however.
> >>
> >>> On 23 Jun 2023, at 07:22, Berenguer Blasi <be...@gmail.com> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I noticed that with 7 bytes we can already encode ~2284 years. We can either shed the 8th byte, for reduced IO and disk, or can encode some sentinel values (such as LIVE) as flags there. That would mean reading and writing 1 byte instead of 12 (8 mfda long + 4 ldts int). Yes we already avoid serializing DeletionTime (DT) in sstables at _row_ level entirely but not at _partition_ level and it is also serialized at index, metadata, etc.
> >>>
> >>> So here's a POC: https://github.com/bereng/cassandra/commits/ldtdeser-trunk and some jmh (1) to evaluate the impact of the new alg (2). It's tested here against a 70% and a 30% LIVE DTs to see how we perform:
> >>>
> >>> [java] Benchmark (liveDTPcParam) (sstableParam) Mode Cnt Score Error Units
> >>> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive NC avgt 15 0.331 ± 0.001 ns/op
> >>> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive OA avgt 15 0.335 ± 0.004 ns/op
> >>> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive NC avgt 15 0.334 ± 0.002 ns/op
> >>> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive OA avgt 15 0.340 ± 0.008 ns/op
> >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive NC avgt 15 0.337 ± 0.006 ns/op
> >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive OA avgt 15 0.340 ± 0.004 ns/op
> >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive NC avgt 15 0.339 ± 0.004 ns/op
> >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive OA avgt 15 0.343 ± 0.016 ns/op
> >>>
> >>> That was ByteBuffer backed to test the extra bit level operations impact. But what would be the impact of an end to end test against disk?
> >>>
> >>> [java] Benchmark (diskRAMParam) (liveDTPcParam) (sstableParam) Mode Cnt Score Error Units
> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive NC avgt 15 605236.515 ± 19929.058 ns/op
> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive OA avgt 15 586477.039 ± 7384.632 ns/op
> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive NC avgt 15 937580.311 ± 30669.647 ns/op
> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive OA avgt 15 914097.770 ± 9865.070 ns/op
> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 70PcLive NC avgt 15 1314417.207 ± 37879.012 ns/op
> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 70PcLive OA avgt 15 805256.345 ± 15471.587 ns/op
> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 30PcLive NC avgt 15 1583239.011 ± 50104.245 ns/op
> >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 30PcLive OA avgt 15 1439605.006 ± 64342.510 ns/op
> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM 70PcLive NC avgt 15 295711.217 ± 5432.507 ns/op
> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM 70PcLive OA avgt 15 305282.827 ± 1906.841 ns/op
> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM 30PcLive NC avgt 15 446029.899 ± 4038.938 ns/op
> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM 30PcLive OA avgt 15 479085.875 ± 10032.804 ns/op
> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 70PcLive NC avgt 15 1789434.838 ± 206455.771 ns/op
> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 70PcLive OA avgt 15 589752.861 ± 31676.265 ns/op
> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 30PcLive NC avgt 15 1754862.122 ± 164903.051 ns/op
> >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 30PcLive OA avgt 15 1252162.253 ± 121626.818 ns/o
> >>>
> >>> We can see big improvements when backed with the disk and little impact from the new alg.
> >>>
> >>> Given we're already introducing a new sstable format (OA) in 5.0 I would like to try to get this in before the freeze. The point being that sstables with lots of small partitions would benefit from a smaller DT at partition level. My tests show a 3%-4% size reduction on disk.
> >>>
> >>> Before proceeding though I'd like to bounce the idea against the community for all the corner cases and scenarios I might have missed where this could be a problem?
> >>>
> >>> Thx in advance!
> >>>
> >>>
> >>> (1) https://github.com/bereng/cassandra/blob/ldtdeser-trunk/test/microbench/org/apache/cassandra/test/microbench/DeletionTimeDeSerBench.java
> >>>
> >>> (2) https://github.com/bereng/cassandra/blob/ldtdeser-trunk/src/java/org/apache/cassandra/db/DeletionTime.java#L212
> >>>
>
Re: Improved DeletionTime serialization to reduce disk size
Posted by Berenguer Blasi <be...@gmail.com>.
It's a possibility. Though I haven't coded and benchmarked such an
approach and I don't think I would have the time before the freeze to
take advantage of the sstable format change opportunity.
Still it's sthg that can be explored later. If we can shave a few extra
% then that would always be great imo.
On 23/6/23 13:57, Benedict wrote:
> If we’re doing this, why don’t we delta encode a vint from some per-sstable minimum value? I’d expect that to commonly compress to a single byte or so.
>
>> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko <al...@apple.com> wrote:
>>
>> Distant future people will not be happy about this, I can already tell you now.
>>
>> Sounds like a reasonable improvement to me however.
>>
>>> On 23 Jun 2023, at 07:22, Berenguer Blasi <be...@gmail.com> wrote:
>>>
>>> Hi all,
>>>
>>> DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I noticed that with 7 bytes we can already encode ~2284 years. We can either shed the 8th byte, for reduced IO and disk, or can encode some sentinel values (such as LIVE) as flags there. That would mean reading and writing 1 byte instead of 12 (8 mfda long + 4 ldts int). Yes we already avoid serializing DeletionTime (DT) in sstables at _row_ level entirely but not at _partition_ level and it is also serialized at index, metadata, etc.
>>>
>>> So here's a POC: https://github.com/bereng/cassandra/commits/ldtdeser-trunk and some jmh (1) to evaluate the impact of the new alg (2). It's tested here against a 70% and a 30% LIVE DTs to see how we perform:
>>>
>>> [java] Benchmark (liveDTPcParam) (sstableParam) Mode Cnt Score Error Units
>>> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive NC avgt 15 0.331 ± 0.001 ns/op
>>> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive OA avgt 15 0.335 ± 0.004 ns/op
>>> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive NC avgt 15 0.334 ± 0.002 ns/op
>>> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive OA avgt 15 0.340 ± 0.008 ns/op
>>> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive NC avgt 15 0.337 ± 0.006 ns/op
>>> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive OA avgt 15 0.340 ± 0.004 ns/op
>>> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive NC avgt 15 0.339 ± 0.004 ns/op
>>> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive OA avgt 15 0.343 ± 0.016 ns/op
>>>
>>> That was ByteBuffer backed to test the extra bit level operations impact. But what would be the impact of an end to end test against disk?
>>>
>>> [java] Benchmark (diskRAMParam) (liveDTPcParam) (sstableParam) Mode Cnt Score Error Units
>>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive NC avgt 15 605236.515 ± 19929.058 ns/op
>>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive OA avgt 15 586477.039 ± 7384.632 ns/op
>>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive NC avgt 15 937580.311 ± 30669.647 ns/op
>>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive OA avgt 15 914097.770 ± 9865.070 ns/op
>>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 70PcLive NC avgt 15 1314417.207 ± 37879.012 ns/op
>>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 70PcLive OA avgt 15 805256.345 ± 15471.587 ns/op
>>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 30PcLive NC avgt 15 1583239.011 ± 50104.245 ns/op
>>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 30PcLive OA avgt 15 1439605.006 ± 64342.510 ns/op
>>> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM 70PcLive NC avgt 15 295711.217 ± 5432.507 ns/op
>>> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM 70PcLive OA avgt 15 305282.827 ± 1906.841 ns/op
>>> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM 30PcLive NC avgt 15 446029.899 ± 4038.938 ns/op
>>> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM 30PcLive OA avgt 15 479085.875 ± 10032.804 ns/op
>>> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 70PcLive NC avgt 15 1789434.838 ± 206455.771 ns/op
>>> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 70PcLive OA avgt 15 589752.861 ± 31676.265 ns/op
>>> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 30PcLive NC avgt 15 1754862.122 ± 164903.051 ns/op
>>> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 30PcLive OA avgt 15 1252162.253 ± 121626.818 ns/o
>>>
>>> We can see big improvements when backed with the disk and little impact from the new alg.
>>>
>>> Given we're already introducing a new sstable format (OA) in 5.0 I would like to try to get this in before the freeze. The point being that sstables with lots of small partitions would benefit from a smaller DT at partition level. My tests show a 3%-4% size reduction on disk.
>>>
>>> Before proceeding though I'd like to bounce the idea against the community for all the corner cases and scenarios I might have missed where this could be a problem?
>>>
>>> Thx in advance!
>>>
>>>
>>> (1) https://github.com/bereng/cassandra/blob/ldtdeser-trunk/test/microbench/org/apache/cassandra/test/microbench/DeletionTimeDeSerBench.java
>>>
>>> (2) https://github.com/bereng/cassandra/blob/ldtdeser-trunk/src/java/org/apache/cassandra/db/DeletionTime.java#L212
>>>
Re: Improved DeletionTime serialization to reduce disk size
Posted by Benedict <be...@apache.org>.
If we’re doing this, why don’t we delta encode a vint from some per-sstable minimum value? I’d expect that to commonly compress to a single byte or so.
> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko <al...@apple.com> wrote:
>
> Distant future people will not be happy about this, I can already tell you now.
>
> Sounds like a reasonable improvement to me however.
>
>> On 23 Jun 2023, at 07:22, Berenguer Blasi <be...@gmail.com> wrote:
>>
>> Hi all,
>>
>> DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I noticed that with 7 bytes we can already encode ~2284 years. We can either shed the 8th byte, for reduced IO and disk, or can encode some sentinel values (such as LIVE) as flags there. That would mean reading and writing 1 byte instead of 12 (8 mfda long + 4 ldts int). Yes we already avoid serializing DeletionTime (DT) in sstables at _row_ level entirely but not at _partition_ level and it is also serialized at index, metadata, etc.
>>
>> So here's a POC: https://github.com/bereng/cassandra/commits/ldtdeser-trunk and some jmh (1) to evaluate the impact of the new alg (2). It's tested here against a 70% and a 30% LIVE DTs to see how we perform:
>>
>> [java] Benchmark (liveDTPcParam) (sstableParam) Mode Cnt Score Error Units
>> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive NC avgt 15 0.331 ± 0.001 ns/op
>> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive OA avgt 15 0.335 ± 0.004 ns/op
>> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive NC avgt 15 0.334 ± 0.002 ns/op
>> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive OA avgt 15 0.340 ± 0.008 ns/op
>> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive NC avgt 15 0.337 ± 0.006 ns/op
>> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive OA avgt 15 0.340 ± 0.004 ns/op
>> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive NC avgt 15 0.339 ± 0.004 ns/op
>> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive OA avgt 15 0.343 ± 0.016 ns/op
>>
>> That was ByteBuffer backed to test the extra bit level operations impact. But what would be the impact of an end to end test against disk?
>>
>> [java] Benchmark (diskRAMParam) (liveDTPcParam) (sstableParam) Mode Cnt Score Error Units
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive NC avgt 15 605236.515 ± 19929.058 ns/op
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive OA avgt 15 586477.039 ± 7384.632 ns/op
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive NC avgt 15 937580.311 ± 30669.647 ns/op
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive OA avgt 15 914097.770 ± 9865.070 ns/op
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 70PcLive NC avgt 15 1314417.207 ± 37879.012 ns/op
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 70PcLive OA avgt 15 805256.345 ± 15471.587 ns/op
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 30PcLive NC avgt 15 1583239.011 ± 50104.245 ns/op
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 30PcLive OA avgt 15 1439605.006 ± 64342.510 ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM 70PcLive NC avgt 15 295711.217 ± 5432.507 ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM 70PcLive OA avgt 15 305282.827 ± 1906.841 ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM 30PcLive NC avgt 15 446029.899 ± 4038.938 ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM 30PcLive OA avgt 15 479085.875 ± 10032.804 ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 70PcLive NC avgt 15 1789434.838 ± 206455.771 ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 70PcLive OA avgt 15 589752.861 ± 31676.265 ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 30PcLive NC avgt 15 1754862.122 ± 164903.051 ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 30PcLive OA avgt 15 1252162.253 ± 121626.818 ns/o
>>
>> We can see big improvements when backed with the disk and little impact from the new alg.
>>
>> Given we're already introducing a new sstable format (OA) in 5.0 I would like to try to get this in before the freeze. The point being that sstables with lots of small partitions would benefit from a smaller DT at partition level. My tests show a 3%-4% size reduction on disk.
>>
>> Before proceeding though I'd like to bounce the idea against the community for all the corner cases and scenarios I might have missed where this could be a problem?
>>
>> Thx in advance!
>>
>>
>> (1) https://github.com/bereng/cassandra/blob/ldtdeser-trunk/test/microbench/org/apache/cassandra/test/microbench/DeletionTimeDeSerBench.java
>>
>> (2) https://github.com/bereng/cassandra/blob/ldtdeser-trunk/src/java/org/apache/cassandra/db/DeletionTime.java#L212
>>
>
Re: Improved DeletionTime serialization to reduce disk size
Posted by Aleksey Yeshchenko <al...@apple.com>.
Distant future people will not be happy about this, I can already tell you now.
Sounds like a reasonable improvement to me however.
> On 23 Jun 2023, at 07:22, Berenguer Blasi <be...@gmail.com> wrote:
>
> Hi all,
>
> DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I noticed that with 7 bytes we can already encode ~2284 years. We can either shed the 8th byte, for reduced IO and disk, or can encode some sentinel values (such as LIVE) as flags there. That would mean reading and writing 1 byte instead of 12 (8 mfda long + 4 ldts int). Yes we already avoid serializing DeletionTime (DT) in sstables at _row_ level entirely but not at _partition_ level and it is also serialized at index, metadata, etc.
>
> So here's a POC: https://github.com/bereng/cassandra/commits/ldtdeser-trunk and some jmh (1) to evaluate the impact of the new alg (2). It's tested here against a 70% and a 30% LIVE DTs to see how we perform:
>
> [java] Benchmark (liveDTPcParam) (sstableParam) Mode Cnt Score Error Units
> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive NC avgt 15 0.331 ± 0.001 ns/op
> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive OA avgt 15 0.335 ± 0.004 ns/op
> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive NC avgt 15 0.334 ± 0.002 ns/op
> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive OA avgt 15 0.340 ± 0.008 ns/op
> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive NC avgt 15 0.337 ± 0.006 ns/op
> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive OA avgt 15 0.340 ± 0.004 ns/op
> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive NC avgt 15 0.339 ± 0.004 ns/op
> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive OA avgt 15 0.343 ± 0.016 ns/op
>
> That was ByteBuffer backed to test the extra bit level operations impact. But what would be the impact of an end to end test against disk?
>
> [java] Benchmark (diskRAMParam) (liveDTPcParam) (sstableParam) Mode Cnt Score Error Units
> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive NC avgt 15 605236.515 ± 19929.058 ns/op
> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive OA avgt 15 586477.039 ± 7384.632 ns/op
> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive NC avgt 15 937580.311 ± 30669.647 ns/op
> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive OA avgt 15 914097.770 ± 9865.070 ns/op
> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 70PcLive NC avgt 15 1314417.207 ± 37879.012 ns/op
> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 70PcLive OA avgt 15 805256.345 ± 15471.587 ns/op
> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 30PcLive NC avgt 15 1583239.011 ± 50104.245 ns/op
> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 30PcLive OA avgt 15 1439605.006 ± 64342.510 ns/op
> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM 70PcLive NC avgt 15 295711.217 ± 5432.507 ns/op
> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM 70PcLive OA avgt 15 305282.827 ± 1906.841 ns/op
> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM 30PcLive NC avgt 15 446029.899 ± 4038.938 ns/op
> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM 30PcLive OA avgt 15 479085.875 ± 10032.804 ns/op
> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 70PcLive NC avgt 15 1789434.838 ± 206455.771 ns/op
> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 70PcLive OA avgt 15 589752.861 ± 31676.265 ns/op
> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 30PcLive NC avgt 15 1754862.122 ± 164903.051 ns/op
> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 30PcLive OA avgt 15 1252162.253 ± 121626.818 ns/o
>
> We can see big improvements when backed with the disk and little impact from the new alg.
>
> Given we're already introducing a new sstable format (OA) in 5.0 I would like to try to get this in before the freeze. The point being that sstables with lots of small partitions would benefit from a smaller DT at partition level. My tests show a 3%-4% size reduction on disk.
>
> Before proceeding though I'd like to bounce the idea against the community for all the corner cases and scenarios I might have missed where this could be a problem?
>
> Thx in advance!
>
>
> (1) https://github.com/bereng/cassandra/blob/ldtdeser-trunk/test/microbench/org/apache/cassandra/test/microbench/DeletionTimeDeSerBench.java
>
> (2) https://github.com/bereng/cassandra/blob/ldtdeser-trunk/src/java/org/apache/cassandra/db/DeletionTime.java#L212
>
Re: Improved DeletionTime serialization to reduce disk size
Posted by Berenguer Blasi <be...@gmail.com>.
The idea is 11 bytes less per LIVE partition. So small partitions will
benefit the most.
On 29/6/23 18:44, Brandon Williams wrote:
> On Thu, Jun 29, 2023 at 11:42 AM Jeff Jirsa <jj...@gmail.com> wrote:
>> 3-4% reduction on disk ... for what exactly?
>>
>> It seems exceptionally uncommon to have 3% of your data SIZE be tombstones.
> If the data is TTL'd I think it's not entirely uncommon.
>
> Kind Regards,
> Brandon
Re: Improved DeletionTime serialization to reduce disk size
Posted by Brandon Williams <dr...@gmail.com>.
On Thu, Jun 29, 2023 at 11:42 AM Jeff Jirsa <jj...@gmail.com> wrote:
> 3-4% reduction on disk ... for what exactly?
>
> It seems exceptionally uncommon to have 3% of your data SIZE be tombstones.
If the data is TTL'd I think it's not entirely uncommon.
Kind Regards,
Brandon
Re: Improved DeletionTime serialization to reduce disk size
Posted by Jeff Jirsa <jj...@gmail.com>.
On Thu, Jun 22, 2023 at 11:23 PM Berenguer Blasi <be...@gmail.com>
wrote:
> Hi all,
>
> Given we're already introducing a new sstable format (OA) in 5.0 I would
> like to try to get this in before the freeze. The point being that
> sstables with lots of small partitions would benefit from a smaller DT
> at partition level. My tests show a 3%-4% size reduction on disk.
>
3-4% reduction on disk ... for what exactly?
It seems exceptionally uncommon to have 3% of your data SIZE be tombstones.
Is this enhancement driven by a pathological data model that's like "mostly
tiny records OR tombstones" ?
Improved DeletionTime serialization to reduce disk size
Posted by Berenguer Blasi <be...@gmail.com>.
Hi all,
DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But
I noticed that with 7 bytes we can already encode ~2284 years. We can
either shed the 8th byte, for reduced IO and disk, or can encode some
sentinel values (such as LIVE) as flags there. That would mean reading
and writing 1 byte instead of 12 (8 mfda long + 4 ldts int). Yes we
already avoid serializing DeletionTime (DT) in sstables at _row_ level
entirely but not at _partition_ level and it is also serialized at
index, metadata, etc.
So here's a POC:
https://github.com/bereng/cassandra/commits/ldtdeser-trunk and some jmh
(1) to evaluate the impact of the new alg (2). It's tested here against
a 70% and a 30% LIVE DTs to see how we perform:
[java] Benchmark (liveDTPcParam) (sstableParam) Mode Cnt
Score Error Units
[java] DeletionTimeDeSerBench.testRawAlgReads
70PcLive NC avgt 15 0.331 ± 0.001 ns/op
[java] DeletionTimeDeSerBench.testRawAlgReads
70PcLive OA avgt 15 0.335 ± 0.004 ns/op
[java] DeletionTimeDeSerBench.testRawAlgReads
30PcLive NC avgt 15 0.334 ± 0.002 ns/op
[java] DeletionTimeDeSerBench.testRawAlgReads
30PcLive OA avgt 15 0.340 ± 0.008 ns/op
[java] DeletionTimeDeSerBench.testNewAlgWrites
70PcLive NC avgt 15 0.337 ± 0.006 ns/op
[java] DeletionTimeDeSerBench.testNewAlgWrites
70PcLive OA avgt 15 0.340 ± 0.004 ns/op
[java] DeletionTimeDeSerBench.testNewAlgWrites
30PcLive NC avgt 15 0.339 ± 0.004 ns/op
[java] DeletionTimeDeSerBench.testNewAlgWrites
30PcLive OA avgt 15 0.343 ± 0.016 ns/op
That was ByteBuffer backed to test the extra bit level operations
impact. But what would be the impact of an end to end test against disk?
[java] Benchmark (diskRAMParam) (liveDTPcParam) (sstableParam)
Mode Cnt Score Error Units
[java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM
70PcLive NC avgt 15 605236.515 ± 19929.058 ns/op
[java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM
70PcLive OA avgt 15 586477.039 ± 7384.632 ns/op
[java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM
30PcLive NC avgt 15 937580.311 ± 30669.647 ns/op
[java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM
30PcLive OA avgt 15 914097.770 ± 9865.070 ns/op
[java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk
70PcLive NC avgt 15 1314417.207 ± 37879.012 ns/op
[java] DeletionTimeDeSerBench.testE2EDeSerializeDT
Disk 70PcLive OA avgt 15 805256.345 ±
15471.587 ns/op
[java] DeletionTimeDeSerBench.testE2EDeSerializeDT
Disk 30PcLive NC avgt 15 1583239.011 ±
50104.245 ns/op
[java] DeletionTimeDeSerBench.testE2EDeSerializeDT
Disk 30PcLive OA avgt 15 1439605.006 ±
64342.510 ns/op
[java] DeletionTimeDeSerBench.testE2ESerializeDT
RAM 70PcLive NC avgt 15 295711.217 ± 5432.507
ns/op
[java] DeletionTimeDeSerBench.testE2ESerializeDT
RAM 70PcLive OA avgt 15 305282.827 ± 1906.841
ns/op
[java] DeletionTimeDeSerBench.testE2ESerializeDT RAM
30PcLive NC avgt 15 446029.899 ± 4038.938 ns/op
[java] DeletionTimeDeSerBench.testE2ESerializeDT RAM
30PcLive OA avgt 15 479085.875 ± 10032.804 ns/op
[java] DeletionTimeDeSerBench.testE2ESerializeDT Disk
70PcLive NC avgt 15 1789434.838 ± 206455.771 ns/op
[java] DeletionTimeDeSerBench.testE2ESerializeDT
Disk 70PcLive OA avgt 15 589752.861 ±
31676.265 ns/op
[java] DeletionTimeDeSerBench.testE2ESerializeDT
Disk 30PcLive NC avgt 15 1754862.122 ±
164903.051 ns/op
[java] DeletionTimeDeSerBench.testE2ESerializeDT Disk
30PcLive OA avgt 15 1252162.253 ± 121626.818 ns/o
We can see big improvements when backed with the disk and little impact
from the new alg.
Given we're already introducing a new sstable format (OA) in 5.0 I would
like to try to get this in before the freeze. The point being that
sstables with lots of small partitions would benefit from a smaller DT
at partition level. My tests show a 3%-4% size reduction on disk.
Before proceeding though I'd like to bounce the idea against the
community for all the corner cases and scenarios I might have missed
where this could be a problem?
Thx in advance!
(1)
https://github.com/bereng/cassandra/blob/ldtdeser-trunk/test/microbench/org/apache/cassandra/test/microbench/DeletionTimeDeSerBench.java
(2)
https://github.com/bereng/cassandra/blob/ldtdeser-trunk/src/java/org/apache/cassandra/db/DeletionTime.java#L212