You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cassandra.apache.org by Jeremy Hanna <je...@gmail.com> on 2023/06/12 21:59:48 UTC

Re: [DISCUSS] CEP-8 Drivers Donation - take 2

I'd like to close out this thread.  As Benjamin notes, we'll have a single subproject for all of the drivers and with 3 PMC members overseeing the subproject as outlined in the linked subproject governance procedures.  However we'll introduce the drivers to that subproject one by one out of necessity.

I'll open up a vote thread shortly so that we can move forward on the CEP and subproject approach.

> On May 30, 2023, at 7:32 AM, Benjamin Lerer <b....@gmail.com> wrote:
> 
> The idea was to have a single driver sub-project. Even if the code bases are different we believe that it is important to keep the drivers together to retain cohesive API semantics and make sure they have similar functionality and feature support.
> In this scenario we would need only 3 PMC members for the governance. I am willing to be one of them.
> 
> For the committers, my understanding, based on subproject governance procedures, <https://lists.apache.org/thread/tgbqpq5x03b7ssoplccxompxj6d1gw90> was that they should be proposed directly to the PMC members.
> 
>> Is the vote for the CEP to be for all drivers, but we will introduce each driver one by one?  What determines when we are comfortable with one driver subproject and can move on to accepting the next ? 
> 
> The goal of the CEP is simply to ensure that the community is in favor of the donation. Nothing more. 
> The plan is to introduce the drivers, one by one. Each driver donation will need to be accepted first by the PMC members, as it is the case for any donation. Therefore the PMC should have full control on the pace at which new drivers are accepted.
>   
> 
> Le mar. 30 mai 2023 à 12:22, Josh McKenzie <jmckenzie@apache.org <ma...@apache.org>> a écrit :
>>> Is the vote for the CEP to be for all drivers, but we will introduce each driver one by one?  What determines when we are comfortable with one driver subproject and can move on to accepting the next ? 
>> Curious to hear on this as well. There's 2 implications from the CEP as written:
>> 
>> 1. The Java and Python drivers hold special importance due to their language proximity and/or project's dependence upon them (https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-8%3A+Datastax+Drivers+Donation#CEP8:DatastaxDriversDonation-Scope)
>> 2. Datastax is explicitly offering all 7 drivers for donation (https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-8%3A+Datastax+Drivers+Donation#CEP8:DatastaxDriversDonation-Goals)
>> 
>> This is the most complex contribution via CEP thus far from a governance perspective; I suggest we chart a bespoke path to navigate this. Having a top level indication of "the CEP is approved" logically separate from a per-language indication of "the project is ready to absorb this language driver now" makes sense to me. This could look like:
>> 
>> * Vote on the CEP itself
>> * Per language (processing one at a time):
>>     * identify 3 PMC members willing to take on the governance role for the language driver
>>     * Identify 2 contributors who are active on a given driver and stepping forward for a committer role on the driver
>>     * Vote on inclusion of that language driver in the project + commit bits
>>     * Integrate that driver into the project ecosystem (build, ci, docs, etc)
>> 
>> Not sure how else we could handle committers / contributors / PMC members other than on a per-driver basis.
>> 
>> On Tue, May 30, 2023, at 5:36 AM, Mick Semb Wever wrote:
>>> 
>>> Thank you so much Jeremy and Greg (+others) for all the hard work on this.
>>>  
>>> 
>>> At this point, we'd like to propose CEP-8 for consideration, starting the process to accept the DataStax Java driver as an official ASF project.
>>> 
>>> 
>>> Is the vote for the CEP to be for all drivers, but we will introduce each driver one by one?  What determines when we are comfortable with one driver subproject and can move on to accepting the next ? 
>>> 
>>> Are there key committers and contributors on each driver that want to be involved?  Should they be listed before the vote?
>>> We also need three PMC for the new subproject.  Are we to assign these before the vote?  
>>> 
>>> 
>>

Re: Improved DeletionTime serialization to reduce disk size

Posted by Brandon Williams <dr...@gmail.com>.

On Sun, Jul 16, 2023 at 11:47 PM Berenguer Blasi
<be...@gmail.com> wrote:
> one q that came up during the review: What should we do if we find a markForDeleteAt (mfda) usging the MSByte? That is a mfda beyond year 4254:
>
> A. That is a mistake/bug. I makes no sense when localDeletionTime can't already go any further than year 2106. We should reject/fail, maybe log and add an upgrade note.

I think creation of doomstones is always a bug, but perhaps there is a
use case I cannot think of.  One option that was discussed is setting
a default for the maximum_timestamp_fail_threshold which I think could
make sense, since it would provide protection but allow a way out.

> B. That was supported, regardless of how weird it may be. Cap it to the current max year 4254, maybe log and add an upgrade note.

I am not a fan of doing something other than what we were asked to do,
I think we should either reject it, or do it.

Re: Improved DeletionTime serialization to reduce disk size

Posted by Berenguer Blasi <be...@gmail.com>.

Hi All,

one q that came up during the review: What should we do if we find a 
markForDeleteAt (mfda) usging the MSByte? That is a mfda beyond year 4254:

A. That is a mistake/bug. I makes no sense when localDeletionTime can't 
already go any further than year 2106. We should reject/fail, maybe log 
and add an upgrade note.

B. That was supported, regardless of how weird it may be. Cap it to the 
current max year 4254, maybe log and add an upgrade note.

Happy to hear your thoughts.

On 5/7/23 7:05, Berenguer Blasi wrote:
>
> Hi All,
>
> https://issues.apache.org/jira/browse/CASSANDRA-18648 up for review 
> and the PR is quite small
>
> Regards
>
> On 3/7/23 11:03, Berenguer Blasi wrote:
>>
>> Thanks for the comments Benedict. Given 
>> DeletionTime.localDeletionTime is what caps everything to year 2106 
>> (uint enconded now) I am ok with a DeletionTime.markForDeleteAt that 
>> can go up to year 4284, personal opinion ofc.
>>
>> And yes I hope once I read, doc and understand the sstable format 
>> better I can look into your suggestion and anything else I come across.
>>
>> On 3/7/23 9:46, Benedict wrote:
>>> I checked and I’m pretty sure we do, but it doesn’t apply any 
>>> liveness optimisation. I had misunderstood the optimisation you 
>>> proposed. Ideally we would encode any non-live timestamp with the 
>>> delta offset, but since that’s a distinct optimisation perhaps that 
>>> can be left to another patch.
>>>
>>> Are we happy, though, that the two different deletion time 
>>> serialisers can store different ranges of timestamp? Both are large 
>>> ranges, but I am not 100% comfortable with them diverging.
>>>
>>>> On 3 Jul 2023, at 05:45, Berenguer Blasi <be...@gmail.com> 
>>>> wrote:
>>>>
>>>> 
>>>>
>>>> It can look into it. I don't have a deep knowledge of the sstable 
>>>> format hence why I wanted to document it someday. But DeletionTime 
>>>> is being serialized in other places as well iirc and I doubt 
>>>> (finger in the air) we'll have that Epoch handy.
>>>>
>>>> On 29/6/23 17:22, Benedict wrote:
>>>>> So I’m just taking a quick peek at SerializationHeader and we 
>>>>> already have a method for reading and writing a deletion time with 
>>>>> offsets from EncodingStats.
>>>>>
>>>>> So perhaps we simply have a bug where we are using DeletionTime 
>>>>> Serializer instead of SerializationHeader.writeLocalDeletionTime? 
>>>>> It looks to me like this is already available at most (perhaps 
>>>>> all) of the relevant call sites.
>>>>>
>>>>>
>>>>>> On 29 Jun 2023, at 15:53, Josh McKenzie <jm...@apache.org> wrote:
>>>>>>
>>>>>> 
>>>>>>> I would prefer we not plan on two distinct changes to this
>>>>>> I agree with this sentiment, /*and*/
>>>>>>
>>>>>>> +1, if you have time for this approach and no other in this window.
>>>>>> People are going to use 5.0 for awhile. Better to have an 
>>>>>> improvement in their hands for that duration than no improvement 
>>>>>> at all IMO. Justifies the cost of the double implementation and 
>>>>>> transitions to me.
>>>>>>
>>>>>> On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote:
>>>>>>>
>>>>>>>     Just for completeness the change is a handful loc. The rest
>>>>>>>     is added tests and we'd loose the sstable format change
>>>>>>>     opportunity window.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> +1, if you have time for this approach and no other in this window.
>>>>>>>
>>>>>>> (If you have time for the other, or someone else does, then the 
>>>>>>> technically superior approach should win)
>>>>>>>
>>>>>>>
>>>>>>

Re: Improved DeletionTime serialization to reduce disk size

Posted by Berenguer Blasi <be...@gmail.com>.

Hi All,

https://issues.apache.org/jira/browse/CASSANDRA-18648 up for review and 
the PR is quite small

Regards

On 3/7/23 11:03, Berenguer Blasi wrote:
>
> Thanks for the comments Benedict. Given DeletionTime.localDeletionTime 
> is what caps everything to year 2106 (uint enconded now) I am ok with 
> a DeletionTime.markForDeleteAt that can go up to year 4284, personal 
> opinion ofc.
>
> And yes I hope once I read, doc and understand the sstable format 
> better I can look into your suggestion and anything else I come across.
>
> On 3/7/23 9:46, Benedict wrote:
>> I checked and I’m pretty sure we do, but it doesn’t apply any 
>> liveness optimisation. I had misunderstood the optimisation you 
>> proposed. Ideally we would encode any non-live timestamp with the 
>> delta offset, but since that’s a distinct optimisation perhaps that 
>> can be left to another patch.
>>
>> Are we happy, though, that the two different deletion time 
>> serialisers can store different ranges of timestamp? Both are large 
>> ranges, but I am not 100% comfortable with them diverging.
>>
>>> On 3 Jul 2023, at 05:45, Berenguer Blasi <be...@gmail.com> 
>>> wrote:
>>>
>>> 
>>>
>>> It can look into it. I don't have a deep knowledge of the sstable 
>>> format hence why I wanted to document it someday. But DeletionTime 
>>> is being serialized in other places as well iirc and I doubt (finger 
>>> in the air) we'll have that Epoch handy.
>>>
>>> On 29/6/23 17:22, Benedict wrote:
>>>> So I’m just taking a quick peek at SerializationHeader and we 
>>>> already have a method for reading and writing a deletion time with 
>>>> offsets from EncodingStats.
>>>>
>>>> So perhaps we simply have a bug where we are using DeletionTime 
>>>> Serializer instead of SerializationHeader.writeLocalDeletionTime? 
>>>> It looks to me like this is already available at most (perhaps all) 
>>>> of the relevant call sites.
>>>>
>>>>
>>>>> On 29 Jun 2023, at 15:53, Josh McKenzie <jm...@apache.org> wrote:
>>>>>
>>>>> 
>>>>>> I would prefer we not plan on two distinct changes to this
>>>>> I agree with this sentiment, /*and*/
>>>>>
>>>>>> +1, if you have time for this approach and no other in this window.
>>>>> People are going to use 5.0 for awhile. Better to have an 
>>>>> improvement in their hands for that duration than no improvement 
>>>>> at all IMO. Justifies the cost of the double implementation and 
>>>>> transitions to me.
>>>>>
>>>>> On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote:
>>>>>>
>>>>>>     Just for completeness the change is a handful loc. The rest
>>>>>>     is added tests and we'd loose the sstable format change
>>>>>>     opportunity window.
>>>>>>
>>>>>>
>>>>>>
>>>>>> +1, if you have time for this approach and no other in this window.
>>>>>>
>>>>>> (If you have time for the other, or someone else does, then the 
>>>>>> technically superior approach should win)
>>>>>>
>>>>>>
>>>>>

Re: Improved DeletionTime serialization to reduce disk size

Posted by Berenguer Blasi <be...@gmail.com>.

Thanks for the comments Benedict. Given DeletionTime.localDeletionTime 
is what caps everything to year 2106 (uint enconded now) I am ok with a 
DeletionTime.markForDeleteAt that can go up to year 4284, personal 
opinion ofc.

And yes I hope once I read, doc and understand the sstable format better 
I can look into your suggestion and anything else I come across.

On 3/7/23 9:46, Benedict wrote:
> I checked and I’m pretty sure we do, but it doesn’t apply any liveness 
> optimisation. I had misunderstood the optimisation you proposed. 
> Ideally we would encode any non-live timestamp with the delta offset, 
> but since that’s a distinct optimisation perhaps that can be left to 
> another patch.
>
> Are we happy, though, that the two different deletion time serialisers 
> can store different ranges of timestamp? Both are large ranges, but I 
> am not 100% comfortable with them diverging.
>
>> On 3 Jul 2023, at 05:45, Berenguer Blasi <be...@gmail.com> 
>> wrote:
>>
>> 
>>
>> It can look into it. I don't have a deep knowledge of the sstable 
>> format hence why I wanted to document it someday. But DeletionTime is 
>> being serialized in other places as well iirc and I doubt (finger in 
>> the air) we'll have that Epoch handy.
>>
>> On 29/6/23 17:22, Benedict wrote:
>>> So I’m just taking a quick peek at SerializationHeader and we 
>>> already have a method for reading and writing a deletion time with 
>>> offsets from EncodingStats.
>>>
>>> So perhaps we simply have a bug where we are using DeletionTime 
>>> Serializer instead of SerializationHeader.writeLocalDeletionTime? It 
>>> looks to me like this is already available at most (perhaps all) of 
>>> the relevant call sites.
>>>
>>>
>>>> On 29 Jun 2023, at 15:53, Josh McKenzie <jm...@apache.org> wrote:
>>>>
>>>> 
>>>>> I would prefer we not plan on two distinct changes to this
>>>> I agree with this sentiment, /*and*/
>>>>
>>>>> +1, if you have time for this approach and no other in this window.
>>>> People are going to use 5.0 for awhile. Better to have an 
>>>> improvement in their hands for that duration than no improvement at 
>>>> all IMO. Justifies the cost of the double implementation and 
>>>> transitions to me.
>>>>
>>>> On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote:
>>>>>
>>>>>     Just for completeness the change is a handful loc. The rest is
>>>>>     added tests and we'd loose the sstable format change
>>>>>     opportunity window.
>>>>>
>>>>>
>>>>>
>>>>> +1, if you have time for this approach and no other in this window.
>>>>>
>>>>> (If you have time for the other, or someone else does, then the 
>>>>> technically superior approach should win)
>>>>>
>>>>>
>>>>

Re: Improved DeletionTime serialization to reduce disk size

Posted by Benedict <be...@apache.org>.

I checked and I’m pretty sure we do, but it doesn’t apply any liveness
optimisation. I had misunderstood the optimisation you proposed. Ideally we
would encode any non-live timestamp with the delta offset, but since that’s a
distinct optimisation perhaps that can be left to another patch.

  

Are we happy, though, that the two different deletion time serialisers can
store different ranges of timestamp? Both are large ranges, but I am not 100%
comfortable with them diverging.

  

> On 3 Jul 2023, at 05:45, Berenguer Blasi <be...@gmail.com> wrote:  
>  
>

> 
>
> It can look into it. I don't have a deep knowledge of the sstable format
> hence why I wanted to document it someday. But DeletionTime is being
> serialized in other places as well iirc and I doubt (finger in the air)
> we'll have that Epoch handy.  
>
>
> On 29/6/23 17:22, Benedict wrote:  
>
>

>> So I’m just taking a quick peek at SerializationHeader and we already have
a method for reading and writing a deletion time with offsets from
EncodingStats.

>>

>>  
>
>>

>> So perhaps we simply have a bug where we are using DeletionTime Serializer
instead of SerializationHeader.writeLocalDeletionTime? It looks to me like
this is already available at most (perhaps all) of the relevant call sites.

>>

>>  
>
>>

>>  
>>

>>  
>
>>

>>> On 29 Jun 2023, at 15:53, Josh McKenzie
[<jm...@apache.org>](mailto:jmckenzie@apache.org) wrote:  
>  
>
>>

>>> 

>>>

>>>> I would prefer we not plan on two distinct changes to this  
>
>>>

>>> I agree with this sentiment,  _ **and**_  
>
>>>

>>>  
>
>>>

>>>> +1, if you have time for this approach and no other in this window.

>>>

>>> People are going to use 5.0 for awhile. Better to have an improvement in
their hands for that duration than no improvement at all IMO. Justifies the
cost of the double implementation and transitions to me.

>>>

>>>  
>
>>>

>>> On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote:  
>
>>>

>>>> > Just for completeness the change is a handful loc. The rest is added
tests and we'd loose the sstable format change opportunity window.  
>
>>>>

>>>>  
>
>>>>

>>>>  
>
>>>>

>>>> +1, if you have time for this approach and no other in this window.  
>
>>>>

>>>>  
>
>>>>

>>>> (If you have time for the other, or someone else does, then the
technically superior approach should win)  
>
>>>>

>>>>  
>
>>>>

>>>>  
>
>>>

>>>  
>

Re: Improved DeletionTime serialization to reduce disk size

Posted by Berenguer Blasi <be...@gmail.com>.

It can look into it. I don't have a deep knowledge of the sstable format 
hence why I wanted to document it someday. But DeletionTime is being 
serialized in other places as well iirc and I doubt (finger in the air) 
we'll have that Epoch handy.

On 29/6/23 17:22, Benedict wrote:
> So I’m just taking a quick peek at SerializationHeader and we already 
> have a method for reading and writing a deletion time with offsets 
> from EncodingStats.
>
> So perhaps we simply have a bug where we are using DeletionTime 
> Serializer instead of SerializationHeader.writeLocalDeletionTime? It 
> looks to me like this is already available at most (perhaps all) of 
> the relevant call sites.
>
>
>> On 29 Jun 2023, at 15:53, Josh McKenzie <jm...@apache.org> wrote:
>>
>> 
>>> I would prefer we not plan on two distinct changes to this
>> I agree with this sentiment, /*and*/
>>
>>> +1, if you have time for this approach and no other in this window.
>> People are going to use 5.0 for awhile. Better to have an improvement 
>> in their hands for that duration than no improvement at all IMO. 
>> Justifies the cost of the double implementation and transitions to me.
>>
>> On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote:
>>>
>>>     Just for completeness the change is a handful loc. The rest is
>>>     added tests and we'd loose the sstable format change opportunity
>>>     window.
>>>
>>>
>>>
>>> +1, if you have time for this approach and no other in this window.
>>>
>>> (If you have time for the other, or someone else does, then the 
>>> technically superior approach should win)
>>>
>>>
>>

Re: Improved DeletionTime serialization to reduce disk size

Posted by Benedict <be...@apache.org>.

So I’m just taking a quick peek at SerializationHeader and we already have a method for reading and writing a deletion time with offsets from EncodingStats.

So perhaps we simply have a bug where we are using DeletionTime Serializer instead of SerializationHeader.writeLocalDeletionTime? It looks to me like this is already available at most (perhaps all) of the relevant call sites.

> On 29 Jun 2023, at 15:53, Josh McKenzie <jm...@apache.org> wrote:
> 
> 
>> 
>> I would prefer we not plan on two distinct changes to this
> I agree with this sentiment, and
> 
>> +1, if you have time for this approach and no other in this window.
> People are going to use 5.0 for awhile. Better to have an improvement in their hands for that duration than no improvement at all IMO. Justifies the cost of the double implementation and transitions to me.
> 
>> On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote:
>> Just for completeness the change is a handful loc. The rest is added tests and we'd loose the sstable format change opportunity window.
>> 
>> 
>> 
>> +1, if you have time for this approach and no other in this window.
>> 
>> (If you have time for the other, or someone else does, then the technically superior approach should win)
>> 
>> 
>

Re: Improved DeletionTime serialization to reduce disk size

Posted by Josh McKenzie <jm...@apache.org>.

> I would prefer we not plan on two distinct changes to this
I agree with this sentiment, **and**

> +1, if you have time for this approach and no other in this window.
People are going to use 5.0 for awhile. Better to have an improvement in their hands for that duration than no improvement at all IMO. Justifies the cost of the double implementation and transitions to me.

On Tue, Jun 27, 2023, at 5:43 AM, Mick Semb Wever wrote:
>> Just for completeness the change is a handful loc. The rest is added tests and we'd loose the sstable format change opportunity window.
>> 
> 
> 
> +1, if you have time for this approach and no other in this window.
> 
> (If you have time for the other, or someone else does, then the technically superior approach should win)
> 
> 
>

Re: Improved DeletionTime serialization to reduce disk size

Posted by Mick Semb Wever <mc...@apache.org>.

>
> Just for completeness the change is a handful loc. The rest is added tests
> and we'd loose the sstable format change opportunity window.
>


+1, if you have time for this approach and no other in this window.

(If you have time for the other, or someone else does, then the technically
superior approach should win)

Re: Improved DeletionTime serialization to reduce disk size

Posted by Berenguer Blasi <be...@gmail.com>.

Just for completeness the change is a handful loc. The rest is added 
tests and we'd loose the sstable format change opportunity window.

Thx again for the replies.

On 26/6/23 9:33, Benedict wrote:
> I would prefer we not plan on two distinct changes to this, 
> particularly when neither change is particularly more complex than the 
> other. There is a modest cost to maintenance from changing this 
> multiple times.
>
> But if others feel strongly otherwise I won’t stand in the way.
>
>> On 26 Jun 2023, at 05:49, Berenguer Blasi <be...@gmail.com> 
>> wrote:
>>
>> 
>>
>> Thanks for the replies.
>>
>> I intend to javadoc the ssatble format in detail someday and more 
>> improvements might come up then, along the vint encoding mentioned 
>> here. But unless sbdy volunteers to do that in 5.0, is anybody 
>> against I try to get the original proposal (1 byte flags for sentinel 
>> values) in?
>>
>> Regards
>>
>>
>>> Distant future people will not be happy about this, I can already 
>>> tell you now.
>> Eh, they'll all be AI's anyway and will just rewrite the code in a 
>> background thread.
>>
>> LOL
>>
>>
>>
>>
>> On 23/6/23 15:44, Josh McKenzie wrote:
>>>> If we’re doing this, why don’t we delta encode a vint from some 
>>>> per-sstable minimum value? I’d expect that to commonly compress to 
>>>> a single byte or so.
>>> +1 to this approach.
>>>
>>>> Distant future people will not be happy about this, I can already 
>>>> tell you now.
>>> Eh, they'll all be AI's anyway and will just rewrite the code in a 
>>> background thread.
>>>
>>> On Fri, Jun 23, 2023, at 9:02 AM, Berenguer Blasi wrote:
>>>> It's a possibility. Though I haven't coded and benchmarked such an
>>>> approach and I don't think I would have the time before the freeze to
>>>> take advantage of the sstable format change opportunity.
>>>>
>>>> Still it's sthg that can be explored later. If we can shave a few 
>>>> extra
>>>> % then that would always be great imo.
>>>>
>>>> On 23/6/23 13:57, Benedict wrote:
>>>> > If we’re doing this, why don’t we delta encode a vint from some 
>>>> per-sstable minimum value? I’d expect that to commonly compress to 
>>>> a single byte or so.
>>>> >
>>>> >> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko <al...@apple.com> 
>>>> wrote:
>>>> >>
>>>> >> Distant future people will not be happy about this, I can 
>>>> already tell you now.
>>>> >>
>>>> >> Sounds like a reasonable improvement to me however.
>>>> >>
>>>> >>> On 23 Jun 2023, at 07:22, Berenguer Blasi 
>>>> <be...@gmail.com> wrote:
>>>> >>>
>>>> >>> Hi all,
>>>> >>>
>>>> >>> DeletionTime.markedForDeleteAt is a long useconds since Unix 
>>>> Epoch. But I noticed that with 7 bytes we can already encode ~2284 
>>>> years. We can either shed the 8th byte, for reduced IO and disk, or 
>>>> can encode some sentinel values (such as LIVE) as flags there. That 
>>>> would mean reading and writing 1 byte instead of 12 (8 mfda long + 
>>>> 4 ldts int). Yes we already avoid serializing DeletionTime (DT) in 
>>>> sstables at _row_ level entirely but not at _partition_ level and 
>>>> it is also serialized at index, metadata, etc.
>>>> >>>
>>>> >>> So here's a POC: 
>>>> https://github.com/bereng/cassandra/commits/ldtdeser-trunk and some 
>>>> jmh (1) to evaluate the impact of the new alg (2). It's tested here 
>>>> against a 70% and a 30% LIVE DTs to see how we perform:
>>>> >>>
>>>> >>>      [java] Benchmark (liveDTPcParam) (sstableParam)  Mode  
>>>> Cnt  Score   Error  Units
>>>> >>>      [java] DeletionTimeDeSerBench.testRawAlgReads 
>>>> 70PcLive              NC  avgt   15  0.331 ± 0.001 ns/op
>>>> >>>      [java] DeletionTimeDeSerBench.testRawAlgReads 
>>>> 70PcLive              OA  avgt   15  0.335 ± 0.004 ns/op
>>>> >>>      [java] DeletionTimeDeSerBench.testRawAlgReads 
>>>> 30PcLive              NC  avgt   15  0.334 ± 0.002 ns/op
>>>> >>>      [java] DeletionTimeDeSerBench.testRawAlgReads 
>>>> 30PcLive              OA  avgt   15  0.340 ± 0.008 ns/op
>>>> >>>      [java] DeletionTimeDeSerBench.testNewAlgWrites 
>>>> 70PcLive              NC  avgt   15  0.337 ± 0.006 ns/op
>>>> >>>      [java] DeletionTimeDeSerBench.testNewAlgWrites 
>>>> 70PcLive              OA  avgt   15  0.340 ± 0.004 ns/op
>>>> >>>      [java] DeletionTimeDeSerBench.testNewAlgWrites 
>>>> 30PcLive              NC  avgt   15  0.339 ± 0.004 ns/op
>>>> >>>      [java] DeletionTimeDeSerBench.testNewAlgWrites 
>>>> 30PcLive              OA  avgt   15  0.343 ± 0.016 ns/op
>>>> >>>
>>>> >>> That was ByteBuffer backed to test the extra bit level 
>>>> operations impact. But what would be the impact of an end to end 
>>>> test against disk?
>>>> >>>
>>>> >>>      [java] Benchmark (diskRAMParam) (liveDTPcParam)  
>>>> (sstableParam)  Mode  Cnt Score Error  Units
>>>> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 
>>>> 70PcLive              NC  avgt   15   605236.515 ± 19929.058  ns/op
>>>> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 
>>>> 70PcLive              OA  avgt   15   586477.039 ± 7384.632  ns/op
>>>> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 
>>>> 30PcLive              NC  avgt   15   937580.311 ± 30669.647  ns/op
>>>> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 
>>>> 30PcLive              OA  avgt   15   914097.770 ± 9865.070  ns/op
>>>> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT 
>>>> Disk         70PcLive              NC  avgt   15 1314417.207 ± 
>>>> 37879.012  ns/op
>>>> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT 
>>>> Disk         70PcLive              OA  avgt   15 805256.345 ±  
>>>> 15471.587  ns/op
>>>> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT 
>>>> Disk         30PcLive              NC  avgt   15 1583239.011 ±  
>>>> 50104.245  ns/op
>>>> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT 
>>>> Disk         30PcLive              OA  avgt   15 1439605.006 ±  
>>>> 64342.510  ns/op
>>>> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT 
>>>> RAM         70PcLive              NC  avgt   15 295711.217 ±   
>>>> 5432.507  ns/op
>>>> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT 
>>>> RAM         70PcLive              OA  avgt   15 305282.827 ±   
>>>> 1906.841  ns/op
>>>> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT 
>>>> RAM         30PcLive              NC  avgt   15 446029.899 ±   
>>>> 4038.938  ns/op
>>>> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT    RAM 
>>>> 30PcLive              OA  avgt   15   479085.875 ± 10032.804  ns/op
>>>> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT 
>>>> Disk         70PcLive              NC  avgt   15 1789434.838 ± 
>>>> 206455.771  ns/op
>>>> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT 
>>>> Disk         70PcLive              OA  avgt   15 589752.861 ±  
>>>> 31676.265  ns/op
>>>> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT 
>>>> Disk         30PcLive              NC  avgt   15 1754862.122 ± 
>>>> 164903.051  ns/op
>>>> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT 
>>>> Disk         30PcLive              OA  avgt   15 1252162.253 ± 
>>>> 121626.818  ns/o
>>>> >>>
>>>> >>> We can see big improvements when backed with the disk and 
>>>> little impact from the new alg.
>>>> >>>
>>>> >>> Given we're already introducing a new sstable format (OA) in 
>>>> 5.0 I would like to try to get this in before the freeze. The point 
>>>> being that sstables with lots of small partitions would benefit 
>>>> from a smaller DT at partition level. My tests show a 3%-4% size 
>>>> reduction on disk.
>>>> >>>
>>>> >>> Before proceeding though I'd like to bounce the idea against 
>>>> the community for all the corner cases and scenarios I might have 
>>>> missed where this could be a problem?
>>>> >>>
>>>> >>> Thx in advance!
>>>> >>>
>>>> >>>
>>>> >>> (1) 
>>>> https://github.com/bereng/cassandra/blob/ldtdeser-trunk/test/microbench/org/apache/cassandra/test/microbench/DeletionTimeDeSerBench.java
>>>> >>>
>>>> >>> (2) 
>>>> https://github.com/bereng/cassandra/blob/ldtdeser-trunk/src/java/org/apache/cassandra/db/DeletionTime.java#L212
>>>> >>>
>>>>
>>>

Re: Improved DeletionTime serialization to reduce disk size

Posted by Benedict <be...@apache.org>.

I would prefer we not plan on two distinct changes to this, particularly when
neither change is particularly more complex than the other. There is a modest
cost to maintenance from changing this multiple times.

  

But if others feel strongly otherwise I won’t stand in the way.

  

> On 26 Jun 2023, at 05:49, Berenguer Blasi <be...@gmail.com> wrote:  
>  
>

> 
>
> Thanks for the replies.
>
> I intend to javadoc the ssatble format in detail someday and more
> improvements might come up then, along the vint encoding mentioned here. But
> unless sbdy volunteers to do that in 5.0, is anybody against I try to get
> the original proposal (1 byte flags for sentinel values) in?
>
> Regards  
>
>
>  
>
>

>> Distant future people will not be happy about this, I can already tell you
now.  
>
>
> Eh, they'll all be AI's anyway and will just rewrite the code in a
> background thread.
>
> LOL  
>
>
>  
>
>
>  
>
>
>  
>
>
> On 23/6/23 15:44, Josh McKenzie wrote:  
>
>

>> > If we’re doing this, why don’t we delta encode a vint from some per-
sstable minimum value? I’d expect that to commonly compress to a single byte
or so.  
>
>>

>> +1 to this approach.

>>

>>  
>
>>

>>> Distant future people will not be happy about this, I can already tell you
now.  
>
>>

>> Eh, they'll all be AI's anyway and will just rewrite the code in a
background thread.

>>

>>  
>
>>

>> On Fri, Jun 23, 2023, at 9:02 AM, Berenguer Blasi wrote:  
>
>>

>>> It's a possibility. Though I haven't coded and benchmarked such an  
>
>>>

>>> approach and I don't think I would have the time before the freeze to  
>
>>>

>>> take advantage of the sstable format change opportunity.  
>
>>>

>>>  
>
>>>

>>> Still it's sthg that can be explored later. If we can shave a few extra  
>
>>>

>>> % then that would always be great imo.  
>
>>>

>>>  
>
>>>

>>> On 23/6/23 13:57, Benedict wrote:  
>
>>>

>>> > If we’re doing this, why don’t we delta encode a vint from some per-
sstable minimum value? I’d expect that to commonly compress to a single byte
or so.  
>
>>>

>>> >  
>
>>>

>>> >> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko
<[aleksey@apple.com](mailto:aleksey@apple.com)> wrote:  
>
>>>

>>> >>  
>
>>>

>>> >> Distant future people will not be happy about this, I can already tell
you now.  
>
>>>

>>> >>  
>
>>>

>>> >> Sounds like a reasonable improvement to me however.  
>
>>>

>>> >>  
>
>>>

>>> >>> On 23 Jun 2023, at 07:22, Berenguer Blasi
<[berenguerblasi@gmail.com](mailto:berenguerblasi@gmail.com)> wrote:  
>
>>>

>>> >>>  
>
>>>

>>> >>> Hi all,  
>
>>>

>>> >>>  
>
>>>

>>> >>> DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch.
But I noticed that with 7 bytes we can already encode ~2284 years. We can
either shed the 8th byte, for reduced IO and disk, or can encode some sentinel
values (such as LIVE) as flags there. That would mean reading and writing 1
byte instead of 12 (8 mfda long + 4 ldts int). Yes we already avoid
serializing DeletionTime (DT) in sstables at _row_ level entirely but not at
_partition_ level and it is also serialized at index, metadata, etc.  
>
>>>

>>> >>>  
>
>>>

>>> >>> So here's a POC:
<https://github.com/bereng/cassandra/commits/ldtdeser-trunk> and some jmh (1)
to evaluate the impact of the new alg (2). It's tested here against a 70% and
a 30% LIVE DTs  to see how we perform:  
>
>>>

>>> >>>  
>
>>>

>>> >>>      [java] Benchmark (liveDTPcParam)  (sstableParam)  Mode  Cnt
Score   Error  Units  
>
>>>

>>> >>>      [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive
NC  avgt   15  0.331 ± 0.001  ns/op  
>
>>>

>>> >>>      [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive
OA  avgt   15  0.335 ± 0.004  ns/op  
>
>>>

>>> >>>      [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive
NC  avgt   15  0.334 ± 0.002  ns/op  
>
>>>

>>> >>>      [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive
OA  avgt   15  0.340 ± 0.008  ns/op  
>
>>>

>>> >>>      [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive
NC  avgt   15  0.337 ± 0.006  ns/op  
>
>>>

>>> >>>      [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive
OA  avgt   15  0.340 ± 0.004  ns/op  
>
>>>

>>> >>>      [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive
NC  avgt   15  0.339 ± 0.004  ns/op  
>
>>>

>>> >>>      [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive
OA  avgt   15  0.343 ± 0.016  ns/op  
>
>>>

>>> >>>  
>
>>>

>>> >>> That was ByteBuffer backed to test the extra bit level operations
impact. But what would be the impact of an end to end test against disk?  
>
>>>

>>> >>>  
>
>>>

>>> >>>      [java] Benchmark (diskRAMParam)  (liveDTPcParam)  (sstableParam)
Mode  Cnt Score        Error  Units  
>
>>>

>>> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM
70PcLive              NC  avgt   15   605236.515 ± 19929.058  ns/op  
>
>>>

>>> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM
70PcLive              OA  avgt   15   586477.039 ± 7384.632  ns/op  
>
>>>

>>> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM
30PcLive              NC  avgt   15   937580.311 ± 30669.647  ns/op  
>
>>>

>>> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM
30PcLive              OA  avgt   15   914097.770 ± 9865.070  ns/op  
>
>>>

>>> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT   Disk
70PcLive              NC  avgt   15  1314417.207 ± 37879.012  ns/op  
>
>>>

>>> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT          Disk
70PcLive              OA  avgt   15 805256.345 ±  15471.587  ns/op  
>
>>>

>>> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT         Disk
30PcLive              NC  avgt   15 1583239.011 ±  50104.245  ns/op  
>
>>>

>>> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT        Disk
30PcLive              OA  avgt   15 1439605.006 ±  64342.510  ns/op  
>
>>>

>>> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT          RAM
70PcLive              NC  avgt   15 295711.217 ±   5432.507  ns/op  
>
>>>

>>> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT        RAM
70PcLive              OA  avgt   15 305282.827 ±   1906.841  ns/op  
>
>>>

>>> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT      RAM
30PcLive              NC  avgt   15   446029.899 ±   4038.938  ns/op  
>
>>>

>>> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT    RAM
30PcLive              OA  avgt   15   479085.875 ± 10032.804  ns/op  
>
>>>

>>> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT     Disk
70PcLive              NC  avgt   15  1789434.838 ± 206455.771  ns/op  
>
>>>

>>> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT           Disk
70PcLive              OA  avgt   15 589752.861 ±  31676.265  ns/op  
>
>>>

>>> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT        Disk
30PcLive              NC  avgt   15 1754862.122 ± 164903.051  ns/op  
>
>>>

>>> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT     Disk
30PcLive              OA  avgt   15  1252162.253 ± 121626.818  ns/o  
>
>>>

>>> >>>  
>
>>>

>>> >>> We can see big improvements when backed with the disk and little
impact from the new alg.  
>
>>>

>>> >>>  
>
>>>

>>> >>> Given we're already introducing a new sstable format (OA) in 5.0 I
would like to try to get this in before the freeze. The point being that
sstables with lots of small partitions would benefit from a smaller DT at
partition level. My tests show a 3%-4% size reduction on disk.  
>
>>>

>>> >>>  
>
>>>

>>> >>> Before proceeding though I'd like to bounce the idea against the
community for all the corner cases and scenarios I might have missed where
this could be a problem?  
>
>>>

>>> >>>  
>
>>>

>>> >>> Thx in advance!  
>
>>>

>>> >>>  
>
>>>

>>> >>>  
>
>>>

>>> >>> (1) <https://github.com/bereng/cassandra/blob/ldtdeser-
trunk/test/microbench/org/apache/cassandra/test/microbench/DeletionTimeDeSerBench.java>  
>
>>>

>>> >>>  
>
>>>

>>> >>> (2) <https://github.com/bereng/cassandra/blob/ldtdeser-
trunk/src/java/org/apache/cassandra/db/DeletionTime.java#L212>  
>
>>>

>>> >>>  
>
>>>

>>>  
>
>>

>>  
>

Re: Improved DeletionTime serialization to reduce disk size

Posted by Berenguer Blasi <be...@gmail.com>.

Thanks for the replies.

I intend to javadoc the ssatble format in detail someday and more 
improvements might come up then, along the vint encoding mentioned here. 
But unless sbdy volunteers to do that in 5.0, is anybody against I try 
to get the original proposal (1 byte flags for sentinel values) in?

Regards


> Distant future people will not be happy about this, I can already tell 
> you now.
Eh, they'll all be AI's anyway and will just rewrite the code in a 
background thread.

LOL




On 23/6/23 15:44, Josh McKenzie wrote:
>> If we’re doing this, why don’t we delta encode a vint from some 
>> per-sstable minimum value? I’d expect that to commonly compress to a 
>> single byte or so.
> +1 to this approach.
>
>> Distant future people will not be happy about this, I can already 
>> tell you now.
> Eh, they'll all be AI's anyway and will just rewrite the code in a 
> background thread.
>
> On Fri, Jun 23, 2023, at 9:02 AM, Berenguer Blasi wrote:
>> It's a possibility. Though I haven't coded and benchmarked such an
>> approach and I don't think I would have the time before the freeze to
>> take advantage of the sstable format change opportunity.
>>
>> Still it's sthg that can be explored later. If we can shave a few extra
>> % then that would always be great imo.
>>
>> On 23/6/23 13:57, Benedict wrote:
>> > If we’re doing this, why don’t we delta encode a vint from some 
>> per-sstable minimum value? I’d expect that to commonly compress to a 
>> single byte or so.
>> >
>> >> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko <al...@apple.com> 
>> wrote:
>> >>
>> >> Distant future people will not be happy about this, I can already 
>> tell you now.
>> >>
>> >> Sounds like a reasonable improvement to me however.
>> >>
>> >>> On 23 Jun 2023, at 07:22, Berenguer Blasi 
>> <be...@gmail.com> wrote:
>> >>>
>> >>> Hi all,
>> >>>
>> >>> DeletionTime.markedForDeleteAt is a long useconds since Unix 
>> Epoch. But I noticed that with 7 bytes we can already encode ~2284 
>> years. We can either shed the 8th byte, for reduced IO and disk, or 
>> can encode some sentinel values (such as LIVE) as flags there. That 
>> would mean reading and writing 1 byte instead of 12 (8 mfda long + 4 
>> ldts int). Yes we already avoid serializing DeletionTime (DT) in 
>> sstables at _row_ level entirely but not at _partition_ level and it 
>> is also serialized at index, metadata, etc.
>> >>>
>> >>> So here's a POC: 
>> https://github.com/bereng/cassandra/commits/ldtdeser-trunk and some 
>> jmh (1) to evaluate the impact of the new alg (2). It's tested here 
>> against a 70% and a 30% LIVE DTs  to see how we perform:
>> >>>
>> >>>      [java] Benchmark (liveDTPcParam) (sstableParam)  Mode  Cnt  
>> Score   Error  Units
>> >>>      [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive NC  
>> avgt   15  0.331 ± 0.001  ns/op
>> >>>      [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive OA  
>> avgt   15  0.335 ± 0.004  ns/op
>> >>>      [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive NC  
>> avgt   15  0.334 ± 0.002  ns/op
>> >>>      [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive OA  
>> avgt   15  0.340 ± 0.008  ns/op
>> >>>      [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive NC  
>> avgt   15  0.337 ± 0.006  ns/op
>> >>>      [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive OA  
>> avgt   15  0.340 ± 0.004  ns/op
>> >>>      [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive NC  
>> avgt   15  0.339 ± 0.004  ns/op
>> >>>      [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive OA  
>> avgt   15  0.343 ± 0.016  ns/op
>> >>>
>> >>> That was ByteBuffer backed to test the extra bit level operations 
>> impact. But what would be the impact of an end to end test against disk?
>> >>>
>> >>>      [java] Benchmark (diskRAMParam) (liveDTPcParam)  
>> (sstableParam)  Mode  Cnt Score        Error Units
>> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 
>> 70PcLive              NC  avgt   15   605236.515 ± 19929.058 ns/op
>> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 
>> 70PcLive              OA  avgt   15   586477.039 ± 7384.632 ns/op
>> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 
>> 30PcLive              NC  avgt   15   937580.311 ± 30669.647 ns/op
>> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 
>> 30PcLive              OA  avgt   15   914097.770 ± 9865.070 ns/op
>> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT   Disk 
>> 70PcLive              NC  avgt   15  1314417.207 ± 37879.012 ns/op
>> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT 
>> Disk         70PcLive              OA  avgt   15 805256.345 ± 
>> 15471.587  ns/op
>> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT 
>> Disk         30PcLive              NC  avgt   15 1583239.011 ±  
>> 50104.245  ns/op
>> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT 
>> Disk         30PcLive              OA  avgt   15 1439605.006 ±  
>> 64342.510  ns/op
>> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT          
>> RAM 70PcLive              NC  avgt   15 295711.217 ±   5432.507 ns/op
>> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT        RAM 
>> 70PcLive              OA  avgt   15 305282.827 ±   1906.841 ns/op
>> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT      RAM 
>> 30PcLive              NC  avgt   15   446029.899 ±   4038.938 ns/op
>> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT    RAM 
>> 30PcLive              OA  avgt   15   479085.875 ± 10032.804 ns/op
>> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT     Disk 
>> 70PcLive              NC  avgt   15  1789434.838 ± 206455.771 ns/op
>> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT 
>> Disk         70PcLive              OA  avgt   15 589752.861 ± 
>> 31676.265  ns/op
>> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT        Disk 
>> 30PcLive              NC  avgt   15 1754862.122 ± 164903.051 ns/op
>> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT     Disk 
>> 30PcLive              OA  avgt   15  1252162.253 ± 121626.818 ns/o
>> >>>
>> >>> We can see big improvements when backed with the disk and little 
>> impact from the new alg.
>> >>>
>> >>> Given we're already introducing a new sstable format (OA) in 5.0 
>> I would like to try to get this in before the freeze. The point being 
>> that sstables with lots of small partitions would benefit from a 
>> smaller DT at partition level. My tests show a 3%-4% size reduction 
>> on disk.
>> >>>
>> >>> Before proceeding though I'd like to bounce the idea against the 
>> community for all the corner cases and scenarios I might have missed 
>> where this could be a problem?
>> >>>
>> >>> Thx in advance!
>> >>>
>> >>>
>> >>> (1) 
>> https://github.com/bereng/cassandra/blob/ldtdeser-trunk/test/microbench/org/apache/cassandra/test/microbench/DeletionTimeDeSerBench.java
>> >>>
>> >>> (2) 
>> https://github.com/bereng/cassandra/blob/ldtdeser-trunk/src/java/org/apache/cassandra/db/DeletionTime.java#L212
>> >>>
>>
>

Re: Improved DeletionTime serialization to reduce disk size

Posted by Josh McKenzie <jm...@apache.org>.

> If we’re doing this, why don’t we delta encode a vint from some per-sstable minimum value? I’d expect that to commonly compress to a single byte or so.
+1 to this approach.

> Distant future people will not be happy about this, I can already tell you now.
Eh, they'll all be AI's anyway and will just rewrite the code in a background thread.

On Fri, Jun 23, 2023, at 9:02 AM, Berenguer Blasi wrote:
> It's a possibility. Though I haven't coded and benchmarked such an 
> approach and I don't think I would have the time before the freeze to 
> take advantage of the sstable format change opportunity.
> 
> Still it's sthg that can be explored later. If we can shave a few extra 
> % then that would always be great imo.
> 
> On 23/6/23 13:57, Benedict wrote:
> > If we’re doing this, why don’t we delta encode a vint from some per-sstable minimum value? I’d expect that to commonly compress to a single byte or so.
> >
> >> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko <al...@apple.com> wrote:
> >>
> >> Distant future people will not be happy about this, I can already tell you now.
> >>
> >> Sounds like a reasonable improvement to me however.
> >>
> >>> On 23 Jun 2023, at 07:22, Berenguer Blasi <be...@gmail.com> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I noticed that with 7 bytes we can already encode ~2284 years. We can either shed the 8th byte, for reduced IO and disk, or can encode some sentinel values (such as LIVE) as flags there. That would mean reading and writing 1 byte instead of 12 (8 mfda long + 4 ldts int). Yes we already avoid serializing DeletionTime (DT) in sstables at _row_ level entirely but not at _partition_ level and it is also serialized at index, metadata, etc.
> >>>
> >>> So here's a POC: https://github.com/bereng/cassandra/commits/ldtdeser-trunk and some jmh (1) to evaluate the impact of the new alg (2). It's tested here against a 70% and a 30% LIVE DTs  to see how we perform:
> >>>
> >>>      [java] Benchmark (liveDTPcParam)  (sstableParam)  Mode  Cnt  Score   Error  Units
> >>>      [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive              NC  avgt   15  0.331 ± 0.001  ns/op
> >>>      [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive              OA  avgt   15  0.335 ± 0.004  ns/op
> >>>      [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive              NC  avgt   15  0.334 ± 0.002  ns/op
> >>>      [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive              OA  avgt   15  0.340 ± 0.008  ns/op
> >>>      [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive              NC  avgt   15  0.337 ± 0.006  ns/op
> >>>      [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive              OA  avgt   15  0.340 ± 0.004  ns/op
> >>>      [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive              NC  avgt   15  0.339 ± 0.004  ns/op
> >>>      [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive              OA  avgt   15  0.343 ± 0.016  ns/op
> >>>
> >>> That was ByteBuffer backed to test the extra bit level operations impact. But what would be the impact of an end to end test against disk?
> >>>
> >>>      [java] Benchmark (diskRAMParam)  (liveDTPcParam)  (sstableParam)  Mode  Cnt Score        Error  Units
> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         70PcLive              NC  avgt   15   605236.515 ± 19929.058  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         70PcLive              OA  avgt   15   586477.039 ± 7384.632  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         30PcLive              NC  avgt   15   937580.311 ± 30669.647  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         30PcLive              OA  avgt   15   914097.770 ± 9865.070  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT   Disk         70PcLive              NC  avgt   15  1314417.207 ± 37879.012  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT          Disk         70PcLive              OA  avgt   15 805256.345 ±  15471.587  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT         Disk         30PcLive              NC  avgt   15 1583239.011 ±  50104.245  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT        Disk         30PcLive              OA  avgt   15 1439605.006 ±  64342.510  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT          RAM         70PcLive              NC  avgt   15 295711.217 ±   5432.507  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT        RAM         70PcLive              OA  avgt   15 305282.827 ±   1906.841  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT      RAM         30PcLive              NC  avgt   15   446029.899 ±   4038.938  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT    RAM         30PcLive              OA  avgt   15   479085.875 ± 10032.804  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT     Disk         70PcLive              NC  avgt   15  1789434.838 ± 206455.771  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT           Disk         70PcLive              OA  avgt   15 589752.861 ±  31676.265  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT        Disk         30PcLive              NC  avgt   15 1754862.122 ± 164903.051  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT     Disk         30PcLive              OA  avgt   15  1252162.253 ± 121626.818  ns/o
> >>>
> >>> We can see big improvements when backed with the disk and little impact from the new alg.
> >>>
> >>> Given we're already introducing a new sstable format (OA) in 5.0 I would like to try to get this in before the freeze. The point being that sstables with lots of small partitions would benefit from a smaller DT at partition level. My tests show a 3%-4% size reduction on disk.
> >>>
> >>> Before proceeding though I'd like to bounce the idea against the community for all the corner cases and scenarios I might have missed where this could be a problem?
> >>>
> >>> Thx in advance!
> >>>
> >>>
> >>> (1) https://github.com/bereng/cassandra/blob/ldtdeser-trunk/test/microbench/org/apache/cassandra/test/microbench/DeletionTimeDeSerBench.java
> >>>
> >>> (2) https://github.com/bereng/cassandra/blob/ldtdeser-trunk/src/java/org/apache/cassandra/db/DeletionTime.java#L212
> >>>
>

Re: Improved DeletionTime serialization to reduce disk size

Posted by Berenguer Blasi <be...@gmail.com>.

It's a possibility. Though I haven't coded and benchmarked such an 
approach and I don't think I would have the time before the freeze to 
take advantage of the sstable format change opportunity.

Still it's sthg that can be explored later. If we can shave a few extra 
% then that would always be great imo.

On 23/6/23 13:57, Benedict wrote:
> If we’re doing this, why don’t we delta encode a vint from some per-sstable minimum value? I’d expect that to commonly compress to a single byte or so.
>
>> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko <al...@apple.com> wrote:
>>
>> Distant future people will not be happy about this, I can already tell you now.
>>
>> Sounds like a reasonable improvement to me however.
>>
>>> On 23 Jun 2023, at 07:22, Berenguer Blasi <be...@gmail.com> wrote:
>>>
>>> Hi all,
>>>
>>> DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I noticed that with 7 bytes we can already encode ~2284 years. We can either shed the 8th byte, for reduced IO and disk, or can encode some sentinel values (such as LIVE) as flags there. That would mean reading and writing 1 byte instead of 12 (8 mfda long + 4 ldts int). Yes we already avoid serializing DeletionTime (DT) in sstables at _row_ level entirely but not at _partition_ level and it is also serialized at index, metadata, etc.
>>>
>>> So here's a POC: https://github.com/bereng/cassandra/commits/ldtdeser-trunk and some jmh (1) to evaluate the impact of the new alg (2). It's tested here against a 70% and a 30% LIVE DTs  to see how we perform:
>>>
>>>      [java] Benchmark (liveDTPcParam)  (sstableParam)  Mode  Cnt  Score   Error  Units
>>>      [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive              NC  avgt   15  0.331 ± 0.001  ns/op
>>>      [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive              OA  avgt   15  0.335 ± 0.004  ns/op
>>>      [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive              NC  avgt   15  0.334 ± 0.002  ns/op
>>>      [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive              OA  avgt   15  0.340 ± 0.008  ns/op
>>>      [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive              NC  avgt   15  0.337 ± 0.006  ns/op
>>>      [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive              OA  avgt   15  0.340 ± 0.004  ns/op
>>>      [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive              NC  avgt   15  0.339 ± 0.004  ns/op
>>>      [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive              OA  avgt   15  0.343 ± 0.016  ns/op
>>>
>>> That was ByteBuffer backed to test the extra bit level operations impact. But what would be the impact of an end to end test against disk?
>>>
>>>      [java] Benchmark (diskRAMParam)  (liveDTPcParam)  (sstableParam)  Mode  Cnt Score        Error  Units
>>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         70PcLive              NC  avgt   15   605236.515 ± 19929.058  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         70PcLive              OA  avgt   15   586477.039 ± 7384.632  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         30PcLive              NC  avgt   15   937580.311 ± 30669.647  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         30PcLive              OA  avgt   15   914097.770 ± 9865.070  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT   Disk         70PcLive              NC  avgt   15  1314417.207 ± 37879.012  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT          Disk         70PcLive              OA  avgt   15 805256.345 ±  15471.587  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT         Disk         30PcLive              NC  avgt   15 1583239.011 ±  50104.245  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT        Disk         30PcLive              OA  avgt   15 1439605.006 ±  64342.510  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT          RAM         70PcLive              NC  avgt   15 295711.217 ±   5432.507  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT        RAM         70PcLive              OA  avgt   15 305282.827 ±   1906.841  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT      RAM         30PcLive              NC  avgt   15   446029.899 ±   4038.938  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT    RAM         30PcLive              OA  avgt   15   479085.875 ± 10032.804  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT     Disk         70PcLive              NC  avgt   15  1789434.838 ± 206455.771  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT           Disk         70PcLive              OA  avgt   15 589752.861 ±  31676.265  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT        Disk         30PcLive              NC  avgt   15 1754862.122 ± 164903.051  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT     Disk         30PcLive              OA  avgt   15  1252162.253 ± 121626.818  ns/o
>>>
>>> We can see big improvements when backed with the disk and little impact from the new alg.
>>>
>>> Given we're already introducing a new sstable format (OA) in 5.0 I would like to try to get this in before the freeze. The point being that sstables with lots of small partitions would benefit from a smaller DT at partition level. My tests show a 3%-4% size reduction on disk.
>>>
>>> Before proceeding though I'd like to bounce the idea against the community for all the corner cases and scenarios I might have missed where this could be a problem?
>>>
>>> Thx in advance!
>>>
>>>
>>> (1) https://github.com/bereng/cassandra/blob/ldtdeser-trunk/test/microbench/org/apache/cassandra/test/microbench/DeletionTimeDeSerBench.java
>>>
>>> (2) https://github.com/bereng/cassandra/blob/ldtdeser-trunk/src/java/org/apache/cassandra/db/DeletionTime.java#L212
>>>

Re: Improved DeletionTime serialization to reduce disk size

Posted by Benedict <be...@apache.org>.

If we’re doing this, why don’t we delta encode a vint from some per-sstable minimum value? I’d expect that to commonly compress to a single byte or so.

> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko <al...@apple.com> wrote:
> 
> Distant future people will not be happy about this, I can already tell you now.
> 
> Sounds like a reasonable improvement to me however.
> 
>> On 23 Jun 2023, at 07:22, Berenguer Blasi <be...@gmail.com> wrote:
>> 
>> Hi all,
>> 
>> DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I noticed that with 7 bytes we can already encode ~2284 years. We can either shed the 8th byte, for reduced IO and disk, or can encode some sentinel values (such as LIVE) as flags there. That would mean reading and writing 1 byte instead of 12 (8 mfda long + 4 ldts int). Yes we already avoid serializing DeletionTime (DT) in sstables at _row_ level entirely but not at _partition_ level and it is also serialized at index, metadata, etc.
>> 
>> So here's a POC: https://github.com/bereng/cassandra/commits/ldtdeser-trunk and some jmh (1) to evaluate the impact of the new alg (2). It's tested here against a 70% and a 30% LIVE DTs  to see how we perform:
>> 
>>     [java] Benchmark (liveDTPcParam)  (sstableParam)  Mode  Cnt  Score   Error  Units
>>     [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive              NC  avgt   15  0.331 ± 0.001  ns/op
>>     [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive              OA  avgt   15  0.335 ± 0.004  ns/op
>>     [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive              NC  avgt   15  0.334 ± 0.002  ns/op
>>     [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive              OA  avgt   15  0.340 ± 0.008  ns/op
>>     [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive              NC  avgt   15  0.337 ± 0.006  ns/op
>>     [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive              OA  avgt   15  0.340 ± 0.004  ns/op
>>     [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive              NC  avgt   15  0.339 ± 0.004  ns/op
>>     [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive              OA  avgt   15  0.343 ± 0.016  ns/op
>> 
>> That was ByteBuffer backed to test the extra bit level operations impact. But what would be the impact of an end to end test against disk?
>> 
>>     [java] Benchmark (diskRAMParam)  (liveDTPcParam)  (sstableParam)  Mode  Cnt Score        Error  Units
>>     [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         70PcLive              NC  avgt   15   605236.515 ± 19929.058  ns/op
>>     [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         70PcLive              OA  avgt   15   586477.039 ± 7384.632  ns/op
>>     [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         30PcLive              NC  avgt   15   937580.311 ± 30669.647  ns/op
>>     [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         30PcLive              OA  avgt   15   914097.770 ± 9865.070  ns/op
>>     [java] DeletionTimeDeSerBench.testE2EDeSerializeDT   Disk         70PcLive              NC  avgt   15  1314417.207 ± 37879.012  ns/op
>>     [java] DeletionTimeDeSerBench.testE2EDeSerializeDT          Disk         70PcLive              OA  avgt   15 805256.345 ±  15471.587  ns/op
>>     [java] DeletionTimeDeSerBench.testE2EDeSerializeDT         Disk         30PcLive              NC  avgt   15 1583239.011 ±  50104.245  ns/op
>>     [java] DeletionTimeDeSerBench.testE2EDeSerializeDT        Disk         30PcLive              OA  avgt   15 1439605.006 ±  64342.510  ns/op
>>     [java] DeletionTimeDeSerBench.testE2ESerializeDT          RAM         70PcLive              NC  avgt   15 295711.217 ±   5432.507  ns/op
>>     [java] DeletionTimeDeSerBench.testE2ESerializeDT        RAM         70PcLive              OA  avgt   15 305282.827 ±   1906.841  ns/op
>>     [java] DeletionTimeDeSerBench.testE2ESerializeDT      RAM         30PcLive              NC  avgt   15   446029.899 ±   4038.938  ns/op
>>     [java] DeletionTimeDeSerBench.testE2ESerializeDT    RAM         30PcLive              OA  avgt   15   479085.875 ± 10032.804  ns/op
>>     [java] DeletionTimeDeSerBench.testE2ESerializeDT     Disk         70PcLive              NC  avgt   15  1789434.838 ± 206455.771  ns/op
>>     [java] DeletionTimeDeSerBench.testE2ESerializeDT           Disk         70PcLive              OA  avgt   15 589752.861 ±  31676.265  ns/op
>>     [java] DeletionTimeDeSerBench.testE2ESerializeDT        Disk         30PcLive              NC  avgt   15 1754862.122 ± 164903.051  ns/op
>>     [java] DeletionTimeDeSerBench.testE2ESerializeDT     Disk         30PcLive              OA  avgt   15  1252162.253 ± 121626.818  ns/o
>> 
>> We can see big improvements when backed with the disk and little impact from the new alg.
>> 
>> Given we're already introducing a new sstable format (OA) in 5.0 I would like to try to get this in before the freeze. The point being that sstables with lots of small partitions would benefit from a smaller DT at partition level. My tests show a 3%-4% size reduction on disk.
>> 
>> Before proceeding though I'd like to bounce the idea against the community for all the corner cases and scenarios I might have missed where this could be a problem?
>> 
>> Thx in advance!
>> 
>> 
>> (1) https://github.com/bereng/cassandra/blob/ldtdeser-trunk/test/microbench/org/apache/cassandra/test/microbench/DeletionTimeDeSerBench.java
>> 
>> (2) https://github.com/bereng/cassandra/blob/ldtdeser-trunk/src/java/org/apache/cassandra/db/DeletionTime.java#L212
>> 
>

Re: Improved DeletionTime serialization to reduce disk size

Posted by Aleksey Yeshchenko <al...@apple.com>.

Distant future people will not be happy about this, I can already tell you now.

Sounds like a reasonable improvement to me however.

> On 23 Jun 2023, at 07:22, Berenguer Blasi <be...@gmail.com> wrote:
> 
> Hi all,
> 
> DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I noticed that with 7 bytes we can already encode ~2284 years. We can either shed the 8th byte, for reduced IO and disk, or can encode some sentinel values (such as LIVE) as flags there. That would mean reading and writing 1 byte instead of 12 (8 mfda long + 4 ldts int). Yes we already avoid serializing DeletionTime (DT) in sstables at _row_ level entirely but not at _partition_ level and it is also serialized at index, metadata, etc.
> 
> So here's a POC: https://github.com/bereng/cassandra/commits/ldtdeser-trunk and some jmh (1) to evaluate the impact of the new alg (2). It's tested here against a 70% and a 30% LIVE DTs  to see how we perform:
> 
>      [java] Benchmark (liveDTPcParam)  (sstableParam)  Mode  Cnt  Score   Error  Units
>      [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive              NC  avgt   15  0.331 ± 0.001  ns/op
>      [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive              OA  avgt   15  0.335 ± 0.004  ns/op
>      [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive              NC  avgt   15  0.334 ± 0.002  ns/op
>      [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive              OA  avgt   15  0.340 ± 0.008  ns/op
>      [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive              NC  avgt   15  0.337 ± 0.006  ns/op
>      [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive              OA  avgt   15  0.340 ± 0.004  ns/op
>      [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive              NC  avgt   15  0.339 ± 0.004  ns/op
>      [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive              OA  avgt   15  0.343 ± 0.016  ns/op
> 
> That was ByteBuffer backed to test the extra bit level operations impact. But what would be the impact of an end to end test against disk?
> 
>      [java] Benchmark (diskRAMParam)  (liveDTPcParam)  (sstableParam)  Mode  Cnt Score        Error  Units
>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         70PcLive              NC  avgt   15   605236.515 ± 19929.058  ns/op
>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         70PcLive              OA  avgt   15   586477.039 ± 7384.632  ns/op
>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         30PcLive              NC  avgt   15   937580.311 ± 30669.647  ns/op
>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         30PcLive              OA  avgt   15   914097.770 ± 9865.070  ns/op
>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT   Disk         70PcLive              NC  avgt   15  1314417.207 ± 37879.012  ns/op
>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT          Disk         70PcLive              OA  avgt   15 805256.345 ±  15471.587  ns/op
>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT         Disk         30PcLive              NC  avgt   15 1583239.011 ±  50104.245  ns/op
>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT        Disk         30PcLive              OA  avgt   15 1439605.006 ±  64342.510  ns/op
>      [java] DeletionTimeDeSerBench.testE2ESerializeDT          RAM         70PcLive              NC  avgt   15 295711.217 ±   5432.507  ns/op
>      [java] DeletionTimeDeSerBench.testE2ESerializeDT        RAM         70PcLive              OA  avgt   15 305282.827 ±   1906.841  ns/op
>      [java] DeletionTimeDeSerBench.testE2ESerializeDT      RAM         30PcLive              NC  avgt   15   446029.899 ±   4038.938  ns/op
>      [java] DeletionTimeDeSerBench.testE2ESerializeDT    RAM         30PcLive              OA  avgt   15   479085.875 ± 10032.804  ns/op
>      [java] DeletionTimeDeSerBench.testE2ESerializeDT     Disk         70PcLive              NC  avgt   15  1789434.838 ± 206455.771  ns/op
>      [java] DeletionTimeDeSerBench.testE2ESerializeDT           Disk         70PcLive              OA  avgt   15 589752.861 ±  31676.265  ns/op
>      [java] DeletionTimeDeSerBench.testE2ESerializeDT        Disk         30PcLive              NC  avgt   15 1754862.122 ± 164903.051  ns/op
>      [java] DeletionTimeDeSerBench.testE2ESerializeDT     Disk         30PcLive              OA  avgt   15  1252162.253 ± 121626.818  ns/o
> 
> We can see big improvements when backed with the disk and little impact from the new alg.
> 
> Given we're already introducing a new sstable format (OA) in 5.0 I would like to try to get this in before the freeze. The point being that sstables with lots of small partitions would benefit from a smaller DT at partition level. My tests show a 3%-4% size reduction on disk.
> 
> Before proceeding though I'd like to bounce the idea against the community for all the corner cases and scenarios I might have missed where this could be a problem?
> 
> Thx in advance!
> 
> 
> (1) https://github.com/bereng/cassandra/blob/ldtdeser-trunk/test/microbench/org/apache/cassandra/test/microbench/DeletionTimeDeSerBench.java
> 
> (2) https://github.com/bereng/cassandra/blob/ldtdeser-trunk/src/java/org/apache/cassandra/db/DeletionTime.java#L212
>

Re: Improved DeletionTime serialization to reduce disk size

Posted by Berenguer Blasi <be...@gmail.com>.

The idea is 11 bytes less per LIVE partition. So small partitions will 
benefit the most.

On 29/6/23 18:44, Brandon Williams wrote:
> On Thu, Jun 29, 2023 at 11:42 AM Jeff Jirsa <jj...@gmail.com> wrote:
>> 3-4% reduction on disk ... for what exactly?
>>
>> It seems exceptionally uncommon to have 3% of your data SIZE be tombstones.
> If the data is TTL'd I think it's not entirely uncommon.
>
> Kind Regards,
> Brandon

Re: Improved DeletionTime serialization to reduce disk size

Posted by Brandon Williams <dr...@gmail.com>.

On Thu, Jun 29, 2023 at 11:42 AM Jeff Jirsa <jj...@gmail.com> wrote:
> 3-4% reduction on disk ... for what exactly?
>
> It seems exceptionally uncommon to have 3% of your data SIZE be tombstones.

If the data is TTL'd I think it's not entirely uncommon.

Kind Regards,
Brandon

Re: Improved DeletionTime serialization to reduce disk size

Posted by Jeff Jirsa <jj...@gmail.com>.

On Thu, Jun 22, 2023 at 11:23 PM Berenguer Blasi <be...@gmail.com>
wrote:

> Hi all,
>
> Given we're already introducing a new sstable format (OA) in 5.0 I would
> like to try to get this in before the freeze. The point being that
> sstables with lots of small partitions would benefit from a smaller DT
> at partition level. My tests show a 3%-4% size reduction on disk.
>

3-4% reduction on disk ... for what exactly?

It seems exceptionally uncommon to have 3% of your data SIZE be tombstones.

Is this enhancement driven by a pathological data model that's like "mostly
tiny records OR tombstones" ?

Improved DeletionTime serialization to reduce disk size

Posted by Berenguer Blasi <be...@gmail.com>.

Hi all,

DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But 
I noticed that with 7 bytes we can already encode ~2284 years. We can 
either shed the 8th byte, for reduced IO and disk, or can encode some 
sentinel values (such as LIVE) as flags there. That would mean reading 
and writing 1 byte instead of 12 (8 mfda long + 4 ldts int). Yes we 
already avoid serializing DeletionTime (DT) in sstables at _row_ level 
entirely but not at _partition_ level and it is also serialized at 
index, metadata, etc.

So here's a POC: 
https://github.com/bereng/cassandra/commits/ldtdeser-trunk and some jmh 
(1) to evaluate the impact of the new alg (2). It's tested here against 
a 70% and a 30% LIVE DTs  to see how we perform:

      [java] Benchmark (liveDTPcParam)  (sstableParam)  Mode  Cnt  
Score   Error  Units
      [java] DeletionTimeDeSerBench.testRawAlgReads 
70PcLive              NC  avgt   15  0.331 ± 0.001  ns/op
      [java] DeletionTimeDeSerBench.testRawAlgReads 
70PcLive              OA  avgt   15  0.335 ± 0.004  ns/op
      [java] DeletionTimeDeSerBench.testRawAlgReads 
30PcLive              NC  avgt   15  0.334 ± 0.002  ns/op
      [java] DeletionTimeDeSerBench.testRawAlgReads 
30PcLive              OA  avgt   15  0.340 ± 0.008  ns/op
      [java] DeletionTimeDeSerBench.testNewAlgWrites 
70PcLive              NC  avgt   15  0.337 ± 0.006  ns/op
      [java] DeletionTimeDeSerBench.testNewAlgWrites 
70PcLive              OA  avgt   15  0.340 ± 0.004  ns/op
      [java] DeletionTimeDeSerBench.testNewAlgWrites 
30PcLive              NC  avgt   15  0.339 ± 0.004  ns/op
      [java] DeletionTimeDeSerBench.testNewAlgWrites 
30PcLive              OA  avgt   15  0.343 ± 0.016  ns/op

That was ByteBuffer backed to test the extra bit level operations 
impact. But what would be the impact of an end to end test against disk?

      [java] Benchmark (diskRAMParam)  (liveDTPcParam)  (sstableParam)  
Mode  Cnt Score        Error  Units
      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         
70PcLive              NC  avgt   15   605236.515 ± 19929.058  ns/op
      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         
70PcLive              OA  avgt   15   586477.039 ± 7384.632  ns/op
      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         
30PcLive              NC  avgt   15   937580.311 ± 30669.647  ns/op
      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         
30PcLive              OA  avgt   15   914097.770 ± 9865.070  ns/op
      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT   Disk         
70PcLive              NC  avgt   15  1314417.207 ± 37879.012  ns/op
      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT          
Disk         70PcLive              OA  avgt   15 805256.345 ±  
15471.587  ns/op
      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT         
Disk         30PcLive              NC  avgt   15 1583239.011 ±  
50104.245  ns/op
      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT        
Disk         30PcLive              OA  avgt   15 1439605.006 ±  
64342.510  ns/op
      [java] DeletionTimeDeSerBench.testE2ESerializeDT          
RAM         70PcLive              NC  avgt   15 295711.217 ±   5432.507  
ns/op
      [java] DeletionTimeDeSerBench.testE2ESerializeDT        
RAM         70PcLive              OA  avgt   15 305282.827 ±   1906.841  
ns/op
      [java] DeletionTimeDeSerBench.testE2ESerializeDT      RAM         
30PcLive              NC  avgt   15   446029.899 ±   4038.938  ns/op
      [java] DeletionTimeDeSerBench.testE2ESerializeDT    RAM         
30PcLive              OA  avgt   15   479085.875 ± 10032.804  ns/op
      [java] DeletionTimeDeSerBench.testE2ESerializeDT     Disk         
70PcLive              NC  avgt   15  1789434.838 ± 206455.771  ns/op
      [java] DeletionTimeDeSerBench.testE2ESerializeDT           
Disk         70PcLive              OA  avgt   15 589752.861 ±  
31676.265  ns/op
      [java] DeletionTimeDeSerBench.testE2ESerializeDT        
Disk         30PcLive              NC  avgt   15 1754862.122 ± 
164903.051  ns/op
      [java] DeletionTimeDeSerBench.testE2ESerializeDT     Disk         
30PcLive              OA  avgt   15  1252162.253 ± 121626.818  ns/o

We can see big improvements when backed with the disk and little impact 
from the new alg.

Given we're already introducing a new sstable format (OA) in 5.0 I would 
like to try to get this in before the freeze. The point being that 
sstables with lots of small partitions would benefit from a smaller DT 
at partition level. My tests show a 3%-4% size reduction on disk.

Before proceeding though I'd like to bounce the idea against the 
community for all the corner cases and scenarios I might have missed 
where this could be a problem?

Thx in advance!


(1) 
https://github.com/bereng/cassandra/blob/ldtdeser-trunk/test/microbench/org/apache/cassandra/test/microbench/DeletionTimeDeSerBench.java

(2) 
https://github.com/bereng/cassandra/blob/ldtdeser-trunk/src/java/org/apache/cassandra/db/DeletionTime.java#L212