You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Michal Budzyn <mi...@gmail.com> on 2014/09/10 17:48:39 UTC

Storage: upsert vs. delete + insert

Is there any serious difference in the used disk and memory storage between
upsert and delete + insert ?

e.g. 2 vs 2A + 2B.

PK ((key), version, c1)

1. INSERT INTO A (key , version , c1, val) values (1, 1, 4711, “X1”)
...
2. INSERT INTO A (key , version , c1, val) values (1, 1, 4711, “X2”)
Vs.
2A. DELETE FROM A WHERE key=1 AND version = 1 AND c1=4711
2B. INSERT INTO A (key , version , c1, values) values (1, 1,  4711, “X2”)

Re: Storage: upsert vs. delete + insert

Posted by graham sanderson <gr...@vast.com>.
agreed

On Sep 10, 2014, at 3:27 PM, olek.stasiak@gmail.com wrote:

> You're right, there is no data in tombstone, only a column name. So
> there is only small overhead of disk size after delete. But i must
> agree with post above, it's pointless in deleting prior to inserting.
> Moreover, it needs one op more to compute resulting row.
> cheers,
> Olek
> 
> 2014-09-10 22:18 GMT+02:00 graham sanderson <gr...@vast.com>:
>> delete inserts a tombstone which is likely smaller than the original record (though still (currently) has overhead of cost for full key/column name
>> the data for the insert after a delete would be identical to the data if you just inserted/updated
>> 
>> no real benefit I can think of for doing the delete first.
>> 
>> On Sep 10, 2014, at 2:25 PM, olek.stasiak@gmail.com wrote:
>> 
>>> I think so.
>>> this is how i see it:
>>> on the very beginning you have such line in datafile:
>>> {key: [col_name, col_value, date_of_last_change]} //something similar,
>>> i don't remember now
>>> 
>>> after delete you're adding line:
>>> {key:[col_name, last_col_value, date_of_delete, 'd']} //this d
>>> indicates that field is deleted
>>> after insert the following line is added:
>>> {key: [col_name, col_value, date_of_insert]}
>>> so delete and then insert generates 2 lines in datafile.
>>> 
>>> after pure insert (upsert in fact) you will have only one line
>>> {key: [col_name, col_value, date_of_insert]}
>>> So, summarizing, in second scenario you have only one line, in first: two.
>>> I hope my post is correct ;)
>>> regards,
>>> Olek
>>> 
>>> 2014-09-10 18:56 GMT+02:00 Michal Budzyn <mi...@gmail.com>:
>>>> Would the factor before compaction be always 2 ?
>>>> 
>>>> On Wed, Sep 10, 2014 at 6:38 PM, olek.stasiak@gmail.com
>>>> <ol...@gmail.com> wrote:
>>>>> 
>>>>> IMHO, delete then insert will take two times more disk space then
>>>>> single insert. But after compaction the difference will disappear.
>>>>> This was true in version prior to 2.0, but it should still work this
>>>>> way. But maybe someone will correct me, if i'm wrong.
>>>>> Cheers,
>>>>> Olek
>>>>> 
>>>>> 2014-09-10 18:30 GMT+02:00 Michal Budzyn <mi...@gmail.com>:
>>>>>> One insert would be much better e.g. for performance and network
>>>>>> latency.
>>>>>> I wanted to know if there is a significant difference (apart from
>>>>>> additional
>>>>>> commit log entry) in the used storage between these 2 use cases.
>>>>>> 
>>>> 
>>>> 
>> 


Re: Storage: upsert vs. delete + insert

Posted by "olek.stasiak@gmail.com" <ol...@gmail.com>.
You're right, there is no data in tombstone, only a column name. So
there is only small overhead of disk size after delete. But i must
agree with post above, it's pointless in deleting prior to inserting.
Moreover, it needs one op more to compute resulting row.
cheers,
Olek

2014-09-10 22:18 GMT+02:00 graham sanderson <gr...@vast.com>:
> delete inserts a tombstone which is likely smaller than the original record (though still (currently) has overhead of cost for full key/column name
> the data for the insert after a delete would be identical to the data if you just inserted/updated
>
> no real benefit I can think of for doing the delete first.
>
> On Sep 10, 2014, at 2:25 PM, olek.stasiak@gmail.com wrote:
>
>> I think so.
>> this is how i see it:
>> on the very beginning you have such line in datafile:
>> {key: [col_name, col_value, date_of_last_change]} //something similar,
>> i don't remember now
>>
>> after delete you're adding line:
>> {key:[col_name, last_col_value, date_of_delete, 'd']} //this d
>> indicates that field is deleted
>> after insert the following line is added:
>> {key: [col_name, col_value, date_of_insert]}
>> so delete and then insert generates 2 lines in datafile.
>>
>> after pure insert (upsert in fact) you will have only one line
>> {key: [col_name, col_value, date_of_insert]}
>> So, summarizing, in second scenario you have only one line, in first: two.
>> I hope my post is correct ;)
>> regards,
>> Olek
>>
>> 2014-09-10 18:56 GMT+02:00 Michal Budzyn <mi...@gmail.com>:
>>> Would the factor before compaction be always 2 ?
>>>
>>> On Wed, Sep 10, 2014 at 6:38 PM, olek.stasiak@gmail.com
>>> <ol...@gmail.com> wrote:
>>>>
>>>> IMHO, delete then insert will take two times more disk space then
>>>> single insert. But after compaction the difference will disappear.
>>>> This was true in version prior to 2.0, but it should still work this
>>>> way. But maybe someone will correct me, if i'm wrong.
>>>> Cheers,
>>>> Olek
>>>>
>>>> 2014-09-10 18:30 GMT+02:00 Michal Budzyn <mi...@gmail.com>:
>>>>> One insert would be much better e.g. for performance and network
>>>>> latency.
>>>>> I wanted to know if there is a significant difference (apart from
>>>>> additional
>>>>> commit log entry) in the used storage between these 2 use cases.
>>>>>
>>>
>>>
>

Re: Storage: upsert vs. delete + insert

Posted by graham sanderson <gr...@vast.com>.
delete inserts a tombstone which is likely smaller than the original record (though still (currently) has overhead of cost for full key/column name
the data for the insert after a delete would be identical to the data if you just inserted/updated

no real benefit I can think of for doing the delete first.

On Sep 10, 2014, at 2:25 PM, olek.stasiak@gmail.com wrote:

> I think so.
> this is how i see it:
> on the very beginning you have such line in datafile:
> {key: [col_name, col_value, date_of_last_change]} //something similar,
> i don't remember now
> 
> after delete you're adding line:
> {key:[col_name, last_col_value, date_of_delete, 'd']} //this d
> indicates that field is deleted
> after insert the following line is added:
> {key: [col_name, col_value, date_of_insert]}
> so delete and then insert generates 2 lines in datafile.
> 
> after pure insert (upsert in fact) you will have only one line
> {key: [col_name, col_value, date_of_insert]}
> So, summarizing, in second scenario you have only one line, in first: two.
> I hope my post is correct ;)
> regards,
> Olek
> 
> 2014-09-10 18:56 GMT+02:00 Michal Budzyn <mi...@gmail.com>:
>> Would the factor before compaction be always 2 ?
>> 
>> On Wed, Sep 10, 2014 at 6:38 PM, olek.stasiak@gmail.com
>> <ol...@gmail.com> wrote:
>>> 
>>> IMHO, delete then insert will take two times more disk space then
>>> single insert. But after compaction the difference will disappear.
>>> This was true in version prior to 2.0, but it should still work this
>>> way. But maybe someone will correct me, if i'm wrong.
>>> Cheers,
>>> Olek
>>> 
>>> 2014-09-10 18:30 GMT+02:00 Michal Budzyn <mi...@gmail.com>:
>>>> One insert would be much better e.g. for performance and network
>>>> latency.
>>>> I wanted to know if there is a significant difference (apart from
>>>> additional
>>>> commit log entry) in the used storage between these 2 use cases.
>>>> 
>> 
>> 


Re: Storage: upsert vs. delete + insert

Posted by "olek.stasiak@gmail.com" <ol...@gmail.com>.
I think so.
this is how i see it:
on the very beginning you have such line in datafile:
{key: [col_name, col_value, date_of_last_change]} //something similar,
i don't remember now

after delete you're adding line:
{key:[col_name, last_col_value, date_of_delete, 'd']} //this d
indicates that field is deleted
after insert the following line is added:
{key: [col_name, col_value, date_of_insert]}
so delete and then insert generates 2 lines in datafile.

after pure insert (upsert in fact) you will have only one line
{key: [col_name, col_value, date_of_insert]}
So, summarizing, in second scenario you have only one line, in first: two.
I hope my post is correct ;)
regards,
Olek

2014-09-10 18:56 GMT+02:00 Michal Budzyn <mi...@gmail.com>:
> Would the factor before compaction be always 2 ?
>
> On Wed, Sep 10, 2014 at 6:38 PM, olek.stasiak@gmail.com
> <ol...@gmail.com> wrote:
>>
>> IMHO, delete then insert will take two times more disk space then
>> single insert. But after compaction the difference will disappear.
>> This was true in version prior to 2.0, but it should still work this
>> way. But maybe someone will correct me, if i'm wrong.
>> Cheers,
>> Olek
>>
>> 2014-09-10 18:30 GMT+02:00 Michal Budzyn <mi...@gmail.com>:
>> > One insert would be much better e.g. for performance and network
>> > latency.
>> > I wanted to know if there is a significant difference (apart from
>> > additional
>> > commit log entry) in the used storage between these 2 use cases.
>> >
>
>

Re: Storage: upsert vs. delete + insert

Posted by Michal Budzyn <mi...@gmail.com>.
Would the factor before compaction be always 2 ?

On Wed, Sep 10, 2014 at 6:38 PM, olek.stasiak@gmail.com <
olek.stasiak@gmail.com> wrote:

> IMHO, delete then insert will take two times more disk space then
> single insert. But after compaction the difference will disappear.
> This was true in version prior to 2.0, but it should still work this
> way. But maybe someone will correct me, if i'm wrong.
> Cheers,
> Olek
>
> 2014-09-10 18:30 GMT+02:00 Michal Budzyn <mi...@gmail.com>:
> > One insert would be much better e.g. for performance and network latency.
> > I wanted to know if there is a significant difference (apart from
> additional
> > commit log entry) in the used storage between these 2 use cases.
> >
>

Re: Storage: upsert vs. delete + insert

Posted by "olek.stasiak@gmail.com" <ol...@gmail.com>.
IMHO, delete then insert will take two times more disk space then
single insert. But after compaction the difference will disappear.
This was true in version prior to 2.0, but it should still work this
way. But maybe someone will correct me, if i'm wrong.
Cheers,
Olek

2014-09-10 18:30 GMT+02:00 Michal Budzyn <mi...@gmail.com>:
> One insert would be much better e.g. for performance and network latency.
> I wanted to know if there is a significant difference (apart from additional
> commit log entry) in the used storage between these 2 use cases.
>

Re: Storage: upsert vs. delete + insert

Posted by Michal Budzyn <mi...@gmail.com>.
One insert would be much better e.g. for performance and network latency.
I wanted to know if there is a significant difference (apart from
additional commit log entry) in the used storage between these 2 use cases.

Re: Storage: upsert vs. delete + insert

Posted by Shane Hansen <sh...@gmail.com>.
My understanding is that a update is the same as an insert. So I would
think delete+insert is a bad idea. Also insert+delete would put 2 entries
in the commit log.
On Sep 10, 2014 9:49 AM, "Michal Budzyn" <mi...@gmail.com> wrote:

> Is there any serious difference in the used disk and memory storage
> between upsert and delete + insert ?
>
> e.g. 2 vs 2A + 2B.
>
> PK ((key), version, c1)
>
> 1. INSERT INTO A (key , version , c1, val) values (1, 1, 4711, “X1”)
> ...
> 2. INSERT INTO A (key , version , c1, val) values (1, 1, 4711, “X2”)
> Vs.
> 2A. DELETE FROM A WHERE key=1 AND version = 1 AND c1=4711
> 2B. INSERT INTO A (key , version , c1, values) values (1, 1,  4711, “X2”)
>
>