You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by "Durity, Sean R" <SE...@homedepot.com> on 2018/12/27 20:21:24 UTC

RE: [EXTERNAL] Howto avoid tombstones when inserting NULL values

You say the events are incremental updates. I am interpreting this to mean only some columns are updated. Others should keep their original values.

You are correct that inserting null creates a tombstone.

Can you only insert the columns that actually have new values? Just skip the columns with no information. (Make the insert generator a bit smarter.)

Create table happening (id text primary key, event text, a text, b text, c text);
Insert into table happening (id, event, a, b, c) values ("MainEvent","The most complete info we have right now","Priceless","10 pm","Grand Ballroom");
-- b changes
Insert into happening (id, b) values ("MainEvent","9:30 pm");


Sean Durity


-----Original Message-----
From: Tomas Bartalos <to...@gmail.com>
Sent: Thursday, December 27, 2018 9:27 AM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Howto avoid tombstones when inserting NULL values

Hello,

I’d start with describing my use case and how I’d like to use Cassandra to solve my storage needs.
We're processing a stream of events for various happenings. Every event have a unique happening_id.
One happening may have many events, usually ~ 20-100 events. I’d like to store only the latest event for the same happening (Event is an incremental update and it contains all up-to date data about happening).
Technically the events are streamed from Kafka, processed with Spark an saved to Cassandra.
In Cassandra we use upserts (insert with same primary key).  So far so good, however there comes the tombstone...

When I’m inserting field with NULL value, Cassandra creates tombstone for this field. As I understood this is due to space efficiency, Cassandra doesn’t have to remember there is a NULL value, she just deletes the respective column and a delete creates a ... tombstone.
I was hoping there could be an option to tell Cassandra not to be so space effective and store “unset" info without generating tombstones.
Something similar to inserting empty strings instead of null values:

CREATE TABLE happening (id text PRIMARY KEY, event text); insert into happening (‘1’, ‘event1’); — tombstone is generated insert into happening (‘1’, null); — tombstone is not generated insert into happening (‘1’, '’);

Possible solutions:
1. Disable tombstones with gc_grace_seconds = 0 or set to reasonable low value (1 hour ?) . Not good, since phantom data may re-appear 2. ignore NULLs on spark side with “spark.cassandra.output.ignoreNulls=true”. Not good since this will never overwrite previously inserted event field with “empty” one.
3. On inserts with spark, find all NULL values and replace them with “empty” equivalent (empty string for text, 0 for integer). Very inefficient and problematic to find “empty” equivalent for some data types.

Until tombstones appeared Cassandra was the right fit for our use case, however now I’m not sure if we’re heading the right direction.
Could you please give me some advice how to solve this problem ?

Thank you,
Tomas
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org


________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org

Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values

Posted by Jonathan Haddad <jo...@jonhaddad.com>.
> I’m still not sure if having tombstones vs. empty values / frozen UDTs
will have the same results.

When in doubt, benchmark.

Good luck,
Jon

On Wed, Jan 9, 2019 at 3:02 PM Tomas Bartalos <to...@gmail.com>
wrote:

> Loosing atomic updates is a good point, but in my use case its not a
> problem, since I always overwrite the whole record (no partitial updates).
>
> I’m still not sure if having tombstones vs. empty values / frozen UDTs
> will have the same results.
> When I update one row with 10 null columns it will create 10 tombstones.
> We do OLAP processing of data stored in Cassandra with Spark.
>
> When Spark requests range of data, lets say 1000 rows, I can easily hit
> the 10 000 tombstones threshold.
>
> Even if I would not hit the error threshold Spark requests would increase
> the heap pressure, because tombstones have to be collected and returned to
> coordinator.
>
> Are my assumptions correct ?
>
> On 4 Jan 2019, at 21:15, DuyHai Doan <do...@gmail.com> wrote:
>
> The idea of storing your data as a single blob can be dangerous.
>
> Indeed, you loose the ability to perform atomic update on each column.
>
> In Cassandra, LWW is the rule. Suppose 2 concurrent updates on the same
> row, 1st update changes column Firstname (let's say it's a Person record)
> and 2nd update changes column Lastname
>
> Now depending on the timestamp between the 2 updates, you'll have:
>
> - old Firstname, new Lastname
> - new Firstname, old Lastname
>
> having updates on columns atomically guarantees you to have new Firstname,
> new Lastname
>
> On Fri, Jan 4, 2019 at 8:17 PM Jonathan Haddad <jo...@jonhaddad.com> wrote:
>
>> Those are two different cases though.  It *sounds like* (again, I may be
>> missing the point) you're trying to overwrite a value with another value.
>> You're either going to serialize a blob and overwrite a single cell, or
>> you're going to overwrite all the cells and include a tombstone.
>>
>> When you do a read, reading a single tombstone vs a single vs is
>> essentially the same thing, performance wise.
>>
>> In your description you said "~ 20-100 events", and you're overwriting
>> the event each time, so I don't know how you go to 10K tombstones either.
>> Compaction will bring multiple tombstones together for a cell in the same
>> way it compacts multiple values for a single cell.
>>
>> I sounds to make like you're taking some advice about tombstones out of
>> context and trying to apply the advice to a different problem.  Again, I
>> might be misunderstanding what you're doing.
>>
>>
>> On Fri, Jan 4, 2019 at 10:49 AM Tomas Bartalos <to...@gmail.com>
>> wrote:
>>
>>> Hello Jon,
>>>
>>> I thought having tombstones is much higher overhead than just
>>> overwriting values. The compaction overhead can be l similar, but I think
>>> the read performance is much worse.
>>>
>>> Tombstones accumulate and hang for 10 days (by default) before they are
>>> eligible for compaction.
>>>
>>> Also we have tombstone warning and error thresholds. If cassandra scans
>>> more than 10 000 tombstones, she will abort the query.
>>>
>>> According to this article:
>>> https://opencredo.com/blogs/cassandra-tombstones-common-issues/
>>>
>>> "The cassandra.yaml comments explain in perfectly: *“When executing a
>>> scan, within or across a partition, we need to keep the tombstones seen in
>>> memory so we can return them to the coordinator, which will use them to
>>> make sure other replicas also know about the deleted rows. With workloads
>>> that generate a lot of tombstones, this can cause performance problems and
>>> even exhaust the server heap. "*
>>>
>>> Regards,
>>> Tomas
>>>
>>> On Fri, 4 Jan 2019, 7:06 pm Jonathan Haddad <jon@jonhaddad.com wrote:
>>>
>>>> If you're overwriting values, it really doesn't matter much if it's a
>>>> tombstone or any other value, they still need to be compacted and have the
>>>> same overhead at read time.
>>>>
>>>> Tombstones are problematic when you try to use Cassandra as a queue (or
>>>> something like a queue) and you need to scan over thousands of tombstones
>>>> in order to get to the real data.  You're simply overwriting a row and
>>>> trying to avoid a single tombstone.
>>>>
>>>> Maybe I'm missing something here.  Why do you think overwriting a
>>>> single cell with a tombstone is any worse than overwriting a single cell
>>>> with a value?
>>>>
>>>> Jon
>>>>
>>>>
>>>> On Fri, Jan 4, 2019 at 9:57 AM Tomas Bartalos <to...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I beleive your approach is the same as using spark with "
>>>>> spark.cassandra.output.ignoreNulls=true"
>>>>> This will not cover the situation when a value have to be overwriten
>>>>> with null.
>>>>>
>>>>> I found one possible solution - change the schema to keep only primary
>>>>> key fields and move all other fields to frozen UDT.
>>>>> create table (year, month, day, id, frozen<Event>, primary key((year,
>>>>> month, day), id) )
>>>>> In this way anything that is null inside event doesn't create
>>>>> tombstone, since event is serialized to BLOB.
>>>>> The penalty is in need of deserializing the whole Event when selecting
>>>>> only few columns.
>>>>> Can anyone confirm if this is good solution performance wise?
>>>>>
>>>>> Thank you,
>>>>>
>>>>> On Fri, 4 Jan 2019, 2:20 pm DuyHai Doan <doanduyhai@gmail.com wrote:
>>>>>
>>>>>> "The problem is I can't know the combination of set/unset values" -->
>>>>>> Just for this requirement, Achilles has a working solution for many years
>>>>>> using INSERT_NOT_NULL_FIELDS strategy:
>>>>>>
>>>>>> https://github.com/doanduyhai/Achilles/wiki/Insert-Strategy
>>>>>>
>>>>>> Or you can use the Update API that by design only perform update on
>>>>>> not null fields:
>>>>>> https://github.com/doanduyhai/Achilles/wiki/Quick-Reference#updating-all-non-null-fields-for-an-entity
>>>>>>
>>>>>>
>>>>>> Behind the scene, for each new combination of INSERT INTO
>>>>>> table(x,y,z) statement, Achilles will check its prepared statement cache
>>>>>> and if the statement does not exist yet, create a new prepared statement
>>>>>> and put it into the cache for later re-use for you
>>>>>>
>>>>>> Disclaiment: I'm the creator of Achilles
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 27, 2018 at 10:21 PM Tomas Bartalos <
>>>>>> tomas.bartalos@gmail.com> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> The problem is I can't know the combination of set/unset values.
>>>>>>> From my perspective every value should be set. The event from Kafka
>>>>>>> represents the complete state of the happening at certain point in time. In
>>>>>>> my table I want to store the latest event so the most recent state of the
>>>>>>> happening (in this table I don't care about the history). Actually I used
>>>>>>> wrong expression since its just the opposite of "incremental update", every
>>>>>>> event carries all data (state) for specific point of time.
>>>>>>>
>>>>>>> The event is represented with nested json structure. Top level
>>>>>>> elements of the json are table fields with type like text, boolean,
>>>>>>> timestamp, list and the nested elements are UDT fields.
>>>>>>>
>>>>>>> Simplified example:
>>>>>>> There is a new purchase for the happening, event:
>>>>>>> {total_amount: 50, items : [A, B, C, new_item], purchase_time :
>>>>>>> '2018-12-27 13:30', specials: null, customer : {... }, fare_amount,...}
>>>>>>> I don't know what actually happened for this event, maybe there is a
>>>>>>> new item purchased, maybe some customer info have been changed, maybe the
>>>>>>> specials have been revoked and I have to reset them. I just need to store
>>>>>>> the state as it artived from Kafka, there might already be an event for
>>>>>>> this happening saved before, or maybe this is the first one.
>>>>>>>
>>>>>>> BR,
>>>>>>> Tomas
>>>>>>>
>>>>>>>
>>>>>>> On Thu, 27 Dec 2018, 9:36 pm Eric Stevens <mightye@gmail.com wrote:
>>>>>>>
>>>>>>>> Depending on the use case, creating separate prepared statements
>>>>>>>> for each combination of set / unset values in large INSERT/UPDATE
>>>>>>>> statements may be prohibitive.
>>>>>>>>
>>>>>>>> Instead, you can look into driver level support for UNSET values.
>>>>>>>> Requires Cassandra 2.2 or later IIRC.
>>>>>>>>
>>>>>>>> See:
>>>>>>>> Java Driver:
>>>>>>>> https://docs.datastax.com/en/developer/java-driver/3.0/manual/statements/prepared/#parameters-and-binding
>>>>>>>> Python Driver:
>>>>>>>> https://www.datastax.com/dev/blog/python-driver-2-6-0-rc1-with-cassandra-2-2-features#distinguishing_between_null_and_unset_values
>>>>>>>> Node Driver:
>>>>>>>> https://docs.datastax.com/en/developer/nodejs-driver/3.5/features/datatypes/nulls/#unset
>>>>>>>>
>>>>>>>> On Thu, Dec 27, 2018 at 3:21 PM Durity, Sean R <
>>>>>>>> SEAN_R_DURITY@homedepot.com> wrote:
>>>>>>>>
>>>>>>>>> You say the events are incremental updates. I am interpreting this
>>>>>>>>> to mean only some columns are updated. Others should keep their original
>>>>>>>>> values.
>>>>>>>>>
>>>>>>>>> You are correct that inserting null creates a tombstone.
>>>>>>>>>
>>>>>>>>> Can you only insert the columns that actually have new values?
>>>>>>>>> Just skip the columns with no information. (Make the insert generator a bit
>>>>>>>>> smarter.)
>>>>>>>>>
>>>>>>>>> Create table happening (id text primary key, event text, a text, b
>>>>>>>>> text, c text);
>>>>>>>>> Insert into table happening (id, event, a, b, c) values
>>>>>>>>> ("MainEvent","The most complete info we have right now","Priceless","10
>>>>>>>>> pm","Grand Ballroom");
>>>>>>>>> -- b changes
>>>>>>>>> Insert into happening (id, b) values ("MainEvent","9:30 pm");
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Sean Durity
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Tomas Bartalos <to...@gmail.com>
>>>>>>>>> Sent: Thursday, December 27, 2018 9:27 AM
>>>>>>>>> To: user@cassandra.apache.org
>>>>>>>>> Subject: [EXTERNAL] Howto avoid tombstones when inserting NULL
>>>>>>>>> values
>>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I’d start with describing my use case and how I’d like to use
>>>>>>>>> Cassandra to solve my storage needs.
>>>>>>>>> We're processing a stream of events for various happenings. Every
>>>>>>>>> event have a unique happening_id.
>>>>>>>>> One happening may have many events, usually ~ 20-100 events. I’d
>>>>>>>>> like to store only the latest event for the same happening (Event is an
>>>>>>>>> incremental update and it contains all up-to date data about happening).
>>>>>>>>> Technically the events are streamed from Kafka, processed with
>>>>>>>>> Spark an saved to Cassandra.
>>>>>>>>> In Cassandra we use upserts (insert with same primary key).  So
>>>>>>>>> far so good, however there comes the tombstone...
>>>>>>>>>
>>>>>>>>> When I’m inserting field with NULL value, Cassandra creates
>>>>>>>>> tombstone for this field. As I understood this is due to space efficiency,
>>>>>>>>> Cassandra doesn’t have to remember there is a NULL value, she just deletes
>>>>>>>>> the respective column and a delete creates a ... tombstone.
>>>>>>>>> I was hoping there could be an option to tell Cassandra not to be
>>>>>>>>> so space effective and store “unset" info without generating tombstones.
>>>>>>>>> Something similar to inserting empty strings instead of null
>>>>>>>>> values:
>>>>>>>>>
>>>>>>>>> CREATE TABLE happening (id text PRIMARY KEY, event text); insert
>>>>>>>>> into happening (‘1’, ‘event1’); — tombstone is generated insert into
>>>>>>>>> happening (‘1’, null); — tombstone is not generated insert into happening
>>>>>>>>> (‘1’, '’);
>>>>>>>>>
>>>>>>>>> Possible solutions:
>>>>>>>>> 1. Disable tombstones with gc_grace_seconds = 0 or set to
>>>>>>>>> reasonable low value (1 hour ?) . Not good, since phantom data may
>>>>>>>>> re-appear 2. ignore NULLs on spark side with
>>>>>>>>> “spark.cassandra.output.ignoreNulls=true”. Not good since this will never
>>>>>>>>> overwrite previously inserted event field with “empty” one.
>>>>>>>>> 3. On inserts with spark, find all NULL values and replace them
>>>>>>>>> with “empty” equivalent (empty string for text, 0 for integer). Very
>>>>>>>>> inefficient and problematic to find “empty” equivalent for some data types.
>>>>>>>>>
>>>>>>>>> Until tombstones appeared Cassandra was the right fit for our use
>>>>>>>>> case, however now I’m not sure if we’re heading the right direction.
>>>>>>>>> Could you please give me some advice how to solve this problem ?
>>>>>>>>>
>>>>>>>>> Thank you,
>>>>>>>>> Tomas
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>>>>>>>>> For additional commands, e-mail: user-help@cassandra.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ________________________________
>>>>>>>>>
>>>>>>>>> The information in this Internet Email is confidential and may be
>>>>>>>>> legally privileged. It is intended solely for the addressee. Access to this
>>>>>>>>> Email by anyone else is unauthorized. If you are not the intended
>>>>>>>>> recipient, any disclosure, copying, distribution or any action taken or
>>>>>>>>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>>>>>>>>> When addressed to our clients any opinions or advice contained in this
>>>>>>>>> Email are subject to the terms and conditions expressed in any applicable
>>>>>>>>> governing The Home Depot terms of business or client engagement letter. The
>>>>>>>>> Home Depot disclaims all responsibility and liability for the accuracy and
>>>>>>>>> content of this attachment and for any damages or losses arising from any
>>>>>>>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>>>>>>>>> items of a destructive nature, which may be contained in this attachment
>>>>>>>>> and shall not be liable for direct, indirect, consequential or special
>>>>>>>>> damages in connection with this e-mail message or its attachment.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>>>>>>>>> For additional commands, e-mail: user-help@cassandra.apache.org
>>>>>>>>>
>>>>>>>>
>>>>
>>>> --
>>>> Jon Haddad
>>>> http://www.rustyrazorblade.com
>>>> twitter: rustyrazorblade
>>>>
>>>
>>
>> --
>> Jon Haddad
>> http://www.rustyrazorblade.com
>> twitter: rustyrazorblade
>>
>
>

-- 
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade

Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values

Posted by Tomas Bartalos <to...@gmail.com>.
Loosing atomic updates is a good point, but in my use case its not a problem, since I always overwrite the whole record (no partitial updates).

I’m still not sure if having tombstones vs. empty values / frozen UDTs will have the same results.
When I update one row with 10 null columns it will create 10 tombstones.
We do OLAP processing of data stored in Cassandra with Spark.

When Spark requests range of data, lets say 1000 rows, I can easily hit the 10 000 tombstones threshold.

Even if I would not hit the error threshold Spark requests would increase the heap pressure, because tombstones have to be collected and returned to coordinator. 

Are my assumptions correct ?

> On 4 Jan 2019, at 21:15, DuyHai Doan <do...@gmail.com> wrote:
> 
> The idea of storing your data as a single blob can be dangerous.
> 
> Indeed, you loose the ability to perform atomic update on each column.
> 
> In Cassandra, LWW is the rule. Suppose 2 concurrent updates on the same row, 1st update changes column Firstname (let's say it's a Person record) and 2nd update changes column Lastname
> 
> Now depending on the timestamp between the 2 updates, you'll have:
> 
> - old Firstname, new Lastname
> - new Firstname, old Lastname
> 
> having updates on columns atomically guarantees you to have new Firstname, new Lastname
> 
> On Fri, Jan 4, 2019 at 8:17 PM Jonathan Haddad <jon@jonhaddad.com <ma...@jonhaddad.com>> wrote:
> Those are two different cases though.  It *sounds like* (again, I may be missing the point) you're trying to overwrite a value with another value.  You're either going to serialize a blob and overwrite a single cell, or you're going to overwrite all the cells and include a tombstone.
> 
> When you do a read, reading a single tombstone vs a single vs is essentially the same thing, performance wise.  
> 
> In your description you said "~ 20-100 events", and you're overwriting the event each time, so I don't know how you go to 10K tombstones either.  Compaction will bring multiple tombstones together for a cell in the same way it compacts multiple values for a single cell.  
> 
> I sounds to make like you're taking some advice about tombstones out of context and trying to apply the advice to a different problem.  Again, I might be misunderstanding what you're doing.
> 
> 
> On Fri, Jan 4, 2019 at 10:49 AM Tomas Bartalos <tomas.bartalos@gmail.com <ma...@gmail.com>> wrote:
> Hello Jon, 
> 
> I thought having tombstones is much higher overhead than just overwriting values. The compaction overhead can be l similar, but I think the read performance is much worse.
> 
> Tombstones accumulate and hang for 10 days (by default) before they are eligible for compaction. 
> 
> Also we have tombstone warning and error thresholds. If cassandra scans more than 10 000 tombstones, she will abort the query.
> 
> According to this article: https://opencredo.com/blogs/cassandra-tombstones-common-issues/ <https://opencredo.com/blogs/cassandra-tombstones-common-issues/>
> 
> "The cassandra.yaml comments explain in perfectly: “When executing a scan, within or across a partition, we need to keep the tombstones seen in memory so we can return them to the coordinator, which will use them to make sure other replicas also know about the deleted rows. With workloads that generate a lot of tombstones, this can cause performance problems and even exhaust the server heap. "
> 
> Regards, 
> Tomas
> 
> On Fri, 4 Jan 2019, 7:06 pm Jonathan Haddad <jon@jonhaddad.com <ma...@jonhaddad.com> wrote:
> If you're overwriting values, it really doesn't matter much if it's a tombstone or any other value, they still need to be compacted and have the same overhead at read time.  
> 
> Tombstones are problematic when you try to use Cassandra as a queue (or something like a queue) and you need to scan over thousands of tombstones in order to get to the real data.  You're simply overwriting a row and trying to avoid a single tombstone.  
> 
> Maybe I'm missing something here.  Why do you think overwriting a single cell with a tombstone is any worse than overwriting a single cell with a value?
> 
> Jon
> 
> 
> On Fri, Jan 4, 2019 at 9:57 AM Tomas Bartalos <tomas.bartalos@gmail.com <ma...@gmail.com>> wrote:
> Hello,
> 
> I beleive your approach is the same as using spark with "spark.cassandra.output.ignoreNulls=true"
> This will not cover the situation when a value have to be overwriten with null. 
> 
> I found one possible solution - change the schema to keep only primary key fields and move all other fields to frozen UDT.
> create table (year, month, day, id, frozen<Event>, primary key((year, month, day), id) )
> In this way anything that is null inside event doesn't create tombstone, since event is serialized to BLOB.
> The penalty is in need of deserializing the whole Event when selecting only few columns. 
> Can anyone confirm if this is good solution performance wise?
> 
> Thank you, 
> 
> On Fri, 4 Jan 2019, 2:20 pm DuyHai Doan <doanduyhai@gmail.com <ma...@gmail.com> wrote:
> "The problem is I can't know the combination of set/unset values" --> Just for this requirement, Achilles has a working solution for many years using INSERT_NOT_NULL_FIELDS strategy:
> 
> https://github.com/doanduyhai/Achilles/wiki/Insert-Strategy <https://github.com/doanduyhai/Achilles/wiki/Insert-Strategy>
> 
> Or you can use the Update API that by design only perform update on not null fields: https://github.com/doanduyhai/Achilles/wiki/Quick-Reference#updating-all-non-null-fields-for-an-entity <https://github.com/doanduyhai/Achilles/wiki/Quick-Reference#updating-all-non-null-fields-for-an-entity>
> 
> 
> Behind the scene, for each new combination of INSERT INTO table(x,y,z) statement, Achilles will check its prepared statement cache and if the statement does not exist yet, create a new prepared statement and put it into the cache for later re-use for you
> 
> Disclaiment: I'm the creator of Achilles
> 
> 
> 
> On Thu, Dec 27, 2018 at 10:21 PM Tomas Bartalos <tomas.bartalos@gmail.com <ma...@gmail.com>> wrote:
> Hello,
> 
> The problem is I can't know the combination of set/unset values. From my perspective every value should be set. The event from Kafka represents the complete state of the happening at certain point in time. In my table I want to store the latest event so the most recent state of the happening (in this table I don't care about the history). Actually I used wrong expression since its just the opposite of "incremental update", every event carries all data (state) for specific point of time. 
> 
> The event is represented with nested json structure. Top level elements of the json are table fields with type like text, boolean, timestamp, list and the nested elements are UDT fields. 
> 
> Simplified example:
> There is a new purchase for the happening, event:
> {total_amount: 50, items : [A, B, C, new_item], purchase_time : '2018-12-27 13:30', specials: null, customer : {... }, fare_amount,...} 
> I don't know what actually happened for this event, maybe there is a new item purchased, maybe some customer info have been changed, maybe the specials have been revoked and I have to reset them. I just need to store the state as it artived from Kafka, there might already be an event for this happening saved before, or maybe this is the first one.
> 
> BR,
> Tomas
> 
> 
> On Thu, 27 Dec 2018, 9:36 pm Eric Stevens <mightye@gmail.com <ma...@gmail.com> wrote:
> Depending on the use case, creating separate prepared statements for each combination of set / unset values in large INSERT/UPDATE statements may be prohibitive.  
> 
> Instead, you can look into driver level support for UNSET values.  Requires Cassandra 2.2 or later IIRC.
> 
> See:
> Java Driver: https://docs.datastax.com/en/developer/java-driver/3.0/manual/statements/prepared/#parameters-and-binding <https://docs.datastax.com/en/developer/java-driver/3.0/manual/statements/prepared/#parameters-and-binding>
> Python Driver: https://www.datastax.com/dev/blog/python-driver-2-6-0-rc1-with-cassandra-2-2-features#distinguishing_between_null_and_unset_values <https://www.datastax.com/dev/blog/python-driver-2-6-0-rc1-with-cassandra-2-2-features#distinguishing_between_null_and_unset_values>
> Node Driver: https://docs.datastax.com/en/developer/nodejs-driver/3.5/features/datatypes/nulls/#unset <https://docs.datastax.com/en/developer/nodejs-driver/3.5/features/datatypes/nulls/#unset>
> On Thu, Dec 27, 2018 at 3:21 PM Durity, Sean R <SEAN_R_DURITY@homedepot.com <ma...@homedepot.com>> wrote:
> You say the events are incremental updates. I am interpreting this to mean only some columns are updated. Others should keep their original values.
> 
> You are correct that inserting null creates a tombstone.
> 
> Can you only insert the columns that actually have new values? Just skip the columns with no information. (Make the insert generator a bit smarter.)
> 
> Create table happening (id text primary key, event text, a text, b text, c text);
> Insert into table happening (id, event, a, b, c) values ("MainEvent","The most complete info we have right now","Priceless","10 pm","Grand Ballroom");
> -- b changes
> Insert into happening (id, b) values ("MainEvent","9:30 pm");
> 
> 
> Sean Durity
> 
> 
> -----Original Message-----
> From: Tomas Bartalos <tomas.bartalos@gmail.com <ma...@gmail.com>>
> Sent: Thursday, December 27, 2018 9:27 AM
> To: user@cassandra.apache.org <ma...@cassandra.apache.org>
> Subject: [EXTERNAL] Howto avoid tombstones when inserting NULL values
> 
> Hello,
> 
> I’d start with describing my use case and how I’d like to use Cassandra to solve my storage needs.
> We're processing a stream of events for various happenings. Every event have a unique happening_id.
> One happening may have many events, usually ~ 20-100 events. I’d like to store only the latest event for the same happening (Event is an incremental update and it contains all up-to date data about happening).
> Technically the events are streamed from Kafka, processed with Spark an saved to Cassandra.
> In Cassandra we use upserts (insert with same primary key).  So far so good, however there comes the tombstone...
> 
> When I’m inserting field with NULL value, Cassandra creates tombstone for this field. As I understood this is due to space efficiency, Cassandra doesn’t have to remember there is a NULL value, she just deletes the respective column and a delete creates a ... tombstone.
> I was hoping there could be an option to tell Cassandra not to be so space effective and store “unset" info without generating tombstones.
> Something similar to inserting empty strings instead of null values:
> 
> CREATE TABLE happening (id text PRIMARY KEY, event text); insert into happening (‘1’, ‘event1’); — tombstone is generated insert into happening (‘1’, null); — tombstone is not generated insert into happening (‘1’, '’);
> 
> Possible solutions:
> 1. Disable tombstones with gc_grace_seconds = 0 or set to reasonable low value (1 hour ?) . Not good, since phantom data may re-appear 2. ignore NULLs on spark side with “spark.cassandra.output.ignoreNulls=true”. Not good since this will never overwrite previously inserted event field with “empty” one.
> 3. On inserts with spark, find all NULL values and replace them with “empty” equivalent (empty string for text, 0 for integer). Very inefficient and problematic to find “empty” equivalent for some data types.
> 
> Until tombstones appeared Cassandra was the right fit for our use case, however now I’m not sure if we’re heading the right direction.
> Could you please give me some advice how to solve this problem ?
> 
> Thank you,
> Tomas
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org <ma...@cassandra.apache.org>
> For additional commands, e-mail: user-help@cassandra.apache.org <ma...@cassandra.apache.org>
> 
> 
> ________________________________
> 
> The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org <ma...@cassandra.apache.org>
> For additional commands, e-mail: user-help@cassandra.apache.org <ma...@cassandra.apache.org>
> 
> 
> -- 
> Jon Haddad
> http://www.rustyrazorblade.com <http://www.rustyrazorblade.com/>
> twitter: rustyrazorblade
> 
> 
> -- 
> Jon Haddad
> http://www.rustyrazorblade.com <http://www.rustyrazorblade.com/>
> twitter: rustyrazorblade


Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values

Posted by DuyHai Doan <do...@gmail.com>.
The idea of storing your data as a single blob can be dangerous.

Indeed, you loose the ability to perform atomic update on each column.

In Cassandra, LWW is the rule. Suppose 2 concurrent updates on the same
row, 1st update changes column Firstname (let's say it's a Person record)
and 2nd update changes column Lastname

Now depending on the timestamp between the 2 updates, you'll have:

- old Firstname, new Lastname
- new Firstname, old Lastname

having updates on columns atomically guarantees you to have new Firstname,
new Lastname

On Fri, Jan 4, 2019 at 8:17 PM Jonathan Haddad <jo...@jonhaddad.com> wrote:

> Those are two different cases though.  It *sounds like* (again, I may be
> missing the point) you're trying to overwrite a value with another value.
> You're either going to serialize a blob and overwrite a single cell, or
> you're going to overwrite all the cells and include a tombstone.
>
> When you do a read, reading a single tombstone vs a single vs is
> essentially the same thing, performance wise.
>
> In your description you said "~ 20-100 events", and you're overwriting the
> event each time, so I don't know how you go to 10K tombstones either.
> Compaction will bring multiple tombstones together for a cell in the same
> way it compacts multiple values for a single cell.
>
> I sounds to make like you're taking some advice about tombstones out of
> context and trying to apply the advice to a different problem.  Again, I
> might be misunderstanding what you're doing.
>
>
> On Fri, Jan 4, 2019 at 10:49 AM Tomas Bartalos <to...@gmail.com>
> wrote:
>
>> Hello Jon,
>>
>> I thought having tombstones is much higher overhead than just overwriting
>> values. The compaction overhead can be l similar, but I think the read
>> performance is much worse.
>>
>> Tombstones accumulate and hang for 10 days (by default) before they are
>> eligible for compaction.
>>
>> Also we have tombstone warning and error thresholds. If cassandra scans
>> more than 10 000 tombstones, she will abort the query.
>>
>> According to this article:
>> https://opencredo.com/blogs/cassandra-tombstones-common-issues/
>>
>> "The cassandra.yaml comments explain in perfectly: *“When executing a
>> scan, within or across a partition, we need to keep the tombstones seen in
>> memory so we can return them to the coordinator, which will use them to
>> make sure other replicas also know about the deleted rows. With workloads
>> that generate a lot of tombstones, this can cause performance problems and
>> even exhaust the server heap. "*
>>
>> Regards,
>> Tomas
>>
>> On Fri, 4 Jan 2019, 7:06 pm Jonathan Haddad <jon@jonhaddad.com wrote:
>>
>>> If you're overwriting values, it really doesn't matter much if it's a
>>> tombstone or any other value, they still need to be compacted and have the
>>> same overhead at read time.
>>>
>>> Tombstones are problematic when you try to use Cassandra as a queue (or
>>> something like a queue) and you need to scan over thousands of tombstones
>>> in order to get to the real data.  You're simply overwriting a row and
>>> trying to avoid a single tombstone.
>>>
>>> Maybe I'm missing something here.  Why do you think overwriting a single
>>> cell with a tombstone is any worse than overwriting a single cell with a
>>> value?
>>>
>>> Jon
>>>
>>>
>>> On Fri, Jan 4, 2019 at 9:57 AM Tomas Bartalos <to...@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I beleive your approach is the same as using spark with "
>>>> spark.cassandra.output.ignoreNulls=true"
>>>> This will not cover the situation when a value have to be overwriten
>>>> with null.
>>>>
>>>> I found one possible solution - change the schema to keep only primary
>>>> key fields and move all other fields to frozen UDT.
>>>> create table (year, month, day, id, frozen<Event>, primary key((year,
>>>> month, day), id) )
>>>> In this way anything that is null inside event doesn't create
>>>> tombstone, since event is serialized to BLOB.
>>>> The penalty is in need of deserializing the whole Event when selecting
>>>> only few columns.
>>>> Can anyone confirm if this is good solution performance wise?
>>>>
>>>> Thank you,
>>>>
>>>> On Fri, 4 Jan 2019, 2:20 pm DuyHai Doan <doanduyhai@gmail.com wrote:
>>>>
>>>>> "The problem is I can't know the combination of set/unset values" -->
>>>>> Just for this requirement, Achilles has a working solution for many years
>>>>> using INSERT_NOT_NULL_FIELDS strategy:
>>>>>
>>>>> https://github.com/doanduyhai/Achilles/wiki/Insert-Strategy
>>>>>
>>>>> Or you can use the Update API that by design only perform update on
>>>>> not null fields:
>>>>> https://github.com/doanduyhai/Achilles/wiki/Quick-Reference#updating-all-non-null-fields-for-an-entity
>>>>>
>>>>>
>>>>> Behind the scene, for each new combination of INSERT INTO table(x,y,z)
>>>>> statement, Achilles will check its prepared statement cache and if the
>>>>> statement does not exist yet, create a new prepared statement and put it
>>>>> into the cache for later re-use for you
>>>>>
>>>>> Disclaiment: I'm the creator of Achilles
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Dec 27, 2018 at 10:21 PM Tomas Bartalos <
>>>>> tomas.bartalos@gmail.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> The problem is I can't know the combination of set/unset values. From
>>>>>> my perspective every value should be set. The event from Kafka represents
>>>>>> the complete state of the happening at certain point in time. In my table I
>>>>>> want to store the latest event so the most recent state of the happening
>>>>>> (in this table I don't care about the history). Actually I used wrong
>>>>>> expression since its just the opposite of "incremental update", every event
>>>>>> carries all data (state) for specific point of time.
>>>>>>
>>>>>> The event is represented with nested json structure. Top level
>>>>>> elements of the json are table fields with type like text, boolean,
>>>>>> timestamp, list and the nested elements are UDT fields.
>>>>>>
>>>>>> Simplified example:
>>>>>> There is a new purchase for the happening, event:
>>>>>> {total_amount: 50, items : [A, B, C, new_item], purchase_time :
>>>>>> '2018-12-27 13:30', specials: null, customer : {... }, fare_amount,...}
>>>>>> I don't know what actually happened for this event, maybe there is a
>>>>>> new item purchased, maybe some customer info have been changed, maybe the
>>>>>> specials have been revoked and I have to reset them. I just need to store
>>>>>> the state as it artived from Kafka, there might already be an event for
>>>>>> this happening saved before, or maybe this is the first one.
>>>>>>
>>>>>> BR,
>>>>>> Tomas
>>>>>>
>>>>>>
>>>>>> On Thu, 27 Dec 2018, 9:36 pm Eric Stevens <mightye@gmail.com wrote:
>>>>>>
>>>>>>> Depending on the use case, creating separate prepared statements for
>>>>>>> each combination of set / unset values in large INSERT/UPDATE statements
>>>>>>> may be prohibitive.
>>>>>>>
>>>>>>> Instead, you can look into driver level support for UNSET values.
>>>>>>> Requires Cassandra 2.2 or later IIRC.
>>>>>>>
>>>>>>> See:
>>>>>>> Java Driver:
>>>>>>> https://docs.datastax.com/en/developer/java-driver/3.0/manual/statements/prepared/#parameters-and-binding
>>>>>>> Python Driver:
>>>>>>> https://www.datastax.com/dev/blog/python-driver-2-6-0-rc1-with-cassandra-2-2-features#distinguishing_between_null_and_unset_values
>>>>>>> Node Driver:
>>>>>>> https://docs.datastax.com/en/developer/nodejs-driver/3.5/features/datatypes/nulls/#unset
>>>>>>>
>>>>>>> On Thu, Dec 27, 2018 at 3:21 PM Durity, Sean R <
>>>>>>> SEAN_R_DURITY@homedepot.com> wrote:
>>>>>>>
>>>>>>>> You say the events are incremental updates. I am interpreting this
>>>>>>>> to mean only some columns are updated. Others should keep their original
>>>>>>>> values.
>>>>>>>>
>>>>>>>> You are correct that inserting null creates a tombstone.
>>>>>>>>
>>>>>>>> Can you only insert the columns that actually have new values? Just
>>>>>>>> skip the columns with no information. (Make the insert generator a bit
>>>>>>>> smarter.)
>>>>>>>>
>>>>>>>> Create table happening (id text primary key, event text, a text, b
>>>>>>>> text, c text);
>>>>>>>> Insert into table happening (id, event, a, b, c) values
>>>>>>>> ("MainEvent","The most complete info we have right now","Priceless","10
>>>>>>>> pm","Grand Ballroom");
>>>>>>>> -- b changes
>>>>>>>> Insert into happening (id, b) values ("MainEvent","9:30 pm");
>>>>>>>>
>>>>>>>>
>>>>>>>> Sean Durity
>>>>>>>>
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Tomas Bartalos <to...@gmail.com>
>>>>>>>> Sent: Thursday, December 27, 2018 9:27 AM
>>>>>>>> To: user@cassandra.apache.org
>>>>>>>> Subject: [EXTERNAL] Howto avoid tombstones when inserting NULL
>>>>>>>> values
>>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I’d start with describing my use case and how I’d like to use
>>>>>>>> Cassandra to solve my storage needs.
>>>>>>>> We're processing a stream of events for various happenings. Every
>>>>>>>> event have a unique happening_id.
>>>>>>>> One happening may have many events, usually ~ 20-100 events. I’d
>>>>>>>> like to store only the latest event for the same happening (Event is an
>>>>>>>> incremental update and it contains all up-to date data about happening).
>>>>>>>> Technically the events are streamed from Kafka, processed with
>>>>>>>> Spark an saved to Cassandra.
>>>>>>>> In Cassandra we use upserts (insert with same primary key).  So far
>>>>>>>> so good, however there comes the tombstone...
>>>>>>>>
>>>>>>>> When I’m inserting field with NULL value, Cassandra creates
>>>>>>>> tombstone for this field. As I understood this is due to space efficiency,
>>>>>>>> Cassandra doesn’t have to remember there is a NULL value, she just deletes
>>>>>>>> the respective column and a delete creates a ... tombstone.
>>>>>>>> I was hoping there could be an option to tell Cassandra not to be
>>>>>>>> so space effective and store “unset" info without generating tombstones.
>>>>>>>> Something similar to inserting empty strings instead of null values:
>>>>>>>>
>>>>>>>> CREATE TABLE happening (id text PRIMARY KEY, event text); insert
>>>>>>>> into happening (‘1’, ‘event1’); — tombstone is generated insert into
>>>>>>>> happening (‘1’, null); — tombstone is not generated insert into happening
>>>>>>>> (‘1’, '’);
>>>>>>>>
>>>>>>>> Possible solutions:
>>>>>>>> 1. Disable tombstones with gc_grace_seconds = 0 or set to
>>>>>>>> reasonable low value (1 hour ?) . Not good, since phantom data may
>>>>>>>> re-appear 2. ignore NULLs on spark side with
>>>>>>>> “spark.cassandra.output.ignoreNulls=true”. Not good since this will never
>>>>>>>> overwrite previously inserted event field with “empty” one.
>>>>>>>> 3. On inserts with spark, find all NULL values and replace them
>>>>>>>> with “empty” equivalent (empty string for text, 0 for integer). Very
>>>>>>>> inefficient and problematic to find “empty” equivalent for some data types.
>>>>>>>>
>>>>>>>> Until tombstones appeared Cassandra was the right fit for our use
>>>>>>>> case, however now I’m not sure if we’re heading the right direction.
>>>>>>>> Could you please give me some advice how to solve this problem ?
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>> Tomas
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>>>>>>>> For additional commands, e-mail: user-help@cassandra.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>> ________________________________
>>>>>>>>
>>>>>>>> The information in this Internet Email is confidential and may be
>>>>>>>> legally privileged. It is intended solely for the addressee. Access to this
>>>>>>>> Email by anyone else is unauthorized. If you are not the intended
>>>>>>>> recipient, any disclosure, copying, distribution or any action taken or
>>>>>>>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>>>>>>>> When addressed to our clients any opinions or advice contained in this
>>>>>>>> Email are subject to the terms and conditions expressed in any applicable
>>>>>>>> governing The Home Depot terms of business or client engagement letter. The
>>>>>>>> Home Depot disclaims all responsibility and liability for the accuracy and
>>>>>>>> content of this attachment and for any damages or losses arising from any
>>>>>>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>>>>>>>> items of a destructive nature, which may be contained in this attachment
>>>>>>>> and shall not be liable for direct, indirect, consequential or special
>>>>>>>> damages in connection with this e-mail message or its attachment.
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>>>>>>>> For additional commands, e-mail: user-help@cassandra.apache.org
>>>>>>>>
>>>>>>>
>>>
>>> --
>>> Jon Haddad
>>> http://www.rustyrazorblade.com
>>> twitter: rustyrazorblade
>>>
>>
>
> --
> Jon Haddad
> http://www.rustyrazorblade.com
> twitter: rustyrazorblade
>

Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values

Posted by Jonathan Haddad <jo...@jonhaddad.com>.
Those are two different cases though.  It *sounds like* (again, I may be
missing the point) you're trying to overwrite a value with another value.
You're either going to serialize a blob and overwrite a single cell, or
you're going to overwrite all the cells and include a tombstone.

When you do a read, reading a single tombstone vs a single vs is
essentially the same thing, performance wise.

In your description you said "~ 20-100 events", and you're overwriting the
event each time, so I don't know how you go to 10K tombstones either.
Compaction will bring multiple tombstones together for a cell in the same
way it compacts multiple values for a single cell.

I sounds to make like you're taking some advice about tombstones out of
context and trying to apply the advice to a different problem.  Again, I
might be misunderstanding what you're doing.


On Fri, Jan 4, 2019 at 10:49 AM Tomas Bartalos <to...@gmail.com>
wrote:

> Hello Jon,
>
> I thought having tombstones is much higher overhead than just overwriting
> values. The compaction overhead can be l similar, but I think the read
> performance is much worse.
>
> Tombstones accumulate and hang for 10 days (by default) before they are
> eligible for compaction.
>
> Also we have tombstone warning and error thresholds. If cassandra scans
> more than 10 000 tombstones, she will abort the query.
>
> According to this article:
> https://opencredo.com/blogs/cassandra-tombstones-common-issues/
>
> "The cassandra.yaml comments explain in perfectly: *“When executing a
> scan, within or across a partition, we need to keep the tombstones seen in
> memory so we can return them to the coordinator, which will use them to
> make sure other replicas also know about the deleted rows. With workloads
> that generate a lot of tombstones, this can cause performance problems and
> even exhaust the server heap. "*
>
> Regards,
> Tomas
>
> On Fri, 4 Jan 2019, 7:06 pm Jonathan Haddad <jon@jonhaddad.com wrote:
>
>> If you're overwriting values, it really doesn't matter much if it's a
>> tombstone or any other value, they still need to be compacted and have the
>> same overhead at read time.
>>
>> Tombstones are problematic when you try to use Cassandra as a queue (or
>> something like a queue) and you need to scan over thousands of tombstones
>> in order to get to the real data.  You're simply overwriting a row and
>> trying to avoid a single tombstone.
>>
>> Maybe I'm missing something here.  Why do you think overwriting a single
>> cell with a tombstone is any worse than overwriting a single cell with a
>> value?
>>
>> Jon
>>
>>
>> On Fri, Jan 4, 2019 at 9:57 AM Tomas Bartalos <to...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> I beleive your approach is the same as using spark with "
>>> spark.cassandra.output.ignoreNulls=true"
>>> This will not cover the situation when a value have to be overwriten
>>> with null.
>>>
>>> I found one possible solution - change the schema to keep only primary
>>> key fields and move all other fields to frozen UDT.
>>> create table (year, month, day, id, frozen<Event>, primary key((year,
>>> month, day), id) )
>>> In this way anything that is null inside event doesn't create tombstone,
>>> since event is serialized to BLOB.
>>> The penalty is in need of deserializing the whole Event when selecting
>>> only few columns.
>>> Can anyone confirm if this is good solution performance wise?
>>>
>>> Thank you,
>>>
>>> On Fri, 4 Jan 2019, 2:20 pm DuyHai Doan <doanduyhai@gmail.com wrote:
>>>
>>>> "The problem is I can't know the combination of set/unset values" -->
>>>> Just for this requirement, Achilles has a working solution for many years
>>>> using INSERT_NOT_NULL_FIELDS strategy:
>>>>
>>>> https://github.com/doanduyhai/Achilles/wiki/Insert-Strategy
>>>>
>>>> Or you can use the Update API that by design only perform update on not
>>>> null fields:
>>>> https://github.com/doanduyhai/Achilles/wiki/Quick-Reference#updating-all-non-null-fields-for-an-entity
>>>>
>>>>
>>>> Behind the scene, for each new combination of INSERT INTO table(x,y,z)
>>>> statement, Achilles will check its prepared statement cache and if the
>>>> statement does not exist yet, create a new prepared statement and put it
>>>> into the cache for later re-use for you
>>>>
>>>> Disclaiment: I'm the creator of Achilles
>>>>
>>>>
>>>>
>>>> On Thu, Dec 27, 2018 at 10:21 PM Tomas Bartalos <
>>>> tomas.bartalos@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> The problem is I can't know the combination of set/unset values. From
>>>>> my perspective every value should be set. The event from Kafka represents
>>>>> the complete state of the happening at certain point in time. In my table I
>>>>> want to store the latest event so the most recent state of the happening
>>>>> (in this table I don't care about the history). Actually I used wrong
>>>>> expression since its just the opposite of "incremental update", every event
>>>>> carries all data (state) for specific point of time.
>>>>>
>>>>> The event is represented with nested json structure. Top level
>>>>> elements of the json are table fields with type like text, boolean,
>>>>> timestamp, list and the nested elements are UDT fields.
>>>>>
>>>>> Simplified example:
>>>>> There is a new purchase for the happening, event:
>>>>> {total_amount: 50, items : [A, B, C, new_item], purchase_time :
>>>>> '2018-12-27 13:30', specials: null, customer : {... }, fare_amount,...}
>>>>> I don't know what actually happened for this event, maybe there is a
>>>>> new item purchased, maybe some customer info have been changed, maybe the
>>>>> specials have been revoked and I have to reset them. I just need to store
>>>>> the state as it artived from Kafka, there might already be an event for
>>>>> this happening saved before, or maybe this is the first one.
>>>>>
>>>>> BR,
>>>>> Tomas
>>>>>
>>>>>
>>>>> On Thu, 27 Dec 2018, 9:36 pm Eric Stevens <mightye@gmail.com wrote:
>>>>>
>>>>>> Depending on the use case, creating separate prepared statements for
>>>>>> each combination of set / unset values in large INSERT/UPDATE statements
>>>>>> may be prohibitive.
>>>>>>
>>>>>> Instead, you can look into driver level support for UNSET values.
>>>>>> Requires Cassandra 2.2 or later IIRC.
>>>>>>
>>>>>> See:
>>>>>> Java Driver:
>>>>>> https://docs.datastax.com/en/developer/java-driver/3.0/manual/statements/prepared/#parameters-and-binding
>>>>>> Python Driver:
>>>>>> https://www.datastax.com/dev/blog/python-driver-2-6-0-rc1-with-cassandra-2-2-features#distinguishing_between_null_and_unset_values
>>>>>> Node Driver:
>>>>>> https://docs.datastax.com/en/developer/nodejs-driver/3.5/features/datatypes/nulls/#unset
>>>>>>
>>>>>> On Thu, Dec 27, 2018 at 3:21 PM Durity, Sean R <
>>>>>> SEAN_R_DURITY@homedepot.com> wrote:
>>>>>>
>>>>>>> You say the events are incremental updates. I am interpreting this
>>>>>>> to mean only some columns are updated. Others should keep their original
>>>>>>> values.
>>>>>>>
>>>>>>> You are correct that inserting null creates a tombstone.
>>>>>>>
>>>>>>> Can you only insert the columns that actually have new values? Just
>>>>>>> skip the columns with no information. (Make the insert generator a bit
>>>>>>> smarter.)
>>>>>>>
>>>>>>> Create table happening (id text primary key, event text, a text, b
>>>>>>> text, c text);
>>>>>>> Insert into table happening (id, event, a, b, c) values
>>>>>>> ("MainEvent","The most complete info we have right now","Priceless","10
>>>>>>> pm","Grand Ballroom");
>>>>>>> -- b changes
>>>>>>> Insert into happening (id, b) values ("MainEvent","9:30 pm");
>>>>>>>
>>>>>>>
>>>>>>> Sean Durity
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Tomas Bartalos <to...@gmail.com>
>>>>>>> Sent: Thursday, December 27, 2018 9:27 AM
>>>>>>> To: user@cassandra.apache.org
>>>>>>> Subject: [EXTERNAL] Howto avoid tombstones when inserting NULL values
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I’d start with describing my use case and how I’d like to use
>>>>>>> Cassandra to solve my storage needs.
>>>>>>> We're processing a stream of events for various happenings. Every
>>>>>>> event have a unique happening_id.
>>>>>>> One happening may have many events, usually ~ 20-100 events. I’d
>>>>>>> like to store only the latest event for the same happening (Event is an
>>>>>>> incremental update and it contains all up-to date data about happening).
>>>>>>> Technically the events are streamed from Kafka, processed with Spark
>>>>>>> an saved to Cassandra.
>>>>>>> In Cassandra we use upserts (insert with same primary key).  So far
>>>>>>> so good, however there comes the tombstone...
>>>>>>>
>>>>>>> When I’m inserting field with NULL value, Cassandra creates
>>>>>>> tombstone for this field. As I understood this is due to space efficiency,
>>>>>>> Cassandra doesn’t have to remember there is a NULL value, she just deletes
>>>>>>> the respective column and a delete creates a ... tombstone.
>>>>>>> I was hoping there could be an option to tell Cassandra not to be so
>>>>>>> space effective and store “unset" info without generating tombstones.
>>>>>>> Something similar to inserting empty strings instead of null values:
>>>>>>>
>>>>>>> CREATE TABLE happening (id text PRIMARY KEY, event text); insert
>>>>>>> into happening (‘1’, ‘event1’); — tombstone is generated insert into
>>>>>>> happening (‘1’, null); — tombstone is not generated insert into happening
>>>>>>> (‘1’, '’);
>>>>>>>
>>>>>>> Possible solutions:
>>>>>>> 1. Disable tombstones with gc_grace_seconds = 0 or set to reasonable
>>>>>>> low value (1 hour ?) . Not good, since phantom data may re-appear 2. ignore
>>>>>>> NULLs on spark side with “spark.cassandra.output.ignoreNulls=true”. Not
>>>>>>> good since this will never overwrite previously inserted event field with
>>>>>>> “empty” one.
>>>>>>> 3. On inserts with spark, find all NULL values and replace them with
>>>>>>> “empty” equivalent (empty string for text, 0 for integer). Very inefficient
>>>>>>> and problematic to find “empty” equivalent for some data types.
>>>>>>>
>>>>>>> Until tombstones appeared Cassandra was the right fit for our use
>>>>>>> case, however now I’m not sure if we’re heading the right direction.
>>>>>>> Could you please give me some advice how to solve this problem ?
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Tomas
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>>>>>>> For additional commands, e-mail: user-help@cassandra.apache.org
>>>>>>>
>>>>>>>
>>>>>>> ________________________________
>>>>>>>
>>>>>>> The information in this Internet Email is confidential and may be
>>>>>>> legally privileged. It is intended solely for the addressee. Access to this
>>>>>>> Email by anyone else is unauthorized. If you are not the intended
>>>>>>> recipient, any disclosure, copying, distribution or any action taken or
>>>>>>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>>>>>>> When addressed to our clients any opinions or advice contained in this
>>>>>>> Email are subject to the terms and conditions expressed in any applicable
>>>>>>> governing The Home Depot terms of business or client engagement letter. The
>>>>>>> Home Depot disclaims all responsibility and liability for the accuracy and
>>>>>>> content of this attachment and for any damages or losses arising from any
>>>>>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>>>>>>> items of a destructive nature, which may be contained in this attachment
>>>>>>> and shall not be liable for direct, indirect, consequential or special
>>>>>>> damages in connection with this e-mail message or its attachment.
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>>>>>>> For additional commands, e-mail: user-help@cassandra.apache.org
>>>>>>>
>>>>>>
>>
>> --
>> Jon Haddad
>> http://www.rustyrazorblade.com
>> twitter: rustyrazorblade
>>
>

-- 
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade

Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values

Posted by Tomas Bartalos <to...@gmail.com>.
Hello Jon,

I thought having tombstones is much higher overhead than just overwriting
values. The compaction overhead can be l similar, but I think the read
performance is much worse.

Tombstones accumulate and hang for 10 days (by default) before they are
eligible for compaction.

Also we have tombstone warning and error thresholds. If cassandra scans
more than 10 000 tombstones, she will abort the query.

According to this article:
https://opencredo.com/blogs/cassandra-tombstones-common-issues/

"The cassandra.yaml comments explain in perfectly: *“When executing a scan,
within or across a partition, we need to keep the tombstones seen in memory
so we can return them to the coordinator, which will use them to make sure
other replicas also know about the deleted rows. With workloads that
generate a lot of tombstones, this can cause performance problems and even
exhaust the server heap. "*

Regards,
Tomas

On Fri, 4 Jan 2019, 7:06 pm Jonathan Haddad <jon@jonhaddad.com wrote:

> If you're overwriting values, it really doesn't matter much if it's a
> tombstone or any other value, they still need to be compacted and have the
> same overhead at read time.
>
> Tombstones are problematic when you try to use Cassandra as a queue (or
> something like a queue) and you need to scan over thousands of tombstones
> in order to get to the real data.  You're simply overwriting a row and
> trying to avoid a single tombstone.
>
> Maybe I'm missing something here.  Why do you think overwriting a single
> cell with a tombstone is any worse than overwriting a single cell with a
> value?
>
> Jon
>
>
> On Fri, Jan 4, 2019 at 9:57 AM Tomas Bartalos <to...@gmail.com>
> wrote:
>
>> Hello,
>>
>> I beleive your approach is the same as using spark with "
>> spark.cassandra.output.ignoreNulls=true"
>> This will not cover the situation when a value have to be overwriten with
>> null.
>>
>> I found one possible solution - change the schema to keep only primary
>> key fields and move all other fields to frozen UDT.
>> create table (year, month, day, id, frozen<Event>, primary key((year,
>> month, day), id) )
>> In this way anything that is null inside event doesn't create tombstone,
>> since event is serialized to BLOB.
>> The penalty is in need of deserializing the whole Event when selecting
>> only few columns.
>> Can anyone confirm if this is good solution performance wise?
>>
>> Thank you,
>>
>> On Fri, 4 Jan 2019, 2:20 pm DuyHai Doan <doanduyhai@gmail.com wrote:
>>
>>> "The problem is I can't know the combination of set/unset values" -->
>>> Just for this requirement, Achilles has a working solution for many years
>>> using INSERT_NOT_NULL_FIELDS strategy:
>>>
>>> https://github.com/doanduyhai/Achilles/wiki/Insert-Strategy
>>>
>>> Or you can use the Update API that by design only perform update on not
>>> null fields:
>>> https://github.com/doanduyhai/Achilles/wiki/Quick-Reference#updating-all-non-null-fields-for-an-entity
>>>
>>>
>>> Behind the scene, for each new combination of INSERT INTO table(x,y,z)
>>> statement, Achilles will check its prepared statement cache and if the
>>> statement does not exist yet, create a new prepared statement and put it
>>> into the cache for later re-use for you
>>>
>>> Disclaiment: I'm the creator of Achilles
>>>
>>>
>>>
>>> On Thu, Dec 27, 2018 at 10:21 PM Tomas Bartalos <
>>> tomas.bartalos@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> The problem is I can't know the combination of set/unset values. From
>>>> my perspective every value should be set. The event from Kafka represents
>>>> the complete state of the happening at certain point in time. In my table I
>>>> want to store the latest event so the most recent state of the happening
>>>> (in this table I don't care about the history). Actually I used wrong
>>>> expression since its just the opposite of "incremental update", every event
>>>> carries all data (state) for specific point of time.
>>>>
>>>> The event is represented with nested json structure. Top level elements
>>>> of the json are table fields with type like text, boolean, timestamp, list
>>>> and the nested elements are UDT fields.
>>>>
>>>> Simplified example:
>>>> There is a new purchase for the happening, event:
>>>> {total_amount: 50, items : [A, B, C, new_item], purchase_time :
>>>> '2018-12-27 13:30', specials: null, customer : {... }, fare_amount,...}
>>>> I don't know what actually happened for this event, maybe there is a
>>>> new item purchased, maybe some customer info have been changed, maybe the
>>>> specials have been revoked and I have to reset them. I just need to store
>>>> the state as it artived from Kafka, there might already be an event for
>>>> this happening saved before, or maybe this is the first one.
>>>>
>>>> BR,
>>>> Tomas
>>>>
>>>>
>>>> On Thu, 27 Dec 2018, 9:36 pm Eric Stevens <mightye@gmail.com wrote:
>>>>
>>>>> Depending on the use case, creating separate prepared statements for
>>>>> each combination of set / unset values in large INSERT/UPDATE statements
>>>>> may be prohibitive.
>>>>>
>>>>> Instead, you can look into driver level support for UNSET values.
>>>>> Requires Cassandra 2.2 or later IIRC.
>>>>>
>>>>> See:
>>>>> Java Driver:
>>>>> https://docs.datastax.com/en/developer/java-driver/3.0/manual/statements/prepared/#parameters-and-binding
>>>>> Python Driver:
>>>>> https://www.datastax.com/dev/blog/python-driver-2-6-0-rc1-with-cassandra-2-2-features#distinguishing_between_null_and_unset_values
>>>>> Node Driver:
>>>>> https://docs.datastax.com/en/developer/nodejs-driver/3.5/features/datatypes/nulls/#unset
>>>>>
>>>>> On Thu, Dec 27, 2018 at 3:21 PM Durity, Sean R <
>>>>> SEAN_R_DURITY@homedepot.com> wrote:
>>>>>
>>>>>> You say the events are incremental updates. I am interpreting this to
>>>>>> mean only some columns are updated. Others should keep their original
>>>>>> values.
>>>>>>
>>>>>> You are correct that inserting null creates a tombstone.
>>>>>>
>>>>>> Can you only insert the columns that actually have new values? Just
>>>>>> skip the columns with no information. (Make the insert generator a bit
>>>>>> smarter.)
>>>>>>
>>>>>> Create table happening (id text primary key, event text, a text, b
>>>>>> text, c text);
>>>>>> Insert into table happening (id, event, a, b, c) values
>>>>>> ("MainEvent","The most complete info we have right now","Priceless","10
>>>>>> pm","Grand Ballroom");
>>>>>> -- b changes
>>>>>> Insert into happening (id, b) values ("MainEvent","9:30 pm");
>>>>>>
>>>>>>
>>>>>> Sean Durity
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Tomas Bartalos <to...@gmail.com>
>>>>>> Sent: Thursday, December 27, 2018 9:27 AM
>>>>>> To: user@cassandra.apache.org
>>>>>> Subject: [EXTERNAL] Howto avoid tombstones when inserting NULL values
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I’d start with describing my use case and how I’d like to use
>>>>>> Cassandra to solve my storage needs.
>>>>>> We're processing a stream of events for various happenings. Every
>>>>>> event have a unique happening_id.
>>>>>> One happening may have many events, usually ~ 20-100 events. I’d like
>>>>>> to store only the latest event for the same happening (Event is an
>>>>>> incremental update and it contains all up-to date data about happening).
>>>>>> Technically the events are streamed from Kafka, processed with Spark
>>>>>> an saved to Cassandra.
>>>>>> In Cassandra we use upserts (insert with same primary key).  So far
>>>>>> so good, however there comes the tombstone...
>>>>>>
>>>>>> When I’m inserting field with NULL value, Cassandra creates tombstone
>>>>>> for this field. As I understood this is due to space efficiency, Cassandra
>>>>>> doesn’t have to remember there is a NULL value, she just deletes the
>>>>>> respective column and a delete creates a ... tombstone.
>>>>>> I was hoping there could be an option to tell Cassandra not to be so
>>>>>> space effective and store “unset" info without generating tombstones.
>>>>>> Something similar to inserting empty strings instead of null values:
>>>>>>
>>>>>> CREATE TABLE happening (id text PRIMARY KEY, event text); insert into
>>>>>> happening (‘1’, ‘event1’); — tombstone is generated insert into happening
>>>>>> (‘1’, null); — tombstone is not generated insert into happening (‘1’, '’);
>>>>>>
>>>>>> Possible solutions:
>>>>>> 1. Disable tombstones with gc_grace_seconds = 0 or set to reasonable
>>>>>> low value (1 hour ?) . Not good, since phantom data may re-appear 2. ignore
>>>>>> NULLs on spark side with “spark.cassandra.output.ignoreNulls=true”. Not
>>>>>> good since this will never overwrite previously inserted event field with
>>>>>> “empty” one.
>>>>>> 3. On inserts with spark, find all NULL values and replace them with
>>>>>> “empty” equivalent (empty string for text, 0 for integer). Very inefficient
>>>>>> and problematic to find “empty” equivalent for some data types.
>>>>>>
>>>>>> Until tombstones appeared Cassandra was the right fit for our use
>>>>>> case, however now I’m not sure if we’re heading the right direction.
>>>>>> Could you please give me some advice how to solve this problem ?
>>>>>>
>>>>>> Thank you,
>>>>>> Tomas
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>>>>>> For additional commands, e-mail: user-help@cassandra.apache.org
>>>>>>
>>>>>>
>>>>>> ________________________________
>>>>>>
>>>>>> The information in this Internet Email is confidential and may be
>>>>>> legally privileged. It is intended solely for the addressee. Access to this
>>>>>> Email by anyone else is unauthorized. If you are not the intended
>>>>>> recipient, any disclosure, copying, distribution or any action taken or
>>>>>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>>>>>> When addressed to our clients any opinions or advice contained in this
>>>>>> Email are subject to the terms and conditions expressed in any applicable
>>>>>> governing The Home Depot terms of business or client engagement letter. The
>>>>>> Home Depot disclaims all responsibility and liability for the accuracy and
>>>>>> content of this attachment and for any damages or losses arising from any
>>>>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>>>>>> items of a destructive nature, which may be contained in this attachment
>>>>>> and shall not be liable for direct, indirect, consequential or special
>>>>>> damages in connection with this e-mail message or its attachment.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>>>>>> For additional commands, e-mail: user-help@cassandra.apache.org
>>>>>>
>>>>>
>
> --
> Jon Haddad
> http://www.rustyrazorblade.com
> twitter: rustyrazorblade
>

Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values

Posted by Jonathan Haddad <jo...@jonhaddad.com>.
If you're overwriting values, it really doesn't matter much if it's a
tombstone or any other value, they still need to be compacted and have the
same overhead at read time.

Tombstones are problematic when you try to use Cassandra as a queue (or
something like a queue) and you need to scan over thousands of tombstones
in order to get to the real data.  You're simply overwriting a row and
trying to avoid a single tombstone.

Maybe I'm missing something here.  Why do you think overwriting a single
cell with a tombstone is any worse than overwriting a single cell with a
value?

Jon


On Fri, Jan 4, 2019 at 9:57 AM Tomas Bartalos <to...@gmail.com>
wrote:

> Hello,
>
> I beleive your approach is the same as using spark with "
> spark.cassandra.output.ignoreNulls=true"
> This will not cover the situation when a value have to be overwriten with
> null.
>
> I found one possible solution - change the schema to keep only primary key
> fields and move all other fields to frozen UDT.
> create table (year, month, day, id, frozen<Event>, primary key((year,
> month, day), id) )
> In this way anything that is null inside event doesn't create tombstone,
> since event is serialized to BLOB.
> The penalty is in need of deserializing the whole Event when selecting
> only few columns.
> Can anyone confirm if this is good solution performance wise?
>
> Thank you,
>
> On Fri, 4 Jan 2019, 2:20 pm DuyHai Doan <doanduyhai@gmail.com wrote:
>
>> "The problem is I can't know the combination of set/unset values" -->
>> Just for this requirement, Achilles has a working solution for many years
>> using INSERT_NOT_NULL_FIELDS strategy:
>>
>> https://github.com/doanduyhai/Achilles/wiki/Insert-Strategy
>>
>> Or you can use the Update API that by design only perform update on not
>> null fields:
>> https://github.com/doanduyhai/Achilles/wiki/Quick-Reference#updating-all-non-null-fields-for-an-entity
>>
>>
>> Behind the scene, for each new combination of INSERT INTO table(x,y,z)
>> statement, Achilles will check its prepared statement cache and if the
>> statement does not exist yet, create a new prepared statement and put it
>> into the cache for later re-use for you
>>
>> Disclaiment: I'm the creator of Achilles
>>
>>
>>
>> On Thu, Dec 27, 2018 at 10:21 PM Tomas Bartalos <to...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> The problem is I can't know the combination of set/unset values. From my
>>> perspective every value should be set. The event from Kafka represents the
>>> complete state of the happening at certain point in time. In my table I
>>> want to store the latest event so the most recent state of the happening
>>> (in this table I don't care about the history). Actually I used wrong
>>> expression since its just the opposite of "incremental update", every event
>>> carries all data (state) for specific point of time.
>>>
>>> The event is represented with nested json structure. Top level elements
>>> of the json are table fields with type like text, boolean, timestamp, list
>>> and the nested elements are UDT fields.
>>>
>>> Simplified example:
>>> There is a new purchase for the happening, event:
>>> {total_amount: 50, items : [A, B, C, new_item], purchase_time :
>>> '2018-12-27 13:30', specials: null, customer : {... }, fare_amount,...}
>>> I don't know what actually happened for this event, maybe there is a new
>>> item purchased, maybe some customer info have been changed, maybe the
>>> specials have been revoked and I have to reset them. I just need to store
>>> the state as it artived from Kafka, there might already be an event for
>>> this happening saved before, or maybe this is the first one.
>>>
>>> BR,
>>> Tomas
>>>
>>>
>>> On Thu, 27 Dec 2018, 9:36 pm Eric Stevens <mightye@gmail.com wrote:
>>>
>>>> Depending on the use case, creating separate prepared statements for
>>>> each combination of set / unset values in large INSERT/UPDATE statements
>>>> may be prohibitive.
>>>>
>>>> Instead, you can look into driver level support for UNSET values.
>>>> Requires Cassandra 2.2 or later IIRC.
>>>>
>>>> See:
>>>> Java Driver:
>>>> https://docs.datastax.com/en/developer/java-driver/3.0/manual/statements/prepared/#parameters-and-binding
>>>> Python Driver:
>>>> https://www.datastax.com/dev/blog/python-driver-2-6-0-rc1-with-cassandra-2-2-features#distinguishing_between_null_and_unset_values
>>>> Node Driver:
>>>> https://docs.datastax.com/en/developer/nodejs-driver/3.5/features/datatypes/nulls/#unset
>>>>
>>>> On Thu, Dec 27, 2018 at 3:21 PM Durity, Sean R <
>>>> SEAN_R_DURITY@homedepot.com> wrote:
>>>>
>>>>> You say the events are incremental updates. I am interpreting this to
>>>>> mean only some columns are updated. Others should keep their original
>>>>> values.
>>>>>
>>>>> You are correct that inserting null creates a tombstone.
>>>>>
>>>>> Can you only insert the columns that actually have new values? Just
>>>>> skip the columns with no information. (Make the insert generator a bit
>>>>> smarter.)
>>>>>
>>>>> Create table happening (id text primary key, event text, a text, b
>>>>> text, c text);
>>>>> Insert into table happening (id, event, a, b, c) values
>>>>> ("MainEvent","The most complete info we have right now","Priceless","10
>>>>> pm","Grand Ballroom");
>>>>> -- b changes
>>>>> Insert into happening (id, b) values ("MainEvent","9:30 pm");
>>>>>
>>>>>
>>>>> Sean Durity
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Tomas Bartalos <to...@gmail.com>
>>>>> Sent: Thursday, December 27, 2018 9:27 AM
>>>>> To: user@cassandra.apache.org
>>>>> Subject: [EXTERNAL] Howto avoid tombstones when inserting NULL values
>>>>>
>>>>> Hello,
>>>>>
>>>>> I’d start with describing my use case and how I’d like to use
>>>>> Cassandra to solve my storage needs.
>>>>> We're processing a stream of events for various happenings. Every
>>>>> event have a unique happening_id.
>>>>> One happening may have many events, usually ~ 20-100 events. I’d like
>>>>> to store only the latest event for the same happening (Event is an
>>>>> incremental update and it contains all up-to date data about happening).
>>>>> Technically the events are streamed from Kafka, processed with Spark
>>>>> an saved to Cassandra.
>>>>> In Cassandra we use upserts (insert with same primary key).  So far so
>>>>> good, however there comes the tombstone...
>>>>>
>>>>> When I’m inserting field with NULL value, Cassandra creates tombstone
>>>>> for this field. As I understood this is due to space efficiency, Cassandra
>>>>> doesn’t have to remember there is a NULL value, she just deletes the
>>>>> respective column and a delete creates a ... tombstone.
>>>>> I was hoping there could be an option to tell Cassandra not to be so
>>>>> space effective and store “unset" info without generating tombstones.
>>>>> Something similar to inserting empty strings instead of null values:
>>>>>
>>>>> CREATE TABLE happening (id text PRIMARY KEY, event text); insert into
>>>>> happening (‘1’, ‘event1’); — tombstone is generated insert into happening
>>>>> (‘1’, null); — tombstone is not generated insert into happening (‘1’, '’);
>>>>>
>>>>> Possible solutions:
>>>>> 1. Disable tombstones with gc_grace_seconds = 0 or set to reasonable
>>>>> low value (1 hour ?) . Not good, since phantom data may re-appear 2. ignore
>>>>> NULLs on spark side with “spark.cassandra.output.ignoreNulls=true”. Not
>>>>> good since this will never overwrite previously inserted event field with
>>>>> “empty” one.
>>>>> 3. On inserts with spark, find all NULL values and replace them with
>>>>> “empty” equivalent (empty string for text, 0 for integer). Very inefficient
>>>>> and problematic to find “empty” equivalent for some data types.
>>>>>
>>>>> Until tombstones appeared Cassandra was the right fit for our use
>>>>> case, however now I’m not sure if we’re heading the right direction.
>>>>> Could you please give me some advice how to solve this problem ?
>>>>>
>>>>> Thank you,
>>>>> Tomas
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>>>>> For additional commands, e-mail: user-help@cassandra.apache.org
>>>>>
>>>>>
>>>>> ________________________________
>>>>>
>>>>> The information in this Internet Email is confidential and may be
>>>>> legally privileged. It is intended solely for the addressee. Access to this
>>>>> Email by anyone else is unauthorized. If you are not the intended
>>>>> recipient, any disclosure, copying, distribution or any action taken or
>>>>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>>>>> When addressed to our clients any opinions or advice contained in this
>>>>> Email are subject to the terms and conditions expressed in any applicable
>>>>> governing The Home Depot terms of business or client engagement letter. The
>>>>> Home Depot disclaims all responsibility and liability for the accuracy and
>>>>> content of this attachment and for any damages or losses arising from any
>>>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>>>>> items of a destructive nature, which may be contained in this attachment
>>>>> and shall not be liable for direct, indirect, consequential or special
>>>>> damages in connection with this e-mail message or its attachment.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>>>>> For additional commands, e-mail: user-help@cassandra.apache.org
>>>>>
>>>>

-- 
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade

Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values

Posted by Tomas Bartalos <to...@gmail.com>.
Hello,

I beleive your approach is the same as using spark with "
spark.cassandra.output.ignoreNulls=true"
This will not cover the situation when a value have to be overwriten with
null.

I found one possible solution - change the schema to keep only primary key
fields and move all other fields to frozen UDT.
create table (year, month, day, id, frozen<Event>, primary key((year,
month, day), id) )
In this way anything that is null inside event doesn't create tombstone,
since event is serialized to BLOB.
The penalty is in need of deserializing the whole Event when selecting only
few columns.
Can anyone confirm if this is good solution performance wise?

Thank you,

On Fri, 4 Jan 2019, 2:20 pm DuyHai Doan <doanduyhai@gmail.com wrote:

> "The problem is I can't know the combination of set/unset values" --> Just
> for this requirement, Achilles has a working solution for many years using
> INSERT_NOT_NULL_FIELDS strategy:
>
> https://github.com/doanduyhai/Achilles/wiki/Insert-Strategy
>
> Or you can use the Update API that by design only perform update on not
> null fields:
> https://github.com/doanduyhai/Achilles/wiki/Quick-Reference#updating-all-non-null-fields-for-an-entity
>
>
> Behind the scene, for each new combination of INSERT INTO table(x,y,z)
> statement, Achilles will check its prepared statement cache and if the
> statement does not exist yet, create a new prepared statement and put it
> into the cache for later re-use for you
>
> Disclaiment: I'm the creator of Achilles
>
>
>
> On Thu, Dec 27, 2018 at 10:21 PM Tomas Bartalos <to...@gmail.com>
> wrote:
>
>> Hello,
>>
>> The problem is I can't know the combination of set/unset values. From my
>> perspective every value should be set. The event from Kafka represents the
>> complete state of the happening at certain point in time. In my table I
>> want to store the latest event so the most recent state of the happening
>> (in this table I don't care about the history). Actually I used wrong
>> expression since its just the opposite of "incremental update", every event
>> carries all data (state) for specific point of time.
>>
>> The event is represented with nested json structure. Top level elements
>> of the json are table fields with type like text, boolean, timestamp, list
>> and the nested elements are UDT fields.
>>
>> Simplified example:
>> There is a new purchase for the happening, event:
>> {total_amount: 50, items : [A, B, C, new_item], purchase_time :
>> '2018-12-27 13:30', specials: null, customer : {... }, fare_amount,...}
>> I don't know what actually happened for this event, maybe there is a new
>> item purchased, maybe some customer info have been changed, maybe the
>> specials have been revoked and I have to reset them. I just need to store
>> the state as it artived from Kafka, there might already be an event for
>> this happening saved before, or maybe this is the first one.
>>
>> BR,
>> Tomas
>>
>>
>> On Thu, 27 Dec 2018, 9:36 pm Eric Stevens <mightye@gmail.com wrote:
>>
>>> Depending on the use case, creating separate prepared statements for
>>> each combination of set / unset values in large INSERT/UPDATE statements
>>> may be prohibitive.
>>>
>>> Instead, you can look into driver level support for UNSET values.
>>> Requires Cassandra 2.2 or later IIRC.
>>>
>>> See:
>>> Java Driver:
>>> https://docs.datastax.com/en/developer/java-driver/3.0/manual/statements/prepared/#parameters-and-binding
>>> Python Driver:
>>> https://www.datastax.com/dev/blog/python-driver-2-6-0-rc1-with-cassandra-2-2-features#distinguishing_between_null_and_unset_values
>>> Node Driver:
>>> https://docs.datastax.com/en/developer/nodejs-driver/3.5/features/datatypes/nulls/#unset
>>>
>>> On Thu, Dec 27, 2018 at 3:21 PM Durity, Sean R <
>>> SEAN_R_DURITY@homedepot.com> wrote:
>>>
>>>> You say the events are incremental updates. I am interpreting this to
>>>> mean only some columns are updated. Others should keep their original
>>>> values.
>>>>
>>>> You are correct that inserting null creates a tombstone.
>>>>
>>>> Can you only insert the columns that actually have new values? Just
>>>> skip the columns with no information. (Make the insert generator a bit
>>>> smarter.)
>>>>
>>>> Create table happening (id text primary key, event text, a text, b
>>>> text, c text);
>>>> Insert into table happening (id, event, a, b, c) values
>>>> ("MainEvent","The most complete info we have right now","Priceless","10
>>>> pm","Grand Ballroom");
>>>> -- b changes
>>>> Insert into happening (id, b) values ("MainEvent","9:30 pm");
>>>>
>>>>
>>>> Sean Durity
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Tomas Bartalos <to...@gmail.com>
>>>> Sent: Thursday, December 27, 2018 9:27 AM
>>>> To: user@cassandra.apache.org
>>>> Subject: [EXTERNAL] Howto avoid tombstones when inserting NULL values
>>>>
>>>> Hello,
>>>>
>>>> I’d start with describing my use case and how I’d like to use Cassandra
>>>> to solve my storage needs.
>>>> We're processing a stream of events for various happenings. Every event
>>>> have a unique happening_id.
>>>> One happening may have many events, usually ~ 20-100 events. I’d like
>>>> to store only the latest event for the same happening (Event is an
>>>> incremental update and it contains all up-to date data about happening).
>>>> Technically the events are streamed from Kafka, processed with Spark an
>>>> saved to Cassandra.
>>>> In Cassandra we use upserts (insert with same primary key).  So far so
>>>> good, however there comes the tombstone...
>>>>
>>>> When I’m inserting field with NULL value, Cassandra creates tombstone
>>>> for this field. As I understood this is due to space efficiency, Cassandra
>>>> doesn’t have to remember there is a NULL value, she just deletes the
>>>> respective column and a delete creates a ... tombstone.
>>>> I was hoping there could be an option to tell Cassandra not to be so
>>>> space effective and store “unset" info without generating tombstones.
>>>> Something similar to inserting empty strings instead of null values:
>>>>
>>>> CREATE TABLE happening (id text PRIMARY KEY, event text); insert into
>>>> happening (‘1’, ‘event1’); — tombstone is generated insert into happening
>>>> (‘1’, null); — tombstone is not generated insert into happening (‘1’, '’);
>>>>
>>>> Possible solutions:
>>>> 1. Disable tombstones with gc_grace_seconds = 0 or set to reasonable
>>>> low value (1 hour ?) . Not good, since phantom data may re-appear 2. ignore
>>>> NULLs on spark side with “spark.cassandra.output.ignoreNulls=true”. Not
>>>> good since this will never overwrite previously inserted event field with
>>>> “empty” one.
>>>> 3. On inserts with spark, find all NULL values and replace them with
>>>> “empty” equivalent (empty string for text, 0 for integer). Very inefficient
>>>> and problematic to find “empty” equivalent for some data types.
>>>>
>>>> Until tombstones appeared Cassandra was the right fit for our use case,
>>>> however now I’m not sure if we’re heading the right direction.
>>>> Could you please give me some advice how to solve this problem ?
>>>>
>>>> Thank you,
>>>> Tomas
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>>>> For additional commands, e-mail: user-help@cassandra.apache.org
>>>>
>>>>
>>>> ________________________________
>>>>
>>>> The information in this Internet Email is confidential and may be
>>>> legally privileged. It is intended solely for the addressee. Access to this
>>>> Email by anyone else is unauthorized. If you are not the intended
>>>> recipient, any disclosure, copying, distribution or any action taken or
>>>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>>>> When addressed to our clients any opinions or advice contained in this
>>>> Email are subject to the terms and conditions expressed in any applicable
>>>> governing The Home Depot terms of business or client engagement letter. The
>>>> Home Depot disclaims all responsibility and liability for the accuracy and
>>>> content of this attachment and for any damages or losses arising from any
>>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>>>> items of a destructive nature, which may be contained in this attachment
>>>> and shall not be liable for direct, indirect, consequential or special
>>>> damages in connection with this e-mail message or its attachment.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>>>> For additional commands, e-mail: user-help@cassandra.apache.org
>>>>
>>>

Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values

Posted by DuyHai Doan <do...@gmail.com>.
"The problem is I can't know the combination of set/unset values" --> Just
for this requirement, Achilles has a working solution for many years using
INSERT_NOT_NULL_FIELDS strategy:

https://github.com/doanduyhai/Achilles/wiki/Insert-Strategy

Or you can use the Update API that by design only perform update on not
null fields:
https://github.com/doanduyhai/Achilles/wiki/Quick-Reference#updating-all-non-null-fields-for-an-entity


Behind the scene, for each new combination of INSERT INTO table(x,y,z)
statement, Achilles will check its prepared statement cache and if the
statement does not exist yet, create a new prepared statement and put it
into the cache for later re-use for you

Disclaiment: I'm the creator of Achilles



On Thu, Dec 27, 2018 at 10:21 PM Tomas Bartalos <to...@gmail.com>
wrote:

> Hello,
>
> The problem is I can't know the combination of set/unset values. From my
> perspective every value should be set. The event from Kafka represents the
> complete state of the happening at certain point in time. In my table I
> want to store the latest event so the most recent state of the happening
> (in this table I don't care about the history). Actually I used wrong
> expression since its just the opposite of "incremental update", every event
> carries all data (state) for specific point of time.
>
> The event is represented with nested json structure. Top level elements of
> the json are table fields with type like text, boolean, timestamp, list and
> the nested elements are UDT fields.
>
> Simplified example:
> There is a new purchase for the happening, event:
> {total_amount: 50, items : [A, B, C, new_item], purchase_time :
> '2018-12-27 13:30', specials: null, customer : {... }, fare_amount,...}
> I don't know what actually happened for this event, maybe there is a new
> item purchased, maybe some customer info have been changed, maybe the
> specials have been revoked and I have to reset them. I just need to store
> the state as it artived from Kafka, there might already be an event for
> this happening saved before, or maybe this is the first one.
>
> BR,
> Tomas
>
>
> On Thu, 27 Dec 2018, 9:36 pm Eric Stevens <mightye@gmail.com wrote:
>
>> Depending on the use case, creating separate prepared statements for each
>> combination of set / unset values in large INSERT/UPDATE statements may be
>> prohibitive.
>>
>> Instead, you can look into driver level support for UNSET values.
>> Requires Cassandra 2.2 or later IIRC.
>>
>> See:
>> Java Driver:
>> https://docs.datastax.com/en/developer/java-driver/3.0/manual/statements/prepared/#parameters-and-binding
>> Python Driver:
>> https://www.datastax.com/dev/blog/python-driver-2-6-0-rc1-with-cassandra-2-2-features#distinguishing_between_null_and_unset_values
>> Node Driver:
>> https://docs.datastax.com/en/developer/nodejs-driver/3.5/features/datatypes/nulls/#unset
>>
>> On Thu, Dec 27, 2018 at 3:21 PM Durity, Sean R <
>> SEAN_R_DURITY@homedepot.com> wrote:
>>
>>> You say the events are incremental updates. I am interpreting this to
>>> mean only some columns are updated. Others should keep their original
>>> values.
>>>
>>> You are correct that inserting null creates a tombstone.
>>>
>>> Can you only insert the columns that actually have new values? Just skip
>>> the columns with no information. (Make the insert generator a bit smarter.)
>>>
>>> Create table happening (id text primary key, event text, a text, b text,
>>> c text);
>>> Insert into table happening (id, event, a, b, c) values
>>> ("MainEvent","The most complete info we have right now","Priceless","10
>>> pm","Grand Ballroom");
>>> -- b changes
>>> Insert into happening (id, b) values ("MainEvent","9:30 pm");
>>>
>>>
>>> Sean Durity
>>>
>>>
>>> -----Original Message-----
>>> From: Tomas Bartalos <to...@gmail.com>
>>> Sent: Thursday, December 27, 2018 9:27 AM
>>> To: user@cassandra.apache.org
>>> Subject: [EXTERNAL] Howto avoid tombstones when inserting NULL values
>>>
>>> Hello,
>>>
>>> I’d start with describing my use case and how I’d like to use Cassandra
>>> to solve my storage needs.
>>> We're processing a stream of events for various happenings. Every event
>>> have a unique happening_id.
>>> One happening may have many events, usually ~ 20-100 events. I’d like to
>>> store only the latest event for the same happening (Event is an incremental
>>> update and it contains all up-to date data about happening).
>>> Technically the events are streamed from Kafka, processed with Spark an
>>> saved to Cassandra.
>>> In Cassandra we use upserts (insert with same primary key).  So far so
>>> good, however there comes the tombstone...
>>>
>>> When I’m inserting field with NULL value, Cassandra creates tombstone
>>> for this field. As I understood this is due to space efficiency, Cassandra
>>> doesn’t have to remember there is a NULL value, she just deletes the
>>> respective column and a delete creates a ... tombstone.
>>> I was hoping there could be an option to tell Cassandra not to be so
>>> space effective and store “unset" info without generating tombstones.
>>> Something similar to inserting empty strings instead of null values:
>>>
>>> CREATE TABLE happening (id text PRIMARY KEY, event text); insert into
>>> happening (‘1’, ‘event1’); — tombstone is generated insert into happening
>>> (‘1’, null); — tombstone is not generated insert into happening (‘1’, '’);
>>>
>>> Possible solutions:
>>> 1. Disable tombstones with gc_grace_seconds = 0 or set to reasonable low
>>> value (1 hour ?) . Not good, since phantom data may re-appear 2. ignore
>>> NULLs on spark side with “spark.cassandra.output.ignoreNulls=true”. Not
>>> good since this will never overwrite previously inserted event field with
>>> “empty” one.
>>> 3. On inserts with spark, find all NULL values and replace them with
>>> “empty” equivalent (empty string for text, 0 for integer). Very inefficient
>>> and problematic to find “empty” equivalent for some data types.
>>>
>>> Until tombstones appeared Cassandra was the right fit for our use case,
>>> however now I’m not sure if we’re heading the right direction.
>>> Could you please give me some advice how to solve this problem ?
>>>
>>> Thank you,
>>> Tomas
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>>> For additional commands, e-mail: user-help@cassandra.apache.org
>>>
>>>
>>> ________________________________
>>>
>>> The information in this Internet Email is confidential and may be
>>> legally privileged. It is intended solely for the addressee. Access to this
>>> Email by anyone else is unauthorized. If you are not the intended
>>> recipient, any disclosure, copying, distribution or any action taken or
>>> omitted to be taken in reliance on it, is prohibited and may be unlawful.
>>> When addressed to our clients any opinions or advice contained in this
>>> Email are subject to the terms and conditions expressed in any applicable
>>> governing The Home Depot terms of business or client engagement letter. The
>>> Home Depot disclaims all responsibility and liability for the accuracy and
>>> content of this attachment and for any damages or losses arising from any
>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>>> items of a destructive nature, which may be contained in this attachment
>>> and shall not be liable for direct, indirect, consequential or special
>>> damages in connection with this e-mail message or its attachment.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>>> For additional commands, e-mail: user-help@cassandra.apache.org
>>>
>>

Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values

Posted by Tomas Bartalos <to...@gmail.com>.
Hello,

The problem is I can't know the combination of set/unset values. From my
perspective every value should be set. The event from Kafka represents the
complete state of the happening at certain point in time. In my table I
want to store the latest event so the most recent state of the happening
(in this table I don't care about the history). Actually I used wrong
expression since its just the opposite of "incremental update", every event
carries all data (state) for specific point of time.

The event is represented with nested json structure. Top level elements of
the json are table fields with type like text, boolean, timestamp, list and
the nested elements are UDT fields.

Simplified example:
There is a new purchase for the happening, event:
{total_amount: 50, items : [A, B, C, new_item], purchase_time : '2018-12-27
13:30', specials: null, customer : {... }, fare_amount,...}
I don't know what actually happened for this event, maybe there is a new
item purchased, maybe some customer info have been changed, maybe the
specials have been revoked and I have to reset them. I just need to store
the state as it artived from Kafka, there might already be an event for
this happening saved before, or maybe this is the first one.

BR,
Tomas


On Thu, 27 Dec 2018, 9:36 pm Eric Stevens <mightye@gmail.com wrote:

> Depending on the use case, creating separate prepared statements for each
> combination of set / unset values in large INSERT/UPDATE statements may be
> prohibitive.
>
> Instead, you can look into driver level support for UNSET values.
> Requires Cassandra 2.2 or later IIRC.
>
> See:
> Java Driver:
> https://docs.datastax.com/en/developer/java-driver/3.0/manual/statements/prepared/#parameters-and-binding
> Python Driver:
> https://www.datastax.com/dev/blog/python-driver-2-6-0-rc1-with-cassandra-2-2-features#distinguishing_between_null_and_unset_values
> Node Driver:
> https://docs.datastax.com/en/developer/nodejs-driver/3.5/features/datatypes/nulls/#unset
>
> On Thu, Dec 27, 2018 at 3:21 PM Durity, Sean R <
> SEAN_R_DURITY@homedepot.com> wrote:
>
>> You say the events are incremental updates. I am interpreting this to
>> mean only some columns are updated. Others should keep their original
>> values.
>>
>> You are correct that inserting null creates a tombstone.
>>
>> Can you only insert the columns that actually have new values? Just skip
>> the columns with no information. (Make the insert generator a bit smarter.)
>>
>> Create table happening (id text primary key, event text, a text, b text,
>> c text);
>> Insert into table happening (id, event, a, b, c) values ("MainEvent","The
>> most complete info we have right now","Priceless","10 pm","Grand Ballroom");
>> -- b changes
>> Insert into happening (id, b) values ("MainEvent","9:30 pm");
>>
>>
>> Sean Durity
>>
>>
>> -----Original Message-----
>> From: Tomas Bartalos <to...@gmail.com>
>> Sent: Thursday, December 27, 2018 9:27 AM
>> To: user@cassandra.apache.org
>> Subject: [EXTERNAL] Howto avoid tombstones when inserting NULL values
>>
>> Hello,
>>
>> I’d start with describing my use case and how I’d like to use Cassandra
>> to solve my storage needs.
>> We're processing a stream of events for various happenings. Every event
>> have a unique happening_id.
>> One happening may have many events, usually ~ 20-100 events. I’d like to
>> store only the latest event for the same happening (Event is an incremental
>> update and it contains all up-to date data about happening).
>> Technically the events are streamed from Kafka, processed with Spark an
>> saved to Cassandra.
>> In Cassandra we use upserts (insert with same primary key).  So far so
>> good, however there comes the tombstone...
>>
>> When I’m inserting field with NULL value, Cassandra creates tombstone for
>> this field. As I understood this is due to space efficiency, Cassandra
>> doesn’t have to remember there is a NULL value, she just deletes the
>> respective column and a delete creates a ... tombstone.
>> I was hoping there could be an option to tell Cassandra not to be so
>> space effective and store “unset" info without generating tombstones.
>> Something similar to inserting empty strings instead of null values:
>>
>> CREATE TABLE happening (id text PRIMARY KEY, event text); insert into
>> happening (‘1’, ‘event1’); — tombstone is generated insert into happening
>> (‘1’, null); — tombstone is not generated insert into happening (‘1’, '’);
>>
>> Possible solutions:
>> 1. Disable tombstones with gc_grace_seconds = 0 or set to reasonable low
>> value (1 hour ?) . Not good, since phantom data may re-appear 2. ignore
>> NULLs on spark side with “spark.cassandra.output.ignoreNulls=true”. Not
>> good since this will never overwrite previously inserted event field with
>> “empty” one.
>> 3. On inserts with spark, find all NULL values and replace them with
>> “empty” equivalent (empty string for text, 0 for integer). Very inefficient
>> and problematic to find “empty” equivalent for some data types.
>>
>> Until tombstones appeared Cassandra was the right fit for our use case,
>> however now I’m not sure if we’re heading the right direction.
>> Could you please give me some advice how to solve this problem ?
>>
>> Thank you,
>> Tomas
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>> For additional commands, e-mail: user-help@cassandra.apache.org
>>
>>
>> ________________________________
>>
>> The information in this Internet Email is confidential and may be legally
>> privileged. It is intended solely for the addressee. Access to this Email
>> by anyone else is unauthorized. If you are not the intended recipient, any
>> disclosure, copying, distribution or any action taken or omitted to be
>> taken in reliance on it, is prohibited and may be unlawful. When addressed
>> to our clients any opinions or advice contained in this Email are subject
>> to the terms and conditions expressed in any applicable governing The Home
>> Depot terms of business or client engagement letter. The Home Depot
>> disclaims all responsibility and liability for the accuracy and content of
>> this attachment and for any damages or losses arising from any
>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>> items of a destructive nature, which may be contained in this attachment
>> and shall not be liable for direct, indirect, consequential or special
>> damages in connection with this e-mail message or its attachment.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
>> For additional commands, e-mail: user-help@cassandra.apache.org
>>
>

Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values

Posted by Eric Stevens <mi...@gmail.com>.
Depending on the use case, creating separate prepared statements for each
combination of set / unset values in large INSERT/UPDATE statements may be
prohibitive.

Instead, you can look into driver level support for UNSET values.  Requires
Cassandra 2.2 or later IIRC.

See:
Java Driver:
https://docs.datastax.com/en/developer/java-driver/3.0/manual/statements/prepared/#parameters-and-binding
Python Driver:
https://www.datastax.com/dev/blog/python-driver-2-6-0-rc1-with-cassandra-2-2-features#distinguishing_between_null_and_unset_values
Node Driver:
https://docs.datastax.com/en/developer/nodejs-driver/3.5/features/datatypes/nulls/#unset

On Thu, Dec 27, 2018 at 3:21 PM Durity, Sean R <SE...@homedepot.com>
wrote:

> You say the events are incremental updates. I am interpreting this to mean
> only some columns are updated. Others should keep their original values.
>
> You are correct that inserting null creates a tombstone.
>
> Can you only insert the columns that actually have new values? Just skip
> the columns with no information. (Make the insert generator a bit smarter.)
>
> Create table happening (id text primary key, event text, a text, b text, c
> text);
> Insert into table happening (id, event, a, b, c) values ("MainEvent","The
> most complete info we have right now","Priceless","10 pm","Grand Ballroom");
> -- b changes
> Insert into happening (id, b) values ("MainEvent","9:30 pm");
>
>
> Sean Durity
>
>
> -----Original Message-----
> From: Tomas Bartalos <to...@gmail.com>
> Sent: Thursday, December 27, 2018 9:27 AM
> To: user@cassandra.apache.org
> Subject: [EXTERNAL] Howto avoid tombstones when inserting NULL values
>
> Hello,
>
> I’d start with describing my use case and how I’d like to use Cassandra to
> solve my storage needs.
> We're processing a stream of events for various happenings. Every event
> have a unique happening_id.
> One happening may have many events, usually ~ 20-100 events. I’d like to
> store only the latest event for the same happening (Event is an incremental
> update and it contains all up-to date data about happening).
> Technically the events are streamed from Kafka, processed with Spark an
> saved to Cassandra.
> In Cassandra we use upserts (insert with same primary key).  So far so
> good, however there comes the tombstone...
>
> When I’m inserting field with NULL value, Cassandra creates tombstone for
> this field. As I understood this is due to space efficiency, Cassandra
> doesn’t have to remember there is a NULL value, she just deletes the
> respective column and a delete creates a ... tombstone.
> I was hoping there could be an option to tell Cassandra not to be so space
> effective and store “unset" info without generating tombstones.
> Something similar to inserting empty strings instead of null values:
>
> CREATE TABLE happening (id text PRIMARY KEY, event text); insert into
> happening (‘1’, ‘event1’); — tombstone is generated insert into happening
> (‘1’, null); — tombstone is not generated insert into happening (‘1’, '’);
>
> Possible solutions:
> 1. Disable tombstones with gc_grace_seconds = 0 or set to reasonable low
> value (1 hour ?) . Not good, since phantom data may re-appear 2. ignore
> NULLs on spark side with “spark.cassandra.output.ignoreNulls=true”. Not
> good since this will never overwrite previously inserted event field with
> “empty” one.
> 3. On inserts with spark, find all NULL values and replace them with
> “empty” equivalent (empty string for text, 0 for integer). Very inefficient
> and problematic to find “empty” equivalent for some data types.
>
> Until tombstones appeared Cassandra was the right fit for our use case,
> however now I’m not sure if we’re heading the right direction.
> Could you please give me some advice how to solve this problem ?
>
> Thank you,
> Tomas
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org
>
>
> ________________________________
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org
>