You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Matthew Johnson <ma...@algomi.com> on 2015/04/29 10:36:34 UTC

Inserting null values

Hi all,



I have some fields that I am storing into Cassandra, but some of them could
be null at any given point. As there are quite a lot of them, it makes the
code much more readable if I don’t check each one for null before adding it
to the INSERT.



I can see a few Jiras around CQL 3 supporting inserting nulls:



https://issues.apache.org/jira/browse/CASSANDRA-3783

https://issues.apache.org/jira/browse/CASSANDRA-5648



But I have tested inserting null and it seems to work fine (when querying
the table with cqlsh, it shows up as a red lowercase *null*).



Are there any obvious pitfalls to look out for that I have missed? Could it
be a performance concern to insert a row with some nulls, as opposed to
checking the values first and inserting the row and just omitting those
columns?



Thanks!

Matt

Re: Inserting null values

Posted by DuyHai Doan <do...@gmail.com>.

<auto promotion mode on>

The problem of NULL insert is already solved long time ago with Insert
Strategy in Achilles:
https://github.com/doanduyhai/Achilles/wiki/Insert-Strategy

</auto promotion off>

However, it's nice to see there will be a flag on the protocol side to
handle this problem

On Wed, Apr 29, 2015 at 2:27 PM, Ali Akhtar <al...@gmail.com> wrote:

> Have you considered adding a 'toSafe' method which checks if the item is
> null, and if so, returns a default value? E.g String too = safe(bar, ""); .
> On Apr 29, 2015 3:14 PM, "Matthew Johnson" <ma...@algomi.com>
> wrote:
>
>> Hi all,
>>
>>
>>
>> I have some fields that I am storing into Cassandra, but some of them
>> could be null at any given point. As there are quite a lot of them, it
>> makes the code much more readable if I don’t check each one for null before
>> adding it to the INSERT.
>>
>>
>>
>> I can see a few Jiras around CQL 3 supporting inserting nulls:
>>
>>
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-3783
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-5648
>>
>>
>>
>> But I have tested inserting null and it seems to work fine (when querying
>> the table with cqlsh, it shows up as a red lowercase *null*).
>>
>>
>>
>> Are there any obvious pitfalls to look out for that I have missed? Could
>> it be a performance concern to insert a row with some nulls, as opposed to
>> checking the values first and inserting the row and just omitting those
>> columns?
>>
>>
>>
>> Thanks!
>>
>> Matt
>>
>>
>>
>

Re: Inserting null values

Posted by Ali Akhtar <al...@gmail.com>.

Have you considered adding a 'toSafe' method which checks if the item is
null, and if so, returns a default value? E.g String too = safe(bar, ""); .
On Apr 29, 2015 3:14 PM, "Matthew Johnson" <ma...@algomi.com> wrote:

> Hi all,
>
>
>
> I have some fields that I am storing into Cassandra, but some of them
> could be null at any given point. As there are quite a lot of them, it
> makes the code much more readable if I don’t check each one for null before
> adding it to the INSERT.
>
>
>
> I can see a few Jiras around CQL 3 supporting inserting nulls:
>
>
>
> https://issues.apache.org/jira/browse/CASSANDRA-3783
>
> https://issues.apache.org/jira/browse/CASSANDRA-5648
>
>
>
> But I have tested inserting null and it seems to work fine (when querying
> the table with cqlsh, it shows up as a red lowercase *null*).
>
>
>
> Are there any obvious pitfalls to look out for that I have missed? Could
> it be a performance concern to insert a row with some nulls, as opposed to
> checking the values first and inserting the row and just omitting those
> columns?
>
>
>
> Thanks!
>
> Matt
>
>
>

RE: Inserting null values

Posted by "Peer, Oded" <Od...@rsa.com>.

I’ve added an option to prevent tombstone creation when using PreparedStatements to trunk, see CASSANDRA-7304.

The problem is having tombstones in regular columns.
When you perform a read request (range query or by PK):
- Cassandra iterates over all the cells (all, not only the cells specified in the query) in the relevant rows while counting tombstone cells (https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/filter/SliceQueryFilter.java#L199)
- creates a ColumnFamily object instance with the rows
- filters the selected columns from the internal CF (https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/statements/SelectStatement.java#L653)
- returns the result

If you have many unnecessary tombstones you read many unnecessary cells.



From: Eric Stevens [mailto:mightye@gmail.com]
Sent: Wednesday, May 06, 2015 4:37 PM
To: user@cassandra.apache.org
Subject: Re: Inserting null values

I agree that inserting null is not as good as not inserting that column at all when you have confidence that you are not shadowing any underlying data. But pragmatically speaking it really doesn't sound like a small number of incidental nulls/tombstones (< 20% of columns, otherwise CASSANDRA-3442 takes over) is going to have any performance impact either in your query patterns or in compaction in any practical sense.

If INSERT of null values is problematic for small portions of your data, then it stands to reason that an INSERT option containing an instruction to prevent tombstone creation would be an important performance optimization (and would also address the fact that non-null collections also generate tombstones on INSERT as well).  INSERT INTO ... USING no_tombstones;


> There's thresholds (log messages, etc.) which operate on tombstone counts over a certain number, but not on column counts over the same number.

tombstone_warn_threshold and tombstone_failure_threshold only apply to clustering scans right?  I.E. tombstones don't count against those thresholds if they are not part of the clustering key column being considered for the non-EQ relation?  The documentation certainly implies so:

tombstone_warn_threshold¶<http://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html?scroll=reference_ds_qfg_n1r_1k__tombstone_warn_threshold>
(Default: 1000) The maximum number of tombstones a query can scan before warning.
tombstone_failure_threshold¶<http://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html?scroll=reference_ds_qfg_n1r_1k__tombstone_failure_threshold>
(Default: 100000) The maximum number of tombstones a query can scan before aborting.

On Wed, Apr 29, 2015 at 12:42 PM, Robert Coli <rc...@eventbrite.com>> wrote:
On Wed, Apr 29, 2015 at 9:16 AM, Eric Stevens <mi...@gmail.com>> wrote:
In the end, inserting a tombstone into a non-clustered column shouldn't be appreciably worse (if it is at all) than inserting a value instead.  Or am I missing something here?

There's thresholds (log messages, etc.) which operate on tombstone counts over a certain number, but not on column counts over the same number.

Given that tombstones are often smaller than data columns, sorta hard to understand conceptually?

=Rob

Re: Inserting null values

Posted by Eric Stevens <mi...@gmail.com>.

I agree that inserting null is not as good as not inserting that column at
all when you have confidence that you are not shadowing any underlying
data. But pragmatically speaking it really doesn't sound like a small
number of incidental nulls/tombstones (< 20% of columns, otherwise
CASSANDRA-3442 takes over) is going to have any performance impact either
in your query patterns or in compaction in any practical sense.

If INSERT of null values is problematic for small portions of your data,
then it stands to reason that an INSERT option containing an instruction to
prevent tombstone creation would be an important performance optimization
(and would also address the fact that non-null collections also generate
tombstones on INSERT as well).  INSERT INTO ... USING no_tombstones;

> There's thresholds (log messages, etc.) which operate on tombstone counts
over a certain number, but not on column counts over the same number.

tombstone_warn_threshold and tombstone_failure_threshold only apply to
clustering scans right?  I.E. tombstones don't count against those
thresholds if they are not part of the clustering key column being
considered for the non-EQ relation?  The documentation certainly implies so:

tombstone_warn_threshold¶
<http://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html?scroll=reference_ds_qfg_n1r_1k__tombstone_warn_threshold>
(Default: 1000) The maximum number of tombstones a query can scan before
warning.tombstone_failure_threshold¶
<http://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html?scroll=reference_ds_qfg_n1r_1k__tombstone_failure_threshold>
(Default: 100000) The maximum number of tombstones a query can scan before
aborting.

On Wed, Apr 29, 2015 at 12:42 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Wed, Apr 29, 2015 at 9:16 AM, Eric Stevens <mi...@gmail.com> wrote:
>
>> In the end, inserting a tombstone into a non-clustered column shouldn't
>> be appreciably worse (if it is at all) than inserting a value instead.  Or
>> am I missing something here?
>>
>
> There's thresholds (log messages, etc.) which operate on tombstone counts
> over a certain number, but not on column counts over the same number.
>
> Given that tombstones are often smaller than data columns, sorta hard to
> understand conceptually?
>
> =Rob
>
>

Re: Inserting null values

Posted by Robert Coli <rc...@eventbrite.com>.

On Wed, Apr 29, 2015 at 9:16 AM, Eric Stevens <mi...@gmail.com> wrote:

> In the end, inserting a tombstone into a non-clustered column shouldn't be
> appreciably worse (if it is at all) than inserting a value instead.  Or am
> I missing something here?
>

There's thresholds (log messages, etc.) which operate on tombstone counts
over a certain number, but not on column counts over the same number.

Given that tombstones are often smaller than data columns, sorta hard to
understand conceptually?

=Rob

Re: Inserting null values

Posted by Philip Thompson <ph...@datastax.com>.

In a way, yes. A tombstone will only be removed after gc_grace iff the
compaction is sure that it contains all rows which that tombstone might
shadow. When two non-tombstone conflicting rows are compacted, it's always
just LWW.

On Wed, Apr 29, 2015 at 2:42 PM, Eric Stevens <mi...@gmail.com> wrote:

> But we're talking about a single tombstone on each of a finite (small) set
> of values, right?  We're not talking about INSERTs which are 99% nulls (at
> least I don't think that's what Matthew was suggesting).  Unless you're
> engaging in the antipattern of repeated overwrite, I'm still struggling to
> see why this is worse than an equivalent number of non-tombstoned writes.
> In fact from the description I don't think we're talking about these
> tombstones even occluding any value at all.
>
> > imagine a multi tb sstable w/ 99% tombstones
>
> Let's play with this hypothetical, which doesn't seem like a probable
> consequence of the original question.  You'd have to have taken enough
> writes *inside* gc grace period to have even produced a multi-TB sstable
> to come anywhere near this, and even then this either exceeds or comes
> really close to the recommended maximum total data size per node (let alone
> in a single sstable).  If you did have such an sstable, it doesn't seem
> very likely to compact again inside gc grace period short of manually
> triggered major compaction.
>
> But let's assume you do that, you run cassandra stress inserting nothing
> but tombstones, and kick off major compaction periodically.  If it
> compacted inside gc grace period, is this worse for compaction than the
> same number of non-tombstoned values (i.e. a multi-TB sstable is costly to
> compact no matter what the contents)?  If it compacted outside gc grace
> period, then 99% of the work is just dropping tombstones, it seems like it
> would run really fast (for being an absurdly large sstable), as there would
> be just 1% of the contents to actually copy over to the new sstable.
>
> I'm still not clear on what I'm missing.  Is a tombstone more expensive to
> compact than a non-tombstone?
>
> On Wed, Apr 29, 2015 at 10:06 AM, Jonathan Haddad <jo...@jonhaddad.com>
> wrote:
>
>> Enough tombstones can inflate the size of an SSTable causing issues
>> during compaction (imagine a multi tb sstable w/ 99% tombstones) even if
>> there's no clustering key defined.
>>
>> Perhaps an edge case, but worth considering.
>>
>> On Wed, Apr 29, 2015 at 9:17 AM Eric Stevens <mi...@gmail.com> wrote:
>>
>>> Correct me if I'm wrong, but tombstones are only really problematic if
>>> you have them going into clustering keys, then perform a range select on
>>> that column, right (assuming it's not a symptom of the antipattern of
>>> indefinitely overwriting the same value)?  I.E. you're deleting clusters
>>> off of a partition.  A tombstone isn't any more costly, and in some ways
>>> less costly than a normal column (it's a smaller size at rest than, say,
>>> inserting an empty string or other default value as someone suggested).
>>>
>>> Tombstones stay around a little longer post-compaction than other
>>> values, so that's a downside, but they also would drop off the record as if
>>> it had never been set on the next compaction after gc grace period.
>>>
>>> Tombstones aren't intrinsically bad, but they can have some bad
>>> properties in certain situations.  This doesn't strike me as one of them.
>>> If you have a way to avoid inserting null when you know you aren't
>>> occluding an underlying value, that would be ideal.  But because the
>>> tombstone would sit adjacent on disk to other values from the same insert,
>>> even if you were on platters, the drive head is *already positioned* over
>>> the tombstone location when it's read, because it read the prior value and
>>> subsequent value which were written during the same insert.
>>>
>>> In the end, inserting a tombstone into a non-clustered column shouldn't
>>> be appreciably worse (if it is at all) than inserting a value instead.  Or
>>> am I missing something here?
>>>
>>> On Wed, Apr 29, 2015 at 7:53 AM, Matthew Johnson <
>>> matt.johnson@algomi.com> wrote:
>>>
>>>> Thank you all for the advice!
>>>>
>>>>
>>>>
>>>> I have decided to use the Insert query builder (
>>>> *com.datastax.driver.core.querybuilder.Insert*) which allows me to
>>>> dynamically insert as many or as few columns as I need, and doesn’t require
>>>> multiple prepared statements. Then, I will look at Ali’s suggestion – I
>>>> will create a small helper method like ‘addToInsertIfNotNull’ and pump all
>>>> my values into that, which will then filter out the ones that are null.
>>>> Should keep the code nice and neat – I will feed back if I find any
>>>> problems with this approach (but please jump in if you have already spotted
>>>> any :)).
>>>>
>>>>
>>>>
>>>> Thanks!
>>>>
>>>> Matt
>>>>
>>>>
>>>>
>>>> *From:* Robert Wille [mailto:rwille@fold3.com]
>>>> *Sent:* 29 April 2015 15:16
>>>> *To:* user@cassandra.apache.org
>>>> *Subject:* Re: Inserting null values
>>>>
>>>>
>>>>
>>>> I’ve come across the same thing. I have a table with at least half a
>>>> dozen columns that could be null, in any combination. Having a prepared
>>>> statement for each permutation of null columns just isn’t going to happen.
>>>> I don’t want to build custom queries each time because I have a really cool
>>>> system of managing my queries that relies on them being prepared.
>>>>
>>>>
>>>>
>>>> Fortunately for me, I should have at most a handful of tombstones in
>>>> each partition, and most of my records are written exactly once. So, I just
>>>> let the tombstones get written and they’ll eventually get compacted out and
>>>> life will go on.
>>>>
>>>>
>>>>
>>>> It’s annoying and not ideal, but what can you do?
>>>>
>>>>
>>>>
>>>> On Apr 29, 2015, at 2:36 AM, Matthew Johnson <ma...@algomi.com>
>>>> wrote:
>>>>
>>>>
>>>>
>>>> Hi all,
>>>>
>>>>
>>>>
>>>> I have some fields that I am storing into Cassandra, but some of them
>>>> could be null at any given point. As there are quite a lot of them, it
>>>> makes the code much more readable if I don’t check each one for null before
>>>> adding it to the INSERT.
>>>>
>>>>
>>>>
>>>> I can see a few Jiras around CQL 3 supporting inserting nulls:
>>>>
>>>>
>>>>
>>>> https://issues.apache.org/jira/browse/CASSANDRA-3783
>>>>
>>>> https://issues.apache.org/jira/browse/CASSANDRA-5648
>>>>
>>>>
>>>>
>>>> But I have tested inserting null and it seems to work fine (when
>>>> querying the table with cqlsh, it shows up as a red lowercase *null*).
>>>>
>>>>
>>>>
>>>> Are there any obvious pitfalls to look out for that I have missed?
>>>> Could it be a performance concern to insert a row with some nulls, as
>>>> opposed to checking the values first and inserting the row and just
>>>> omitting those columns?
>>>>
>>>>
>>>>
>>>> Thanks!
>>>>
>>>> Matt
>>>>
>>>>
>>>>
>>>
>>>
>

Re: Inserting null values

Posted by Eric Stevens <mi...@gmail.com>.

But we're talking about a single tombstone on each of a finite (small) set
of values, right?  We're not talking about INSERTs which are 99% nulls (at
least I don't think that's what Matthew was suggesting).  Unless you're
engaging in the antipattern of repeated overwrite, I'm still struggling to
see why this is worse than an equivalent number of non-tombstoned writes.
In fact from the description I don't think we're talking about these
tombstones even occluding any value at all.

> imagine a multi tb sstable w/ 99% tombstones

Let's play with this hypothetical, which doesn't seem like a probable
consequence of the original question.  You'd have to have taken enough
writes *inside* gc grace period to have even produced a multi-TB sstable to
come anywhere near this, and even then this either exceeds or comes really
close to the recommended maximum total data size per node (let alone in a
single sstable).  If you did have such an sstable, it doesn't seem very
likely to compact again inside gc grace period short of manually triggered
major compaction.

But let's assume you do that, you run cassandra stress inserting nothing
but tombstones, and kick off major compaction periodically.  If it
compacted inside gc grace period, is this worse for compaction than the
same number of non-tombstoned values (i.e. a multi-TB sstable is costly to
compact no matter what the contents)?  If it compacted outside gc grace
period, then 99% of the work is just dropping tombstones, it seems like it
would run really fast (for being an absurdly large sstable), as there would
be just 1% of the contents to actually copy over to the new sstable.

I'm still not clear on what I'm missing.  Is a tombstone more expensive to
compact than a non-tombstone?

On Wed, Apr 29, 2015 at 10:06 AM, Jonathan Haddad <jo...@jonhaddad.com> wrote:

> Enough tombstones can inflate the size of an SSTable causing issues during
> compaction (imagine a multi tb sstable w/ 99% tombstones) even if there's
> no clustering key defined.
>
> Perhaps an edge case, but worth considering.
>
> On Wed, Apr 29, 2015 at 9:17 AM Eric Stevens <mi...@gmail.com> wrote:
>
>> Correct me if I'm wrong, but tombstones are only really problematic if
>> you have them going into clustering keys, then perform a range select on
>> that column, right (assuming it's not a symptom of the antipattern of
>> indefinitely overwriting the same value)?  I.E. you're deleting clusters
>> off of a partition.  A tombstone isn't any more costly, and in some ways
>> less costly than a normal column (it's a smaller size at rest than, say,
>> inserting an empty string or other default value as someone suggested).
>>
>> Tombstones stay around a little longer post-compaction than other values,
>> so that's a downside, but they also would drop off the record as if it had
>> never been set on the next compaction after gc grace period.
>>
>> Tombstones aren't intrinsically bad, but they can have some bad
>> properties in certain situations.  This doesn't strike me as one of them.
>> If you have a way to avoid inserting null when you know you aren't
>> occluding an underlying value, that would be ideal.  But because the
>> tombstone would sit adjacent on disk to other values from the same insert,
>> even if you were on platters, the drive head is *already positioned* over
>> the tombstone location when it's read, because it read the prior value and
>> subsequent value which were written during the same insert.
>>
>> In the end, inserting a tombstone into a non-clustered column shouldn't
>> be appreciably worse (if it is at all) than inserting a value instead.  Or
>> am I missing something here?
>>
>> On Wed, Apr 29, 2015 at 7:53 AM, Matthew Johnson <matt.johnson@algomi.com
>> > wrote:
>>
>>> Thank you all for the advice!
>>>
>>>
>>>
>>> I have decided to use the Insert query builder (
>>> *com.datastax.driver.core.querybuilder.Insert*) which allows me to
>>> dynamically insert as many or as few columns as I need, and doesn’t require
>>> multiple prepared statements. Then, I will look at Ali’s suggestion – I
>>> will create a small helper method like ‘addToInsertIfNotNull’ and pump all
>>> my values into that, which will then filter out the ones that are null.
>>> Should keep the code nice and neat – I will feed back if I find any
>>> problems with this approach (but please jump in if you have already spotted
>>> any :)).
>>>
>>>
>>>
>>> Thanks!
>>>
>>> Matt
>>>
>>>
>>>
>>> *From:* Robert Wille [mailto:rwille@fold3.com]
>>> *Sent:* 29 April 2015 15:16
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Re: Inserting null values
>>>
>>>
>>>
>>> I’ve come across the same thing. I have a table with at least half a
>>> dozen columns that could be null, in any combination. Having a prepared
>>> statement for each permutation of null columns just isn’t going to happen.
>>> I don’t want to build custom queries each time because I have a really cool
>>> system of managing my queries that relies on them being prepared.
>>>
>>>
>>>
>>> Fortunately for me, I should have at most a handful of tombstones in
>>> each partition, and most of my records are written exactly once. So, I just
>>> let the tombstones get written and they’ll eventually get compacted out and
>>> life will go on.
>>>
>>>
>>>
>>> It’s annoying and not ideal, but what can you do?
>>>
>>>
>>>
>>> On Apr 29, 2015, at 2:36 AM, Matthew Johnson <ma...@algomi.com>
>>> wrote:
>>>
>>>
>>>
>>> Hi all,
>>>
>>>
>>>
>>> I have some fields that I am storing into Cassandra, but some of them
>>> could be null at any given point. As there are quite a lot of them, it
>>> makes the code much more readable if I don’t check each one for null before
>>> adding it to the INSERT.
>>>
>>>
>>>
>>> I can see a few Jiras around CQL 3 supporting inserting nulls:
>>>
>>>
>>>
>>> https://issues.apache.org/jira/browse/CASSANDRA-3783
>>>
>>> https://issues.apache.org/jira/browse/CASSANDRA-5648
>>>
>>>
>>>
>>> But I have tested inserting null and it seems to work fine (when
>>> querying the table with cqlsh, it shows up as a red lowercase *null*).
>>>
>>>
>>>
>>> Are there any obvious pitfalls to look out for that I have missed? Could
>>> it be a performance concern to insert a row with some nulls, as opposed to
>>> checking the values first and inserting the row and just omitting those
>>> columns?
>>>
>>>
>>>
>>> Thanks!
>>>
>>> Matt
>>>
>>>
>>>
>>
>>

Re: Inserting null values

Posted by Jonathan Haddad <jo...@jonhaddad.com>.

Enough tombstones can inflate the size of an SSTable causing issues during
compaction (imagine a multi tb sstable w/ 99% tombstones) even if there's
no clustering key defined.

Perhaps an edge case, but worth considering.

On Wed, Apr 29, 2015 at 9:17 AM Eric Stevens <mi...@gmail.com> wrote:

> Correct me if I'm wrong, but tombstones are only really problematic if you
> have them going into clustering keys, then perform a range select on that
> column, right (assuming it's not a symptom of the antipattern of
> indefinitely overwriting the same value)?  I.E. you're deleting clusters
> off of a partition.  A tombstone isn't any more costly, and in some ways
> less costly than a normal column (it's a smaller size at rest than, say,
> inserting an empty string or other default value as someone suggested).
>
> Tombstones stay around a little longer post-compaction than other values,
> so that's a downside, but they also would drop off the record as if it had
> never been set on the next compaction after gc grace period.
>
> Tombstones aren't intrinsically bad, but they can have some bad properties
> in certain situations.  This doesn't strike me as one of them.  If you have
> a way to avoid inserting null when you know you aren't occluding an
> underlying value, that would be ideal.  But because the tombstone would sit
> adjacent on disk to other values from the same insert, even if you were on
> platters, the drive head is *already positioned* over the tombstone
> location when it's read, because it read the prior value and subsequent
> value which were written during the same insert.
>
> In the end, inserting a tombstone into a non-clustered column shouldn't be
> appreciably worse (if it is at all) than inserting a value instead.  Or am
> I missing something here?
>
> On Wed, Apr 29, 2015 at 7:53 AM, Matthew Johnson <ma...@algomi.com>
> wrote:
>
>> Thank you all for the advice!
>>
>>
>>
>> I have decided to use the Insert query builder (
>> *com.datastax.driver.core.querybuilder.Insert*) which allows me to
>> dynamically insert as many or as few columns as I need, and doesn’t require
>> multiple prepared statements. Then, I will look at Ali’s suggestion – I
>> will create a small helper method like ‘addToInsertIfNotNull’ and pump all
>> my values into that, which will then filter out the ones that are null.
>> Should keep the code nice and neat – I will feed back if I find any
>> problems with this approach (but please jump in if you have already spotted
>> any :)).
>>
>>
>>
>> Thanks!
>>
>> Matt
>>
>>
>>
>> *From:* Robert Wille [mailto:rwille@fold3.com]
>> *Sent:* 29 April 2015 15:16
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Inserting null values
>>
>>
>>
>> I’ve come across the same thing. I have a table with at least half a
>> dozen columns that could be null, in any combination. Having a prepared
>> statement for each permutation of null columns just isn’t going to happen.
>> I don’t want to build custom queries each time because I have a really cool
>> system of managing my queries that relies on them being prepared.
>>
>>
>>
>> Fortunately for me, I should have at most a handful of tombstones in each
>> partition, and most of my records are written exactly once. So, I just let
>> the tombstones get written and they’ll eventually get compacted out and
>> life will go on.
>>
>>
>>
>> It’s annoying and not ideal, but what can you do?
>>
>>
>>
>> On Apr 29, 2015, at 2:36 AM, Matthew Johnson <ma...@algomi.com>
>> wrote:
>>
>>
>>
>> Hi all,
>>
>>
>>
>> I have some fields that I am storing into Cassandra, but some of them
>> could be null at any given point. As there are quite a lot of them, it
>> makes the code much more readable if I don’t check each one for null before
>> adding it to the INSERT.
>>
>>
>>
>> I can see a few Jiras around CQL 3 supporting inserting nulls:
>>
>>
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-3783
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-5648
>>
>>
>>
>> But I have tested inserting null and it seems to work fine (when querying
>> the table with cqlsh, it shows up as a red lowercase *null*).
>>
>>
>>
>> Are there any obvious pitfalls to look out for that I have missed? Could
>> it be a performance concern to insert a row with some nulls, as opposed to
>> checking the values first and inserting the row and just omitting those
>> columns?
>>
>>
>>
>> Thanks!
>>
>> Matt
>>
>>
>>
>
>

Re: Inserting null values

Posted by Eric Stevens <mi...@gmail.com>.

Correct me if I'm wrong, but tombstones are only really problematic if you
have them going into clustering keys, then perform a range select on that
column, right (assuming it's not a symptom of the antipattern of
indefinitely overwriting the same value)?  I.E. you're deleting clusters
off of a partition.  A tombstone isn't any more costly, and in some ways
less costly than a normal column (it's a smaller size at rest than, say,
inserting an empty string or other default value as someone suggested).

Tombstones stay around a little longer post-compaction than other values,
so that's a downside, but they also would drop off the record as if it had
never been set on the next compaction after gc grace period.

Tombstones aren't intrinsically bad, but they can have some bad properties
in certain situations.  This doesn't strike me as one of them.  If you have
a way to avoid inserting null when you know you aren't occluding an
underlying value, that would be ideal.  But because the tombstone would sit
adjacent on disk to other values from the same insert, even if you were on
platters, the drive head is *already positioned* over the tombstone
location when it's read, because it read the prior value and subsequent
value which were written during the same insert.

In the end, inserting a tombstone into a non-clustered column shouldn't be
appreciably worse (if it is at all) than inserting a value instead.  Or am
I missing something here?

On Wed, Apr 29, 2015 at 7:53 AM, Matthew Johnson <ma...@algomi.com>
wrote:

> Thank you all for the advice!
>
>
>
> I have decided to use the Insert query builder (
> *com.datastax.driver.core.querybuilder.Insert*) which allows me to
> dynamically insert as many or as few columns as I need, and doesn’t require
> multiple prepared statements. Then, I will look at Ali’s suggestion – I
> will create a small helper method like ‘addToInsertIfNotNull’ and pump all
> my values into that, which will then filter out the ones that are null.
> Should keep the code nice and neat – I will feed back if I find any
> problems with this approach (but please jump in if you have already spotted
> any :)).
>
>
>
> Thanks!
>
> Matt
>
>
>
> *From:* Robert Wille [mailto:rwille@fold3.com]
> *Sent:* 29 April 2015 15:16
> *To:* user@cassandra.apache.org
> *Subject:* Re: Inserting null values
>
>
>
> I’ve come across the same thing. I have a table with at least half a dozen
> columns that could be null, in any combination. Having a prepared statement
> for each permutation of null columns just isn’t going to happen. I don’t
> want to build custom queries each time because I have a really cool system
> of managing my queries that relies on them being prepared.
>
>
>
> Fortunately for me, I should have at most a handful of tombstones in each
> partition, and most of my records are written exactly once. So, I just let
> the tombstones get written and they’ll eventually get compacted out and
> life will go on.
>
>
>
> It’s annoying and not ideal, but what can you do?
>
>
>
> On Apr 29, 2015, at 2:36 AM, Matthew Johnson <ma...@algomi.com>
> wrote:
>
>
>
> Hi all,
>
>
>
> I have some fields that I am storing into Cassandra, but some of them
> could be null at any given point. As there are quite a lot of them, it
> makes the code much more readable if I don’t check each one for null before
> adding it to the INSERT.
>
>
>
> I can see a few Jiras around CQL 3 supporting inserting nulls:
>
>
>
> https://issues.apache.org/jira/browse/CASSANDRA-3783
>
> https://issues.apache.org/jira/browse/CASSANDRA-5648
>
>
>
> But I have tested inserting null and it seems to work fine (when querying
> the table with cqlsh, it shows up as a red lowercase *null*).
>
>
>
> Are there any obvious pitfalls to look out for that I have missed? Could
> it be a performance concern to insert a row with some nulls, as opposed to
> checking the values first and inserting the row and just omitting those
> columns?
>
>
>
> Thanks!
>
> Matt
>
>
>

RE: Inserting null values

Posted by Matthew Johnson <ma...@algomi.com>.

Thank you all for the advice!



I have decided to use the Insert query builder (
*com.datastax.driver.core.querybuilder.Insert*) which allows me to
dynamically insert as many or as few columns as I need, and doesn’t require
multiple prepared statements. Then, I will look at Ali’s suggestion – I
will create a small helper method like ‘addToInsertIfNotNull’ and pump all
my values into that, which will then filter out the ones that are null.
Should keep the code nice and neat – I will feed back if I find any
problems with this approach (but please jump in if you have already spotted
any :)).



Thanks!

Matt



*From:* Robert Wille [mailto:rwille@fold3.com]
*Sent:* 29 April 2015 15:16
*To:* user@cassandra.apache.org
*Subject:* Re: Inserting null values



I’ve come across the same thing. I have a table with at least half a dozen
columns that could be null, in any combination. Having a prepared statement
for each permutation of null columns just isn’t going to happen. I don’t
want to build custom queries each time because I have a really cool system
of managing my queries that relies on them being prepared.



Fortunately for me, I should have at most a handful of tombstones in each
partition, and most of my records are written exactly once. So, I just let
the tombstones get written and they’ll eventually get compacted out and
life will go on.



It’s annoying and not ideal, but what can you do?



On Apr 29, 2015, at 2:36 AM, Matthew Johnson <ma...@algomi.com>
wrote:



Hi all,



I have some fields that I am storing into Cassandra, but some of them could
be null at any given point. As there are quite a lot of them, it makes the
code much more readable if I don’t check each one for null before adding it
to the INSERT.



I can see a few Jiras around CQL 3 supporting inserting nulls:



https://issues.apache.org/jira/browse/CASSANDRA-3783

https://issues.apache.org/jira/browse/CASSANDRA-5648



But I have tested inserting null and it seems to work fine (when querying
the table with cqlsh, it shows up as a red lowercase *null*).



Are there any obvious pitfalls to look out for that I have missed? Could it
be a performance concern to insert a row with some nulls, as opposed to
checking the values first and inserting the row and just omitting those
columns?



Thanks!

Matt

Re: Inserting null values

Posted by Robert Wille <rw...@fold3.com>.

I’ve come across the same thing. I have a table with at least half a dozen columns that could be null, in any combination. Having a prepared statement for each permutation of null columns just isn’t going to happen. I don’t want to build custom queries each time because I have a really cool system of managing my queries that relies on them being prepared.

Fortunately for me, I should have at most a handful of tombstones in each partition, and most of my records are written exactly once. So, I just let the tombstones get written and they’ll eventually get compacted out and life will go on.

It’s annoying and not ideal, but what can you do?

On Apr 29, 2015, at 2:36 AM, Matthew Johnson <ma...@algomi.com>> wrote:

Hi all,

I have some fields that I am storing into Cassandra, but some of them could be null at any given point. As there are quite a lot of them, it makes the code much more readable if I don’t check each one for null before adding it to the INSERT.

I can see a few Jiras around CQL 3 supporting inserting nulls:

https://issues.apache.org/jira/browse/CASSANDRA-3783
https://issues.apache.org/jira/browse/CASSANDRA-5648

But I have tested inserting null and it seems to work fine (when querying the table with cqlsh, it shows up as a red lowercase null).

Are there any obvious pitfalls to look out for that I have missed? Could it be a performance concern to insert a row with some nulls, as opposed to checking the values first and inserting the row and just omitting those columns?

Thanks!
Matt

RE: Inserting null values

Posted by "Peer, Oded" <Od...@rsa.com>.

Inserting a null value creates a tombstone. Tombstones can have major performance implications.
You can see the tombstones using sstable2json.
If you have a small number of records with null values this seems OK, otherwise I recommend using the QueryBuilder (for Java clients) and waiting for https://issues.apache.org/jira/browse/CASSANDRA-7304

From: Matthew Johnson [mailto:matt.johnson@algomi.com]
Sent: Wednesday, April 29, 2015 11:37 AM
To: user@cassandra.apache.org
Subject: Inserting null values

Hi all,

I have some fields that I am storing into Cassandra, but some of them could be null at any given point. As there are quite a lot of them, it makes the code much more readable if I don’t check each one for null before adding it to the INSERT.

I can see a few Jiras around CQL 3 supporting inserting nulls:

https://issues.apache.org/jira/browse/CASSANDRA-3783
https://issues.apache.org/jira/browse/CASSANDRA-5648

But I have tested inserting null and it seems to work fine (when querying the table with cqlsh, it shows up as a red lowercase null).

Are there any obvious pitfalls to look out for that I have missed? Could it be a performance concern to insert a row with some nulls, as opposed to checking the values first and inserting the row and just omitting those columns?

Thanks!
Matt