You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by ch...@cmartinit.co.uk on 2016/09/07 07:47:00 UTC

Finding records that exist on Cassandra but not externally

First off I hope this appropriate here- I couldn't decide whether this was a question for Cassandra users or spark users so if you think it's in the wiring place feel free to redirect me.

I have a system that does a load of data manipulation using spark.  The output of this program is a effectively the new state that I want my Cassandra table to be in and the final step is to update Cassandra so that it matches this state.

At present I'm currently inserting all rows in my generated state into Cassandra. This works for new rows and also for updating existing rows but doesn't of course delete any rows that were already in Cassandra but not in my new state. 
 
The problem I have now is how best to delete these missing rows. Options I have considered are:

1. Setting a ttl on inserts which is roughly the same as my data refresh period. This would probably be pretty performant but I really don't want to do this because it would mean that all data in my database would disappear if I had issues running my refresh task!

2. Every time I refresh the data I would first have to fetch all primary keys from Cassandra and, compare them to primary keys locally to create a list of pks to delete before the insert. This seems the most logicaly correct option but is going to result in reading vast amounts of data from Cassandra.

3. Truncating the entire table before refreshing Cassandra. This has the benefit of being pretty simple in code but I'm not sure of the performance implications of this and what will happen if I truncate while a node is offline.

For reference the table is on the order of 10s of millions of rows and for any data refresh only a very small fraction (<.1%) will actually need deleting. 99% of the time I'll just be overwriting existing keys. 

I'd be grateful if anyone could shed some advice on the best solution here or whether there's some better way I haven't thought of.

Thanks,

Chris 

Re: Finding records that exist on Cassandra but not externally

Posted by Eric Stevens <mi...@gmail.com>.
I might be inclined to include a generation ID in the partition keys.  Keep
a separate table where you upgrade the generation ID when your processing
is complete.  You can even use CAS operations in case you goofed up and
generated two generations at the same time (or your processing time exceeds
your processing period) so that you'd know which generation failed.  Maybe
generation is a TimeUUID (don't do an integer, even though the generation
update would fail, you'd still have had two processes writing data to the
same generation ID).  Also this way if processing fails part way through
you don't end up with either corrupted or incomplete state.

Then in your primary table write data with a TTL of some whole multiple of
your expected processing period (I wouldn't recommend you make it close to
your processing period unless you are concerned about the storage costs,
things go wrong in processing, you don't want to end up with the most
recent active generation having fully expired).

It is an anti pattern to repeatedly overwrite the same data in Cassandra,
even if the prior data is TTL'd out.  You'll find that you still end up
having to seek far more SSTables than you anticipate due to the counter
intuitive way that tombstones and expired TTL's are expunged, though in
this exact pattern there's a few optimizations (I'm thinking of
tombstone-only compaction) which would make it less painful that it would
be for even very minor deviations from that pattern.

On Thu, Sep 8, 2016 at 5:32 AM Jens Rantil <je...@tink.se> wrote:

> Hi again Chris,
>
> Another option would be to have a look at using a Merkle Tree to quickly
> drill down to the differences. This is actually what Cassandra uses
> internally when running a repair between different nodes.
>
> Cheers,
> Jens
>
> On Wed, Sep 7, 2016 at 9:47 AM <ch...@cmartinit.co.uk> wrote:
>
>> First off I hope this appropriate here- I couldn't decide whether this
>> was a question for Cassandra users or spark users so if you think it's in
>> the wiring place feel free to redirect me.
>>
>> I have a system that does a load of data manipulation using spark.  The
>> output of this program is a effectively the new state that I want my
>> Cassandra table to be in and the final step is to update Cassandra so that
>> it matches this state.
>>
>> At present I'm currently inserting all rows in my generated state into
>> Cassandra. This works for new rows and also for updating existing rows but
>> doesn't of course delete any rows that were already in Cassandra but not in
>> my new state.
>>
>> The problem I have now is how best to delete these missing rows. Options
>> I have considered are:
>>
>> 1. Setting a ttl on inserts which is roughly the same as my data refresh
>> period. This would probably be pretty performant but I really don't want to
>> do this because it would mean that all data in my database would disappear
>> if I had issues running my refresh task!
>>
>> 2. Every time I refresh the data I would first have to fetch all primary
>> keys from Cassandra and, compare them to primary keys locally to create a
>> list of pks to delete before the insert. This seems the most logicaly
>> correct option but is going to result in reading vast amounts of data from
>> Cassandra.
>>
>> 3. Truncating the entire table before refreshing Cassandra. This has the
>> benefit of being pretty simple in code but I'm not sure of the performance
>> implications of this and what will happen if I truncate while a node is
>> offline.
>>
>> For reference the table is on the order of 10s of millions of rows and
>> for any data refresh only a very small fraction (<.1%) will actually need
>> deleting. 99% of the time I'll just be overwriting existing keys.
>>
>> I'd be grateful if anyone could shed some advice on the best solution
>> here or whether there's some better way I haven't thought of.
>>
>> Thanks,
>>
>> Chris
>>
> --
>
> Jens Rantil
> Backend Developer @ Tink
>
> Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden
> For urgent matters you can reach me at +46-708-84 18 32.
>

Re: Finding records that exist on Cassandra but not externally

Posted by Jens Rantil <je...@tink.se>.
Hi again Chris,

Another option would be to have a look at using a Merkle Tree to quickly
drill down to the differences. This is actually what Cassandra uses
internally when running a repair between different nodes.

Cheers,
Jens

On Wed, Sep 7, 2016 at 9:47 AM <ch...@cmartinit.co.uk> wrote:

> First off I hope this appropriate here- I couldn't decide whether this was
> a question for Cassandra users or spark users so if you think it's in the
> wiring place feel free to redirect me.
>
> I have a system that does a load of data manipulation using spark.  The
> output of this program is a effectively the new state that I want my
> Cassandra table to be in and the final step is to update Cassandra so that
> it matches this state.
>
> At present I'm currently inserting all rows in my generated state into
> Cassandra. This works for new rows and also for updating existing rows but
> doesn't of course delete any rows that were already in Cassandra but not in
> my new state.
>
> The problem I have now is how best to delete these missing rows. Options I
> have considered are:
>
> 1. Setting a ttl on inserts which is roughly the same as my data refresh
> period. This would probably be pretty performant but I really don't want to
> do this because it would mean that all data in my database would disappear
> if I had issues running my refresh task!
>
> 2. Every time I refresh the data I would first have to fetch all primary
> keys from Cassandra and, compare them to primary keys locally to create a
> list of pks to delete before the insert. This seems the most logicaly
> correct option but is going to result in reading vast amounts of data from
> Cassandra.
>
> 3. Truncating the entire table before refreshing Cassandra. This has the
> benefit of being pretty simple in code but I'm not sure of the performance
> implications of this and what will happen if I truncate while a node is
> offline.
>
> For reference the table is on the order of 10s of millions of rows and for
> any data refresh only a very small fraction (<.1%) will actually need
> deleting. 99% of the time I'll just be overwriting existing keys.
>
> I'd be grateful if anyone could shed some advice on the best solution here
> or whether there's some better way I haven't thought of.
>
> Thanks,
>
> Chris
>
-- 

Jens Rantil
Backend Developer @ Tink

Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden
For urgent matters you can reach me at +46-708-84 18 32.

Re: Finding records that exist on Cassandra but not externally

Posted by Jens Rantil <je...@tink.se>.
Hi Chris,

Without fully knowing your usecase; You can't keep track of which keys have
changed in the external system somehow? Otherwise 2) sounds like the way to
go to me.

Cheers,
Jens

On Wed, Sep 7, 2016 at 9:47 AM <ch...@cmartinit.co.uk> wrote:

> First off I hope this appropriate here- I couldn't decide whether this was
> a question for Cassandra users or spark users so if you think it's in the
> wiring place feel free to redirect me.
>
> I have a system that does a load of data manipulation using spark.  The
> output of this program is a effectively the new state that I want my
> Cassandra table to be in and the final step is to update Cassandra so that
> it matches this state.
>
> At present I'm currently inserting all rows in my generated state into
> Cassandra. This works for new rows and also for updating existing rows but
> doesn't of course delete any rows that were already in Cassandra but not in
> my new state.
>
> The problem I have now is how best to delete these missing rows. Options I
> have considered are:
>
> 1. Setting a ttl on inserts which is roughly the same as my data refresh
> period. This would probably be pretty performant but I really don't want to
> do this because it would mean that all data in my database would disappear
> if I had issues running my refresh task!
>
> 2. Every time I refresh the data I would first have to fetch all primary
> keys from Cassandra and, compare them to primary keys locally to create a
> list of pks to delete before the insert. This seems the most logicaly
> correct option but is going to result in reading vast amounts of data from
> Cassandra.
>
> 3. Truncating the entire table before refreshing Cassandra. This has the
> benefit of being pretty simple in code but I'm not sure of the performance
> implications of this and what will happen if I truncate while a node is
> offline.
>
> For reference the table is on the order of 10s of millions of rows and for
> any data refresh only a very small fraction (<.1%) will actually need
> deleting. 99% of the time I'll just be overwriting existing keys.
>
> I'd be grateful if anyone could shed some advice on the best solution here
> or whether there's some better way I haven't thought of.
>
> Thanks,
>
> Chris
>
-- 

Jens Rantil
Backend Developer @ Tink

Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden
For urgent matters you can reach me at +46-708-84 18 32.