You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by "Charulata Sharma (charshar)" <ch...@cisco.com> on 2018/03/22 19:18:56 UTC

Using Spark to delete from Transactional Cluster

Hi,
   Wanted to know the community’s experiences and feedback on using Apache Spark to delete data from C* transactional cluster.
We have spark installed in our analytical C* cluster and so far we have been using Spark only for analytics purposes.

However, now with advanced features of Spark 2.0, I am considering using spark-cassandra connector for deletes instead of a series of Delete Prepared Statements
So essentially the deletes will happen on the analytical cluster and they will be replicated over to transactional cluster by means of our keyspace replication strategies.

Are there any risks involved in this ??

Thanks,
Charu


Re: Using Spark to delete from Transactional Cluster

Posted by "Charulata Sharma (charshar)" <ch...@cisco.com>.
Yes essentially it’s the same, but from a code complexity perspective, writing in spark is more compact and execution is superfast. Spark uses the Cassandra connector so the question was mostly on if there is any issue with that and also
with spark we will be deleting in analytical nodes which would then be replicated over to the transactional nodes instead of the other way round. And yes tombstone problem is also the same with either approach.

I just want to know the pros and cons if any .

Charu


From: Jonathan Haddad <jo...@jonhaddad.com>
Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Date: Friday, March 23, 2018 at 12:10 PM
To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Subject: Re: Using Spark to delete from Transactional Cluster

I'm confused as to what the difference between deleting with prepared statements and deleting through spark is?  To the best of my knowledge either way it's the same thing - normal deletion with tombstones replicated.  Is it that you're doing deletes in the analytics DC instead of your real time one?

On Fri, Mar 23, 2018 at 11:38 AM Charulata Sharma (charshar) <ch...@cisco.com>> wrote:
Hi Rahul,
         Thanks for your answer. Why do you say that deleting from spark is not elegant?? This is the exact feedback I want. Basically why is it not elegant?
I can either delete using delete prepared statements or through spark. TTL approach doesn’t work for us
Because first of all ttl is there at a column level and there are business rules for purge which make the TTL solution not very clean in our case.

Thanks,
Charu

From: Rahul Singh <ra...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Thursday, March 22, 2018 at 5:08 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>, "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Using Spark to delete from Transactional Cluster

Short answer : it works. You can even run “delete” statements from within Spark once you know which keys to delete. Not elegant but it works.

It will create a bunch of tombstones and you may need to spread your deletes over days. Another thing to consider is instead of deleting setting a TTL which will eventually get cleansed.

--
Rahul Singh
rahul.singh@anant.us<ma...@anant.us>

Anant Corporation

On Mar 22, 2018, 2:19 PM -0500, Charulata Sharma (charshar) <ch...@cisco.com>>, wrote:
Hi,
   Wanted to know the community’s experiences and feedback on using Apache Spark to delete data from C* transactional cluster.
We have spark installed in our analytical C* cluster and so far we have been using Spark only for analytics purposes.

However, now with advanced features of Spark 2.0, I am considering using spark-cassandra connector for deletes instead of a series of Delete Prepared Statements
So essentially the deletes will happen on the analytical cluster and they will be replicated over to transactional cluster by means of our keyspace replication strategies.

Are there any risks involved in this ??

Thanks,
Charu


Re: Using Spark to delete from Transactional Cluster

Posted by Nitan Kainth <ni...@gmail.com>.
We use spark to do same because our partition contains data for whole year and we delete one day at a time. C* does not allow us delete without using partition key. I know it’s wrong data model but we can’t change it due to obvious reason of whole application redesign.

Sent from my iPhone

> On Mar 23, 2018, at 2:10 PM, Jonathan Haddad <jo...@jonhaddad.com> wrote:
> 
> I'm confused as to what the difference between deleting with prepared statements and deleting through spark is?  To the best of my knowledge either way it's the same thing - normal deletion with tombstones replicated.  Is it that you're doing deletes in the analytics DC instead of your real time one? 
> 
>> On Fri, Mar 23, 2018 at 11:38 AM Charulata Sharma (charshar) <ch...@cisco.com> wrote:
>> Hi Rahul,
>> 
>>          Thanks for your answer. Why do you say that deleting from spark is not elegant?? This is the exact feedback I want. Basically why is it not elegant?
>> 
>> I can either delete using delete prepared statements or through spark. TTL approach doesn’t work for us
>> 
>> Because first of all ttl is there at a column level and there are business rules for purge which make the TTL solution not very clean in our case.
>> 
>>  
>> 
>> Thanks,
>> 
>> Charu
>> 
>>  
>> 
>> From: Rahul Singh <ra...@gmail.com>
>> Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
>> Date: Thursday, March 22, 2018 at 5:08 PM
>> To: "user@cassandra.apache.org" <us...@cassandra.apache.org>, "user@cassandra.apache.org" <us...@cassandra.apache.org>
>> Subject: Re: Using Spark to delete from Transactional Cluster
>> 
>>  
>> 
>> Short answer : it works. You can even run “delete” statements from within Spark once you know which keys to delete. Not elegant but it works.
>> 
>> It will create a bunch of tombstones and you may need to spread your deletes over days. Another thing to consider is instead of deleting setting a TTL which will eventually get cleansed.
>> 
>> 
>> --
>> Rahul Singh
>> rahul.singh@anant.us
>> 
>> Anant Corporation
>> 
>> 
>> On Mar 22, 2018, 2:19 PM -0500, Charulata Sharma (charshar) <ch...@cisco.com>, wrote:
>> 
>> 
>> Hi,
>> 
>>    Wanted to know the community’s experiences and feedback on using Apache Spark to delete data from C* transactional cluster.
>> 
>> We have spark installed in our analytical C* cluster and so far we have been using Spark only for analytics purposes.
>> 
>>  
>> 
>> However, now with advanced features of Spark 2.0, I am considering using spark-cassandra connector for deletes instead of a series of Delete Prepared Statements  
>> 
>> So essentially the deletes will happen on the analytical cluster and they will be replicated over to transactional cluster by means of our keyspace replication strategies.
>> 
>>  
>> 
>> Are there any risks involved in this ??
>> 
>>  
>> 
>> Thanks,
>> 
>> Charu
>> 
>>           

Re: Using Spark to delete from Transactional Cluster

Posted by Jonathan Haddad <jo...@jonhaddad.com>.
I'm confused as to what the difference between deleting with prepared
statements and deleting through spark is?  To the best of my knowledge
either way it's the same thing - normal deletion with tombstones
replicated.  Is it that you're doing deletes in the analytics DC instead of
your real time one?

On Fri, Mar 23, 2018 at 11:38 AM Charulata Sharma (charshar) <
charshar@cisco.com> wrote:

> Hi Rahul,
>
>          Thanks for your answer. Why do you say that deleting from spark
> is not elegant?? This is the exact feedback I want. Basically why is it not
> elegant?
>
> I can either delete using delete prepared statements or through spark. TTL
> approach doesn’t work for us
>
> Because first of all ttl is there at a column level and there are business
> rules for purge which make the TTL solution not very clean in our case.
>
>
>
> Thanks,
>
> Charu
>
>
>
> *From: *Rahul Singh <ra...@gmail.com>
> *Reply-To: *"user@cassandra.apache.org" <us...@cassandra.apache.org>
> *Date: *Thursday, March 22, 2018 at 5:08 PM
> *To: *"user@cassandra.apache.org" <us...@cassandra.apache.org>, "
> user@cassandra.apache.org" <us...@cassandra.apache.org>
> *Subject: *Re: Using Spark to delete from Transactional Cluster
>
>
>
> Short answer : it works. You can even run “delete” statements from within
> Spark once you know which keys to delete. Not elegant but it works.
>
> It will create a bunch of tombstones and you may need to spread your
> deletes over days. Another thing to consider is instead of deleting setting
> a TTL which will eventually get cleansed.
>
>
> --
> Rahul Singh
> rahul.singh@anant.us
>
> Anant Corporation
>
>
> On Mar 22, 2018, 2:19 PM -0500, Charulata Sharma (charshar) <
> charshar@cisco.com>, wrote:
>
> Hi,
>
>    Wanted to know the community’s experiences and feedback on using Apache
> Spark to delete data from C* transactional cluster.
>
> We have spark installed in our analytical C* cluster and so far we have
> been using Spark only for analytics purposes.
>
>
>
> However, now with advanced features of Spark 2.0, I am considering using
> spark-cassandra connector for deletes instead of a series of Delete
> Prepared Statements
>
> So essentially the deletes will happen on the analytical cluster and they
> will be replicated over to transactional cluster by means of our keyspace
> replication strategies.
>
>
>
> Are there any risks involved in this ??
>
>
>
> Thanks,
>
> Charu
>
>
>
>

Re: Using Spark to delete from Transactional Cluster

Posted by Jacques-Henri Berthemet <ja...@genesys.com>.
A row is TTLed once all its columns are TTLed. If you want a row to be TTLed at once just set the same TTL on all its columns.

________________________________
From: Charulata Sharma (charshar) <ch...@cisco.com>
Sent: Friday, March 23, 2018 9:52:28 PM
To: user@cassandra.apache.org
Subject: Re: Using Spark to delete from Transactional Cluster


Yes agree on “let really old data expire” . However, I could not find a way to TTL an entire row. Only columns can be TTLed.



Charu



From: Rahul Singh <ra...@gmail.com>
Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Date: Friday, March 23, 2018 at 1:45 PM
To: "user@cassandra.apache.org" <us...@cassandra.apache.org>, "user@cassandra.apache.org" <us...@cassandra.apache.org>
Subject: Re: Using Spark to delete from Transactional Cluster



I think there are better ways to leverage parallel processing than to use it to delete data. As I said , it works for one of my projects for the same exact reason you stated : business rules.

Deleting data is an old way of thinking. Why not store the data and just use the relevant data .. let really old data expire ..

--
Rahul Singh
rahul.singh@anant.us

Anant Corporation

On Mar 23, 2018, 11:38 AM -0700, Charulata Sharma (charshar) <ch...@cisco.com>, wrote:


Hi Rahul,

         Thanks for your answer. Why do you say that deleting from spark is not elegant?? This is the exact feedback I want. Basically why is it not elegant?

I can either delete using delete prepared statements or through spark. TTL approach doesn’t work for us

Because first of all ttl is there at a column level and there are business rules for purge which make the TTL solution not very clean in our case.



Thanks,

Charu



From: Rahul Singh <ra...@gmail.com>
Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Date: Thursday, March 22, 2018 at 5:08 PM
To: "user@cassandra.apache.org" <us...@cassandra.apache.org>, "user@cassandra.apache.org" <us...@cassandra.apache.org>
Subject: Re: Using Spark to delete from Transactional Cluster



Short answer : it works. You can even run “delete” statements from within Spark once you know which keys to delete. Not elegant but it works.

It will create a bunch of tombstones and you may need to spread your deletes over days. Another thing to consider is instead of deleting setting a TTL which will eventually get cleansed.

--
Rahul Singh
rahul.singh@anant.us

Anant Corporation

On Mar 22, 2018, 2:19 PM -0500, Charulata Sharma (charshar) <ch...@cisco.com>, wrote:

Hi,

   Wanted to know the community’s experiences and feedback on using Apache Spark to delete data from C* transactional cluster.

We have spark installed in our analytical C* cluster and so far we have been using Spark only for analytics purposes.



However, now with advanced features of Spark 2.0, I am considering using spark-cassandra connector for deletes instead of a series of Delete Prepared Statements

So essentially the deletes will happen on the analytical cluster and they will be replicated over to transactional cluster by means of our keyspace replication strategies.



Are there any risks involved in this ??



Thanks,

Charu



Re: Using Spark to delete from Transactional Cluster

Posted by "Charulata Sharma (charshar)" <ch...@cisco.com>.
Yes agree on “let really old data expire” . However, I could not find a way to TTL an entire row. Only columns can be TTLed.

Charu

From: Rahul Singh <ra...@gmail.com>
Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Date: Friday, March 23, 2018 at 1:45 PM
To: "user@cassandra.apache.org" <us...@cassandra.apache.org>, "user@cassandra.apache.org" <us...@cassandra.apache.org>
Subject: Re: Using Spark to delete from Transactional Cluster

I think there are better ways to leverage parallel processing than to use it to delete data. As I said , it works for one of my projects for the same exact reason you stated : business rules.

Deleting data is an old way of thinking. Why not store the data and just use the relevant data .. let really old data expire ..

--
Rahul Singh
rahul.singh@anant.us

Anant Corporation

On Mar 23, 2018, 11:38 AM -0700, Charulata Sharma (charshar) <ch...@cisco.com>, wrote:

Hi Rahul,
         Thanks for your answer. Why do you say that deleting from spark is not elegant?? This is the exact feedback I want. Basically why is it not elegant?
I can either delete using delete prepared statements or through spark. TTL approach doesn’t work for us
Because first of all ttl is there at a column level and there are business rules for purge which make the TTL solution not very clean in our case.

Thanks,
Charu

From: Rahul Singh <ra...@gmail.com>
Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Date: Thursday, March 22, 2018 at 5:08 PM
To: "user@cassandra.apache.org" <us...@cassandra.apache.org>, "user@cassandra.apache.org" <us...@cassandra.apache.org>
Subject: Re: Using Spark to delete from Transactional Cluster

Short answer : it works. You can even run “delete” statements from within Spark once you know which keys to delete. Not elegant but it works.

It will create a bunch of tombstones and you may need to spread your deletes over days. Another thing to consider is instead of deleting setting a TTL which will eventually get cleansed.

--
Rahul Singh
rahul.singh@anant.us

Anant Corporation

On Mar 22, 2018, 2:19 PM -0500, Charulata Sharma (charshar) <ch...@cisco.com>, wrote:
Hi,
   Wanted to know the community’s experiences and feedback on using Apache Spark to delete data from C* transactional cluster.
We have spark installed in our analytical C* cluster and so far we have been using Spark only for analytics purposes.

However, now with advanced features of Spark 2.0, I am considering using spark-cassandra connector for deletes instead of a series of Delete Prepared Statements
So essentially the deletes will happen on the analytical cluster and they will be replicated over to transactional cluster by means of our keyspace replication strategies.

Are there any risks involved in this ??

Thanks,
Charu


Re: Using Spark to delete from Transactional Cluster

Posted by Rahul Singh <ra...@gmail.com>.
I think there are better ways to leverage parallel processing than to use it to delete data. As I said , it works for one of my projects for the same exact reason you stated : business rules.

Deleting data is an old way of thinking. Why not store the data and just use the relevant data .. let really old data expire ..

--
Rahul Singh
rahul.singh@anant.us

Anant Corporation

On Mar 23, 2018, 11:38 AM -0700, Charulata Sharma (charshar) <ch...@cisco.com>, wrote:
> Hi Rahul,
>          Thanks for your answer. Why do you say that deleting from spark is not elegant?? This is the exact feedback I want. Basically why is it not elegant?
> I can either delete using delete prepared statements or through spark. TTL approach doesn’t work for us
> Because first of all ttl is there at a column level and there are business rules for purge which make the TTL solution not very clean in our case.
>
> Thanks,
> Charu
>
> From: Rahul Singh <ra...@gmail.com>
> Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
> Date: Thursday, March 22, 2018 at 5:08 PM
> To: "user@cassandra.apache.org" <us...@cassandra.apache.org>, "user@cassandra.apache.org" <us...@cassandra.apache.org>
> Subject: Re: Using Spark to delete from Transactional Cluster
>
> Short answer : it works. You can even run “delete” statements from within Spark once you know which keys to delete. Not elegant but it works.
>
> It will create a bunch of tombstones and you may need to spread your deletes over days. Another thing to consider is instead of deleting setting a TTL which will eventually get cleansed.
>
> --
> Rahul Singh
> rahul.singh@anant.us
>
> Anant Corporation
>
> On Mar 22, 2018, 2:19 PM -0500, Charulata Sharma (charshar) <ch...@cisco.com>, wrote:
>
> > Hi,
> >    Wanted to know the community’s experiences and feedback on using Apache Spark to delete data from C* transactional cluster.
> > We have spark installed in our analytical C* cluster and so far we have been using Spark only for analytics purposes.
> >
> > However, now with advanced features of Spark 2.0, I am considering using spark-cassandra connector for deletes instead of a series of Delete Prepared Statements
> > So essentially the deletes will happen on the analytical cluster and they will be replicated over to transactional cluster by means of our keyspace replication strategies.
> >
> > Are there any risks involved in this ??
> >
> > Thanks,
> > Charu
> >

Re: Using Spark to delete from Transactional Cluster

Posted by "Charulata Sharma (charshar)" <ch...@cisco.com>.
Hi Rahul,
         Thanks for your answer. Why do you say that deleting from spark is not elegant?? This is the exact feedback I want. Basically why is it not elegant?
I can either delete using delete prepared statements or through spark. TTL approach doesn’t work for us
Because first of all ttl is there at a column level and there are business rules for purge which make the TTL solution not very clean in our case.

Thanks,
Charu

From: Rahul Singh <ra...@gmail.com>
Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Date: Thursday, March 22, 2018 at 5:08 PM
To: "user@cassandra.apache.org" <us...@cassandra.apache.org>, "user@cassandra.apache.org" <us...@cassandra.apache.org>
Subject: Re: Using Spark to delete from Transactional Cluster

Short answer : it works. You can even run “delete” statements from within Spark once you know which keys to delete. Not elegant but it works.

It will create a bunch of tombstones and you may need to spread your deletes over days. Another thing to consider is instead of deleting setting a TTL which will eventually get cleansed.

--
Rahul Singh
rahul.singh@anant.us

Anant Corporation

On Mar 22, 2018, 2:19 PM -0500, Charulata Sharma (charshar) <ch...@cisco.com>, wrote:

Hi,
   Wanted to know the community’s experiences and feedback on using Apache Spark to delete data from C* transactional cluster.
We have spark installed in our analytical C* cluster and so far we have been using Spark only for analytics purposes.

However, now with advanced features of Spark 2.0, I am considering using spark-cassandra connector for deletes instead of a series of Delete Prepared Statements
So essentially the deletes will happen on the analytical cluster and they will be replicated over to transactional cluster by means of our keyspace replication strategies.

Are there any risks involved in this ??

Thanks,
Charu


Re: Using Spark to delete from Transactional Cluster

Posted by Rahul Singh <ra...@gmail.com>.
Short answer : it works. You can even run “delete” statements from within Spark once you know which keys to delete. Not elegant but it works.

It will create a bunch of tombstones and you may need to spread your deletes over days. Another thing to consider is instead of deleting setting a TTL which will eventually get cleansed.

--
Rahul Singh
rahul.singh@anant.us

Anant Corporation

On Mar 22, 2018, 2:19 PM -0500, Charulata Sharma (charshar) <ch...@cisco.com>, wrote:
> Hi,
>    Wanted to know the community’s experiences and feedback on using Apache Spark to delete data from C* transactional cluster.
> We have spark installed in our analytical C* cluster and so far we have been using Spark only for analytics purposes.
>
> However, now with advanced features of Spark 2.0, I am considering using spark-cassandra connector for deletes instead of a series of Delete Prepared Statements
> So essentially the deletes will happen on the analytical cluster and they will be replicated over to transactional cluster by means of our keyspace replication strategies.
>
> Are there any risks involved in this ??
>
> Thanks,
> Charu
>