You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Marcel Steinbach <ma...@chors.de> on 2012/01/21 10:45:49 UTC

Get all keys from the cluster

We're running a 8 node cluster with different CFs for different applications. One of the application uses 1.5TB out of 1.8TB in total, but only because we started out with a deletion mechanism and implemented one later on. So there is probably a high amount of old data in there, that we don't even use anymore. 

Now we want to delete that data. To know, which rows we may delete, we have to lookup a SQL database. If the key is not in there anymore, we may delete that row in cassandra, too. 

This basically means, we have to iterate over all the rows in that CF. This kind of begs for hadoop, but that seems not to be an option, currently. I tried.

So we figured, we could run over the sstables files (maybe only the index), check the keys in the mysql, and later run the deletes on the cluster. This way, we could iterate on each node in parallel. 

Does that sound reasonable? Any pros/cons, maybe a "killer" argument to use hadoop for that?

Cheers
Marcel

Re: Get all keys from the cluster

Posted by aaron morton <aa...@thelastpickle.com>.
If you want to keep the load out of the cassandra process and do the join to sql off line, take a look at the bin/sstablekeys utility. This will let you output the keys in an sstable. You will need to do it for every sstable on every node, create the unique list and then check in your SQL db thingy. 

The output will be hex encoded, and the process scans through the index file. So while it will take some IO load but it does ask the OS to skip using the IO cache. 

Alternatively just write a script that iterates over all the rows in the CF and does the lookup.

hope that helps. 

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 22/01/2012, at 1:11 AM, Eric Czech wrote:

> Great!  I'm glad at least one of those ideas was helpful for you.
> 
> That's a road we've travelled before and as one last suggestion that might help, you could alter all client writers to cassandra beforehand so that they write to BOTH keyspaces BEFORE beginning the SQL based transfer.  This might help keep you from having to make multiple passes unless I'm not missing something.
> 
> On Sat, Jan 21, 2012 at 4:53 AM, Marcel Steinbach <ma...@chors.de> wrote:
> Thanks for your suggestions, Eric!
> 
>> One of the application uses 1.5TB out of 1.8TB
> 
> I'm sorry, maybe that statment was slightly ambiguous. I meant to say, that one application uses 1.5TB, while the others use 300GB, totalling in 1.8TB of data. Our total disk capacity, however, is at about 7 TB, so we're still far from running out of disk space.
> 
>> Is there any way that you could do that lookup in reverse where you pull the records from your SQL database, figure out which keys aren't necessary, and then delete any unnecessary keys that may or may not exist in cassandra? 
> 
> Unfortunately, that won't work since the SQL db does only contain the keys, that we want to _keep_ in cassandra.
> 
>> If that's not a possibility, then what about creating the same Cassandra schema in a different keyspace and copying all the relevant records from the current keyspace to the new keyspace using the SQL database records as a basis for what is actually "relevant" within the new keyspace.  
> 
> I like that idea. So instead of iterating over all cassandra rows, I would iterate over the SQL DB, which would indeed save me a lot of IO. However, rows inserted into my CF during iterating over the SQL DB might not be copied into the new keyspace. But maybe we could arrange to do that 
> during low-demand-hours to minimize the amount of new inserts and additionally run the "copy" a second time with a select on newly inserted sql rows. So we'll probably go with that.
> 
> Thanks again for your help!
> 
> Cheers
> Marcel
> 
> On 21.01.2012, at 11:52, Eric Czech wrote:
> 
>> Is there any way that you could do that lookup in reverse where you pull the records from your SQL database, figure out which keys aren't necessary, and then delete any unnecessary keys that may or may not exist in cassandra?  
>> 
>> If that's not a possibility, then what about creating the same Cassandra schema in a different keyspace and copying all the relevant records from the current keyspace to the new keyspace using the SQL database records as a basis for what is actually "relevant" within the new keyspace.  If you could perform that transfer, then you could just delete the old 1.5TB keyspace altogether, leaving only the data you need.  If that sort of duplication would put you over the 1.8TB limit during the transfer, then maybe you could consider CF compression upfront.
>> 
>> Short of that, I can tell from experience that doing these sort of "left join" deletes from cassandra to SQL really suck.  We have had to resort to using hadoop to do this but since our hadoop/cassandra clusters are much larger than our single SQL instances, keeping all the hadoop processes from basically "DDoS"ing our SQL servers while still making the process faster than thrift iterations over all the rows (via custom programs) in cassandra hasn't been a convincing solution.
>> 
>> I'd say that the first solution I proposed is definitely the best, but also the most unrealistic.  If that's really not a possibility for you, then I'd seriously look at trying to make my second suggestion work even if it means brining up new hardware or increasing the capacity of existing resources.  That second suggestion also has the added benefit of likely minimizing I/O since it's the only solution that doesn't require reading or deleting any of the unnecessary data (beyond wholesale keyspace or CF deletions) assuming that the actually relevant portion of your data is significantly less than 1.5TB.  
>> 
>> I hope that helps!
>> 
>> And in the future, you should really try to avoid letting your data size get beyond 40 - 50 % of your actual on-disk capacity.  Let me know if anyone in the community disagrees, but I'd say you're about 600 GB past the point at which you have a lot of easy outs -- but I hope you find one anyways!
>> 
>> 
>> On Sat, Jan 21, 2012 at 2:45 AM, Marcel Steinbach <ma...@chors.de> wrote:
>> We're running a 8 node cluster with different CFs for different applications. One of the application uses 1.5TB out of 1.8TB in total, but only because we started out with a deletion mechanism and implemented one later on. So there is probably a high amount of old data in there, that we don't even use anymore.
>> 
>> Now we want to delete that data. To know, which rows we may delete, we have to lookup a SQL database. If the key is not in there anymore, we may delete that row in cassandra, too.
>> 
>> This basically means, we have to iterate over all the rows in that CF. This kind of begs for hadoop, but that seems not to be an option, currently. I tried.
>> 
>> So we figured, we could run over the sstables files (maybe only the index), check the keys in the mysql, and later run the deletes on the cluster. This way, we could iterate on each node in parallel.
>> 
>> Does that sound reasonable? Any pros/cons, maybe a "killer" argument to use hadoop for that?
>> 
>> Cheers
>> Marcel
>> <hr style="border-color:blue">
>> <p>chors GmbH
>> <br><hr style="border-color:blue">
>> <p>specialists in digital and direct marketing solutions<br>
>> Haid-und-Neu-Straße 7<br>
>> 76131 Karlsruhe, Germany<br>
>> www.chors.com</p>
>> <p>Managing Directors: Dr. Volker Hatz, Markus Plattner<br>Amtsgericht Montabaur, HRB 15029</p>
>> <p style="font-size:9px">This e-mail is for the intended recipient only and may contain confidential or privileged information. If you have received this e-mail by mistake, please contact us immediately and completely delete it (and any attachments) and do not forward it or inform any other person of its contents. If you send us messages by e-mail, we take this as your authorization to correspond with you by e-mail. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, amended, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. Neither chors GmbH nor the sender accept liability for any errors or omissions in the content of this message which arise as a result of its e-mail transmission. Please note that all e-mail communications to and from chors GmbH may be monitored.</p>
>> 
> 
> 


Re: Get all keys from the cluster

Posted by Eric Czech <er...@nextbigsound.com>.
Great!  I'm glad at least one of those ideas was helpful for you.

That's a road we've travelled before and as one last suggestion that might
help, you could alter all client writers to cassandra beforehand so that
they write to BOTH keyspaces BEFORE beginning the SQL based transfer.  This
might help keep you from having to make multiple passes unless I'm not
missing something.

On Sat, Jan 21, 2012 at 4:53 AM, Marcel Steinbach <marcel.steinbach@chors.de
> wrote:

>    Thanks for your suggestions, Eric!
>
> One of the application uses 1.5TB out of 1.8TB
>
> I'm sorry, maybe that statment was slightly ambiguous. I meant to say,
> that one application uses 1.5TB, while the others use 300GB, totalling in
> 1.8TB of data. Our total disk capacity, however, is at about 7 TB, so we're
> still far from running out of disk space.
>
> Is there any way that you could do that lookup in reverse where you pull
> the records from your SQL database, figure out which keys aren't necessary,
> and then delete any unnecessary keys that may or may not exist in
> cassandra?
>
> Unfortunately, that won't work since the SQL db does only contain the
> keys, that we want to _keep_ in cassandra.
>
> If that's not a possibility, then what about creating the same Cassandra
> schema in a different keyspace and copying all the relevant records from
> the current keyspace to the new keyspace using the SQL database records as
> a basis for what is actually "relevant" within the new keyspace.
>
> I like that idea. So instead of iterating over all cassandra rows, I would
> iterate over the SQL DB, which would indeed save me a lot of IO. However,
> rows inserted into my CF during iterating over the SQL DB might not be
> copied into the new keyspace. But maybe we could arrange to do that
> during low-demand-hours to minimize the amount of new inserts and
> additionally run the "copy" a second time with a select on newly inserted
> sql rows. So we'll probably go with that.
>
> Thanks again for your help!
>
> Cheers
> Marcel
>
> On 21.01.2012, at 11:52, Eric Czech wrote:
>
> Is there any way that you could do that lookup in reverse where you pull
> the records from your SQL database, figure out which keys aren't necessary,
> and then delete any unnecessary keys that may or may not exist in
> cassandra?
>
> If that's not a possibility, then what about creating the same Cassandra
> schema in a different keyspace and copying all the relevant records from
> the current keyspace to the new keyspace using the SQL database records as
> a basis for what is actually "relevant" within the new keyspace.  If you
> could perform that transfer, then you could just delete the old 1.5TB
> keyspace altogether, leaving only the data you need.  If that sort of
> duplication would put you over the 1.8TB limit during the transfer, then
> maybe you could consider CF compression upfront.
>
> Short of that, I can tell from experience that doing these sort of "left
> join" deletes from cassandra to SQL really suck.  We have had to resort to
> using hadoop to do this but since our hadoop/cassandra clusters are much
> larger than our single SQL instances, keeping all the hadoop processes from
> basically "DDoS"ing our SQL servers while still making the process faster
> than thrift iterations over all the rows (via custom programs) in cassandra
> hasn't been a convincing solution.
>
> I'd say that the first solution I proposed is definitely the best, but
> also the most unrealistic.  If that's really not a possibility for you,
> then I'd seriously look at trying to make my second suggestion work even if
> it means brining up new hardware or increasing the capacity of existing
> resources.  That second suggestion also has the added benefit of likely
> minimizing I/O since it's the only solution that doesn't require reading or
> deleting any of the unnecessary data (beyond wholesale keyspace or CF
> deletions) assuming that the actually relevant portion of your data is
> significantly less than 1.5TB.
>
> I hope that helps!
>
> And in the future, you should really try to avoid letting your data size
> get beyond 40 - 50 % of your actual on-disk capacity.  Let me know if
> anyone in the community disagrees, but I'd say you're about 600 GB past the
> point at which you have a lot of easy outs -- but I hope you find one
> anyways!
>
>
> On Sat, Jan 21, 2012 at 2:45 AM, Marcel Steinbach <
> marcel.steinbach@chors.de> wrote:
>
>> We're running a 8 node cluster with different CFs for different
>> applications. One of the application uses 1.5TB out of 1.8TB in total, but
>> only because we started out with a deletion mechanism and implemented one
>> later on. So there is probably a high amount of old data in there, that we
>> don't even use anymore.
>>
>> Now we want to delete that data. To know, which rows we may delete, we
>> have to lookup a SQL database. If the key is not in there anymore, we may
>> delete that row in cassandra, too.
>>
>> This basically means, we have to iterate over all the rows in that CF.
>> This kind of begs for hadoop, but that seems not to be an option,
>> currently. I tried.
>>
>> So we figured, we could run over the sstables files (maybe only the
>> index), check the keys in the mysql, and later run the deletes on the
>> cluster. This way, we could iterate on each node in parallel.
>>
>> Does that sound reasonable? Any pros/cons, maybe a "killer" argument to
>> use hadoop for that?
>>
>> Cheers
>> Marcel
>> <hr style="border-color:blue">
>> <p>chors GmbH
>> <br><hr style="border-color:blue">
>> <p>specialists in digital and direct marketing solutions<br>
>> Haid-und-Neu-Straße 7<br>
>> 76131 Karlsruhe, Germany<br>
>> www.chors.com</p>
>> <p>Managing Directors: Dr. Volker Hatz, Markus Plattner<br>Amtsgericht
>> Montabaur, HRB 15029</p>
>> <p style="font-size:9px">This e-mail is for the intended recipient only
>> and may contain confidential or privileged information. If you have
>> received this e-mail by mistake, please contact us immediately and
>> completely delete it (and any attachments) and do not forward it or inform
>> any other person of its contents. If you send us messages by e-mail, we
>> take this as your authorization to correspond with you by e-mail. E-mail
>> transmission cannot be guaranteed to be secure or error-free as information
>> could be intercepted, amended, corrupted, lost, destroyed, arrive late or
>> incomplete, or contain viruses. Neither chors GmbH nor the sender accept
>> liability for any errors or omissions in the content of this message which
>> arise as a result of its e-mail transmission. Please note that all e-mail
>> communications to and from chors GmbH may be monitored.</p>
>>
>
>
>

Re: Get all keys from the cluster

Posted by Marcel Steinbach <ma...@chors.de>.
Thanks for your suggestions, Eric!

> One of the application uses 1.5TB out of 1.8TB

I'm sorry, maybe that statment was slightly ambiguous. I meant to say, that one application uses 1.5TB, while the others use 300GB, totalling in 1.8TB of data. Our total disk capacity, however, is at about 7 TB, so we're still far from running out of disk space.

> Is there any way that you could do that lookup in reverse where you pull the records from your SQL database, figure out which keys aren't necessary, and then delete any unnecessary keys that may or may not exist in cassandra? 
Unfortunately, that won't work since the SQL db does only contain the keys, that we want to _keep_ in cassandra.

> If that's not a possibility, then what about creating the same Cassandra schema in a different keyspace and copying all the relevant records from the current keyspace to the new keyspace using the SQL database records as a basis for what is actually "relevant" within the new keyspace.  
I like that idea. So instead of iterating over all cassandra rows, I would iterate over the SQL DB, which would indeed save me a lot of IO. However, rows inserted into my CF during iterating over the SQL DB might not be copied into the new keyspace. But maybe we could arrange to do that 
during low-demand-hours to minimize the amount of new inserts and additionally run the "copy" a second time with a select on newly inserted sql rows. So we'll probably go with that.

Thanks again for your help!

Cheers
Marcel

On 21.01.2012, at 11:52, Eric Czech wrote:

> Is there any way that you could do that lookup in reverse where you pull the records from your SQL database, figure out which keys aren't necessary, and then delete any unnecessary keys that may or may not exist in cassandra?  
> 
> If that's not a possibility, then what about creating the same Cassandra schema in a different keyspace and copying all the relevant records from the current keyspace to the new keyspace using the SQL database records as a basis for what is actually "relevant" within the new keyspace.  If you could perform that transfer, then you could just delete the old 1.5TB keyspace altogether, leaving only the data you need.  If that sort of duplication would put you over the 1.8TB limit during the transfer, then maybe you could consider CF compression upfront.
> 
> Short of that, I can tell from experience that doing these sort of "left join" deletes from cassandra to SQL really suck.  We have had to resort to using hadoop to do this but since our hadoop/cassandra clusters are much larger than our single SQL instances, keeping all the hadoop processes from basically "DDoS"ing our SQL servers while still making the process faster than thrift iterations over all the rows (via custom programs) in cassandra hasn't been a convincing solution.
> 
> I'd say that the first solution I proposed is definitely the best, but also the most unrealistic.  If that's really not a possibility for you, then I'd seriously look at trying to make my second suggestion work even if it means brining up new hardware or increasing the capacity of existing resources.  That second suggestion also has the added benefit of likely minimizing I/O since it's the only solution that doesn't require reading or deleting any of the unnecessary data (beyond wholesale keyspace or CF deletions) assuming that the actually relevant portion of your data is significantly less than 1.5TB.  
> 
> I hope that helps!
> 
> And in the future, you should really try to avoid letting your data size get beyond 40 - 50 % of your actual on-disk capacity.  Let me know if anyone in the community disagrees, but I'd say you're about 600 GB past the point at which you have a lot of easy outs -- but I hope you find one anyways!
> 
> 
> On Sat, Jan 21, 2012 at 2:45 AM, Marcel Steinbach <ma...@chors.de> wrote:
> We're running a 8 node cluster with different CFs for different applications. One of the application uses 1.5TB out of 1.8TB in total, but only because we started out with a deletion mechanism and implemented one later on. So there is probably a high amount of old data in there, that we don't even use anymore.
> 
> Now we want to delete that data. To know, which rows we may delete, we have to lookup a SQL database. If the key is not in there anymore, we may delete that row in cassandra, too.
> 
> This basically means, we have to iterate over all the rows in that CF. This kind of begs for hadoop, but that seems not to be an option, currently. I tried.
> 
> So we figured, we could run over the sstables files (maybe only the index), check the keys in the mysql, and later run the deletes on the cluster. This way, we could iterate on each node in parallel.
> 
> Does that sound reasonable? Any pros/cons, maybe a "killer" argument to use hadoop for that?
> 
> Cheers
> Marcel
> <hr style="border-color:blue">
> <p>chors GmbH
> <br><hr style="border-color:blue">
> <p>specialists in digital and direct marketing solutions<br>
> Haid-und-Neu-Straße 7<br>
> 76131 Karlsruhe, Germany<br>
> www.chors.com</p>
> <p>Managing Directors: Dr. Volker Hatz, Markus Plattner<br>Amtsgericht Montabaur, HRB 15029</p>
> <p style="font-size:9px">This e-mail is for the intended recipient only and may contain confidential or privileged information. If you have received this e-mail by mistake, please contact us immediately and completely delete it (and any attachments) and do not forward it or inform any other person of its contents. If you send us messages by e-mail, we take this as your authorization to correspond with you by e-mail. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, amended, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. Neither chors GmbH nor the sender accept liability for any errors or omissions in the content of this message which arise as a result of its e-mail transmission. Please note that all e-mail communications to and from chors GmbH may be monitored.</p>
> 


Re: Get all keys from the cluster

Posted by Eric Czech <er...@nextbigsound.com>.
Is there any way that you could do that lookup in reverse where you pull
the records from your SQL database, figure out which keys aren't necessary,
and then delete any unnecessary keys that may or may not exist in
cassandra?

If that's not a possibility, then what about creating the same Cassandra
schema in a different keyspace and copying all the relevant records from
the current keyspace to the new keyspace using the SQL database records as
a basis for what is actually "relevant" within the new keyspace.  If you
could perform that transfer, then you could just delete the old 1.5TB
keyspace altogether, leaving only the data you need.  If that sort of
duplication would put you over the 1.8TB limit during the transfer, then
maybe you could consider CF compression upfront.

Short of that, I can tell from experience that doing these sort of "left
join" deletes from cassandra to SQL really suck.  We have had to resort to
using hadoop to do this but since our hadoop/cassandra clusters are much
larger than our single SQL instances, keeping all the hadoop processes from
basically "DDoS"ing our SQL servers while still making the process faster
than thrift iterations over all the rows (via custom programs) in cassandra
hasn't been a convincing solution.

I'd say that the first solution I proposed is definitely the best, but also
the most unrealistic.  If that's really not a possibility for you, then I'd
seriously look at trying to make my second suggestion work even if it means
brining up new hardware or increasing the capacity of existing resources.
 That second suggestion also has the added benefit of likely minimizing I/O
since it's the only solution that doesn't require reading or deleting any
of the unnecessary data (beyond wholesale keyspace or CF deletions)
assuming that the actually relevant portion of your data is significantly
less than 1.5TB.

I hope that helps!

And in the future, you should really try to avoid letting your data size
get beyond 40 - 50 % of your actual on-disk capacity.  Let me know if
anyone in the community disagrees, but I'd say you're about 600 GB past the
point at which you have a lot of easy outs -- but I hope you find one
anyways!


On Sat, Jan 21, 2012 at 2:45 AM, Marcel Steinbach <marcel.steinbach@chors.de
> wrote:

> We're running a 8 node cluster with different CFs for different
> applications. One of the application uses 1.5TB out of 1.8TB in total, but
> only because we started out with a deletion mechanism and implemented one
> later on. So there is probably a high amount of old data in there, that we
> don't even use anymore.
>
> Now we want to delete that data. To know, which rows we may delete, we
> have to lookup a SQL database. If the key is not in there anymore, we may
> delete that row in cassandra, too.
>
> This basically means, we have to iterate over all the rows in that CF.
> This kind of begs for hadoop, but that seems not to be an option,
> currently. I tried.
>
> So we figured, we could run over the sstables files (maybe only the
> index), check the keys in the mysql, and later run the deletes on the
> cluster. This way, we could iterate on each node in parallel.
>
> Does that sound reasonable? Any pros/cons, maybe a "killer" argument to
> use hadoop for that?
>
> Cheers
> Marcel
> <hr style="border-color:blue">
> <p>chors GmbH
> <br><hr style="border-color:blue">
> <p>specialists in digital and direct marketing solutions<br>
> Haid-und-Neu-Straße 7<br>
> 76131 Karlsruhe, Germany<br>
> www.chors.com</p>
> <p>Managing Directors: Dr. Volker Hatz, Markus Plattner<br>Amtsgericht
> Montabaur, HRB 15029</p>
> <p style="font-size:9px">This e-mail is for the intended recipient only
> and may contain confidential or privileged information. If you have
> received this e-mail by mistake, please contact us immediately and
> completely delete it (and any attachments) and do not forward it or inform
> any other person of its contents. If you send us messages by e-mail, we
> take this as your authorization to correspond with you by e-mail. E-mail
> transmission cannot be guaranteed to be secure or error-free as information
> could be intercepted, amended, corrupted, lost, destroyed, arrive late or
> incomplete, or contain viruses. Neither chors GmbH nor the sender accept
> liability for any errors or omissions in the content of this message which
> arise as a result of its e-mail transmission. Please note that all e-mail
> communications to and from chors GmbH may be monitored.</p>
>