You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Leena Ghatpande <lg...@hotmail.com> on 2019/03/11 15:55:15 UTC

Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.

We have a table with over 70M rows with a partition key that is unique.  We have a  created datetime stamp on each record, and we have a need to select all rows created for a date range. Secondary index is not an option as its high cardinality and could slow performance doing a full scan on 70M rows.


We are thinking of creating a new table with a date field as a clustering column to be able to query for date ranges, but partition key to clustering key will be 1-1. Is this a good approach?

To do this, we need to copy this large volume of data from table1 to table2 within the same cluster, while updates are still happening to table1. We need to do this real time without impacting our customers. COPY is not an option, as we have ttl's on each row on table1 that need to be applied to table2 as well.


So what would be the best approach

  1.  To be able select data using date range without impacting performance. This operation will be needed only on adhoc basis and it wont be as frequent .
  2.  Best way to migrate large volume of data with ttl from one table to another within the same cluster.


Any other suggestions also will be greatly appreciated.


Re: [EXTERNAL] Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.

Posted by Rahul Singh <ra...@gmail.com>.
Adding to Stefan's comment. There is a "scylladb" migrator, which uses the
spark connector from Datastax, and theoretically can work on any Cassandra
compiant DB.. and should not be limited to cassandra to scylla.

https://www.scylladb.com/2019/02/07/moving-from-cassandra-to-scylla-via-apache-spark-scylla-migrator/

https://github.com/scylladb/scylla-migrator

On Thu, Mar 14, 2019 at 3:04 PM Durity, Sean R <SE...@homedepot.com>
wrote:

> The possibility of a highly available way to do this gives more
> challenges. I would be weighing the cost of a complex solution vs the
> possibility of a maintenance window when you stop your app to move the
> data, then restart.
>
>
>
> For the straight copy of the data, I am currently enamored with DataStax’s
> dsbulk utility for unloading and loading larger amounts of data. I don’t
> have extensive experience, yet, but it has been fast enough in my
> experiments – and that is without doing too much tuning for speed. From a
> host not in the cluster, I was able to extract 3.5 million rows in about 11
> seconds. I inserted them into a differently partitioned table in about 26
> seconds. Very small data rows, but it was impressive for not doing much to
> try and speed it up further. (In some other tests, it was about ¼ the time
> of simple copy statement from cqlsh)
>
>
>
> If I was designing something for a “can’t take an outage” scenario, I
> would start with:
>
> -          Writing the data to the old and new tables on all inserts
>
> -          On reads, read from the new table first. If not there, read
> from the old table ß could introduce some latency, but would be
> available; could also do asynchronous reads on both tables and choose the
> latest
>
> -          Do this until the data has been copied from old to new (with
> dsbulk or custom code or Spark)
>
> -          Drop the double writes and conditional reads
>
>
>
>
>
> Sean
>
>
>
> *From:* Stefan Miklosovic <st...@instaclustr.com>
> *Sent:* Wednesday, March 13, 2019 6:39 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: [EXTERNAL] Re: Migrate large volume of data from one table
> to another table within the same cluster when COPY is not an option.
>
>
>
> Hi Leena,
>
>
>
> as already suggested in my previous email, you could use Apache Spark and
> Cassandra Spark connector (1). I have checked TTLs and I believe you should
> especially read this section (2) about TTLs. Seems like thats what you need
> to do, ttls per row. The workflow would be that you read from your source
> table, making transformations per row (via some mapping) and then you would
> save it to new table.
>
>
>
> This would import it "all" but until you switch to the new table and
> records are still being saved into the original one, I am not sure how to
> cover "the gap" in such sense that once you make the switch, you would miss
> records which were created in the first table after you did the loading.
> You could maybe leverage Spark streaming (Cassandra connector knows that
> too) so you would make this transformation on the fly with new ones.
>
>
>
> (1) https://github.com/datastax/spark-cassandra-connector
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_datastax_spark-2Dcassandra-2Dconnector&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=_DgzHjtyiXt4QUBiWPplE-cs_HMaVflC9fAK6I4TdpQ&s=mMB-uNoPbBBK9Zfn5WuDoKoF31IgSi1MXgNlYG7jhDE&e=>
>
> (2)
> https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md#using-a-different-value-for-each-row
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_datastax_spark-2Dcassandra-2Dconnector_blob_master_doc_5-5Fsaving.md-23using-2Da-2Ddifferent-2Dvalue-2Dfor-2Deach-2Drow&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=_DgzHjtyiXt4QUBiWPplE-cs_HMaVflC9fAK6I4TdpQ&s=AwO-LFAxHWvYgzjuWt9ez5FHKDeNdS3C6KYfaoUUgOs&e=>
>
>
>
>
>
> On Thu, 14 Mar 2019 at 00:13, Leena Ghatpande <lg...@hotmail.com>
> wrote:
>
> Understand, 2nd table would be a better approach. So what would be the
> best way to copy 70M rows from current table to the 2nd table with ttl set
> on each record as the first table?
>
>
> ------------------------------
>
> *From:* Durity, Sean R <SE...@homedepot.com>
> *Sent:* Wednesday, March 13, 2019 8:17 AM
> *To:* user@cassandra.apache.org
> *Subject:* RE: [EXTERNAL] Re: Migrate large volume of data from one table
> to another table within the same cluster when COPY is not an option.
>
>
>
> Correct, there is no current flag. I think there SHOULD be one.
>
>
>
>
>
> *From:* Dieudonné Madishon NGAYA <dm...@gmail.com>
> *Sent:* Tuesday, March 12, 2019 7:17 PM
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] Re: Migrate large volume of data from one table to
> another table within the same cluster when COPY is not an option.
>
>
>
> Hi Sean, you can’t flag in Cassandra.yaml not allowing allow filtering ,
> the only thing you can do will be from your data model .
>
> Don’t ask Cassandra to query all data from table but the ideal query will
> be using single partition.
>
>
>
> On Tue, Mar 12, 2019 at 6:46 PM Stefan Miklosovic <
> stefan.miklosovic@instaclustr.com> wrote:
>
> Hi Sean,
>
>
>
> for sure, the best approach would be to create another table which would
> treat just that specific query.
>
>
>
> How do I set the flag for not allowing allow filtering in cassandra.yaml?
> I read a doco and there seems to be nothing about that.
>
>
>
> Regards
>
>
>
> On Wed, 13 Mar 2019 at 06:57, Durity, Sean R <SE...@homedepot.com>
> wrote:
>
> If there are 2 access patterns, I would consider having 2 tables. The
> first one with the ID, which you say is the majority use case.  Then have a
> second table that uses a time-bucket approach as others have suggested:
>
> (time bucket, id) as primary key
>
> Choose a time bucket (day, week, hour, month, whatever) that would hold
> less than 100 MB of data in the time-bucket partition.
>
>
>
> You could include all relevant data in the second table to meet your
> query. OR, if that data seems too large or too volatile to duplicate, just
> include your primary key and look-up the data in the primary table as
> needed.
>
>
>
> If you use allow filtering, you are setting yourself up for failure to
> scale. I tell my developers, “if you use allow filtering, you are doing it
> wrong.” In fact, I think the Cassandra admin should be able to set a flag
> in cassandra.yaml to not allow filtering at all. The cluster should be able
> to protect itself from bad queries.
>
>
>
>
>
>
>
> *From:* Leena Ghatpande <lg...@hotmail.com>
> *Sent:* Tuesday, March 12, 2019 9:02 AM
> *To:* Stefan Miklosovic <st...@instaclustr.com>;
> user@cassandra.apache.org
> *Subject:* [EXTERNAL] Re: Migrate large volume of data from one table to
> another table within the same cluster when COPY is not an option.
>
>
>
> Our data model cannot be like below as you have recommended as majority of
> the reads need to select the data by the partition key (id) only, not by
> date.
>
> You could remodel your data in such way that you would make primary key
> like this
>
> ((date), hour-minute, id)
>
> or
>
> ((date, hour-minute), id)
>
>
>
>
>
> By adding the date as clustering column, yes the idea was to use the Allow
> Filtering on the date and pull the records. Understand that it is not
> recommended to do this, but we have been doing this on another existing
> large table and have not run into any issue so far. But want to understand
> if there is a better approach to this?
>
>
>
> Thanks
>
>
> ------------------------------
>
> *From:* Stefan Miklosovic <st...@instaclustr.com>
> *Sent:* Monday, March 11, 2019 7:12 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Migrate large volume of data from one table to another
> table within the same cluster when COPY is not an option.
>
>
>
> The query which does not work should be like this, I made a mistake there
>
>
>
> cqlsh> SELECT * from my_keyspace.my_table where  number > 2;
>
> InvalidRequest: Error from server: code=2200 [Invalid query]
> message="Cannot execute this query as it might involve data filtering and
> thus may have unpredictable performance. If you want to execute this query
> despite the performance unpredictability, use ALLOW FILTERING"
>
>
>
>
>
> On Tue, 12 Mar 2019 at 10:10, Stefan Miklosovic <
> stefan.miklosovic@instaclustr.com> wrote:
>
> Hi Leena,
>
>
>
> "We are thinking of creating a new table with a date field as a
> clustering column to be able to query for date ranges, but partition key to
> clustering key will be 1-1. Is this a good approach?"
>
>
>
> If you want to select by some time range here, I am wondering how would
> making datetime a clustering column help you here? You still have to
> provide primary key, right?
>
>
>
> E.g. select * from your_keyspace.your_table where id=123 and my_date >
> yesterday and my_date < tomorrow (you got the idea)
>
>
>
> If you make my_date clustering column, you cant not do this below, because
> you still have to specify partition key fully and then clustering key
> (optionally) where you can further order and do ranges. But you cant do a
> query without specifying partition key. Well, you can use ALLOW FILTERING
> but you do not want to do this at all in your situation as it would scan
> everything.
>
>
>
> select * from your_keyspace.your_table where my_date > yesterday and
> my_date < tomorrow
>
>
>
> cqlsh> create KEYSPACE my_keyspace WITH replication = {'class':
> 'NetworkTopologyStrategy', 'dc1': '1'};
>
> cqlsh> CREATE TABLE my_keyspace.my_table (id uuid, number int, PRIMARY KEY
> ((id), number));
>
>
>
> cqlsh> SELECT * from my_keyspace.my_table ;
>
>
>
>  id                                   | number
>
> --------------------------------------+--------
>
>  6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3
>
>  abdc0184-a695-427d-b63b-57cdf7a45f00 |      1
>
>  90fe112e-0f74-4cbc-8767-67bdc9c8c3b0 |      4
>
>  8cff3eb7-1aff-4dc7-9969-60190c7e4675 |      2
>
>
>
> cqlsh> SELECT * from my_keyspace.my_table where id =
> '6e23f79a-8b67-47e0-b8e0-50be78bb1c7f' and  number > 2;
>
> InvalidRequest: Error from server: code=2200 [Invalid query]
> message="Invalid STRING constant (6e23f79a-8b67-47e0-b8e0-50be78bb1c7f) for
> "id" of type uuid"
>
>
>
> cqlsh> SELECT * from my_keyspace.my_table where id =
> 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f and  number > 2;
>
>
>
>  id                                   | number
>
> --------------------------------------+--------
>
>  6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3
>
>
>
> You could remodel your data in such way that you would make primary key
> like this
>
>
>
> ((date), hour-minute, id)
>
>
>
> or
>
>
>
> ((date, hour-minute), id)
>
>
>
> I would prefer the second one because if you expect a lot of data per day,
> they would all end up on same set of replicas as hash of partition key
> would be same whole day if you have same date all day so I think you would
> end up with hotspots. You want to have your data spread more evenly so the
> second one seems to be better to me.
>
>
>
> You can also investigate how to do this with materialized view but I am
> not sure about the performance here.
>
>
>
> If you want to copy data you can do this e.g. by Cassandra Spark
> connector, you would just read table and as you read it you would write to
> another one. That is imho the fastest approach and the least error prone.
> You can do that on live production data and you can just make a "switch"
> afterwards. Not sure about ttls but that should be transparent while
> copying that.
>
>
>
> On Tue, 12 Mar 2019 at 03:04, Leena Ghatpande <lg...@hotmail.com>
> wrote:
>
> We have a table with over 70M rows with a partition key that is unique.
> We have a  created datetime stamp on each record, and we have a need to
> select all rows created for a date range. Secondary index is not an option
> as its high cardinality and could slow performance doing a full scan on 70M
> rows.
>
>
>
> We are thinking of creating a new table with a date field as a clustering
> column to be able to query for date ranges, but partition key to clustering
> key will be 1-1. Is this a good approach?
>
> To do this, we need to copy this large volume of data from table1 to
> table2 within the same cluster, while updates are still happening to
> table1. We need to do this real time without impacting our customers. COPY
> is not an option, as we have ttl's on each row on table1 that need to be
> applied to table2 as well.
>
>
>
> So what would be the best approach
>
>    1. To be able select data using date range without impacting
>    performance. This operation will be needed only on adhoc basis and it wont
>    be as frequent .
>    2. Best way to migrate large volume of data with ttl from one table to
>    another within the same cluster.
>
>
>
> Any other suggestions also will be greatly appreciated.
>
>
>
>
>
>
> Stefan Miklosovic
>
>
>
> Stefan Miklosovic
>
>
> ------------------------------
>
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>
>
> Stefan Miklosovic
>
> --
>
>
>
> Best regards
>
> _____________________________________________________________
>
>
> [image:
> https://www.facebook.com/DMN-BigData-371074727032197/?modal=admin_todo_tour]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_DMN-2DBigData-2D371074727032197_-3Fmodal-3Dadmin-5Ftodo-5Ftour&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=Be86MmxLibyKk8oxXeTYi0gzXBnCsVEtV0WLZwIBsWY&e=>
>
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_dmnbigdata&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=KinxHlxxeSKrA0moelGQ2zzr1f6EpimoYKLvCHagfVw&e=>
>
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=XUqWrw-xfuEB6DvHcrdSlp4wkeSZ4WEpl5N7hJROTJQ&e=>
>
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_dngaya_&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=ziiCdML08BtSMoZRZjZBjlXu45ZtuXE-70aC9vdtpdk&e=>
>
>
> *Dieudonne Madishon NGAYA*
> Datastax, Cassandra Architect
> *P: *7048580065
> *w: *www.dmnbigdata.com
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.dmnbigdata.com_&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=J2WoKkzomumHyaEJpHca4urihb80evzx_d6cujZu664&e=>
> *E: *dmngaya@dmnbigdata.com
> *Private E: *dmngaya@gmail.com
> *A: *Charlotte,NC,28273, USA
>
>
>
>
>
>
> ------------------------------
>
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>
>
>
>
> Stefan Miklosovic
>
> ------------------------------
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>

RE: [EXTERNAL] Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.

Posted by "Durity, Sean R" <SE...@homedepot.com>.
The possibility of a highly available way to do this gives more challenges. I would be weighing the cost of a complex solution vs the possibility of a maintenance window when you stop your app to move the data, then restart.

For the straight copy of the data, I am currently enamored with DataStax’s dsbulk utility for unloading and loading larger amounts of data. I don’t have extensive experience, yet, but it has been fast enough in my experiments – and that is without doing too much tuning for speed. From a host not in the cluster, I was able to extract 3.5 million rows in about 11 seconds. I inserted them into a differently partitioned table in about 26 seconds. Very small data rows, but it was impressive for not doing much to try and speed it up further. (In some other tests, it was about ¼ the time of simple copy statement from cqlsh)

If I was designing something for a “can’t take an outage” scenario, I would start with:

-          Writing the data to the old and new tables on all inserts

-          On reads, read from the new table first. If not there, read from the old table <-- could introduce some latency, but would be available; could also do asynchronous reads on both tables and choose the latest

-          Do this until the data has been copied from old to new (with dsbulk or custom code or Spark)

-          Drop the double writes and conditional reads


Sean

From: Stefan Miklosovic <st...@instaclustr.com>
Sent: Wednesday, March 13, 2019 6:39 PM
To: user@cassandra.apache.org
Subject: Re: [EXTERNAL] Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.

Hi Leena,

as already suggested in my previous email, you could use Apache Spark and Cassandra Spark connector (1). I have checked TTLs and I believe you should especially read this section (2) about TTLs. Seems like thats what you need to do, ttls per row. The workflow would be that you read from your source table, making transformations per row (via some mapping) and then you would save it to new table.

This would import it "all" but until you switch to the new table and records are still being saved into the original one, I am not sure how to cover "the gap" in such sense that once you make the switch, you would miss records which were created in the first table after you did the loading. You could maybe leverage Spark streaming (Cassandra connector knows that too) so you would make this transformation on the fly with new ones.

(1) https://github.com/datastax/spark-cassandra-connector<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_datastax_spark-2Dcassandra-2Dconnector&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=_DgzHjtyiXt4QUBiWPplE-cs_HMaVflC9fAK6I4TdpQ&s=mMB-uNoPbBBK9Zfn5WuDoKoF31IgSi1MXgNlYG7jhDE&e=>
(2) https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md#using-a-different-value-for-each-row<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_datastax_spark-2Dcassandra-2Dconnector_blob_master_doc_5-5Fsaving.md-23using-2Da-2Ddifferent-2Dvalue-2Dfor-2Deach-2Drow&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=_DgzHjtyiXt4QUBiWPplE-cs_HMaVflC9fAK6I4TdpQ&s=AwO-LFAxHWvYgzjuWt9ez5FHKDeNdS3C6KYfaoUUgOs&e=>


On Thu, 14 Mar 2019 at 00:13, Leena Ghatpande <lg...@hotmail.com>> wrote:
Understand, 2nd table would be a better approach. So what would be the best way to copy 70M rows from current table to the 2nd table with ttl set on each record as the first table?

________________________________
From: Durity, Sean R <SE...@homedepot.com>>
Sent: Wednesday, March 13, 2019 8:17 AM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: RE: [EXTERNAL] Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.


Correct, there is no current flag. I think there SHOULD be one.





From: Dieudonné Madishon NGAYA <dm...@gmail.com>>
Sent: Tuesday, March 12, 2019 7:17 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: [EXTERNAL] Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.



Hi Sean, you can’t flag in Cassandra.yaml not allowing allow filtering , the only thing you can do will be from your data model .

Don’t ask Cassandra to query all data from table but the ideal query will be using single partition.



On Tue, Mar 12, 2019 at 6:46 PM Stefan Miklosovic <st...@instaclustr.com>> wrote:

Hi Sean,



for sure, the best approach would be to create another table which would treat just that specific query.



How do I set the flag for not allowing allow filtering in cassandra.yaml? I read a doco and there seems to be nothing about that.



Regards



On Wed, 13 Mar 2019 at 06:57, Durity, Sean R <SE...@homedepot.com>> wrote:

If there are 2 access patterns, I would consider having 2 tables. The first one with the ID, which you say is the majority use case.  Then have a second table that uses a time-bucket approach as others have suggested:

(time bucket, id) as primary key

Choose a time bucket (day, week, hour, month, whatever) that would hold less than 100 MB of data in the time-bucket partition.



You could include all relevant data in the second table to meet your query. OR, if that data seems too large or too volatile to duplicate, just include your primary key and look-up the data in the primary table as needed.



If you use allow filtering, you are setting yourself up for failure to scale. I tell my developers, “if you use allow filtering, you are doing it wrong.” In fact, I think the Cassandra admin should be able to set a flag in cassandra.yaml to not allow filtering at all. The cluster should be able to protect itself from bad queries.







From: Leena Ghatpande <lg...@hotmail.com>>
Sent: Tuesday, March 12, 2019 9:02 AM
To: Stefan Miklosovic <st...@instaclustr.com>>; user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: [EXTERNAL] Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.



Our data model cannot be like below as you have recommended as majority of the reads need to select the data by the partition key (id) only, not by date.

You could remodel your data in such way that you would make primary key like this

((date), hour-minute, id)

or

((date, hour-minute), id)





By adding the date as clustering column, yes the idea was to use the Allow Filtering on the date and pull the records. Understand that it is not recommended to do this, but we have been doing this on another existing large table and have not run into any issue so far. But want to understand if there is a better approach to this?



Thanks



________________________________

From: Stefan Miklosovic <st...@instaclustr.com>>
Sent: Monday, March 11, 2019 7:12 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.



The query which does not work should be like this, I made a mistake there



cqlsh> SELECT * from my_keyspace.my_table where  number > 2;

InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"





On Tue, 12 Mar 2019 at 10:10, Stefan Miklosovic <st...@instaclustr.com>> wrote:

Hi Leena,



"We are thinking of creating a new table with a date field as a clustering column to be able to query for date ranges, but partition key to clustering key will be 1-1. Is this a good approach?"



If you want to select by some time range here, I am wondering how would making datetime a clustering column help you here? You still have to provide primary key, right?



E.g. select * from your_keyspace.your_table where id=123 and my_date > yesterday and my_date < tomorrow (you got the idea)



If you make my_date clustering column, you cant not do this below, because you still have to specify partition key fully and then clustering key (optionally) where you can further order and do ranges. But you cant do a query without specifying partition key. Well, you can use ALLOW FILTERING but you do not want to do this at all in your situation as it would scan everything.



select * from your_keyspace.your_table where my_date > yesterday and my_date < tomorrow



cqlsh> create KEYSPACE my_keyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '1'};

cqlsh> CREATE TABLE my_keyspace.my_table (id uuid, number int, PRIMARY KEY ((id), number));



cqlsh> SELECT * from my_keyspace.my_table ;



 id                                   | number

--------------------------------------+--------

 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3

 abdc0184-a695-427d-b63b-57cdf7a45f00 |      1

 90fe112e-0f74-4cbc-8767-67bdc9c8c3b0 |      4

 8cff3eb7-1aff-4dc7-9969-60190c7e4675 |      2



cqlsh> SELECT * from my_keyspace.my_table where id = '6e23f79a-8b67-47e0-b8e0-50be78bb1c7f' and  number > 2;

InvalidRequest: Error from server: code=2200 [Invalid query] message="Invalid STRING constant (6e23f79a-8b67-47e0-b8e0-50be78bb1c7f) for "id" of type uuid"



cqlsh> SELECT * from my_keyspace.my_table where id = 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f and  number > 2;



 id                                   | number

--------------------------------------+--------

 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3



You could remodel your data in such way that you would make primary key like this



((date), hour-minute, id)



or



((date, hour-minute), id)



I would prefer the second one because if you expect a lot of data per day, they would all end up on same set of replicas as hash of partition key would be same whole day if you have same date all day so I think you would end up with hotspots. You want to have your data spread more evenly so the second one seems to be better to me.



You can also investigate how to do this with materialized view but I am not sure about the performance here.



If you want to copy data you can do this e.g. by Cassandra Spark connector, you would just read table and as you read it you would write to another one. That is imho the fastest approach and the least error prone. You can do that on live production data and you can just make a "switch" afterwards. Not sure about ttls but that should be transparent while copying that.



On Tue, 12 Mar 2019 at 03:04, Leena Ghatpande <lg...@hotmail.com>> wrote:

We have a table with over 70M rows with a partition key that is unique.  We have a  created datetime stamp on each record, and we have a need to select all rows created for a date range. Secondary index is not an option as its high cardinality and could slow performance doing a full scan on 70M rows.



We are thinking of creating a new table with a date field as a clustering column to be able to query for date ranges, but partition key to clustering key will be 1-1. Is this a good approach?

To do this, we need to copy this large volume of data from table1 to table2 within the same cluster, while updates are still happening to table1. We need to do this real time without impacting our customers. COPY is not an option, as we have ttl's on each row on table1 that need to be applied to table2 as well.



So what would be the best approach

  1.  To be able select data using date range without impacting performance. This operation will be needed only on adhoc basis and it wont be as frequent .
  2.  Best way to migrate large volume of data with ttl from one table to another within the same cluster.



Any other suggestions also will be greatly appreciated.






Stefan Miklosovic



Stefan Miklosovic



________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.


Stefan Miklosovic

--



Best regards

_____________________________________________________________

[https://s26.postimg.cc/9rnn6cmjt/logo.png]
[https://www.facebook.com/DMN-BigData-371074727032197/?modal=admin_todo_tour]<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_DMN-2DBigData-2D371074727032197_-3Fmodal-3Dadmin-5Ftodo-5Ftour&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=Be86MmxLibyKk8oxXeTYi0gzXBnCsVEtV0WLZwIBsWY&e=>  [https://s26.postimg.cc/uvc71fu09/twitter.png] <https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_dmnbigdata&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=KinxHlxxeSKrA0moelGQ2zzr1f6EpimoYKLvCHagfVw&e=>   [https://s26.postimg.cc/zb3sy6l61/instragram.png] <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=XUqWrw-xfuEB6DvHcrdSlp4wkeSZ4WEpl5N7hJROTJQ&e=>   [https://s26.postimg.cc/tgoehxx2h/linkadin.png] <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_dngaya_&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=ziiCdML08BtSMoZRZjZBjlXu45ZtuXE-70aC9vdtpdk&e=>   [https://s26.postimg.cc/3mels5x2h/google.png]





Dieudonne Madishon NGAYA
Datastax, Cassandra Architect
P: 7048580065<tel:7048580065>
w: www.dmnbigdata.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.dmnbigdata.com_&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=J2WoKkzomumHyaEJpHca4urihb80evzx_d6cujZu664&e=>
E: dmngaya@dmnbigdata.com<ma...@dmnbigdata.com>
Private E: dmngaya@gmail.com<ma...@gmail.com>
A: Charlotte,NC,28273, USA









________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.


Stefan Miklosovic

________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.

Re: [EXTERNAL] Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.

Posted by Stefan Miklosovic <st...@instaclustr.com>.
Hi Leena,

as already suggested in my previous email, you could use Apache Spark and
Cassandra Spark connector (1). I have checked TTLs and I believe you should
especially read this section (2) about TTLs. Seems like thats what you need
to do, ttls per row. The workflow would be that you read from your source
table, making transformations per row (via some mapping) and then you would
save it to new table.

This would import it "all" but until you switch to the new table and
records are still being saved into the original one, I am not sure how to
cover "the gap" in such sense that once you make the switch, you would miss
records which were created in the first table after you did the loading.
You could maybe leverage Spark streaming (Cassandra connector knows that
too) so you would make this transformation on the fly with new ones.

(1) https://github.com/datastax/spark-cassandra-connector
(2)
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md#using-a-different-value-for-each-row


On Thu, 14 Mar 2019 at 00:13, Leena Ghatpande <lg...@hotmail.com>
wrote:

> Understand, 2nd table would be a better approach. So what would be the
> best way to copy 70M rows from current table to the 2nd table with ttl set
> on each record as the first table?
>
> ------------------------------
> *From:* Durity, Sean R <SE...@homedepot.com>
> *Sent:* Wednesday, March 13, 2019 8:17 AM
> *To:* user@cassandra.apache.org
> *Subject:* RE: [EXTERNAL] Re: Migrate large volume of data from one table
> to another table within the same cluster when COPY is not an option.
>
>
> Correct, there is no current flag. I think there SHOULD be one.
>
>
>
>
>
> *From:* Dieudonné Madishon NGAYA <dm...@gmail.com>
> *Sent:* Tuesday, March 12, 2019 7:17 PM
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] Re: Migrate large volume of data from one table to
> another table within the same cluster when COPY is not an option.
>
>
>
> Hi Sean, you can’t flag in Cassandra.yaml not allowing allow filtering ,
> the only thing you can do will be from your data model .
>
> Don’t ask Cassandra to query all data from table but the ideal query will
> be using single partition.
>
>
>
> On Tue, Mar 12, 2019 at 6:46 PM Stefan Miklosovic <
> stefan.miklosovic@instaclustr.com> wrote:
>
> Hi Sean,
>
>
>
> for sure, the best approach would be to create another table which would
> treat just that specific query.
>
>
>
> How do I set the flag for not allowing allow filtering in cassandra.yaml?
> I read a doco and there seems to be nothing about that.
>
>
>
> Regards
>
>
>
> On Wed, 13 Mar 2019 at 06:57, Durity, Sean R <SE...@homedepot.com>
> wrote:
>
> If there are 2 access patterns, I would consider having 2 tables. The
> first one with the ID, which you say is the majority use case.  Then have a
> second table that uses a time-bucket approach as others have suggested:
>
> (time bucket, id) as primary key
>
> Choose a time bucket (day, week, hour, month, whatever) that would hold
> less than 100 MB of data in the time-bucket partition.
>
>
>
> You could include all relevant data in the second table to meet your
> query. OR, if that data seems too large or too volatile to duplicate, just
> include your primary key and look-up the data in the primary table as
> needed.
>
>
>
> If you use allow filtering, you are setting yourself up for failure to
> scale. I tell my developers, “if you use allow filtering, you are doing it
> wrong.” In fact, I think the Cassandra admin should be able to set a flag
> in cassandra.yaml to not allow filtering at all. The cluster should be able
> to protect itself from bad queries.
>
>
>
>
>
>
>
> *From:* Leena Ghatpande <lg...@hotmail.com>
> *Sent:* Tuesday, March 12, 2019 9:02 AM
> *To:* Stefan Miklosovic <st...@instaclustr.com>;
> user@cassandra.apache.org
> *Subject:* [EXTERNAL] Re: Migrate large volume of data from one table to
> another table within the same cluster when COPY is not an option.
>
>
>
> Our data model cannot be like below as you have recommended as majority of
> the reads need to select the data by the partition key (id) only, not by
> date.
>
> You could remodel your data in such way that you would make primary key
> like this
>
> ((date), hour-minute, id)
>
> or
>
> ((date, hour-minute), id)
>
>
>
>
>
> By adding the date as clustering column, yes the idea was to use the Allow
> Filtering on the date and pull the records. Understand that it is not
> recommended to do this, but we have been doing this on another existing
> large table and have not run into any issue so far. But want to understand
> if there is a better approach to this?
>
>
>
> Thanks
>
>
> ------------------------------
>
> *From:* Stefan Miklosovic <st...@instaclustr.com>
> *Sent:* Monday, March 11, 2019 7:12 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Migrate large volume of data from one table to another
> table within the same cluster when COPY is not an option.
>
>
>
> The query which does not work should be like this, I made a mistake there
>
>
>
> cqlsh> SELECT * from my_keyspace.my_table where  number > 2;
>
> InvalidRequest: Error from server: code=2200 [Invalid query]
> message="Cannot execute this query as it might involve data filtering and
> thus may have unpredictable performance. If you want to execute this query
> despite the performance unpredictability, use ALLOW FILTERING"
>
>
>
>
>
> On Tue, 12 Mar 2019 at 10:10, Stefan Miklosovic <
> stefan.miklosovic@instaclustr.com> wrote:
>
> Hi Leena,
>
>
>
> "We are thinking of creating a new table with a date field as a
> clustering column to be able to query for date ranges, but partition key to
> clustering key will be 1-1. Is this a good approach?"
>
>
>
> If you want to select by some time range here, I am wondering how would
> making datetime a clustering column help you here? You still have to
> provide primary key, right?
>
>
>
> E.g. select * from your_keyspace.your_table where id=123 and my_date >
> yesterday and my_date < tomorrow (you got the idea)
>
>
>
> If you make my_date clustering column, you cant not do this below, because
> you still have to specify partition key fully and then clustering key
> (optionally) where you can further order and do ranges. But you cant do a
> query without specifying partition key. Well, you can use ALLOW FILTERING
> but you do not want to do this at all in your situation as it would scan
> everything.
>
>
>
> select * from your_keyspace.your_table where my_date > yesterday and
> my_date < tomorrow
>
>
>
> cqlsh> create KEYSPACE my_keyspace WITH replication = {'class':
> 'NetworkTopologyStrategy', 'dc1': '1'};
>
> cqlsh> CREATE TABLE my_keyspace.my_table (id uuid, number int, PRIMARY KEY
> ((id), number));
>
>
>
> cqlsh> SELECT * from my_keyspace.my_table ;
>
>
>
>  id                                   | number
>
> --------------------------------------+--------
>
>  6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3
>
>  abdc0184-a695-427d-b63b-57cdf7a45f00 |      1
>
>  90fe112e-0f74-4cbc-8767-67bdc9c8c3b0 |      4
>
>  8cff3eb7-1aff-4dc7-9969-60190c7e4675 |      2
>
>
>
> cqlsh> SELECT * from my_keyspace.my_table where id =
> '6e23f79a-8b67-47e0-b8e0-50be78bb1c7f' and  number > 2;
>
> InvalidRequest: Error from server: code=2200 [Invalid query]
> message="Invalid STRING constant (6e23f79a-8b67-47e0-b8e0-50be78bb1c7f) for
> "id" of type uuid"
>
>
>
> cqlsh> SELECT * from my_keyspace.my_table where id =
> 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f and  number > 2;
>
>
>
>  id                                   | number
>
> --------------------------------------+--------
>
>  6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3
>
>
>
> You could remodel your data in such way that you would make primary key
> like this
>
>
>
> ((date), hour-minute, id)
>
>
>
> or
>
>
>
> ((date, hour-minute), id)
>
>
>
> I would prefer the second one because if you expect a lot of data per day,
> they would all end up on same set of replicas as hash of partition key
> would be same whole day if you have same date all day so I think you would
> end up with hotspots. You want to have your data spread more evenly so the
> second one seems to be better to me.
>
>
>
> You can also investigate how to do this with materialized view but I am
> not sure about the performance here.
>
>
>
> If you want to copy data you can do this e.g. by Cassandra Spark
> connector, you would just read table and as you read it you would write to
> another one. That is imho the fastest approach and the least error prone.
> You can do that on live production data and you can just make a "switch"
> afterwards. Not sure about ttls but that should be transparent while
> copying that.
>
>
>
> On Tue, 12 Mar 2019 at 03:04, Leena Ghatpande <lg...@hotmail.com>
> wrote:
>
> We have a table with over 70M rows with a partition key that is unique.
> We have a  created datetime stamp on each record, and we have a need to
> select all rows created for a date range. Secondary index is not an option
> as its high cardinality and could slow performance doing a full scan on 70M
> rows.
>
>
>
> We are thinking of creating a new table with a date field as a clustering
> column to be able to query for date ranges, but partition key to clustering
> key will be 1-1. Is this a good approach?
>
> To do this, we need to copy this large volume of data from table1 to
> table2 within the same cluster, while updates are still happening to
> table1. We need to do this real time without impacting our customers. COPY
> is not an option, as we have ttl's on each row on table1 that need to be
> applied to table2 as well.
>
>
>
> So what would be the best approach
>
>    1. To be able select data using date range without impacting
>    performance. This operation will be needed only on adhoc basis and it wont
>    be as frequent .
>    2. Best way to migrate large volume of data with ttl from one table to
>    another within the same cluster.
>
>
>
> Any other suggestions also will be greatly appreciated.
>
>
>
>
>
>
> Stefan Miklosovic
>
>
>
> Stefan Miklosovic
>
>
> ------------------------------
>
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>
>
> Stefan Miklosovic
>
> --
>
>
>
> Best regards
>
> _____________________________________________________________
>
>
> [image:
> https://www.facebook.com/DMN-BigData-371074727032197/?modal=admin_todo_tour]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_DMN-2DBigData-2D371074727032197_-3Fmodal-3Dadmin-5Ftodo-5Ftour&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=Be86MmxLibyKk8oxXeTYi0gzXBnCsVEtV0WLZwIBsWY&e=>
>
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_dmnbigdata&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=KinxHlxxeSKrA0moelGQ2zzr1f6EpimoYKLvCHagfVw&e=>
>
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=XUqWrw-xfuEB6DvHcrdSlp4wkeSZ4WEpl5N7hJROTJQ&e=>
>
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_dngaya_&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=ziiCdML08BtSMoZRZjZBjlXu45ZtuXE-70aC9vdtpdk&e=>
>
>
> *Dieudonne Madishon NGAYA*
> Datastax, Cassandra Architect
> *P: *7048580065
> *w: *www.dmnbigdata.com
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.dmnbigdata.com_&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=J2WoKkzomumHyaEJpHca4urihb80evzx_d6cujZu664&e=>
> *E: *dmngaya@dmnbigdata.com
> *Private E: *dmngaya@gmail.com
> *A: *Charlotte,NC,28273, USA
>
>
>
>
>
> ------------------------------
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>


Stefan Miklosovic

Re: [EXTERNAL] Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.

Posted by Leena Ghatpande <lg...@hotmail.com>.
Understand, 2nd table would be a better approach. So what would be the best way to copy 70M rows from current table to the 2nd table with ttl set on each record as the first table?

________________________________
From: Durity, Sean R <SE...@homedepot.com>
Sent: Wednesday, March 13, 2019 8:17 AM
To: user@cassandra.apache.org
Subject: RE: [EXTERNAL] Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.


Correct, there is no current flag. I think there SHOULD be one.





From: Dieudonné Madishon NGAYA <dm...@gmail.com>
Sent: Tuesday, March 12, 2019 7:17 PM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.



Hi Sean, you can’t flag in Cassandra.yaml not allowing allow filtering , the only thing you can do will be from your data model .

Don’t ask Cassandra to query all data from table but the ideal query will be using single partition.



On Tue, Mar 12, 2019 at 6:46 PM Stefan Miklosovic <st...@instaclustr.com>> wrote:

Hi Sean,



for sure, the best approach would be to create another table which would treat just that specific query.



How do I set the flag for not allowing allow filtering in cassandra.yaml? I read a doco and there seems to be nothing about that.



Regards



On Wed, 13 Mar 2019 at 06:57, Durity, Sean R <SE...@homedepot.com>> wrote:

If there are 2 access patterns, I would consider having 2 tables. The first one with the ID, which you say is the majority use case.  Then have a second table that uses a time-bucket approach as others have suggested:

(time bucket, id) as primary key

Choose a time bucket (day, week, hour, month, whatever) that would hold less than 100 MB of data in the time-bucket partition.



You could include all relevant data in the second table to meet your query. OR, if that data seems too large or too volatile to duplicate, just include your primary key and look-up the data in the primary table as needed.



If you use allow filtering, you are setting yourself up for failure to scale. I tell my developers, “if you use allow filtering, you are doing it wrong.” In fact, I think the Cassandra admin should be able to set a flag in cassandra.yaml to not allow filtering at all. The cluster should be able to protect itself from bad queries.







From: Leena Ghatpande <lg...@hotmail.com>>
Sent: Tuesday, March 12, 2019 9:02 AM
To: Stefan Miklosovic <st...@instaclustr.com>>; user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: [EXTERNAL] Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.



Our data model cannot be like below as you have recommended as majority of the reads need to select the data by the partition key (id) only, not by date.

You could remodel your data in such way that you would make primary key like this

((date), hour-minute, id)

or

((date, hour-minute), id)





By adding the date as clustering column, yes the idea was to use the Allow Filtering on the date and pull the records. Understand that it is not recommended to do this, but we have been doing this on another existing large table and have not run into any issue so far. But want to understand if there is a better approach to this?



Thanks



________________________________

From: Stefan Miklosovic <st...@instaclustr.com>>
Sent: Monday, March 11, 2019 7:12 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.



The query which does not work should be like this, I made a mistake there



cqlsh> SELECT * from my_keyspace.my_table where  number > 2;

InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"





On Tue, 12 Mar 2019 at 10:10, Stefan Miklosovic <st...@instaclustr.com>> wrote:

Hi Leena,



"We are thinking of creating a new table with a date field as a clustering column to be able to query for date ranges, but partition key to clustering key will be 1-1. Is this a good approach?"



If you want to select by some time range here, I am wondering how would making datetime a clustering column help you here? You still have to provide primary key, right?



E.g. select * from your_keyspace.your_table where id=123 and my_date > yesterday and my_date < tomorrow (you got the idea)



If you make my_date clustering column, you cant not do this below, because you still have to specify partition key fully and then clustering key (optionally) where you can further order and do ranges. But you cant do a query without specifying partition key. Well, you can use ALLOW FILTERING but you do not want to do this at all in your situation as it would scan everything.



select * from your_keyspace.your_table where my_date > yesterday and my_date < tomorrow



cqlsh> create KEYSPACE my_keyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '1'};

cqlsh> CREATE TABLE my_keyspace.my_table (id uuid, number int, PRIMARY KEY ((id), number));



cqlsh> SELECT * from my_keyspace.my_table ;



 id                                   | number

--------------------------------------+--------

 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3

 abdc0184-a695-427d-b63b-57cdf7a45f00 |      1

 90fe112e-0f74-4cbc-8767-67bdc9c8c3b0 |      4

 8cff3eb7-1aff-4dc7-9969-60190c7e4675 |      2



cqlsh> SELECT * from my_keyspace.my_table where id = '6e23f79a-8b67-47e0-b8e0-50be78bb1c7f' and  number > 2;

InvalidRequest: Error from server: code=2200 [Invalid query] message="Invalid STRING constant (6e23f79a-8b67-47e0-b8e0-50be78bb1c7f) for "id" of type uuid"



cqlsh> SELECT * from my_keyspace.my_table where id = 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f and  number > 2;



 id                                   | number

--------------------------------------+--------

 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3



You could remodel your data in such way that you would make primary key like this



((date), hour-minute, id)



or



((date, hour-minute), id)



I would prefer the second one because if you expect a lot of data per day, they would all end up on same set of replicas as hash of partition key would be same whole day if you have same date all day so I think you would end up with hotspots. You want to have your data spread more evenly so the second one seems to be better to me.



You can also investigate how to do this with materialized view but I am not sure about the performance here.



If you want to copy data you can do this e.g. by Cassandra Spark connector, you would just read table and as you read it you would write to another one. That is imho the fastest approach and the least error prone. You can do that on live production data and you can just make a "switch" afterwards. Not sure about ttls but that should be transparent while copying that.



On Tue, 12 Mar 2019 at 03:04, Leena Ghatpande <lg...@hotmail.com>> wrote:

We have a table with over 70M rows with a partition key that is unique.  We have a  created datetime stamp on each record, and we have a need to select all rows created for a date range. Secondary index is not an option as its high cardinality and could slow performance doing a full scan on 70M rows.



We are thinking of creating a new table with a date field as a clustering column to be able to query for date ranges, but partition key to clustering key will be 1-1. Is this a good approach?

To do this, we need to copy this large volume of data from table1 to table2 within the same cluster, while updates are still happening to table1. We need to do this real time without impacting our customers. COPY is not an option, as we have ttl's on each row on table1 that need to be applied to table2 as well.



So what would be the best approach

  1.  To be able select data using date range without impacting performance. This operation will be needed only on adhoc basis and it wont be as frequent .
  2.  Best way to migrate large volume of data with ttl from one table to another within the same cluster.



Any other suggestions also will be greatly appreciated.






Stefan Miklosovic



Stefan Miklosovic



________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.


Stefan Miklosovic

--



Best regards

_____________________________________________________________

[https://s26.postimg.cc/9rnn6cmjt/logo.png]
[https://www.facebook.com/DMN-BigData-371074727032197/?modal=admin_todo_tour]<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_DMN-2DBigData-2D371074727032197_-3Fmodal-3Dadmin-5Ftodo-5Ftour&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=Be86MmxLibyKk8oxXeTYi0gzXBnCsVEtV0WLZwIBsWY&e=>  [https://s26.postimg.cc/uvc71fu09/twitter.png] <https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_dmnbigdata&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=KinxHlxxeSKrA0moelGQ2zzr1f6EpimoYKLvCHagfVw&e=>   [https://s26.postimg.cc/zb3sy6l61/instragram.png] <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=XUqWrw-xfuEB6DvHcrdSlp4wkeSZ4WEpl5N7hJROTJQ&e=>   [https://s26.postimg.cc/tgoehxx2h/linkadin.png] <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_dngaya_&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=ziiCdML08BtSMoZRZjZBjlXu45ZtuXE-70aC9vdtpdk&e=>   [https://s26.postimg.cc/3mels5x2h/google.png]





Dieudonne Madishon NGAYA
Datastax, Cassandra Architect
P: 7048580065<tel:7048580065>
w: www.dmnbigdata.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.dmnbigdata.com_&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=J2WoKkzomumHyaEJpHca4urihb80evzx_d6cujZu664&e=>
E: dmngaya@dmnbigdata.com<ma...@dmnbigdata.com>
Private E: dmngaya@gmail.com<ma...@gmail.com>
A: Charlotte,NC,28273, USA









________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.

RE: [EXTERNAL] Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.

Posted by "Durity, Sean R" <SE...@homedepot.com>.
Correct, there is no current flag. I think there SHOULD be one.


From: Dieudonné Madishon NGAYA <dm...@gmail.com>
Sent: Tuesday, March 12, 2019 7:17 PM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.

Hi Sean, you can’t flag in Cassandra.yaml not allowing allow filtering , the only thing you can do will be from your data model .
Don’t ask Cassandra to query all data from table but the ideal query will be using single partition.

On Tue, Mar 12, 2019 at 6:46 PM Stefan Miklosovic <st...@instaclustr.com>> wrote:
Hi Sean,

for sure, the best approach would be to create another table which would treat just that specific query.

How do I set the flag for not allowing allow filtering in cassandra.yaml? I read a doco and there seems to be nothing about that.

Regards

On Wed, 13 Mar 2019 at 06:57, Durity, Sean R <SE...@homedepot.com>> wrote:
If there are 2 access patterns, I would consider having 2 tables. The first one with the ID, which you say is the majority use case.  Then have a second table that uses a time-bucket approach as others have suggested:
(time bucket, id) as primary key
Choose a time bucket (day, week, hour, month, whatever) that would hold less than 100 MB of data in the time-bucket partition.

You could include all relevant data in the second table to meet your query. OR, if that data seems too large or too volatile to duplicate, just include your primary key and look-up the data in the primary table as needed.

If you use allow filtering, you are setting yourself up for failure to scale. I tell my developers, “if you use allow filtering, you are doing it wrong.” In fact, I think the Cassandra admin should be able to set a flag in cassandra.yaml to not allow filtering at all. The cluster should be able to protect itself from bad queries.



From: Leena Ghatpande <lg...@hotmail.com>>
Sent: Tuesday, March 12, 2019 9:02 AM
To: Stefan Miklosovic <st...@instaclustr.com>>; user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: [EXTERNAL] Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.

Our data model cannot be like below as you have recommended as majority of the reads need to select the data by the partition key (id) only, not by date.
You could remodel your data in such way that you would make primary key like this
((date), hour-minute, id)
or
((date, hour-minute), id)


By adding the date as clustering column, yes the idea was to use the Allow Filtering on the date and pull the records. Understand that it is not recommended to do this, but we have been doing this on another existing large table and have not run into any issue so far. But want to understand if there is a better approach to this?

Thanks

________________________________
From: Stefan Miklosovic <st...@instaclustr.com>>
Sent: Monday, March 11, 2019 7:12 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.

The query which does not work should be like this, I made a mistake there

cqlsh> SELECT * from my_keyspace.my_table where  number > 2;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"


On Tue, 12 Mar 2019 at 10:10, Stefan Miklosovic <st...@instaclustr.com>> wrote:
Hi Leena,

"We are thinking of creating a new table with a date field as a clustering column to be able to query for date ranges, but partition key to clustering key will be 1-1. Is this a good approach?"

If you want to select by some time range here, I am wondering how would making datetime a clustering column help you here? You still have to provide primary key, right?

E.g. select * from your_keyspace.your_table where id=123 and my_date > yesterday and my_date < tomorrow (you got the idea)

If you make my_date clustering column, you cant not do this below, because you still have to specify partition key fully and then clustering key (optionally) where you can further order and do ranges. But you cant do a query without specifying partition key. Well, you can use ALLOW FILTERING but you do not want to do this at all in your situation as it would scan everything.

select * from your_keyspace.your_table where my_date > yesterday and my_date < tomorrow

cqlsh> create KEYSPACE my_keyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '1'};
cqlsh> CREATE TABLE my_keyspace.my_table (id uuid, number int, PRIMARY KEY ((id), number));

cqlsh> SELECT * from my_keyspace.my_table ;

 id                                   | number
--------------------------------------+--------
 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3
 abdc0184-a695-427d-b63b-57cdf7a45f00 |      1
 90fe112e-0f74-4cbc-8767-67bdc9c8c3b0 |      4
 8cff3eb7-1aff-4dc7-9969-60190c7e4675 |      2

cqlsh> SELECT * from my_keyspace.my_table where id = '6e23f79a-8b67-47e0-b8e0-50be78bb1c7f' and  number > 2;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Invalid STRING constant (6e23f79a-8b67-47e0-b8e0-50be78bb1c7f) for "id" of type uuid"

cqlsh> SELECT * from my_keyspace.my_table where id = 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f and  number > 2;

 id                                   | number
--------------------------------------+--------
 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3

You could remodel your data in such way that you would make primary key like this

((date), hour-minute, id)

or

((date, hour-minute), id)

I would prefer the second one because if you expect a lot of data per day, they would all end up on same set of replicas as hash of partition key would be same whole day if you have same date all day so I think you would end up with hotspots. You want to have your data spread more evenly so the second one seems to be better to me.

You can also investigate how to do this with materialized view but I am not sure about the performance here.

If you want to copy data you can do this e.g. by Cassandra Spark connector, you would just read table and as you read it you would write to another one. That is imho the fastest approach and the least error prone. You can do that on live production data and you can just make a "switch" afterwards. Not sure about ttls but that should be transparent while copying that.

On Tue, 12 Mar 2019 at 03:04, Leena Ghatpande <lg...@hotmail.com>> wrote:

We have a table with over 70M rows with a partition key that is unique.  We have a  created datetime stamp on each record, and we have a need to select all rows created for a date range. Secondary index is not an option as its high cardinality and could slow performance doing a full scan on 70M rows.



We are thinking of creating a new table with a date field as a clustering column to be able to query for date ranges, but partition key to clustering key will be 1-1. Is this a good approach?

To do this, we need to copy this large volume of data from table1 to table2 within the same cluster, while updates are still happening to table1. We need to do this real time without impacting our customers. COPY is not an option, as we have ttl's on each row on table1 that need to be applied to table2 as well.



So what would be the best approach

  1.  To be able select data using date range without impacting performance. This operation will be needed only on adhoc basis and it wont be as frequent .
  2.  Best way to migrate large volume of data with ttl from one table to another within the same cluster.



Any other suggestions also will be greatly appreciated.



Stefan Miklosovic

Stefan Miklosovic

________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.

Stefan Miklosovic
--

Best regards
_____________________________________________________________
[https://s26.postimg.cc/9rnn6cmjt/logo.png]
[https://www.facebook.com/DMN-BigData-371074727032197/?modal=admin_todo_tour]<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_DMN-2DBigData-2D371074727032197_-3Fmodal-3Dadmin-5Ftodo-5Ftour&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=Be86MmxLibyKk8oxXeTYi0gzXBnCsVEtV0WLZwIBsWY&e=>  [https://s26.postimg.cc/uvc71fu09/twitter.png] <https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_dmnbigdata&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=KinxHlxxeSKrA0moelGQ2zzr1f6EpimoYKLvCHagfVw&e=>   [https://s26.postimg.cc/zb3sy6l61/instragram.png] <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=XUqWrw-xfuEB6DvHcrdSlp4wkeSZ4WEpl5N7hJROTJQ&e=>   [https://s26.postimg.cc/tgoehxx2h/linkadin.png] <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_dngaya_&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=ziiCdML08BtSMoZRZjZBjlXu45ZtuXE-70aC9vdtpdk&e=>   [https://s26.postimg.cc/3mels5x2h/google.png]




Dieudonne Madishon NGAYA
Datastax, Cassandra Architect
P: 7048580065<tel:7048580065>
w: www.dmnbigdata.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.dmnbigdata.com_&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=qNy7Vv_qbnRwcH30_SyjXalPKKd8Nnxv-ImzqlFsDFE&s=J2WoKkzomumHyaEJpHca4urihb80evzx_d6cujZu664&e=>
E: dmngaya@dmnbigdata.com<ma...@dmnbigdata.com>
Private E: dmngaya@gmail.com<ma...@gmail.com>
A: Charlotte,NC,28273, USA






________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.

Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.

Posted by Dieudonné Madishon NGAYA <dm...@gmail.com>.
Hi Sean, you can’t flag in Cassandra.yaml not allowing allow filtering ,
the only thing you can do will be from your data model .
Don’t ask Cassandra to query all data from table but the ideal query will
be using single partition.

On Tue, Mar 12, 2019 at 6:46 PM Stefan Miklosovic <
stefan.miklosovic@instaclustr.com> wrote:

> Hi Sean,
>
> for sure, the best approach would be to create another table which would
> treat just that specific query.
>
> How do I set the flag for not allowing allow filtering in cassandra.yaml?
> I read a doco and there seems to be nothing about that.
>
> Regards
>
> On Wed, 13 Mar 2019 at 06:57, Durity, Sean R <SE...@homedepot.com>
> wrote:
>
>> If there are 2 access patterns, I would consider having 2 tables. The
>> first one with the ID, which you say is the majority use case.  Then have a
>> second table that uses a time-bucket approach as others have suggested:
>>
>> (time bucket, id) as primary key
>>
>> Choose a time bucket (day, week, hour, month, whatever) that would hold
>> less than 100 MB of data in the time-bucket partition.
>>
>>
>>
>> You could include all relevant data in the second table to meet your
>> query. OR, if that data seems too large or too volatile to duplicate, just
>> include your primary key and look-up the data in the primary table as
>> needed.
>>
>>
>>
>> If you use allow filtering, you are setting yourself up for failure to
>> scale. I tell my developers, “if you use allow filtering, you are doing it
>> wrong.” In fact, I think the Cassandra admin should be able to set a flag
>> in cassandra.yaml to not allow filtering at all. The cluster should be able
>> to protect itself from bad queries.
>>
>>
>>
>>
>>
>>
>>
>> *From:* Leena Ghatpande <lg...@hotmail.com>
>> *Sent:* Tuesday, March 12, 2019 9:02 AM
>> *To:* Stefan Miklosovic <st...@instaclustr.com>;
>> user@cassandra.apache.org
>> *Subject:* [EXTERNAL] Re: Migrate large volume of data from one table to
>> another table within the same cluster when COPY is not an option.
>>
>>
>>
>> Our data model cannot be like below as you have recommended as majority
>> of the reads need to select the data by the partition key (id) only, not by
>> date.
>>
>> You could remodel your data in such way that you would make primary key
>> like this
>>
>> ((date), hour-minute, id)
>>
>> or
>>
>> ((date, hour-minute), id)
>>
>>
>>
>>
>>
>> By adding the date as clustering column, yes the idea was to use the
>> Allow Filtering on the date and pull the records. Understand that it is not
>> recommended to do this, but we have been doing this on another existing
>> large table and have not run into any issue so far. But want to understand
>> if there is a better approach to this?
>>
>>
>>
>> Thanks
>>
>>
>> ------------------------------
>>
>> *From:* Stefan Miklosovic <st...@instaclustr.com>
>> *Sent:* Monday, March 11, 2019 7:12 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Migrate large volume of data from one table to another
>> table within the same cluster when COPY is not an option.
>>
>>
>>
>> The query which does not work should be like this, I made a mistake there
>>
>>
>>
>> cqlsh> SELECT * from my_keyspace.my_table where  number > 2;
>>
>> InvalidRequest: Error from server: code=2200 [Invalid query]
>> message="Cannot execute this query as it might involve data filtering and
>> thus may have unpredictable performance. If you want to execute this query
>> despite the performance unpredictability, use ALLOW FILTERING"
>>
>>
>>
>>
>>
>> On Tue, 12 Mar 2019 at 10:10, Stefan Miklosovic <
>> stefan.miklosovic@instaclustr.com> wrote:
>>
>> Hi Leena,
>>
>>
>>
>> "We are thinking of creating a new table with a date field as a
>> clustering column to be able to query for date ranges, but partition key to
>> clustering key will be 1-1. Is this a good approach?"
>>
>>
>>
>> If you want to select by some time range here, I am wondering how would
>> making datetime a clustering column help you here? You still have to
>> provide primary key, right?
>>
>>
>>
>> E.g. select * from your_keyspace.your_table where id=123 and my_date >
>> yesterday and my_date < tomorrow (you got the idea)
>>
>>
>>
>> If you make my_date clustering column, you cant not do this below,
>> because you still have to specify partition key fully and then clustering
>> key (optionally) where you can further order and do ranges. But you cant do
>> a query without specifying partition key. Well, you can use ALLOW FILTERING
>> but you do not want to do this at all in your situation as it would scan
>> everything.
>>
>>
>>
>> select * from your_keyspace.your_table where my_date > yesterday and
>> my_date < tomorrow
>>
>>
>>
>> cqlsh> create KEYSPACE my_keyspace WITH replication = {'class':
>> 'NetworkTopologyStrategy', 'dc1': '1'};
>>
>> cqlsh> CREATE TABLE my_keyspace.my_table (id uuid, number int, PRIMARY
>> KEY ((id), number));
>>
>>
>>
>> cqlsh> SELECT * from my_keyspace.my_table ;
>>
>>
>>
>>  id                                   | number
>>
>> --------------------------------------+--------
>>
>>  6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3
>>
>>  abdc0184-a695-427d-b63b-57cdf7a45f00 |      1
>>
>>  90fe112e-0f74-4cbc-8767-67bdc9c8c3b0 |      4
>>
>>  8cff3eb7-1aff-4dc7-9969-60190c7e4675 |      2
>>
>>
>>
>> cqlsh> SELECT * from my_keyspace.my_table where id =
>> '6e23f79a-8b67-47e0-b8e0-50be78bb1c7f' and  number > 2;
>>
>> InvalidRequest: Error from server: code=2200 [Invalid query]
>> message="Invalid STRING constant (6e23f79a-8b67-47e0-b8e0-50be78bb1c7f) for
>> "id" of type uuid"
>>
>>
>>
>> cqlsh> SELECT * from my_keyspace.my_table where id =
>> 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f and  number > 2;
>>
>>
>>
>>  id                                   | number
>>
>> --------------------------------------+--------
>>
>>  6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3
>>
>>
>>
>> You could remodel your data in such way that you would make primary key
>> like this
>>
>>
>>
>> ((date), hour-minute, id)
>>
>>
>>
>> or
>>
>>
>>
>> ((date, hour-minute), id)
>>
>>
>>
>> I would prefer the second one because if you expect a lot of data per
>> day, they would all end up on same set of replicas as hash of partition key
>> would be same whole day if you have same date all day so I think you would
>> end up with hotspots. You want to have your data spread more evenly so the
>> second one seems to be better to me.
>>
>>
>>
>> You can also investigate how to do this with materialized view but I am
>> not sure about the performance here.
>>
>>
>>
>> If you want to copy data you can do this e.g. by Cassandra Spark
>> connector, you would just read table and as you read it you would write to
>> another one. That is imho the fastest approach and the least error prone.
>> You can do that on live production data and you can just make a "switch"
>> afterwards. Not sure about ttls but that should be transparent while
>> copying that.
>>
>>
>>
>> On Tue, 12 Mar 2019 at 03:04, Leena Ghatpande <lg...@hotmail.com>
>> wrote:
>>
>> We have a table with over 70M rows with a partition key that is unique.
>> We have a  created datetime stamp on each record, and we have a need to
>> select all rows created for a date range. Secondary index is not an option
>> as its high cardinality and could slow performance doing a full scan on 70M
>> rows.
>>
>>
>>
>> We are thinking of creating a new table with a date field as a clustering
>> column to be able to query for date ranges, but partition key to clustering
>> key will be 1-1. Is this a good approach?
>>
>> To do this, we need to copy this large volume of data from table1 to
>> table2 within the same cluster, while updates are still happening to
>> table1. We need to do this real time without impacting our customers. COPY
>> is not an option, as we have ttl's on each row on table1 that need to be
>> applied to table2 as well.
>>
>>
>>
>> So what would be the best approach
>>
>>    1. To be able select data using date range without impacting
>>    performance. This operation will be needed only on adhoc basis and it wont
>>    be as frequent .
>>    2. Best way to migrate large volume of data with ttl from one table
>>    to another within the same cluster.
>>
>>
>>
>> Any other suggestions also will be greatly appreciated.
>>
>>
>>
>>
>>
>>
>> Stefan Miklosovic
>>
>>
>>
>> Stefan Miklosovic
>>
>> ------------------------------
>>
>> The information in this Internet Email is confidential and may be legally
>> privileged. It is intended solely for the addressee. Access to this Email
>> by anyone else is unauthorized. If you are not the intended recipient, any
>> disclosure, copying, distribution or any action taken or omitted to be
>> taken in reliance on it, is prohibited and may be unlawful. When addressed
>> to our clients any opinions or advice contained in this Email are subject
>> to the terms and conditions expressed in any applicable governing The Home
>> Depot terms of business or client engagement letter. The Home Depot
>> disclaims all responsibility and liability for the accuracy and content of
>> this attachment and for any damages or losses arising from any
>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>> items of a destructive nature, which may be contained in this attachment
>> and shall not be liable for direct, indirect, consequential or special
>> damages in connection with this e-mail message or its attachment.
>>
>
> Stefan Miklosovic
>
> --

Best regards
_____________________________________________________________

[image:
https://www.facebook.com/DMN-BigData-371074727032197/?modal=admin_todo_tour]
<https://www.facebook.com/DMN-BigData-371074727032197/?modal=admin_todo_tour>
   <https://twitter.com/dmnbigdata>   <https://www.instagram.com/>
<https://www.linkedin.com/in/dngaya/>

*Dieudonne Madishon NGAYA*
Datastax, Cassandra Architect
*P: *7048580065
*w: *www.dmnbigdata.com
*E: *dmngaya@dmnbigdata.com
*Private E: *dmngaya@gmail.com
*A: *Charlotte,NC,28273, USA

Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.

Posted by Stefan Miklosovic <st...@instaclustr.com>.
Hi Sean,

for sure, the best approach would be to create another table which would
treat just that specific query.

How do I set the flag for not allowing allow filtering in cassandra.yaml? I
read a doco and there seems to be nothing about that.

Regards

On Wed, 13 Mar 2019 at 06:57, Durity, Sean R <SE...@homedepot.com>
wrote:

> If there are 2 access patterns, I would consider having 2 tables. The
> first one with the ID, which you say is the majority use case.  Then have a
> second table that uses a time-bucket approach as others have suggested:
>
> (time bucket, id) as primary key
>
> Choose a time bucket (day, week, hour, month, whatever) that would hold
> less than 100 MB of data in the time-bucket partition.
>
>
>
> You could include all relevant data in the second table to meet your
> query. OR, if that data seems too large or too volatile to duplicate, just
> include your primary key and look-up the data in the primary table as
> needed.
>
>
>
> If you use allow filtering, you are setting yourself up for failure to
> scale. I tell my developers, “if you use allow filtering, you are doing it
> wrong.” In fact, I think the Cassandra admin should be able to set a flag
> in cassandra.yaml to not allow filtering at all. The cluster should be able
> to protect itself from bad queries.
>
>
>
>
>
>
>
> *From:* Leena Ghatpande <lg...@hotmail.com>
> *Sent:* Tuesday, March 12, 2019 9:02 AM
> *To:* Stefan Miklosovic <st...@instaclustr.com>;
> user@cassandra.apache.org
> *Subject:* [EXTERNAL] Re: Migrate large volume of data from one table to
> another table within the same cluster when COPY is not an option.
>
>
>
> Our data model cannot be like below as you have recommended as majority of
> the reads need to select the data by the partition key (id) only, not by
> date.
>
> You could remodel your data in such way that you would make primary key
> like this
>
> ((date), hour-minute, id)
>
> or
>
> ((date, hour-minute), id)
>
>
>
>
>
> By adding the date as clustering column, yes the idea was to use the Allow
> Filtering on the date and pull the records. Understand that it is not
> recommended to do this, but we have been doing this on another existing
> large table and have not run into any issue so far. But want to understand
> if there is a better approach to this?
>
>
>
> Thanks
>
>
> ------------------------------
>
> *From:* Stefan Miklosovic <st...@instaclustr.com>
> *Sent:* Monday, March 11, 2019 7:12 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Migrate large volume of data from one table to another
> table within the same cluster when COPY is not an option.
>
>
>
> The query which does not work should be like this, I made a mistake there
>
>
>
> cqlsh> SELECT * from my_keyspace.my_table where  number > 2;
>
> InvalidRequest: Error from server: code=2200 [Invalid query]
> message="Cannot execute this query as it might involve data filtering and
> thus may have unpredictable performance. If you want to execute this query
> despite the performance unpredictability, use ALLOW FILTERING"
>
>
>
>
>
> On Tue, 12 Mar 2019 at 10:10, Stefan Miklosovic <
> stefan.miklosovic@instaclustr.com> wrote:
>
> Hi Leena,
>
>
>
> "We are thinking of creating a new table with a date field as a
> clustering column to be able to query for date ranges, but partition key to
> clustering key will be 1-1. Is this a good approach?"
>
>
>
> If you want to select by some time range here, I am wondering how would
> making datetime a clustering column help you here? You still have to
> provide primary key, right?
>
>
>
> E.g. select * from your_keyspace.your_table where id=123 and my_date >
> yesterday and my_date < tomorrow (you got the idea)
>
>
>
> If you make my_date clustering column, you cant not do this below, because
> you still have to specify partition key fully and then clustering key
> (optionally) where you can further order and do ranges. But you cant do a
> query without specifying partition key. Well, you can use ALLOW FILTERING
> but you do not want to do this at all in your situation as it would scan
> everything.
>
>
>
> select * from your_keyspace.your_table where my_date > yesterday and
> my_date < tomorrow
>
>
>
> cqlsh> create KEYSPACE my_keyspace WITH replication = {'class':
> 'NetworkTopologyStrategy', 'dc1': '1'};
>
> cqlsh> CREATE TABLE my_keyspace.my_table (id uuid, number int, PRIMARY KEY
> ((id), number));
>
>
>
> cqlsh> SELECT * from my_keyspace.my_table ;
>
>
>
>  id                                   | number
>
> --------------------------------------+--------
>
>  6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3
>
>  abdc0184-a695-427d-b63b-57cdf7a45f00 |      1
>
>  90fe112e-0f74-4cbc-8767-67bdc9c8c3b0 |      4
>
>  8cff3eb7-1aff-4dc7-9969-60190c7e4675 |      2
>
>
>
> cqlsh> SELECT * from my_keyspace.my_table where id =
> '6e23f79a-8b67-47e0-b8e0-50be78bb1c7f' and  number > 2;
>
> InvalidRequest: Error from server: code=2200 [Invalid query]
> message="Invalid STRING constant (6e23f79a-8b67-47e0-b8e0-50be78bb1c7f) for
> "id" of type uuid"
>
>
>
> cqlsh> SELECT * from my_keyspace.my_table where id =
> 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f and  number > 2;
>
>
>
>  id                                   | number
>
> --------------------------------------+--------
>
>  6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3
>
>
>
> You could remodel your data in such way that you would make primary key
> like this
>
>
>
> ((date), hour-minute, id)
>
>
>
> or
>
>
>
> ((date, hour-minute), id)
>
>
>
> I would prefer the second one because if you expect a lot of data per day,
> they would all end up on same set of replicas as hash of partition key
> would be same whole day if you have same date all day so I think you would
> end up with hotspots. You want to have your data spread more evenly so the
> second one seems to be better to me.
>
>
>
> You can also investigate how to do this with materialized view but I am
> not sure about the performance here.
>
>
>
> If you want to copy data you can do this e.g. by Cassandra Spark
> connector, you would just read table and as you read it you would write to
> another one. That is imho the fastest approach and the least error prone.
> You can do that on live production data and you can just make a "switch"
> afterwards. Not sure about ttls but that should be transparent while
> copying that.
>
>
>
> On Tue, 12 Mar 2019 at 03:04, Leena Ghatpande <lg...@hotmail.com>
> wrote:
>
> We have a table with over 70M rows with a partition key that is unique.
> We have a  created datetime stamp on each record, and we have a need to
> select all rows created for a date range. Secondary index is not an option
> as its high cardinality and could slow performance doing a full scan on 70M
> rows.
>
>
>
> We are thinking of creating a new table with a date field as a clustering
> column to be able to query for date ranges, but partition key to clustering
> key will be 1-1. Is this a good approach?
>
> To do this, we need to copy this large volume of data from table1 to
> table2 within the same cluster, while updates are still happening to
> table1. We need to do this real time without impacting our customers. COPY
> is not an option, as we have ttl's on each row on table1 that need to be
> applied to table2 as well.
>
>
>
> So what would be the best approach
>
>    1. To be able select data using date range without impacting
>    performance. This operation will be needed only on adhoc basis and it wont
>    be as frequent .
>    2. Best way to migrate large volume of data with ttl from one table to
>    another within the same cluster.
>
>
>
> Any other suggestions also will be greatly appreciated.
>
>
>
>
>
>
> Stefan Miklosovic
>
>
>
> Stefan Miklosovic
>
> ------------------------------
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>

Stefan Miklosovic

RE: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.

Posted by "Durity, Sean R" <SE...@homedepot.com>.
If there are 2 access patterns, I would consider having 2 tables. The first one with the ID, which you say is the majority use case.  Then have a second table that uses a time-bucket approach as others have suggested:
(time bucket, id) as primary key
Choose a time bucket (day, week, hour, month, whatever) that would hold less than 100 MB of data in the time-bucket partition.

You could include all relevant data in the second table to meet your query. OR, if that data seems too large or too volatile to duplicate, just include your primary key and look-up the data in the primary table as needed.

If you use allow filtering, you are setting yourself up for failure to scale. I tell my developers, "if you use allow filtering, you are doing it wrong." In fact, I think the Cassandra admin should be able to set a flag in cassandra.yaml to not allow filtering at all. The cluster should be able to protect itself from bad queries.



From: Leena Ghatpande <lg...@hotmail.com>
Sent: Tuesday, March 12, 2019 9:02 AM
To: Stefan Miklosovic <st...@instaclustr.com>; user@cassandra.apache.org
Subject: [EXTERNAL] Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.

Our data model cannot be like below as you have recommended as majority of the reads need to select the data by the partition key (id) only, not by date.
You could remodel your data in such way that you would make primary key like this
((date), hour-minute, id)
or
((date, hour-minute), id)


By adding the date as clustering column, yes the idea was to use the Allow Filtering on the date and pull the records. Understand that it is not recommended to do this, but we have been doing this on another existing large table and have not run into any issue so far. But want to understand if there is a better approach to this?

Thanks

________________________________
From: Stefan Miklosovic <st...@instaclustr.com>>
Sent: Monday, March 11, 2019 7:12 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.

The query which does not work should be like this, I made a mistake there

cqlsh> SELECT * from my_keyspace.my_table where  number > 2;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"


On Tue, 12 Mar 2019 at 10:10, Stefan Miklosovic <st...@instaclustr.com>> wrote:
Hi Leena,

"We are thinking of creating a new table with a date field as a clustering column to be able to query for date ranges, but partition key to clustering key will be 1-1. Is this a good approach?"

If you want to select by some time range here, I am wondering how would making datetime a clustering column help you here? You still have to provide primary key, right?

E.g. select * from your_keyspace.your_table where id=123 and my_date > yesterday and my_date < tomorrow (you got the idea)

If you make my_date clustering column, you cant not do this below, because you still have to specify partition key fully and then clustering key (optionally) where you can further order and do ranges. But you cant do a query without specifying partition key. Well, you can use ALLOW FILTERING but you do not want to do this at all in your situation as it would scan everything.

select * from your_keyspace.your_table where my_date > yesterday and my_date < tomorrow

cqlsh> create KEYSPACE my_keyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '1'};
cqlsh> CREATE TABLE my_keyspace.my_table (id uuid, number int, PRIMARY KEY ((id), number));

cqlsh> SELECT * from my_keyspace.my_table ;

 id                                   | number
--------------------------------------+--------
 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3
 abdc0184-a695-427d-b63b-57cdf7a45f00 |      1
 90fe112e-0f74-4cbc-8767-67bdc9c8c3b0 |      4
 8cff3eb7-1aff-4dc7-9969-60190c7e4675 |      2

cqlsh> SELECT * from my_keyspace.my_table where id = '6e23f79a-8b67-47e0-b8e0-50be78bb1c7f' and  number > 2;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Invalid STRING constant (6e23f79a-8b67-47e0-b8e0-50be78bb1c7f) for "id" of type uuid"

cqlsh> SELECT * from my_keyspace.my_table where id = 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f and  number > 2;

 id                                   | number
--------------------------------------+--------
 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3

You could remodel your data in such way that you would make primary key like this

((date), hour-minute, id)

or

((date, hour-minute), id)

I would prefer the second one because if you expect a lot of data per day, they would all end up on same set of replicas as hash of partition key would be same whole day if you have same date all day so I think you would end up with hotspots. You want to have your data spread more evenly so the second one seems to be better to me.

You can also investigate how to do this with materialized view but I am not sure about the performance here.

If you want to copy data you can do this e.g. by Cassandra Spark connector, you would just read table and as you read it you would write to another one. That is imho the fastest approach and the least error prone. You can do that on live production data and you can just make a "switch" afterwards. Not sure about ttls but that should be transparent while copying that.

On Tue, 12 Mar 2019 at 03:04, Leena Ghatpande <lg...@hotmail.com>> wrote:

We have a table with over 70M rows with a partition key that is unique.  We have a  created datetime stamp on each record, and we have a need to select all rows created for a date range. Secondary index is not an option as its high cardinality and could slow performance doing a full scan on 70M rows.



We are thinking of creating a new table with a date field as a clustering column to be able to query for date ranges, but partition key to clustering key will be 1-1. Is this a good approach?

To do this, we need to copy this large volume of data from table1 to table2 within the same cluster, while updates are still happening to table1. We need to do this real time without impacting our customers. COPY is not an option, as we have ttl's on each row on table1 that need to be applied to table2 as well.



So what would be the best approach

  1.  To be able select data using date range without impacting performance. This operation will be needed only on adhoc basis and it wont be as frequent .
  2.  Best way to migrate large volume of data with ttl from one table to another within the same cluster.



Any other suggestions also will be greatly appreciated.



Stefan Miklosovic

Stefan Miklosovic

________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.

Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.

Posted by Leena Ghatpande <lg...@hotmail.com>.
Our data model cannot be like below as you have recommended as majority of the reads need to select the data by the partition key (id) only, not by date.
You could remodel your data in such way that you would make primary key like this
((date), hour-minute, id)
or
((date, hour-minute), id)


By adding the date as clustering column, yes the idea was to use the Allow Filtering on the date and pull the records. Understand that it is not recommended to do this, but we have been doing this on another existing large table and have not run into any issue so far. But want to understand if there is a better approach to this?

Thanks

________________________________
From: Stefan Miklosovic <st...@instaclustr.com>
Sent: Monday, March 11, 2019 7:12 PM
To: user@cassandra.apache.org
Subject: Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.

The query which does not work should be like this, I made a mistake there

cqlsh> SELECT * from my_keyspace.my_table where  number > 2;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"


On Tue, 12 Mar 2019 at 10:10, Stefan Miklosovic <st...@instaclustr.com>> wrote:
Hi Leena,

"We are thinking of creating a new table with a date field as a clustering column to be able to query for date ranges, but partition key to clustering key will be 1-1. Is this a good approach?"

If you want to select by some time range here, I am wondering how would making datetime a clustering column help you here? You still have to provide primary key, right?

E.g. select * from your_keyspace.your_table where id=123 and my_date > yesterday and my_date < tomorrow (you got the idea)

If you make my_date clustering column, you cant not do this below, because you still have to specify partition key fully and then clustering key (optionally) where you can further order and do ranges. But you cant do a query without specifying partition key. Well, you can use ALLOW FILTERING but you do not want to do this at all in your situation as it would scan everything.

select * from your_keyspace.your_table where my_date > yesterday and my_date < tomorrow

cqlsh> create KEYSPACE my_keyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '1'};
cqlsh> CREATE TABLE my_keyspace.my_table (id uuid, number int, PRIMARY KEY ((id), number));

cqlsh> SELECT * from my_keyspace.my_table ;

 id                                   | number
--------------------------------------+--------
 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3
 abdc0184-a695-427d-b63b-57cdf7a45f00 |      1
 90fe112e-0f74-4cbc-8767-67bdc9c8c3b0 |      4
 8cff3eb7-1aff-4dc7-9969-60190c7e4675 |      2

cqlsh> SELECT * from my_keyspace.my_table where id = '6e23f79a-8b67-47e0-b8e0-50be78bb1c7f' and  number > 2;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Invalid STRING constant (6e23f79a-8b67-47e0-b8e0-50be78bb1c7f) for "id" of type uuid"

cqlsh> SELECT * from my_keyspace.my_table where id = 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f and  number > 2;

 id                                   | number
--------------------------------------+--------
 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3

You could remodel your data in such way that you would make primary key like this

((date), hour-minute, id)

or

((date, hour-minute), id)

I would prefer the second one because if you expect a lot of data per day, they would all end up on same set of replicas as hash of partition key would be same whole day if you have same date all day so I think you would end up with hotspots. You want to have your data spread more evenly so the second one seems to be better to me.

You can also investigate how to do this with materialized view but I am not sure about the performance here.

If you want to copy data you can do this e.g. by Cassandra Spark connector, you would just read table and as you read it you would write to another one. That is imho the fastest approach and the least error prone. You can do that on live production data and you can just make a "switch" afterwards. Not sure about ttls but that should be transparent while copying that.

On Tue, 12 Mar 2019 at 03:04, Leena Ghatpande <lg...@hotmail.com>> wrote:

We have a table with over 70M rows with a partition key that is unique.  We have a  created datetime stamp on each record, and we have a need to select all rows created for a date range. Secondary index is not an option as its high cardinality and could slow performance doing a full scan on 70M rows.


We are thinking of creating a new table with a date field as a clustering column to be able to query for date ranges, but partition key to clustering key will be 1-1. Is this a good approach?

To do this, we need to copy this large volume of data from table1 to table2 within the same cluster, while updates are still happening to table1. We need to do this real time without impacting our customers. COPY is not an option, as we have ttl's on each row on table1 that need to be applied to table2 as well.


So what would be the best approach

  1.  To be able select data using date range without impacting performance. This operation will be needed only on adhoc basis and it wont be as frequent .
  2.  Best way to migrate large volume of data with ttl from one table to another within the same cluster.


Any other suggestions also will be greatly appreciated.



Stefan Miklosovic

Stefan Miklosovic

Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.

Posted by Stefan Miklosovic <st...@instaclustr.com>.
The query which does not work should be like this, I made a mistake there

cqlsh> SELECT * from my_keyspace.my_table where  number > 2;
InvalidRequest: Error from server: code=2200 [Invalid query]
message="Cannot execute this query as it might involve data filtering and
thus may have unpredictable performance. If you want to execute this query
despite the performance unpredictability, use ALLOW FILTERING"


On Tue, 12 Mar 2019 at 10:10, Stefan Miklosovic <
stefan.miklosovic@instaclustr.com> wrote:

> Hi Leena,
>
> "We are thinking of creating a new table with a date field as a
> clustering column to be able to query for date ranges, but partition key to
> clustering key will be 1-1. Is this a good approach?"
>
> If you want to select by some time range here, I am wondering how would
> making datetime a clustering column help you here? You still have to
> provide primary key, right?
>
> E.g. select * from your_keyspace.your_table where id=123 and my_date >
> yesterday and my_date < tomorrow (you got the idea)
>
> If you make my_date clustering column, you cant not do this below, because
> you still have to specify partition key fully and then clustering key
> (optionally) where you can further order and do ranges. But you cant do a
> query without specifying partition key. Well, you can use ALLOW FILTERING
> but you do not want to do this at all in your situation as it would scan
> everything.
>
> select * from your_keyspace.your_table where my_date > yesterday and
> my_date < tomorrow
>
> cqlsh> create KEYSPACE my_keyspace WITH replication = {'class':
> 'NetworkTopologyStrategy', 'dc1': '1'};
> cqlsh> CREATE TABLE my_keyspace.my_table (id uuid, number int, PRIMARY KEY
> ((id), number));
>
> cqlsh> SELECT * from my_keyspace.my_table ;
>
>  id                                   | number
> --------------------------------------+--------
>  6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3
>  abdc0184-a695-427d-b63b-57cdf7a45f00 |      1
>  90fe112e-0f74-4cbc-8767-67bdc9c8c3b0 |      4
>  8cff3eb7-1aff-4dc7-9969-60190c7e4675 |      2
>
> cqlsh> SELECT * from my_keyspace.my_table where id =
> '6e23f79a-8b67-47e0-b8e0-50be78bb1c7f' and  number > 2;
> InvalidRequest: Error from server: code=2200 [Invalid query]
> message="Invalid STRING constant (6e23f79a-8b67-47e0-b8e0-50be78bb1c7f) for
> "id" of type uuid"
>
> cqlsh> SELECT * from my_keyspace.my_table where id =
> 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f and  number > 2;
>
>  id                                   | number
> --------------------------------------+--------
>  6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3
>
> You could remodel your data in such way that you would make primary key
> like this
>
> ((date), hour-minute, id)
>
> or
>
> ((date, hour-minute), id)
>
> I would prefer the second one because if you expect a lot of data per day,
> they would all end up on same set of replicas as hash of partition key
> would be same whole day if you have same date all day so I think you would
> end up with hotspots. You want to have your data spread more evenly so the
> second one seems to be better to me.
>
> You can also investigate how to do this with materialized view but I am
> not sure about the performance here.
>
> If you want to copy data you can do this e.g. by Cassandra Spark
> connector, you would just read table and as you read it you would write to
> another one. That is imho the fastest approach and the least error prone.
> You can do that on live production data and you can just make a "switch"
> afterwards. Not sure about ttls but that should be transparent while
> copying that.
>
> On Tue, 12 Mar 2019 at 03:04, Leena Ghatpande <lg...@hotmail.com>
> wrote:
>
>> We have a table with over 70M rows with a partition key that is unique.  We
>> have a  created datetime stamp on each record, and we have a need to
>> select all rows created for a date range. Secondary index is not an option
>> as its high cardinality and could slow performance doing a full scan on 70M
>> rows.
>>
>>
>> We are thinking of creating a new table with a date field as a clustering
>> column to be able to query for date ranges, but partition key to clustering
>> key will be 1-1. Is this a good approach?
>>
>> To do this, we need to copy this large volume of data from table1 to
>> table2 within the same cluster, while updates are still happening to
>> table1. We need to do this real time without impacting our customers. COPY
>> is not an option, as we have ttl's on each row on table1 that need to be
>> applied to table2 as well.
>>
>>
>> So what would be the best approach
>>
>>    1. To be able select data using date range without impacting
>>    performance. This operation will be needed only on adhoc basis and it wont
>>    be as frequent .
>>    2. Best way to migrate large volume of data with ttl from one table
>>    to another within the same cluster.
>>
>>
>> Any other suggestions also will be greatly appreciated.
>>
>>
>>
>
> Stefan Miklosovic
>

Stefan Miklosovic

Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.

Posted by Stefan Miklosovic <st...@instaclustr.com>.
Hi Leena,

"We are thinking of creating a new table with a date field as a clustering
column to be able to query for date ranges, but partition key to clustering
key will be 1-1. Is this a good approach?"

If you want to select by some time range here, I am wondering how would
making datetime a clustering column help you here? You still have to
provide primary key, right?

E.g. select * from your_keyspace.your_table where id=123 and my_date >
yesterday and my_date < tomorrow (you got the idea)

If you make my_date clustering column, you cant not do this below, because
you still have to specify partition key fully and then clustering key
(optionally) where you can further order and do ranges. But you cant do a
query without specifying partition key. Well, you can use ALLOW FILTERING
but you do not want to do this at all in your situation as it would scan
everything.

select * from your_keyspace.your_table where my_date > yesterday and
my_date < tomorrow

cqlsh> create KEYSPACE my_keyspace WITH replication = {'class':
'NetworkTopologyStrategy', 'dc1': '1'};
cqlsh> CREATE TABLE my_keyspace.my_table (id uuid, number int, PRIMARY KEY
((id), number));

cqlsh> SELECT * from my_keyspace.my_table ;

 id                                   | number
--------------------------------------+--------
 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3
 abdc0184-a695-427d-b63b-57cdf7a45f00 |      1
 90fe112e-0f74-4cbc-8767-67bdc9c8c3b0 |      4
 8cff3eb7-1aff-4dc7-9969-60190c7e4675 |      2

cqlsh> SELECT * from my_keyspace.my_table where id =
'6e23f79a-8b67-47e0-b8e0-50be78bb1c7f' and  number > 2;
InvalidRequest: Error from server: code=2200 [Invalid query]
message="Invalid STRING constant (6e23f79a-8b67-47e0-b8e0-50be78bb1c7f) for
"id" of type uuid"

cqlsh> SELECT * from my_keyspace.my_table where id =
6e23f79a-8b67-47e0-b8e0-50be78bb1c7f and  number > 2;

 id                                   | number
--------------------------------------+--------
 6e23f79a-8b67-47e0-b8e0-50be78bb1c7f |      3

You could remodel your data in such way that you would make primary key
like this

((date), hour-minute, id)

or

((date, hour-minute), id)

I would prefer the second one because if you expect a lot of data per day,
they would all end up on same set of replicas as hash of partition key
would be same whole day if you have same date all day so I think you would
end up with hotspots. You want to have your data spread more evenly so the
second one seems to be better to me.

You can also investigate how to do this with materialized view but I am not
sure about the performance here.

If you want to copy data you can do this e.g. by Cassandra Spark connector,
you would just read table and as you read it you would write to another
one. That is imho the fastest approach and the least error prone. You can
do that on live production data and you can just make a "switch"
afterwards. Not sure about ttls but that should be transparent while
copying that.

On Tue, 12 Mar 2019 at 03:04, Leena Ghatpande <lg...@hotmail.com>
wrote:

> We have a table with over 70M rows with a partition key that is unique.  We
> have a  created datetime stamp on each record, and we have a need to
> select all rows created for a date range. Secondary index is not an option
> as its high cardinality and could slow performance doing a full scan on 70M
> rows.
>
>
> We are thinking of creating a new table with a date field as a clustering
> column to be able to query for date ranges, but partition key to clustering
> key will be 1-1. Is this a good approach?
>
> To do this, we need to copy this large volume of data from table1 to
> table2 within the same cluster, while updates are still happening to
> table1. We need to do this real time without impacting our customers. COPY
> is not an option, as we have ttl's on each row on table1 that need to be
> applied to table2 as well.
>
>
> So what would be the best approach
>
>    1. To be able select data using date range without impacting
>    performance. This operation will be needed only on adhoc basis and it wont
>    be as frequent .
>    2. Best way to migrate large volume of data with ttl from one table to
>    another within the same cluster.
>
>
> Any other suggestions also will be greatly appreciated.
>
>
>

Stefan Miklosovic