You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Richard Xin <ri...@yahoo.com.INVALID> on 2018/10/01 23:17:58 UTC

Cassandra loading data from another table

I have a tableA with about a few ten millions record, and I have tableB with a few thousands record,TableA and TableB have exact same schema (except that tableB doesnt have TTL)
I want to load all data to tableB from tableA EXCEPT for those already on tableB (we don't want data on tableB to be overwritten)
What's the best to way accomplish this?  
Thanks,

Re: Cassandra loading data from another table

Posted by Christophe Schmitz <ch...@instaclustr.com>.
Have a look at using Spark on Cassandra. It's commonly used for data
movement / data migration / reconciliation (on top of analytics). You will
get much better performances.

Christophe Schmitz - Instaclustr <https://www.instaclustr.com/> - Cassandra
| Kafka | Spark Consulting





On Tue, 2 Oct 2018 at 09:58 Richard Xin <ri...@yahoo.com.invalid>
wrote:

> Christophe, thanks for your insights,
> Sorry, I forgot to mention that currently both tableA and tableB are being
> updated by application (all newly inserted/updated records should be
> identical on A and B), exporting from tableB and COPY it back later on will
> result in older data overwrites newly updated data.
>
> I can only thinking about using COPY tableA to a csv, and then iterate the
> csv line by line to insert to tableB using "if not exists" clause to avoid
> down-time , but it's error-prone and slow. Not sure whether there is a
> better way.
> Best,
> Richard
>
> On Monday, October 1, 2018, 4:34:38 PM PDT, Christophe Schmitz <
> christophe@instaclustr.com> wrote:
>
>
> Hi Richard,
>
> You could consider exporting your few thousands record of Table B in a
> file, with *COPY TO*. Then *TRUNCATE* Table B, copy the SSTable files of
> TableA to the data directory of Table A (make sure you *flush* the
> memtables first), then run nodetool *refresh*. Final step is to load the
> few thousands record on Table B with *COPY FROM*. This will overwrite the
> data you loaded from the SSTables of Table A.
> Overall, there is no downtime on your cluster, there is no downtime on
> Table A, yet you need to think about the consequences on Table B if your
> application is writing on Table A or Table B during this process.
> Please test first :)
>
> Cheers,
> Christophe
>
> Christophe Schmitz - Instaclustr <https://www.instaclustr.com/> -
> Cassandra | Kafka | Spark Consulting
>
>
>
>
> On Tue, 2 Oct 2018 at 09:18 Richard Xin <ri...@yahoo.com.invalid>
> wrote:
>
> I have a tableA with about a few ten millions record, and I have tableB
> with a few thousands record,
> TableA and TableB have exact same schema (except that tableB doesnt have
> TTL)
>
> I want to load all data to tableB from tableA EXCEPT for those already on
> tableB (we don't want data on tableB to be overwritten)
>
> What's the best to way accomplish this?
>
> Thanks,
>
>

Re: Cassandra loading data from another table

Posted by Richard Xin <ri...@yahoo.com.INVALID>.
  Christophe, thanks for your insights,Sorry, I forgot to mention that currently both tableA and tableB are being updated by application (all newly inserted/updated records should be identical on A and B), exporting from tableB and COPY it back later on will result in older data overwrites newly updated data.
I can only thinking about using COPY tableA to a csv, and then iterate the csv line by line to insert to tableB using "if not exists" clause to avoid down-time , but it's error-prone and slow. Not sure whether there is a better way. Best,Richard
    On Monday, October 1, 2018, 4:34:38 PM PDT, Christophe Schmitz <ch...@instaclustr.com> wrote:  
 
 Hi Richard,
You could consider exporting your few thousands record of Table B in a file, with COPY TO. Then TRUNCATE Table B, copy the SSTable files of TableA to the data directory of Table A (make sure you flush the memtables first), then run nodetool refresh. Final step is to load the few thousands record on Table B with COPY FROM. This will overwrite the data you loaded from the SSTables of Table A.Overall, there is no downtime on your cluster, there is no downtime on Table A, yet you need to think about the consequences on Table B if your application is writing on Table A or Table B during this process.Please test first :)
Cheers,Christophe

Christophe Schmitz - Instaclustr - Cassandra | Kafka | Spark Consulting




On Tue, 2 Oct 2018 at 09:18 Richard Xin <ri...@yahoo.com.invalid> wrote:

I have a tableA with about a few ten millions record, and I have tableB with a few thousands record,TableA and TableB have exact same schema (except that tableB doesnt have TTL)
I want to load all data to tableB from tableA EXCEPT for those already on tableB (we don't want data on tableB to be overwritten)
What's the best to way accomplish this?  
Thanks,
  

Re: Cassandra loading data from another table

Posted by Christophe Schmitz <ch...@instaclustr.com>.
Hi Richard,

You could consider exporting your few thousands record of Table B in a
file, with *COPY TO*. Then *TRUNCATE* Table B, copy the SSTable files of
TableA to the data directory of Table A (make sure you *flush* the
memtables first), then run nodetool *refresh*. Final step is to load the
few thousands record on Table B with *COPY FROM*. This will overwrite the
data you loaded from the SSTables of Table A.
Overall, there is no downtime on your cluster, there is no downtime on
Table A, yet you need to think about the consequences on Table B if your
application is writing on Table A or Table B during this process.
Please test first :)

Cheers,
Christophe

Christophe Schmitz - Instaclustr <https://www.instaclustr.com/> - Cassandra
| Kafka | Spark Consulting




On Tue, 2 Oct 2018 at 09:18 Richard Xin <ri...@yahoo.com.invalid>
wrote:

> I have a tableA with about a few ten millions record, and I have tableB
> with a few thousands record,
> TableA and TableB have exact same schema (except that tableB doesnt have
> TTL)
>
> I want to load all data to tableB from tableA EXCEPT for those already on
> tableB (we don't want data on tableB to be overwritten)
>
> What's the best to way accomplish this?
>
> Thanks,
>