You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by "cko2223@gmail.com" <ck...@gmail.com> on 2012/12/14 04:19:15 UTC

ETL Tools to transfer data from Cassandra into other relational databases

We will use Cassandra as logging storage in one of our web application. The application only insert rows into Cassandra but never update or delete any rows. The CF is expected to grow by about 0.5 million rows per day.
 
We need to transfer the data in Cassandra to another relational database daily. Due to the large size of the CF, instead of truncating the relational table and reloading all rows into it each time, we plan to run a job to select the "delta" rows since the last run and insert them into the relational database.
 
We know we can use Java, Pig or Hive to extract the delta rows to a flat file and load the data into the target relational table. We are particularly interested in a process that can extract delta rows without scanning the entire CF.
 
Has anyone used any other ETL tools to do this kind of delta extraction from Cassandra? We appreciate any comments and experience.
 
Thanks,
Chin

Re: ETL Tools to transfer data from Cassandra into other relational databases

Posted by Milind Parikh <mi...@gmail.com>.
Why would you use Cassandra for primary store of logging information? Have
you considered Kafka ?

You could , of course, then fan out the logs to both Cassandra (on a near
real time basis ) and then on a daily basis (if you wish) extract the
"deltas" from Kafka into a RDBMS; with no PIG/Hive etc.


Regards
Milind


Regards
Milind



On Thu, Dec 13, 2012 at 7:19 PM, cko2223@gmail.com <ck...@gmail.com>wrote:

> We will use Cassandra as logging storage in one of our web application.
> The application only insert rows into Cassandra but never update or delete
> any rows. The CF is expected to grow by about 0.5 million rows per day.
>
> We need to transfer the data in Cassandra to another relational database
> daily. Due to the large size of the CF, instead of truncating the
> relational table and reloading all rows into it each time, we plan to run a
> job to select the "delta" rows since the last run and insert them into the
> relational database.
>
> We know we can use Java, Pig or Hive to extract the delta rows to a flat
> file and load the data into the target relational table. We are
> particularly interested in a process that can extract delta rows without
> scanning the entire CF.
>
> Has anyone used any other ETL tools to do this kind of delta extraction
> from Cassandra? We appreciate any comments and experience.
>
> Thanks,
> Chin
>

Re: ETL Tools to transfer data from Cassandra into other relational databases

Posted by Шамим <sr...@yandex.ru>.
Hello Chin,
 you can extract delta using pig script and save it in another CF in Cassandra. By using Pentaho kettle you can then load the data from the CF to RDBMS. Pentaho Kettle is open source project. All of the process you can automate through Azkaban or Ozzie.
Kafka is also an alternatives as metioned above.
Regards
  Shamim 

14.12.2012, 07:20, "cko2223@gmail.com" <ck...@gmail.com>:
> We will use Cassandra as logging storage in one of our web application. The application only insert rows into Cassandra but never update or delete any rows. The CF is expected to grow by about 0.5 million rows per day.
>
> We need to transfer the data in Cassandra to another relational database daily. Due to the large size of the CF, instead of truncating the relational table and reloading all rows into it each time, we plan to run a job to select the "delta" rows since the last run and insert them into the relational database.
>
> We know we can use Java, Pig or Hive to extract the delta rows to a flat file and load the data into the target relational table. We are particularly interested in a process that can extract delta rows without scanning the entire CF.
>
> Has anyone used any other ETL tools to do this kind of delta extraction from Cassandra? We appreciate any comments and experience.
>
> Thanks,
> Chin

Re: ETL Tools to transfer data from Cassandra into other relational databases

Posted by Carlos <ca...@gmail.com>.
I wrote an ETL tool for Cassandra which is based on scanning the binary
commit log of each node, extracting which keys have received inserts,
filtering them by the column timestamp to only select the last X minutes
mutations, then it issues a multiget to Cassandra to get the freshest
version of the rows (if you can/want to partially update rows in your
target DB you could skip this step).

With this data it's then able to connect to various DB (PostgreSQL hstore,
Oracle and plain CSV in our case) and issue the appropriate  "upsert"
calls. It's also parallel and quite fast (150.000 filtered rows in less
than 1m, insert speed depending on your target DB). We use it in production
to provide realtime join/search capabilities with PostgreSQL with a delay
of only 1m to 5m.

I was able to open source an early, rough version in my Github:

https://github.com/carloscm/cassandra-commitlog-extract

It needs some work to make it into a proper Java project, feel free to fork
and play with it.

On Friday, December 14, 2012, cko2223@gmail.com wrote:

> We will use Cassandra as logging storage in one of our web application.
> The application only insert rows into Cassandra but never update or delete
> any rows. The CF is expected to grow by about 0.5 million rows per day.
>
> We need to transfer the data in Cassandra to another relational database
> daily. Due to the large size of the CF, instead of truncating the
> relational table and reloading all rows into it each time, we plan to run a
> job to select the "delta" rows since the last run and insert them into the
> relational database.
>
> We know we can use Java, Pig or Hive to extract the delta rows to a flat
> file and load the data into the target relational table. We are
> particularly interested in a process that can extract delta rows without
> scanning the entire CF.
>
> Has anyone used any other ETL tools to do this kind of delta extraction
> from Cassandra? We appreciate any comments and experience.
>
> Thanks,
> Chin
>