You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Tobias Eriksson <to...@qvantel.com> on 2017/04/26 19:49:33 UTC

How can I efficiently export the content of my table to KAFKA

Hi
I would like to make a dump of the database, in JSON format, to KAFKA
The database contains lots of data, millions and in some cases billions of “rows”
I will provide the customer with an export of the data, where they can read it off of a KAFKA topic

My thinking was to have it scalable such that I will distribute the token range of all available partition-keys to a number of (N) processes (JSON-Producers)
First I will have a process which will read through the available tokens and then publish them on a KAFKA “Coordinator” Topic
And then I can create 1, 10, 20 or N processes that will act as Producers to the real KAFKA topic, and pick available tokens/partition-keys off of the “Coordinator” Topic
One by one until all the “rows” have been processed.
So the JOSN-Producer will take e.g. a range of 1000 “rows” and convert them into my own JSON format and post to KAFKA
And then after that take another 1000 “rows” and then …. And then another 1000 “rows” and so on, until it is done.

I base my idea on how I believe Apache Spark Connector accomplishes data locality, i.e. being aware of where tokens reside and figured that since that is possible it should be possible to create a job-list in a KAFKA topic, and have each Producer pick jobs from there, and read up data from Cassandra based on the partition key (token) and then post the JSON on the export KAFKA topic.
https://dzone.com/articles/data-locality-w-cassandra-how


Would you consider this a good idea ?
Would there in fact be a better idea, what would that be then ?

-Tobias

Re: How can I efficiently export the content of my table to KAFKA

Posted by Justin Cameron <ju...@instaclustr.com>.

You can run multiple applications in parallel in Standalone mode - you just
need to configure spark to allocate resources between your jobs the way you
want (by default it assigns all resources to the first application you run,
so they won't be freed up until it has finished).

You can use Spark's web UI to check the resources that are available and
those allocated to each job. See
http://spark.apache.org/docs/latest/job-scheduling.html for more details.

On Thu, 27 Apr 2017 at 15:12 Tobias Eriksson <to...@qvantel.com>
wrote:

> Well, I have been working some with Spark and the biggest hurdle is that
> Spark does not allow me to run multiple jobs in parallel
>
> i.e. at the point of starting the job to taking the table of “Individuals”
> I will have to wait until all that processing is done before I can start an
> additional one
>
> so I will need to upon demand start various additional jobs where I get
> “Addresses”, “Invoices”, … and so on
>
> I know I could increase number of Workers/Executors and use Mesos for
> handling the scheduling and resource management but we have so far not been
> able to get it dynamic/flexible enough
>
> Although I admit that this could still be a way forward we have not
> evaluated it 100% yet, so I have not completely given up that thought
>
>
>
> -Tobias
>
>
>
>
>
> *From: *Justin Cameron <ju...@instaclustr.com>
> *Reply-To: *"user@cassandra.apache.org" <us...@cassandra.apache.org>
> *Date: *Thursday, 27 April 2017 at 01:36
> *To: *"user@cassandra.apache.org" <us...@cassandra.apache.org>
> *Subject: *Re: How can I efficiently export the content of my table to
> KAFKA
>
>
>
> You could probably save yourself a lot of hassle by just writing a Spark
> job that scans through the entire table, converts each row to JSON and
> dumps the output into a Kafka topic. It should be fairly straightforward to
> implement.
>
>
>
> Spark will manage the partitioning of "Producer" processes for you - no
> need for a "Coordinator" topic.
>
>
>
> On Thu, 27 Apr 2017 at 05:49 Tobias Eriksson <to...@qvantel.com>
> wrote:
>
> Hi
>
> I would like to make a dump of the database, in JSON format, to KAFKA
>
> The database contains lots of data, millions and in some cases billions of
> “rows”
>
> I will provide the customer with an export of the data, where they can
> read it off of a KAFKA topic
>
>
>
> My thinking was to have it scalable such that I will distribute the token
> range of all available partition-keys to a number of (N) processes
> (JSON-Producers)
>
> First I will have a process which will read through the available tokens
> and then publish them on a KAFKA “Coordinator” Topic
>
> And then I can create 1, 10, 20 or N processes that will act as Producers
> to the real KAFKA topic, and pick available tokens/partition-keys off of
> the “Coordinator” Topic
>
> One by one until all the “rows” have been processed.
>
> So the JOSN-Producer will take e.g. a range of 1000 “rows” and convert
> them into my own JSON format and post to KAFKA
>
> And then after that take another 1000 “rows” and then …. And then another
> 1000 “rows” and so on, until it is done.
>
>
>
> I base my idea on how I believe Apache Spark Connector accomplishes data
> locality, i.e. being aware of where tokens reside and figured that since
> that is possible it should be possible to create a job-list in a KAFKA
> topic, and have each Producer pick jobs from there, and read up data from
> Cassandra based on the partition key (token) and then post the JSON on the
> export KAFKA topic.
>
> https://dzone.com/articles/data-locality-w-cassandra-how
>
>
>
>
>
> Would you consider this a good idea ?
>
> Would there in fact be a better idea, what would that be then ?
>
>
>
> -Tobias
>
>
>
> --
>
> *Justin Cameron*
> Senior Software Engineer
>
>
>
> <https://www.instaclustr.com/>
>
>
> This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
> and Instaclustr Inc (USA).
>
> This email and any attachments may contain confidential and legally
> privileged information.  If you are not the intended recipient, do not copy
> or disclose its content, but please reply to this email immediately and
> highlight the error to the sender and then immediately delete the message.
>
-- 


*Justin Cameron*Senior Software Engineer


<https://www.instaclustr.com/>


This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
and Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally
privileged information.  If you are not the intended recipient, do not copy
or disclose its content, but please reply to this email immediately and
highlight the error to the sender and then immediately delete the message.

Re: How can I efficiently export the content of my table to KAFKA

Posted by Tobias Eriksson <to...@qvantel.com>.

Well, I have been working some with Spark and the biggest hurdle is that Spark does not allow me to run multiple jobs in parallel
i.e. at the point of starting the job to taking the table of “Individuals” I will have to wait until all that processing is done before I can start an additional one
so I will need to upon demand start various additional jobs where I get “Addresses”, “Invoices”, … and so on
I know I could increase number of Workers/Executors and use Mesos for handling the scheduling and resource management but we have so far not been able to get it dynamic/flexible enough
Although I admit that this could still be a way forward we have not evaluated it 100% yet, so I have not completely given up that thought

-Tobias


From: Justin Cameron <ju...@instaclustr.com>
Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Date: Thursday, 27 April 2017 at 01:36
To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Subject: Re: How can I efficiently export the content of my table to KAFKA

You could probably save yourself a lot of hassle by just writing a Spark job that scans through the entire table, converts each row to JSON and dumps the output into a Kafka topic. It should be fairly straightforward to implement.

Spark will manage the partitioning of "Producer" processes for you - no need for a "Coordinator" topic.

On Thu, 27 Apr 2017 at 05:49 Tobias Eriksson <to...@qvantel.com>> wrote:
Hi
I would like to make a dump of the database, in JSON format, to KAFKA
The database contains lots of data, millions and in some cases billions of “rows”
I will provide the customer with an export of the data, where they can read it off of a KAFKA topic

My thinking was to have it scalable such that I will distribute the token range of all available partition-keys to a number of (N) processes (JSON-Producers)
First I will have a process which will read through the available tokens and then publish them on a KAFKA “Coordinator” Topic
And then I can create 1, 10, 20 or N processes that will act as Producers to the real KAFKA topic, and pick available tokens/partition-keys off of the “Coordinator” Topic
One by one until all the “rows” have been processed.
So the JOSN-Producer will take e.g. a range of 1000 “rows” and convert them into my own JSON format and post to KAFKA
And then after that take another 1000 “rows” and then …. And then another 1000 “rows” and so on, until it is done.

I base my idea on how I believe Apache Spark Connector accomplishes data locality, i.e. being aware of where tokens reside and figured that since that is possible it should be possible to create a job-list in a KAFKA topic, and have each Producer pick jobs from there, and read up data from Cassandra based on the partition key (token) and then post the JSON on the export KAFKA topic.
https://dzone.com/articles/data-locality-w-cassandra-how


Would you consider this a good idea ?
Would there in fact be a better idea, what would that be then ?

-Tobias

--
Justin Cameron
Senior Software Engineer


[https://cdn2.hubspot.net/hubfs/2549680/Instaclustr-Navy-logo-new.png]<https://www.instaclustr.com/>

This email has been sent on behalf of Instaclustr Pty. Limited (Australia) and Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally privileged information.  If you are not the intended recipient, do not copy or disclose its content, but please reply to this email immediately and highlight the error to the sender and then immediately delete the message.

Re: How can I efficiently export the content of my table to KAFKA

Posted by Justin Cameron <ju...@instaclustr.com>.

You could probably save yourself a lot of hassle by just writing a Spark
job that scans through the entire table, converts each row to JSON and
dumps the output into a Kafka topic. It should be fairly straightforward to
implement.

Spark will manage the partitioning of "Producer" processes for you - no
need for a "Coordinator" topic.

On Thu, 27 Apr 2017 at 05:49 Tobias Eriksson <to...@qvantel.com>
wrote:

> Hi
>
> I would like to make a dump of the database, in JSON format, to KAFKA
>
> The database contains lots of data, millions and in some cases billions of
> “rows”
>
> I will provide the customer with an export of the data, where they can
> read it off of a KAFKA topic
>
>
>
> My thinking was to have it scalable such that I will distribute the token
> range of all available partition-keys to a number of (N) processes
> (JSON-Producers)
>
> First I will have a process which will read through the available tokens
> and then publish them on a KAFKA “Coordinator” Topic
>
> And then I can create 1, 10, 20 or N processes that will act as Producers
> to the real KAFKA topic, and pick available tokens/partition-keys off of
> the “Coordinator” Topic
>
> One by one until all the “rows” have been processed.
>
> So the JOSN-Producer will take e.g. a range of 1000 “rows” and convert
> them into my own JSON format and post to KAFKA
>
> And then after that take another 1000 “rows” and then …. And then another
> 1000 “rows” and so on, until it is done.
>
>
>
> I base my idea on how I believe Apache Spark Connector accomplishes data
> locality, i.e. being aware of where tokens reside and figured that since
> that is possible it should be possible to create a job-list in a KAFKA
> topic, and have each Producer pick jobs from there, and read up data from
> Cassandra based on the partition key (token) and then post the JSON on the
> export KAFKA topic.
>
> https://dzone.com/articles/data-locality-w-cassandra-how
>
>
>
>
>
> Would you consider this a good idea ?
>
> Would there in fact be a better idea, what would that be then ?
>
>
>
> -Tobias
>
>
>
-- 

*Justin Cameron*Senior Software Engineer

<https://www.instaclustr.com/>

This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
and Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally
privileged information.  If you are not the intended recipient, do not copy
or disclose its content, but please reply to this email immediately and
highlight the error to the sender and then immediately delete the message.

Re: How can I efficiently export the content of my table to KAFKA

Posted by Tobias Eriksson <to...@qvantel.com>.

Hi Chris,
Well, that seemed like a good idea at first, I would like to read from Cassandra and post to KAFKA
But the KAFKA Connector Cassandra Source, requires that the table has a time-series order, and all my tables does not
So thanx for the tip, but it did not work ☹
-Tobias

From: Chris Stromberger <ch...@gmail.com>
Date: Thursday, 27 April 2017 at 15:50
To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Subject: Re: How can I efficiently export the content of my table to KAFKA

Maybe https://www.confluent.io/blog/kafka-connect-cassandra-sink-the-perfect-match/

On Wed, Apr 26, 2017 at 2:49 PM, Tobias Eriksson <to...@qvantel.com>> wrote:
Hi
I would like to make a dump of the database, in JSON format, to KAFKA
The database contains lots of data, millions and in some cases billions of “rows”
I will provide the customer with an export of the data, where they can read it off of a KAFKA topic

My thinking was to have it scalable such that I will distribute the token range of all available partition-keys to a number of (N) processes (JSON-Producers)
First I will have a process which will read through the available tokens and then publish them on a KAFKA “Coordinator” Topic
And then I can create 1, 10, 20 or N processes that will act as Producers to the real KAFKA topic, and pick available tokens/partition-keys off of the “Coordinator” Topic
One by one until all the “rows” have been processed.
So the JOSN-Producer will take e.g. a range of 1000 “rows” and convert them into my own JSON format and post to KAFKA
And then after that take another 1000 “rows” and then …. And then another 1000 “rows” and so on, until it is done.

I base my idea on how I believe Apache Spark Connector accomplishes data locality, i.e. being aware of where tokens reside and figured that since that is possible it should be possible to create a job-list in a KAFKA topic, and have each Producer pick jobs from there, and read up data from Cassandra based on the partition key (token) and then post the JSON on the export KAFKA topic.
https://dzone.com/articles/data-locality-w-cassandra-how

Would you consider this a good idea ?
Would there in fact be a better idea, what would that be then ?

-Tobias

Re: How can I efficiently export the content of my table to KAFKA

Posted by Chris Stromberger <ch...@gmail.com>.

Maybe
https://www.confluent.io/blog/kafka-connect-cassandra-sink-the-perfect-match/



On Wed, Apr 26, 2017 at 2:49 PM, Tobias Eriksson <
tobias.eriksson@qvantel.com> wrote:

> Hi
>
> I would like to make a dump of the database, in JSON format, to KAFKA
>
> The database contains lots of data, millions and in some cases billions of
> “rows”
>
> I will provide the customer with an export of the data, where they can
> read it off of a KAFKA topic
>
>
>
> My thinking was to have it scalable such that I will distribute the token
> range of all available partition-keys to a number of (N) processes
> (JSON-Producers)
>
> First I will have a process which will read through the available tokens
> and then publish them on a KAFKA “Coordinator” Topic
>
> And then I can create 1, 10, 20 or N processes that will act as Producers
> to the real KAFKA topic, and pick available tokens/partition-keys off of
> the “Coordinator” Topic
>
> One by one until all the “rows” have been processed.
>
> So the JOSN-Producer will take e.g. a range of 1000 “rows” and convert
> them into my own JSON format and post to KAFKA
>
> And then after that take another 1000 “rows” and then …. And then another
> 1000 “rows” and so on, until it is done.
>
>
>
> I base my idea on how I believe Apache Spark Connector accomplishes data
> locality, i.e. being aware of where tokens reside and figured that since
> that is possible it should be possible to create a job-list in a KAFKA
> topic, and have each Producer pick jobs from there, and read up data from
> Cassandra based on the partition key (token) and then post the JSON on the
> export KAFKA topic.
>
> https://dzone.com/articles/data-locality-w-cassandra-how
>
>
>
>
>
> Would you consider this a good idea ?
>
> Would there in fact be a better idea, what would that be then ?
>
>
>
> -Tobias
>
>
>