You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by ayan guha <gu...@gmail.com> on 2015/07/05 01:58:51 UTC

JDBC Streams

Hi All

I have a requireent to connect to a DB every few minutes and bring data to
HBase. Can anyone suggest if spark streaming would be appropriate for this
senario or I shoud look into jobserver?

Thanks in advance

-- 
Best Regards,
Ayan Guha

Re: JDBC Streams

Posted by Cody Koeninger <co...@koeninger.org>.

Yes

On Wed, Aug 26, 2015 at 10:23 AM, Chen Song <ch...@gmail.com> wrote:

> Thanks Cody.
>
> Are you suggesting to put the cache in global context in each executor
> JVM, in a Scala object for example. Then have a scheduled task to refresh
> the cache (or triggered by the expiry if Guava)?
>
> Chen
>
> On Wed, Aug 26, 2015 at 10:51 AM, Cody Koeninger <co...@koeninger.org>
> wrote:
>
>> If your data only changes every few days, why not restart the job every
>> few days, and just broadcast the data?
>>
>> Or you can keep a local per-jvm cache with an expiry (e.g. guava cache)
>> to avoid many mysql reads
>>
>> On Wed, Aug 26, 2015 at 9:46 AM, Chen Song <ch...@gmail.com>
>> wrote:
>>
>>> Piggyback on this question.
>>>
>>> I have a similar use case but a bit different. My job is consuming a
>>> stream from Kafka and I need to join the Kafka stream with some reference
>>> table from MySQL (kind of data validation and enrichment). I need to
>>> process this stream every 1 min. The data in MySQL is not changed very
>>> often, maybe once a few days.
>>>
>>> So my requirement is:
>>>
>>> * I cannot easily use broadcast variable because the data does change,
>>> although not very often.
>>> * I am not sure if it is good practice to read data from MySQL in every
>>> batch (in my case, 1 min).
>>>
>>> Anyone has done this before, any suggestions and feedback is appreciated.
>>>
>>> Chen
>>>
>>>
>>> On Sun, Jul 5, 2015 at 11:50 AM, Ashic Mahtab <as...@live.com> wrote:
>>>
>>>> If it is indeed a reactive use case, then Spark Streaming would be a
>>>> good choice.
>>>>
>>>> One approach worth considering - is it possible to receive a message
>>>> via kafka (or some other queue). That'd not need any polling, and you could
>>>> use standard consumers. If polling isn't an issue, then writing a custom
>>>> receiver will work fine. The way a receiver works is this:
>>>>
>>>> * Your receiver has a receive() function, where you'd typically start a
>>>> loop. In your loop, you'd fetch items, and call store(entry).
>>>> * You control everything in the receiver. If you're listening on a
>>>> queue, you receive messages, store() and ack your queue. If you're polling,
>>>> it's up to you to ensure delays between db calls.
>>>> * The things you store() go on to make up the rdds in your DStream. So,
>>>> intervals, windowing, etc. apply to those. The receiver is the boundary
>>>> between your data source and the DStream RDDs. In other words, if your
>>>> interval is 15 seconds with no windowing, then the things that went to
>>>> store() every 15 seconds are bunched up into an RDD of your DStream. That's
>>>> kind of a simplification, but should give you the idea that your "db
>>>> polling" interval and streaming interval are not tied together.
>>>>
>>>> -Ashic.
>>>>
>>>> ------------------------------
>>>> Date: Mon, 6 Jul 2015 01:12:34 +1000
>>>> Subject: Re: JDBC Streams
>>>> From: guha.ayan@gmail.com
>>>> To: ashic@live.com
>>>> CC: akhil@sigmoidanalytics.com; user@spark.apache.org
>>>>
>>>>
>>>> Hi
>>>>
>>>> Thanks for the reply. here is my situation: I hve a DB which enbles
>>>> synchronus CDC, think this as a DBtrigger which writes to a taable with
>>>> "changed" values as soon as something changes in production table. My job
>>>> will need to pick up the data "as soon as it arrives" which can be every 1
>>>> min interval. Ideally it will pick up the changes, transform it into a
>>>> jsonand puts it to kinesis. In short, I am emulating a Kinesis producer
>>>> with a DB source (dont even ask why, lets say these are the constraints :) )
>>>>
>>>> Please advice (a) is spark a good choice here (b)  whats your
>>>> suggestion either way.
>>>>
>>>> I understand I can easily do it using a simple java/python app but I am
>>>> little worried about managing scaling/fault tolerance and thats where my
>>>> concern is.
>>>>
>>>> TIA
>>>> Ayan
>>>>
>>>> On Mon, Jul 6, 2015 at 12:51 AM, Ashic Mahtab <as...@live.com> wrote:
>>>>
>>>> Hi Ayan,
>>>> How "continuous" is your workload? As Akhil points out, with streaming,
>>>> you'll give up at least one core for receiving, will need at most one more
>>>> core for processing. Unless you're running on something like Mesos, this
>>>> means that those cores are dedicated to your app, and can't be leveraged by
>>>> other apps / jobs.
>>>>
>>>> If it's something periodic (once an hour, once every 15 minutes, etc.),
>>>> then I'd simply write a "normal" spark application, and trigger it
>>>> periodically. There are many things that can take care of that - sometimes
>>>> a simple cronjob is enough!
>>>>
>>>> ------------------------------
>>>> Date: Sun, 5 Jul 2015 22:48:37 +1000
>>>> Subject: Re: JDBC Streams
>>>> From: guha.ayan@gmail.com
>>>> To: akhil@sigmoidanalytics.com
>>>> CC: user@spark.apache.org
>>>>
>>>>
>>>> Thanks Akhil. In case I go with spark streaming, I guess I have to
>>>> implment a custom receiver and spark streaming will call this receiver
>>>> every batch interval, is that correct? Any gotcha you see in this plan?
>>>> TIA...Best, Ayan
>>>>
>>>> On Sun, Jul 5, 2015 at 5:40 PM, Akhil Das <ak...@sigmoidanalytics.com>
>>>> wrote:
>>>>
>>>> If you want a long running application, then go with spark streaming
>>>> (which kind of blocks your resources). On the other hand, if you use job
>>>> server then you can actually use the resources (CPUs) for other jobs also
>>>> when your dbjob is not using them.
>>>>
>>>> Thanks
>>>> Best Regards
>>>>
>>>> On Sun, Jul 5, 2015 at 5:28 AM, ayan guha <gu...@gmail.com> wrote:
>>>>
>>>> Hi All
>>>>
>>>> I have a requireent to connect to a DB every few minutes and bring data
>>>> to HBase. Can anyone suggest if spark streaming would be appropriate for
>>>> this senario or I shoud look into jobserver?
>>>>
>>>> Thanks in advance
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>
>>>
>>>
>>> --
>>> Chen Song
>>>
>>>
>>
>
>
> --
> Chen Song
>
>

Re: JDBC Streams

Posted by Chen Song <ch...@gmail.com>.

Thanks Cody.

Are you suggesting to put the cache in global context in each executor JVM,
in a Scala object for example. Then have a scheduled task to refresh the
cache (or triggered by the expiry if Guava)?

Chen

On Wed, Aug 26, 2015 at 10:51 AM, Cody Koeninger <co...@koeninger.org> wrote:

> If your data only changes every few days, why not restart the job every
> few days, and just broadcast the data?
>
> Or you can keep a local per-jvm cache with an expiry (e.g. guava cache) to
> avoid many mysql reads
>
> On Wed, Aug 26, 2015 at 9:46 AM, Chen Song <ch...@gmail.com> wrote:
>
>> Piggyback on this question.
>>
>> I have a similar use case but a bit different. My job is consuming a
>> stream from Kafka and I need to join the Kafka stream with some reference
>> table from MySQL (kind of data validation and enrichment). I need to
>> process this stream every 1 min. The data in MySQL is not changed very
>> often, maybe once a few days.
>>
>> So my requirement is:
>>
>> * I cannot easily use broadcast variable because the data does change,
>> although not very often.
>> * I am not sure if it is good practice to read data from MySQL in every
>> batch (in my case, 1 min).
>>
>> Anyone has done this before, any suggestions and feedback is appreciated.
>>
>> Chen
>>
>>
>> On Sun, Jul 5, 2015 at 11:50 AM, Ashic Mahtab <as...@live.com> wrote:
>>
>>> If it is indeed a reactive use case, then Spark Streaming would be a
>>> good choice.
>>>
>>> One approach worth considering - is it possible to receive a message via
>>> kafka (or some other queue). That'd not need any polling, and you could use
>>> standard consumers. If polling isn't an issue, then writing a custom
>>> receiver will work fine. The way a receiver works is this:
>>>
>>> * Your receiver has a receive() function, where you'd typically start a
>>> loop. In your loop, you'd fetch items, and call store(entry).
>>> * You control everything in the receiver. If you're listening on a
>>> queue, you receive messages, store() and ack your queue. If you're polling,
>>> it's up to you to ensure delays between db calls.
>>> * The things you store() go on to make up the rdds in your DStream. So,
>>> intervals, windowing, etc. apply to those. The receiver is the boundary
>>> between your data source and the DStream RDDs. In other words, if your
>>> interval is 15 seconds with no windowing, then the things that went to
>>> store() every 15 seconds are bunched up into an RDD of your DStream. That's
>>> kind of a simplification, but should give you the idea that your "db
>>> polling" interval and streaming interval are not tied together.
>>>
>>> -Ashic.
>>>
>>> ------------------------------
>>> Date: Mon, 6 Jul 2015 01:12:34 +1000
>>> Subject: Re: JDBC Streams
>>> From: guha.ayan@gmail.com
>>> To: ashic@live.com
>>> CC: akhil@sigmoidanalytics.com; user@spark.apache.org
>>>
>>>
>>> Hi
>>>
>>> Thanks for the reply. here is my situation: I hve a DB which enbles
>>> synchronus CDC, think this as a DBtrigger which writes to a taable with
>>> "changed" values as soon as something changes in production table. My job
>>> will need to pick up the data "as soon as it arrives" which can be every 1
>>> min interval. Ideally it will pick up the changes, transform it into a
>>> jsonand puts it to kinesis. In short, I am emulating a Kinesis producer
>>> with a DB source (dont even ask why, lets say these are the constraints :) )
>>>
>>> Please advice (a) is spark a good choice here (b)  whats your suggestion
>>> either way.
>>>
>>> I understand I can easily do it using a simple java/python app but I am
>>> little worried about managing scaling/fault tolerance and thats where my
>>> concern is.
>>>
>>> TIA
>>> Ayan
>>>
>>> On Mon, Jul 6, 2015 at 12:51 AM, Ashic Mahtab <as...@live.com> wrote:
>>>
>>> Hi Ayan,
>>> How "continuous" is your workload? As Akhil points out, with streaming,
>>> you'll give up at least one core for receiving, will need at most one more
>>> core for processing. Unless you're running on something like Mesos, this
>>> means that those cores are dedicated to your app, and can't be leveraged by
>>> other apps / jobs.
>>>
>>> If it's something periodic (once an hour, once every 15 minutes, etc.),
>>> then I'd simply write a "normal" spark application, and trigger it
>>> periodically. There are many things that can take care of that - sometimes
>>> a simple cronjob is enough!
>>>
>>> ------------------------------
>>> Date: Sun, 5 Jul 2015 22:48:37 +1000
>>> Subject: Re: JDBC Streams
>>> From: guha.ayan@gmail.com
>>> To: akhil@sigmoidanalytics.com
>>> CC: user@spark.apache.org
>>>
>>>
>>> Thanks Akhil. In case I go with spark streaming, I guess I have to
>>> implment a custom receiver and spark streaming will call this receiver
>>> every batch interval, is that correct? Any gotcha you see in this plan?
>>> TIA...Best, Ayan
>>>
>>> On Sun, Jul 5, 2015 at 5:40 PM, Akhil Das <ak...@sigmoidanalytics.com>
>>> wrote:
>>>
>>> If you want a long running application, then go with spark streaming
>>> (which kind of blocks your resources). On the other hand, if you use job
>>> server then you can actually use the resources (CPUs) for other jobs also
>>> when your dbjob is not using them.
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Sun, Jul 5, 2015 at 5:28 AM, ayan guha <gu...@gmail.com> wrote:
>>>
>>> Hi All
>>>
>>> I have a requireent to connect to a DB every few minutes and bring data
>>> to HBase. Can anyone suggest if spark streaming would be appropriate for
>>> this senario or I shoud look into jobserver?
>>>
>>> Thanks in advance
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>>
>> --
>> Chen Song
>>
>>
>


-- 
Chen Song

Re: JDBC Streams

Posted by Cody Koeninger <co...@koeninger.org>.

If your data only changes every few days, why not restart the job every few
days, and just broadcast the data?

Or you can keep a local per-jvm cache with an expiry (e.g. guava cache) to
avoid many mysql reads

On Wed, Aug 26, 2015 at 9:46 AM, Chen Song <ch...@gmail.com> wrote:

> Piggyback on this question.
>
> I have a similar use case but a bit different. My job is consuming a
> stream from Kafka and I need to join the Kafka stream with some reference
> table from MySQL (kind of data validation and enrichment). I need to
> process this stream every 1 min. The data in MySQL is not changed very
> often, maybe once a few days.
>
> So my requirement is:
>
> * I cannot easily use broadcast variable because the data does change,
> although not very often.
> * I am not sure if it is good practice to read data from MySQL in every
> batch (in my case, 1 min).
>
> Anyone has done this before, any suggestions and feedback is appreciated.
>
> Chen
>
>
> On Sun, Jul 5, 2015 at 11:50 AM, Ashic Mahtab <as...@live.com> wrote:
>
>> If it is indeed a reactive use case, then Spark Streaming would be a good
>> choice.
>>
>> One approach worth considering - is it possible to receive a message via
>> kafka (or some other queue). That'd not need any polling, and you could use
>> standard consumers. If polling isn't an issue, then writing a custom
>> receiver will work fine. The way a receiver works is this:
>>
>> * Your receiver has a receive() function, where you'd typically start a
>> loop. In your loop, you'd fetch items, and call store(entry).
>> * You control everything in the receiver. If you're listening on a queue,
>> you receive messages, store() and ack your queue. If you're polling, it's
>> up to you to ensure delays between db calls.
>> * The things you store() go on to make up the rdds in your DStream. So,
>> intervals, windowing, etc. apply to those. The receiver is the boundary
>> between your data source and the DStream RDDs. In other words, if your
>> interval is 15 seconds with no windowing, then the things that went to
>> store() every 15 seconds are bunched up into an RDD of your DStream. That's
>> kind of a simplification, but should give you the idea that your "db
>> polling" interval and streaming interval are not tied together.
>>
>> -Ashic.
>>
>> ------------------------------
>> Date: Mon, 6 Jul 2015 01:12:34 +1000
>> Subject: Re: JDBC Streams
>> From: guha.ayan@gmail.com
>> To: ashic@live.com
>> CC: akhil@sigmoidanalytics.com; user@spark.apache.org
>>
>>
>> Hi
>>
>> Thanks for the reply. here is my situation: I hve a DB which enbles
>> synchronus CDC, think this as a DBtrigger which writes to a taable with
>> "changed" values as soon as something changes in production table. My job
>> will need to pick up the data "as soon as it arrives" which can be every 1
>> min interval. Ideally it will pick up the changes, transform it into a
>> jsonand puts it to kinesis. In short, I am emulating a Kinesis producer
>> with a DB source (dont even ask why, lets say these are the constraints :) )
>>
>> Please advice (a) is spark a good choice here (b)  whats your suggestion
>> either way.
>>
>> I understand I can easily do it using a simple java/python app but I am
>> little worried about managing scaling/fault tolerance and thats where my
>> concern is.
>>
>> TIA
>> Ayan
>>
>> On Mon, Jul 6, 2015 at 12:51 AM, Ashic Mahtab <as...@live.com> wrote:
>>
>> Hi Ayan,
>> How "continuous" is your workload? As Akhil points out, with streaming,
>> you'll give up at least one core for receiving, will need at most one more
>> core for processing. Unless you're running on something like Mesos, this
>> means that those cores are dedicated to your app, and can't be leveraged by
>> other apps / jobs.
>>
>> If it's something periodic (once an hour, once every 15 minutes, etc.),
>> then I'd simply write a "normal" spark application, and trigger it
>> periodically. There are many things that can take care of that - sometimes
>> a simple cronjob is enough!
>>
>> ------------------------------
>> Date: Sun, 5 Jul 2015 22:48:37 +1000
>> Subject: Re: JDBC Streams
>> From: guha.ayan@gmail.com
>> To: akhil@sigmoidanalytics.com
>> CC: user@spark.apache.org
>>
>>
>> Thanks Akhil. In case I go with spark streaming, I guess I have to
>> implment a custom receiver and spark streaming will call this receiver
>> every batch interval, is that correct? Any gotcha you see in this plan?
>> TIA...Best, Ayan
>>
>> On Sun, Jul 5, 2015 at 5:40 PM, Akhil Das <ak...@sigmoidanalytics.com>
>> wrote:
>>
>> If you want a long running application, then go with spark streaming
>> (which kind of blocks your resources). On the other hand, if you use job
>> server then you can actually use the resources (CPUs) for other jobs also
>> when your dbjob is not using them.
>>
>> Thanks
>> Best Regards
>>
>> On Sun, Jul 5, 2015 at 5:28 AM, ayan guha <gu...@gmail.com> wrote:
>>
>> Hi All
>>
>> I have a requireent to connect to a DB every few minutes and bring data
>> to HBase. Can anyone suggest if spark streaming would be appropriate for
>> this senario or I shoud look into jobserver?
>>
>> Thanks in advance
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
> Chen Song
>
>

Re: JDBC Streams

Posted by Chen Song <ch...@gmail.com>.

Piggyback on this question.

I have a similar use case but a bit different. My job is consuming a stream
from Kafka and I need to join the Kafka stream with some reference table
from MySQL (kind of data validation and enrichment). I need to process this
stream every 1 min. The data in MySQL is not changed very often, maybe once
a few days.

So my requirement is:

* I cannot easily use broadcast variable because the data does change,
although not very often.
* I am not sure if it is good practice to read data from MySQL in every
batch (in my case, 1 min).

Anyone has done this before, any suggestions and feedback is appreciated.

Chen


On Sun, Jul 5, 2015 at 11:50 AM, Ashic Mahtab <as...@live.com> wrote:

> If it is indeed a reactive use case, then Spark Streaming would be a good
> choice.
>
> One approach worth considering - is it possible to receive a message via
> kafka (or some other queue). That'd not need any polling, and you could use
> standard consumers. If polling isn't an issue, then writing a custom
> receiver will work fine. The way a receiver works is this:
>
> * Your receiver has a receive() function, where you'd typically start a
> loop. In your loop, you'd fetch items, and call store(entry).
> * You control everything in the receiver. If you're listening on a queue,
> you receive messages, store() and ack your queue. If you're polling, it's
> up to you to ensure delays between db calls.
> * The things you store() go on to make up the rdds in your DStream. So,
> intervals, windowing, etc. apply to those. The receiver is the boundary
> between your data source and the DStream RDDs. In other words, if your
> interval is 15 seconds with no windowing, then the things that went to
> store() every 15 seconds are bunched up into an RDD of your DStream. That's
> kind of a simplification, but should give you the idea that your "db
> polling" interval and streaming interval are not tied together.
>
> -Ashic.
>
> ------------------------------
> Date: Mon, 6 Jul 2015 01:12:34 +1000
> Subject: Re: JDBC Streams
> From: guha.ayan@gmail.com
> To: ashic@live.com
> CC: akhil@sigmoidanalytics.com; user@spark.apache.org
>
>
> Hi
>
> Thanks for the reply. here is my situation: I hve a DB which enbles
> synchronus CDC, think this as a DBtrigger which writes to a taable with
> "changed" values as soon as something changes in production table. My job
> will need to pick up the data "as soon as it arrives" which can be every 1
> min interval. Ideally it will pick up the changes, transform it into a
> jsonand puts it to kinesis. In short, I am emulating a Kinesis producer
> with a DB source (dont even ask why, lets say these are the constraints :) )
>
> Please advice (a) is spark a good choice here (b)  whats your suggestion
> either way.
>
> I understand I can easily do it using a simple java/python app but I am
> little worried about managing scaling/fault tolerance and thats where my
> concern is.
>
> TIA
> Ayan
>
> On Mon, Jul 6, 2015 at 12:51 AM, Ashic Mahtab <as...@live.com> wrote:
>
> Hi Ayan,
> How "continuous" is your workload? As Akhil points out, with streaming,
> you'll give up at least one core for receiving, will need at most one more
> core for processing. Unless you're running on something like Mesos, this
> means that those cores are dedicated to your app, and can't be leveraged by
> other apps / jobs.
>
> If it's something periodic (once an hour, once every 15 minutes, etc.),
> then I'd simply write a "normal" spark application, and trigger it
> periodically. There are many things that can take care of that - sometimes
> a simple cronjob is enough!
>
> ------------------------------
> Date: Sun, 5 Jul 2015 22:48:37 +1000
> Subject: Re: JDBC Streams
> From: guha.ayan@gmail.com
> To: akhil@sigmoidanalytics.com
> CC: user@spark.apache.org
>
>
> Thanks Akhil. In case I go with spark streaming, I guess I have to
> implment a custom receiver and spark streaming will call this receiver
> every batch interval, is that correct? Any gotcha you see in this plan?
> TIA...Best, Ayan
>
> On Sun, Jul 5, 2015 at 5:40 PM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
> If you want a long running application, then go with spark streaming
> (which kind of blocks your resources). On the other hand, if you use job
> server then you can actually use the resources (CPUs) for other jobs also
> when your dbjob is not using them.
>
> Thanks
> Best Regards
>
> On Sun, Jul 5, 2015 at 5:28 AM, ayan guha <gu...@gmail.com> wrote:
>
> Hi All
>
> I have a requireent to connect to a DB every few minutes and bring data to
> HBase. Can anyone suggest if spark streaming would be appropriate for this
> senario or I shoud look into jobserver?
>
> Thanks in advance
>
> --
> Best Regards,
> Ayan Guha
>
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>



-- 
Chen Song

RE: JDBC Streams

Posted by Ashic Mahtab <as...@live.com>.

If it is indeed a reactive use case, then Spark Streaming would be a good choice. 
One approach worth considering - is it possible to receive a message via kafka (or some other queue). That'd not need any polling, and you could use standard consumers. If polling isn't an issue, then writing a custom receiver will work fine. The way a receiver works is this:
* Your receiver has a receive() function, where you'd typically start a loop. In your loop, you'd fetch items, and call store(entry). * You control everything in the receiver. If you're listening on a queue, you receive messages, store() and ack your queue. If you're polling, it's up to you to ensure delays between db calls.* The things you store() go on to make up the rdds in your DStream. So, intervals, windowing, etc. apply to those. The receiver is the boundary between your data source and the DStream RDDs. In other words, if your interval is 15 seconds with no windowing, then the things that went to store() every 15 seconds are bunched up into an RDD of your DStream. That's kind of a simplification, but should give you the idea that your "db polling" interval and streaming interval are not tied together.
-Ashic.

Date: Mon, 6 Jul 2015 01:12:34 +1000
Subject: Re: JDBC Streams
From: guha.ayan@gmail.com
To: ashic@live.com
CC: akhil@sigmoidanalytics.com; user@spark.apache.org

Hi

Thanks for the reply. here is my situation: I hve a DB which enbles synchronus CDC, think this as a DBtrigger which writes to a taable with "changed" values as soon as something changes in production table. My job will need to pick up the data "as soon as it arrives" which can be every 1 min interval. Ideally it will pick up the changes, transform it into a jsonand puts it to kinesis. In short, I am emulating a Kinesis producer with a DB source (dont even ask why, lets say these are the constraints :) )

Please advice (a) is spark a good choice here (b)  whats your suggestion either way.

I understand I can easily do it using a simple java/python app but I am little worried about managing scaling/fault tolerance and thats where my concern is.

TIA
Ayan

On Mon, Jul 6, 2015 at 12:51 AM, Ashic Mahtab <as...@live.com> wrote:

Hi Ayan,How "continuous" is your workload? As Akhil points out, with streaming, you'll give up at least one core for receiving, will need at most one more core for processing. Unless you're running on something like Mesos, this means that those cores are dedicated to your app, and can't be leveraged by other apps / jobs.
If it's something periodic (once an hour, once every 15 minutes, etc.), then I'd simply write a "normal" spark application, and trigger it periodically. There are many things that can take care of that - sometimes a simple cronjob is enough!

Date: Sun, 5 Jul 2015 22:48:37 +1000
Subject: Re: JDBC Streams
From: guha.ayan@gmail.com
To: akhil@sigmoidanalytics.com
CC: user@spark.apache.org

Thanks Akhil. In case I go with spark streaming, I guess I have to implment a custom receiver and spark streaming will call this receiver every batch interval, is that correct? Any gotcha you see in this plan? TIA...Best, Ayan

On Sun, Jul 5, 2015 at 5:40 PM, Akhil Das <ak...@sigmoidanalytics.com> wrote:
If you want a long running application, then go with spark streaming (which kind of blocks your resources). On the other hand, if you use job server then you can actually use the resources (CPUs) for other jobs also when your dbjob is not using them.ThanksBest Regards

On Sun, Jul 5, 2015 at 5:28 AM, ayan guha <gu...@gmail.com> wrote:
Hi All

I have a requireent to connect to a DB every few minutes and bring data to HBase. Can anyone suggest if spark streaming would be appropriate for this senario or I shoud look into jobserver?

Thanks in advance
-- 
Best Regards,
Ayan Guha

-- 
Best Regards,
Ayan Guha

-- 
Best Regards,
Ayan Guha

Re: JDBC Streams

Posted by ayan guha <gu...@gmail.com>.

Hi

Thanks for the reply. here is my situation: I hve a DB which enbles
synchronus CDC, think this as a DBtrigger which writes to a taable with
"changed" values as soon as something changes in production table. My job
will need to pick up the data "as soon as it arrives" which can be every 1
min interval. Ideally it will pick up the changes, transform it into a
jsonand puts it to kinesis. In short, I am emulating a Kinesis producer
with a DB source (dont even ask why, lets say these are the constraints :) )

Please advice (a) is spark a good choice here (b)  whats your suggestion
either way.

I understand I can easily do it using a simple java/python app but I am
little worried about managing scaling/fault tolerance and thats where my
concern is.

TIA
Ayan

On Mon, Jul 6, 2015 at 12:51 AM, Ashic Mahtab <as...@live.com> wrote:

> Hi Ayan,
> How "continuous" is your workload? As Akhil points out, with streaming,
> you'll give up at least one core for receiving, will need at most one more
> core for processing. Unless you're running on something like Mesos, this
> means that those cores are dedicated to your app, and can't be leveraged by
> other apps / jobs.
>
> If it's something periodic (once an hour, once every 15 minutes, etc.),
> then I'd simply write a "normal" spark application, and trigger it
> periodically. There are many things that can take care of that - sometimes
> a simple cronjob is enough!
>
> ------------------------------
> Date: Sun, 5 Jul 2015 22:48:37 +1000
> Subject: Re: JDBC Streams
> From: guha.ayan@gmail.com
> To: akhil@sigmoidanalytics.com
> CC: user@spark.apache.org
>
>
> Thanks Akhil. In case I go with spark streaming, I guess I have to
> implment a custom receiver and spark streaming will call this receiver
> every batch interval, is that correct? Any gotcha you see in this plan?
> TIA...Best, Ayan
>
> On Sun, Jul 5, 2015 at 5:40 PM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
> If you want a long running application, then go with spark streaming
> (which kind of blocks your resources). On the other hand, if you use job
> server then you can actually use the resources (CPUs) for other jobs also
> when your dbjob is not using them.
>
> Thanks
> Best Regards
>
> On Sun, Jul 5, 2015 at 5:28 AM, ayan guha <gu...@gmail.com> wrote:
>
> Hi All
>
> I have a requireent to connect to a DB every few minutes and bring data to
> HBase. Can anyone suggest if spark streaming would be appropriate for this
> senario or I shoud look into jobserver?
>
> Thanks in advance
>
> --
> Best Regards,
> Ayan Guha
>
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>

-- 
Best Regards,
Ayan Guha

RE: JDBC Streams

Posted by Ashic Mahtab <as...@live.com>.

Hi Ayan,How "continuous" is your workload? As Akhil points out, with streaming, you'll give up at least one core for receiving, will need at most one more core for processing. Unless you're running on something like Mesos, this means that those cores are dedicated to your app, and can't be leveraged by other apps / jobs.
If it's something periodic (once an hour, once every 15 minutes, etc.), then I'd simply write a "normal" spark application, and trigger it periodically. There are many things that can take care of that - sometimes a simple cronjob is enough!

Date: Sun, 5 Jul 2015 22:48:37 +1000
Subject: Re: JDBC Streams
From: guha.ayan@gmail.com
To: akhil@sigmoidanalytics.com
CC: user@spark.apache.org

Thanks Akhil. In case I go with spark streaming, I guess I have to implment a custom receiver and spark streaming will call this receiver every batch interval, is that correct? Any gotcha you see in this plan? TIA...Best, Ayan

On Sun, Jul 5, 2015 at 5:40 PM, Akhil Das <ak...@sigmoidanalytics.com> wrote:
If you want a long running application, then go with spark streaming (which kind of blocks your resources). On the other hand, if you use job server then you can actually use the resources (CPUs) for other jobs also when your dbjob is not using them.ThanksBest Regards

On Sun, Jul 5, 2015 at 5:28 AM, ayan guha <gu...@gmail.com> wrote:
Hi All

I have a requireent to connect to a DB every few minutes and bring data to HBase. Can anyone suggest if spark streaming would be appropriate for this senario or I shoud look into jobserver?

Thanks in advance
-- 
Best Regards,
Ayan Guha

-- 
Best Regards,
Ayan Guha

Re: JDBC Streams

Posted by ayan guha <gu...@gmail.com>.

Thanks Akhil. In case I go with spark streaming, I guess I have to implment
a custom receiver and spark streaming will call this receiver every batch
interval, is that correct? Any gotcha you see in this plan? TIA...Best, Ayan

On Sun, Jul 5, 2015 at 5:40 PM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> If you want a long running application, then go with spark streaming
> (which kind of blocks your resources). On the other hand, if you use job
> server then you can actually use the resources (CPUs) for other jobs also
> when your dbjob is not using them.
>
> Thanks
> Best Regards
>
> On Sun, Jul 5, 2015 at 5:28 AM, ayan guha <gu...@gmail.com> wrote:
>
>> Hi All
>>
>> I have a requireent to connect to a DB every few minutes and bring data
>> to HBase. Can anyone suggest if spark streaming would be appropriate for
>> this senario or I shoud look into jobserver?
>>
>> Thanks in advance
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>

-- 
Best Regards,
Ayan Guha

Re: JDBC Streams

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

If you want a long running application, then go with spark streaming (which
kind of blocks your resources). On the other hand, if you use job server
then you can actually use the resources (CPUs) for other jobs also when
your dbjob is not using them.

Thanks
Best Regards

On Sun, Jul 5, 2015 at 5:28 AM, ayan guha <gu...@gmail.com> wrote:

> Hi All
>
> I have a requireent to connect to a DB every few minutes and bring data to
> HBase. Can anyone suggest if spark streaming would be appropriate for this
> senario or I shoud look into jobserver?
>
> Thanks in advance
>
> --
> Best Regards,
> Ayan Guha
>

Re: JDBC Streams

Posted by Jörn Franke <jo...@gmail.com>.

I would use Sqoop. It has been designed exactly for these types of
scenarios. Spark streaming does not make sense here

Le dim. 5 juil. 2015 à 1:59, ayan guha <gu...@gmail.com> a écrit :

> Hi All
>
> I have a requireent to connect to a DB every few minutes and bring data to
> HBase. Can anyone suggest if spark streaming would be appropriate for this
> senario or I shoud look into jobserver?
>
> Thanks in advance
>
>
> --
> Best Regards,
> Ayan Guha
>