You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Chen Song <ch...@gmail.com> on 2015/08/26 16:46:07 UTC

Re: JDBC Streams

Piggyback on this question.

I have a similar use case but a bit different. My job is consuming a stream
from Kafka and I need to join the Kafka stream with some reference table
from MySQL (kind of data validation and enrichment). I need to process this
stream every 1 min. The data in MySQL is not changed very often, maybe once
a few days.

So my requirement is:

* I cannot easily use broadcast variable because the data does change,
although not very often.
* I am not sure if it is good practice to read data from MySQL in every
batch (in my case, 1 min).

Anyone has done this before, any suggestions and feedback is appreciated.

Chen


On Sun, Jul 5, 2015 at 11:50 AM, Ashic Mahtab <as...@live.com> wrote:

> If it is indeed a reactive use case, then Spark Streaming would be a good
> choice.
>
> One approach worth considering - is it possible to receive a message via
> kafka (or some other queue). That'd not need any polling, and you could use
> standard consumers. If polling isn't an issue, then writing a custom
> receiver will work fine. The way a receiver works is this:
>
> * Your receiver has a receive() function, where you'd typically start a
> loop. In your loop, you'd fetch items, and call store(entry).
> * You control everything in the receiver. If you're listening on a queue,
> you receive messages, store() and ack your queue. If you're polling, it's
> up to you to ensure delays between db calls.
> * The things you store() go on to make up the rdds in your DStream. So,
> intervals, windowing, etc. apply to those. The receiver is the boundary
> between your data source and the DStream RDDs. In other words, if your
> interval is 15 seconds with no windowing, then the things that went to
> store() every 15 seconds are bunched up into an RDD of your DStream. That's
> kind of a simplification, but should give you the idea that your "db
> polling" interval and streaming interval are not tied together.
>
> -Ashic.
>
> ------------------------------
> Date: Mon, 6 Jul 2015 01:12:34 +1000
> Subject: Re: JDBC Streams
> From: guha.ayan@gmail.com
> To: ashic@live.com
> CC: akhil@sigmoidanalytics.com; user@spark.apache.org
>
>
> Hi
>
> Thanks for the reply. here is my situation: I hve a DB which enbles
> synchronus CDC, think this as a DBtrigger which writes to a taable with
> "changed" values as soon as something changes in production table. My job
> will need to pick up the data "as soon as it arrives" which can be every 1
> min interval. Ideally it will pick up the changes, transform it into a
> jsonand puts it to kinesis. In short, I am emulating a Kinesis producer
> with a DB source (dont even ask why, lets say these are the constraints :) )
>
> Please advice (a) is spark a good choice here (b)  whats your suggestion
> either way.
>
> I understand I can easily do it using a simple java/python app but I am
> little worried about managing scaling/fault tolerance and thats where my
> concern is.
>
> TIA
> Ayan
>
> On Mon, Jul 6, 2015 at 12:51 AM, Ashic Mahtab <as...@live.com> wrote:
>
> Hi Ayan,
> How "continuous" is your workload? As Akhil points out, with streaming,
> you'll give up at least one core for receiving, will need at most one more
> core for processing. Unless you're running on something like Mesos, this
> means that those cores are dedicated to your app, and can't be leveraged by
> other apps / jobs.
>
> If it's something periodic (once an hour, once every 15 minutes, etc.),
> then I'd simply write a "normal" spark application, and trigger it
> periodically. There are many things that can take care of that - sometimes
> a simple cronjob is enough!
>
> ------------------------------
> Date: Sun, 5 Jul 2015 22:48:37 +1000
> Subject: Re: JDBC Streams
> From: guha.ayan@gmail.com
> To: akhil@sigmoidanalytics.com
> CC: user@spark.apache.org
>
>
> Thanks Akhil. In case I go with spark streaming, I guess I have to
> implment a custom receiver and spark streaming will call this receiver
> every batch interval, is that correct? Any gotcha you see in this plan?
> TIA...Best, Ayan
>
> On Sun, Jul 5, 2015 at 5:40 PM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
> If you want a long running application, then go with spark streaming
> (which kind of blocks your resources). On the other hand, if you use job
> server then you can actually use the resources (CPUs) for other jobs also
> when your dbjob is not using them.
>
> Thanks
> Best Regards
>
> On Sun, Jul 5, 2015 at 5:28 AM, ayan guha <gu...@gmail.com> wrote:
>
> Hi All
>
> I have a requireent to connect to a DB every few minutes and bring data to
> HBase. Can anyone suggest if spark streaming would be appropriate for this
> senario or I shoud look into jobserver?
>
> Thanks in advance
>
> --
> Best Regards,
> Ayan Guha
>
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
>
>
> --
> Best Regards,
> Ayan Guha
>



-- 
Chen Song

Re: JDBC Streams

Posted by Cody Koeninger <co...@koeninger.org>.

Yes

On Wed, Aug 26, 2015 at 10:23 AM, Chen Song <ch...@gmail.com> wrote:

> Thanks Cody.
>
> Are you suggesting to put the cache in global context in each executor
> JVM, in a Scala object for example. Then have a scheduled task to refresh
> the cache (or triggered by the expiry if Guava)?
>
> Chen
>
> On Wed, Aug 26, 2015 at 10:51 AM, Cody Koeninger <co...@koeninger.org>
> wrote:
>
>> If your data only changes every few days, why not restart the job every
>> few days, and just broadcast the data?
>>
>> Or you can keep a local per-jvm cache with an expiry (e.g. guava cache)
>> to avoid many mysql reads
>>
>> On Wed, Aug 26, 2015 at 9:46 AM, Chen Song <ch...@gmail.com>
>> wrote:
>>
>>> Piggyback on this question.
>>>
>>> I have a similar use case but a bit different. My job is consuming a
>>> stream from Kafka and I need to join the Kafka stream with some reference
>>> table from MySQL (kind of data validation and enrichment). I need to
>>> process this stream every 1 min. The data in MySQL is not changed very
>>> often, maybe once a few days.
>>>
>>> So my requirement is:
>>>
>>> * I cannot easily use broadcast variable because the data does change,
>>> although not very often.
>>> * I am not sure if it is good practice to read data from MySQL in every
>>> batch (in my case, 1 min).
>>>
>>> Anyone has done this before, any suggestions and feedback is appreciated.
>>>
>>> Chen
>>>
>>>
>>> On Sun, Jul 5, 2015 at 11:50 AM, Ashic Mahtab <as...@live.com> wrote:
>>>
>>>> If it is indeed a reactive use case, then Spark Streaming would be a
>>>> good choice.
>>>>
>>>> One approach worth considering - is it possible to receive a message
>>>> via kafka (or some other queue). That'd not need any polling, and you could
>>>> use standard consumers. If polling isn't an issue, then writing a custom
>>>> receiver will work fine. The way a receiver works is this:
>>>>
>>>> * Your receiver has a receive() function, where you'd typically start a
>>>> loop. In your loop, you'd fetch items, and call store(entry).
>>>> * You control everything in the receiver. If you're listening on a
>>>> queue, you receive messages, store() and ack your queue. If you're polling,
>>>> it's up to you to ensure delays between db calls.
>>>> * The things you store() go on to make up the rdds in your DStream. So,
>>>> intervals, windowing, etc. apply to those. The receiver is the boundary
>>>> between your data source and the DStream RDDs. In other words, if your
>>>> interval is 15 seconds with no windowing, then the things that went to
>>>> store() every 15 seconds are bunched up into an RDD of your DStream. That's
>>>> kind of a simplification, but should give you the idea that your "db
>>>> polling" interval and streaming interval are not tied together.
>>>>
>>>> -Ashic.
>>>>
>>>> ------------------------------
>>>> Date: Mon, 6 Jul 2015 01:12:34 +1000
>>>> Subject: Re: JDBC Streams
>>>> From: guha.ayan@gmail.com
>>>> To: ashic@live.com
>>>> CC: akhil@sigmoidanalytics.com; user@spark.apache.org
>>>>
>>>>
>>>> Hi
>>>>
>>>> Thanks for the reply. here is my situation: I hve a DB which enbles
>>>> synchronus CDC, think this as a DBtrigger which writes to a taable with
>>>> "changed" values as soon as something changes in production table. My job
>>>> will need to pick up the data "as soon as it arrives" which can be every 1
>>>> min interval. Ideally it will pick up the changes, transform it into a
>>>> jsonand puts it to kinesis. In short, I am emulating a Kinesis producer
>>>> with a DB source (dont even ask why, lets say these are the constraints :) )
>>>>
>>>> Please advice (a) is spark a good choice here (b)  whats your
>>>> suggestion either way.
>>>>
>>>> I understand I can easily do it using a simple java/python app but I am
>>>> little worried about managing scaling/fault tolerance and thats where my
>>>> concern is.
>>>>
>>>> TIA
>>>> Ayan
>>>>
>>>> On Mon, Jul 6, 2015 at 12:51 AM, Ashic Mahtab <as...@live.com> wrote:
>>>>
>>>> Hi Ayan,
>>>> How "continuous" is your workload? As Akhil points out, with streaming,
>>>> you'll give up at least one core for receiving, will need at most one more
>>>> core for processing. Unless you're running on something like Mesos, this
>>>> means that those cores are dedicated to your app, and can't be leveraged by
>>>> other apps / jobs.
>>>>
>>>> If it's something periodic (once an hour, once every 15 minutes, etc.),
>>>> then I'd simply write a "normal" spark application, and trigger it
>>>> periodically. There are many things that can take care of that - sometimes
>>>> a simple cronjob is enough!
>>>>
>>>> ------------------------------
>>>> Date: Sun, 5 Jul 2015 22:48:37 +1000
>>>> Subject: Re: JDBC Streams
>>>> From: guha.ayan@gmail.com
>>>> To: akhil@sigmoidanalytics.com
>>>> CC: user@spark.apache.org
>>>>
>>>>
>>>> Thanks Akhil. In case I go with spark streaming, I guess I have to
>>>> implment a custom receiver and spark streaming will call this receiver
>>>> every batch interval, is that correct? Any gotcha you see in this plan?
>>>> TIA...Best, Ayan
>>>>
>>>> On Sun, Jul 5, 2015 at 5:40 PM, Akhil Das <ak...@sigmoidanalytics.com>
>>>> wrote:
>>>>
>>>> If you want a long running application, then go with spark streaming
>>>> (which kind of blocks your resources). On the other hand, if you use job
>>>> server then you can actually use the resources (CPUs) for other jobs also
>>>> when your dbjob is not using them.
>>>>
>>>> Thanks
>>>> Best Regards
>>>>
>>>> On Sun, Jul 5, 2015 at 5:28 AM, ayan guha <gu...@gmail.com> wrote:
>>>>
>>>> Hi All
>>>>
>>>> I have a requireent to connect to a DB every few minutes and bring data
>>>> to HBase. Can anyone suggest if spark streaming would be appropriate for
>>>> this senario or I shoud look into jobserver?
>>>>
>>>> Thanks in advance
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>
>>>
>>>
>>> --
>>> Chen Song
>>>
>>>
>>
>
>
> --
> Chen Song
>
>

Re: JDBC Streams

Posted by Chen Song <ch...@gmail.com>.

Thanks Cody.

Are you suggesting to put the cache in global context in each executor JVM,
in a Scala object for example. Then have a scheduled task to refresh the
cache (or triggered by the expiry if Guava)?

Chen

On Wed, Aug 26, 2015 at 10:51 AM, Cody Koeninger <co...@koeninger.org> wrote:

> If your data only changes every few days, why not restart the job every
> few days, and just broadcast the data?
>
> Or you can keep a local per-jvm cache with an expiry (e.g. guava cache) to
> avoid many mysql reads
>
> On Wed, Aug 26, 2015 at 9:46 AM, Chen Song <ch...@gmail.com> wrote:
>
>> Piggyback on this question.
>>
>> I have a similar use case but a bit different. My job is consuming a
>> stream from Kafka and I need to join the Kafka stream with some reference
>> table from MySQL (kind of data validation and enrichment). I need to
>> process this stream every 1 min. The data in MySQL is not changed very
>> often, maybe once a few days.
>>
>> So my requirement is:
>>
>> * I cannot easily use broadcast variable because the data does change,
>> although not very often.
>> * I am not sure if it is good practice to read data from MySQL in every
>> batch (in my case, 1 min).
>>
>> Anyone has done this before, any suggestions and feedback is appreciated.
>>
>> Chen
>>
>>
>> On Sun, Jul 5, 2015 at 11:50 AM, Ashic Mahtab <as...@live.com> wrote:
>>
>>> If it is indeed a reactive use case, then Spark Streaming would be a
>>> good choice.
>>>
>>> One approach worth considering - is it possible to receive a message via
>>> kafka (or some other queue). That'd not need any polling, and you could use
>>> standard consumers. If polling isn't an issue, then writing a custom
>>> receiver will work fine. The way a receiver works is this:
>>>
>>> * Your receiver has a receive() function, where you'd typically start a
>>> loop. In your loop, you'd fetch items, and call store(entry).
>>> * You control everything in the receiver. If you're listening on a
>>> queue, you receive messages, store() and ack your queue. If you're polling,
>>> it's up to you to ensure delays between db calls.
>>> * The things you store() go on to make up the rdds in your DStream. So,
>>> intervals, windowing, etc. apply to those. The receiver is the boundary
>>> between your data source and the DStream RDDs. In other words, if your
>>> interval is 15 seconds with no windowing, then the things that went to
>>> store() every 15 seconds are bunched up into an RDD of your DStream. That's
>>> kind of a simplification, but should give you the idea that your "db
>>> polling" interval and streaming interval are not tied together.
>>>
>>> -Ashic.
>>>
>>> ------------------------------
>>> Date: Mon, 6 Jul 2015 01:12:34 +1000
>>> Subject: Re: JDBC Streams
>>> From: guha.ayan@gmail.com
>>> To: ashic@live.com
>>> CC: akhil@sigmoidanalytics.com; user@spark.apache.org
>>>
>>>
>>> Hi
>>>
>>> Thanks for the reply. here is my situation: I hve a DB which enbles
>>> synchronus CDC, think this as a DBtrigger which writes to a taable with
>>> "changed" values as soon as something changes in production table. My job
>>> will need to pick up the data "as soon as it arrives" which can be every 1
>>> min interval. Ideally it will pick up the changes, transform it into a
>>> jsonand puts it to kinesis. In short, I am emulating a Kinesis producer
>>> with a DB source (dont even ask why, lets say these are the constraints :) )
>>>
>>> Please advice (a) is spark a good choice here (b)  whats your suggestion
>>> either way.
>>>
>>> I understand I can easily do it using a simple java/python app but I am
>>> little worried about managing scaling/fault tolerance and thats where my
>>> concern is.
>>>
>>> TIA
>>> Ayan
>>>
>>> On Mon, Jul 6, 2015 at 12:51 AM, Ashic Mahtab <as...@live.com> wrote:
>>>
>>> Hi Ayan,
>>> How "continuous" is your workload? As Akhil points out, with streaming,
>>> you'll give up at least one core for receiving, will need at most one more
>>> core for processing. Unless you're running on something like Mesos, this
>>> means that those cores are dedicated to your app, and can't be leveraged by
>>> other apps / jobs.
>>>
>>> If it's something periodic (once an hour, once every 15 minutes, etc.),
>>> then I'd simply write a "normal" spark application, and trigger it
>>> periodically. There are many things that can take care of that - sometimes
>>> a simple cronjob is enough!
>>>
>>> ------------------------------
>>> Date: Sun, 5 Jul 2015 22:48:37 +1000
>>> Subject: Re: JDBC Streams
>>> From: guha.ayan@gmail.com
>>> To: akhil@sigmoidanalytics.com
>>> CC: user@spark.apache.org
>>>
>>>
>>> Thanks Akhil. In case I go with spark streaming, I guess I have to
>>> implment a custom receiver and spark streaming will call this receiver
>>> every batch interval, is that correct? Any gotcha you see in this plan?
>>> TIA...Best, Ayan
>>>
>>> On Sun, Jul 5, 2015 at 5:40 PM, Akhil Das <ak...@sigmoidanalytics.com>
>>> wrote:
>>>
>>> If you want a long running application, then go with spark streaming
>>> (which kind of blocks your resources). On the other hand, if you use job
>>> server then you can actually use the resources (CPUs) for other jobs also
>>> when your dbjob is not using them.
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Sun, Jul 5, 2015 at 5:28 AM, ayan guha <gu...@gmail.com> wrote:
>>>
>>> Hi All
>>>
>>> I have a requireent to connect to a DB every few minutes and bring data
>>> to HBase. Can anyone suggest if spark streaming would be appropriate for
>>> this senario or I shoud look into jobserver?
>>>
>>> Thanks in advance
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>>
>> --
>> Chen Song
>>
>>
>


-- 
Chen Song

Re: JDBC Streams

Posted by Cody Koeninger <co...@koeninger.org>.

If your data only changes every few days, why not restart the job every few
days, and just broadcast the data?

Or you can keep a local per-jvm cache with an expiry (e.g. guava cache) to
avoid many mysql reads

On Wed, Aug 26, 2015 at 9:46 AM, Chen Song <ch...@gmail.com> wrote:

> Piggyback on this question.
>
> I have a similar use case but a bit different. My job is consuming a
> stream from Kafka and I need to join the Kafka stream with some reference
> table from MySQL (kind of data validation and enrichment). I need to
> process this stream every 1 min. The data in MySQL is not changed very
> often, maybe once a few days.
>
> So my requirement is:
>
> * I cannot easily use broadcast variable because the data does change,
> although not very often.
> * I am not sure if it is good practice to read data from MySQL in every
> batch (in my case, 1 min).
>
> Anyone has done this before, any suggestions and feedback is appreciated.
>
> Chen
>
>
> On Sun, Jul 5, 2015 at 11:50 AM, Ashic Mahtab <as...@live.com> wrote:
>
>> If it is indeed a reactive use case, then Spark Streaming would be a good
>> choice.
>>
>> One approach worth considering - is it possible to receive a message via
>> kafka (or some other queue). That'd not need any polling, and you could use
>> standard consumers. If polling isn't an issue, then writing a custom
>> receiver will work fine. The way a receiver works is this:
>>
>> * Your receiver has a receive() function, where you'd typically start a
>> loop. In your loop, you'd fetch items, and call store(entry).
>> * You control everything in the receiver. If you're listening on a queue,
>> you receive messages, store() and ack your queue. If you're polling, it's
>> up to you to ensure delays between db calls.
>> * The things you store() go on to make up the rdds in your DStream. So,
>> intervals, windowing, etc. apply to those. The receiver is the boundary
>> between your data source and the DStream RDDs. In other words, if your
>> interval is 15 seconds with no windowing, then the things that went to
>> store() every 15 seconds are bunched up into an RDD of your DStream. That's
>> kind of a simplification, but should give you the idea that your "db
>> polling" interval and streaming interval are not tied together.
>>
>> -Ashic.
>>
>> ------------------------------
>> Date: Mon, 6 Jul 2015 01:12:34 +1000
>> Subject: Re: JDBC Streams
>> From: guha.ayan@gmail.com
>> To: ashic@live.com
>> CC: akhil@sigmoidanalytics.com; user@spark.apache.org
>>
>>
>> Hi
>>
>> Thanks for the reply. here is my situation: I hve a DB which enbles
>> synchronus CDC, think this as a DBtrigger which writes to a taable with
>> "changed" values as soon as something changes in production table. My job
>> will need to pick up the data "as soon as it arrives" which can be every 1
>> min interval. Ideally it will pick up the changes, transform it into a
>> jsonand puts it to kinesis. In short, I am emulating a Kinesis producer
>> with a DB source (dont even ask why, lets say these are the constraints :) )
>>
>> Please advice (a) is spark a good choice here (b)  whats your suggestion
>> either way.
>>
>> I understand I can easily do it using a simple java/python app but I am
>> little worried about managing scaling/fault tolerance and thats where my
>> concern is.
>>
>> TIA
>> Ayan
>>
>> On Mon, Jul 6, 2015 at 12:51 AM, Ashic Mahtab <as...@live.com> wrote:
>>
>> Hi Ayan,
>> How "continuous" is your workload? As Akhil points out, with streaming,
>> you'll give up at least one core for receiving, will need at most one more
>> core for processing. Unless you're running on something like Mesos, this
>> means that those cores are dedicated to your app, and can't be leveraged by
>> other apps / jobs.
>>
>> If it's something periodic (once an hour, once every 15 minutes, etc.),
>> then I'd simply write a "normal" spark application, and trigger it
>> periodically. There are many things that can take care of that - sometimes
>> a simple cronjob is enough!
>>
>> ------------------------------
>> Date: Sun, 5 Jul 2015 22:48:37 +1000
>> Subject: Re: JDBC Streams
>> From: guha.ayan@gmail.com
>> To: akhil@sigmoidanalytics.com
>> CC: user@spark.apache.org
>>
>>
>> Thanks Akhil. In case I go with spark streaming, I guess I have to
>> implment a custom receiver and spark streaming will call this receiver
>> every batch interval, is that correct? Any gotcha you see in this plan?
>> TIA...Best, Ayan
>>
>> On Sun, Jul 5, 2015 at 5:40 PM, Akhil Das <ak...@sigmoidanalytics.com>
>> wrote:
>>
>> If you want a long running application, then go with spark streaming
>> (which kind of blocks your resources). On the other hand, if you use job
>> server then you can actually use the resources (CPUs) for other jobs also
>> when your dbjob is not using them.
>>
>> Thanks
>> Best Regards
>>
>> On Sun, Jul 5, 2015 at 5:28 AM, ayan guha <gu...@gmail.com> wrote:
>>
>> Hi All
>>
>> I have a requireent to connect to a DB every few minutes and bring data
>> to HBase. Can anyone suggest if spark streaming would be appropriate for
>> this senario or I shoud look into jobserver?
>>
>> Thanks in advance
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
> Chen Song
>
>