You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Uthayan Suthakar <ut...@gmail.com> on 2015/10/27 20:02:35 UTC

[Spark Streaming] Connect to Database only once at the start of Streaming job

Hello all,

What I wanted to do is configure the spark streaming job to read the
database using JdbcRDD and cache the results. This should occur only once
at the start of the job. It should not make any further connection to DB
 afterwards. Is it possible to do that?

Re: [Spark Streaming] Connect to Database only once at the start of Streaming job

Posted by Tathagata Das <td...@databricks.com>.

However, if your executor dies. Then it may reconnect to JDBC to
reconstruct the RDD partitions that were lost. To prevent that you can
checkpoint the RDD to a HDFS-like filesystem (using rdd.checkpoint()). Then
you are safe, it wont reconnect to JDBC.


On Tue, Oct 27, 2015 at 11:17 PM, Tathagata Das <td...@databricks.com> wrote:

> Yeah, of course. Just create an RDD from jdbc, call cache()/persist(),
> then force it to be evaluated using something like count(). Once it is
> cached, you can use it in a StreamingContext. Because of the cache it
> should not access JDBC any more.
>
> On Tue, Oct 27, 2015 at 12:04 PM, diplomatic Guru <
> diplomaticguru@gmail.com> wrote:
>
>> I know it uses lazy model, which is why I was wondering.
>>
>> On 27 October 2015 at 19:02, Uthayan Suthakar <uthayan.suthakar@gmail.com
>> > wrote:
>>
>>> Hello all,
>>>
>>> What I wanted to do is configure the spark streaming job to read the
>>> database using JdbcRDD and cache the results. This should occur only once
>>> at the start of the job. It should not make any further connection to DB
>>>  afterwards. Is it possible to do that?
>>>
>>
>>
>

Re: [Spark Streaming] Connect to Database only once at the start of Streaming job

Posted by Tathagata Das <td...@databricks.com>.

Yeah, of course. Just create an RDD from jdbc, call cache()/persist(), then
force it to be evaluated using something like count(). Once it is cached,
you can use it in a StreamingContext. Because of the cache it should not
access JDBC any more.

On Tue, Oct 27, 2015 at 12:04 PM, diplomatic Guru <di...@gmail.com>
wrote:

> I know it uses lazy model, which is why I was wondering.
>
> On 27 October 2015 at 19:02, Uthayan Suthakar <ut...@gmail.com>
> wrote:
>
>> Hello all,
>>
>> What I wanted to do is configure the spark streaming job to read the
>> database using JdbcRDD and cache the results. This should occur only once
>> at the start of the job. It should not make any further connection to DB
>>  afterwards. Is it possible to do that?
>>
>
>

Re: [Spark Streaming] Connect to Database only once at the start of Streaming job

Posted by diplomatic Guru <di...@gmail.com>.

I know it uses lazy model, which is why I was wondering.

On 27 October 2015 at 19:02, Uthayan Suthakar <ut...@gmail.com>
wrote:

> Hello all,
>
> What I wanted to do is configure the spark streaming job to read the
> database using JdbcRDD and cache the results. This should occur only once
> at the start of the job. It should not make any further connection to DB
>  afterwards. Is it possible to do that?
>