You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Hemanth Gudela <he...@qvantel.com> on 2017/04/20 20:08:57 UTC

Spark structured streaming: Is it possible to periodically refresh static data frame?

Hello,

I am working on a use case where there is a need to join streaming data frame with a static data frame.
The streaming data frame continuously gets data from Kafka topics, whereas static data frame fetches data from a database table.

However, as the underlying database table is getting updated often, I must somehow manage to refresh my static data frame periodically to get the latest information from underlying database table.

My questions:

1.       Is it possible to periodically refresh static data frame?

2.       If refreshing static data frame is not possible, is there a mechanism to automatically stop & restarting spark structured streaming job, so that every time the job restarts, the static data frame gets updated with latest information from underlying database table.

3.       If 1) and 2) are not possible, please suggest alternatives to achieve my requirement described above.

Thanks,
Hemanth

Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

Posted by Hemanth Gudela <he...@qvantel.com>.

Thank you Georg, Gene for your ideas.
For now, I am using ”Futures” to asynchronously run a background thread that periodically creates a new dataframe fetching latest data from underlying table, and re-registers temp view with the same name as used by main thread’s static dataframe.

This looks to be working for me now, but if this solution leads to other problems, I will look for persisted views in hive / Alluxio.

Regards,
Hemanth

From: Gene Pang <ge...@gmail.com>
Date: Saturday, 22 April 2017 at 0.30
To: Georg Heiler <ge...@gmail.com>
Cc: Hemanth Gudela <he...@qvantel.com>, Tathagata Das <ta...@gmail.com>, "user@spark.apache.org" <us...@spark.apache.org>
Subject: Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

Hi Georg,

Yes, that should be possible with Alluxio. Tachyon was renamed to Alluxio.

This article on how Alluxio is used for a Spark streaming use case<https://www.alluxio.com/blog/qunar-performs-real-time-data-analytics-up-to-300x-faster-with-alluxio> may be helpful.

Thanks,
Gene

On Fri, Apr 21, 2017 at 8:22 AM, Georg Heiler <ge...@gmail.com>> wrote:
You could write your views to hive or maybe tachyon.

Is the periodically updated data big?

Hemanth Gudela <he...@qvantel.com>> schrieb am Fr. 21. Apr. 2017 um 16:55:
Being new to spark, I think I need your suggestion again.

#2 you can always define a batch Dataframe and register it as view, and then run a background then periodically creates a new Dataframe with updated data and re-registers it as a view with the same name

I seem to have misunderstood your statement and tried registering static dataframe as a temp view (“myTempView”) using createOrReplaceView in one spark session, and tried re-registering another refreshed dataframe as temp view with same name (“myTempView”) in another session. However, with this approach, I have failed to achieve what I’m aiming for, because views are local to one spark session.
From spark 2.1.0 onwards, Global view is a nice feature, but still would not solve my problem, because global view cannot be updated.

So after much thinking, I understood that you would have meant to use a background running process in the same spark job that would periodically create a new dataframe and re-register temp view with same name, within the same spark session.
Could you please give me some pointers to documentation on how to create such asynchronous background process in spark streaming? Is Scala’s “Futures” the way to achieve this?

Thanks,
Hemanth

From: Tathagata Das <ta...@gmail.com>>

Date: Friday, 21 April 2017 at 0.03
To: Hemanth Gudela <he...@qvantel.com>>
Cc: Georg Heiler <ge...@gmail.com>>, "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>

Subject: Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

Here are couple of ideas.
1. You can set up a Structured Streaming query to update in-memory table.
Look at the memory sink in the programming guide - http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks
So you can query the latest table using a specified table name, and also join that table with another stream. However, note that this in-memory table is maintained in the driver, and so you have be careful about the size of the table.

2. If you cannot define a streaming query in the slow moving due to unavailability of connector for your streaming data source, then you can always define a batch Dataframe and register it as view, and then run a background then periodically creates a new Dataframe with updated data and re-registers it as a view with the same name. Any streaming query that joins a streaming dataframe with the view will automatically start using the most updated data as soon as the view is updated.

Hope this helps.

On Thu, Apr 20, 2017 at 1:30 PM, Hemanth Gudela <he...@qvantel.com>> wrote:
Thanks Georg for your reply.
But I’m not sure if I fully understood your answer.

If you meant to join two streams (one reading Kafka, and another reading database table), then I think it’s not possible, because

1.       According to documentation<http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#data-sources>, Structured streaming does not support database as a streaming source

2.       Joining between two streams is not possible yet.

Regards,
Hemanth

From: Georg Heiler <ge...@gmail.com>>
Date: Thursday, 20 April 2017 at 23.11
To: Hemanth Gudela <he...@qvantel.com>>, "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

What about treating the static data as a (slow) stream as well?

Hemanth Gudela <he...@qvantel.com>> schrieb am Do., 20. Apr. 2017 um 22:09 Uhr:
Hello,

I am working on a use case where there is a need to join streaming data frame with a static data frame.
The streaming data frame continuously gets data from Kafka topics, whereas static data frame fetches data from a database table.

However, as the underlying database table is getting updated often, I must somehow manage to refresh my static data frame periodically to get the latest information from underlying database table.

My questions:

1.       Is it possible to periodically refresh static data frame?

2.       If refreshing static data frame is not possible, is there a mechanism to automatically stop & restarting spark structured streaming job, so that every time the job restarts, the static data frame gets updated with latest information from underlying database table.

3.       If 1) and 2) are not possible, please suggest alternatives to achieve my requirement described above.

Thanks,
Hemanth

Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

Posted by Gene Pang <ge...@gmail.com>.

Hi Georg,

Yes, that should be possible with Alluxio. Tachyon was renamed to Alluxio.

This article on how Alluxio is used for a Spark streaming use case
<https://www.alluxio.com/blog/qunar-performs-real-time-data-analytics-up-to-300x-faster-with-alluxio>
may be helpful.

Thanks,
Gene

On Fri, Apr 21, 2017 at 8:22 AM, Georg Heiler <ge...@gmail.com>
wrote:

> You could write your views to hive or maybe tachyon.
>
> Is the periodically updated data big?
>
> Hemanth Gudela <he...@qvantel.com> schrieb am Fr. 21. Apr. 2017
> um 16:55:
>
>> Being new to spark, I think I need your suggestion again.
>>
>>
>>
>> #2 you can always define a batch Dataframe and register it as view, and
>> then run a background then periodically creates a new Dataframe with
>> updated data and re-registers it as a view with the same name
>>
>>
>>
>> I seem to have misunderstood your statement and tried registering static
>> dataframe as a temp view (“myTempView”) using createOrReplaceView in one
>> spark session, and tried re-registering another refreshed dataframe as temp
>> view with same name (“myTempView”) in another session. However, with this
>> approach, I have failed to achieve what I’m aiming for, because views are
>> local to one spark session.
>>
>> From spark 2.1.0 onwards, Global view is a nice feature, but still would
>> not solve my problem, because global view cannot be updated.
>>
>>
>>
>> So after much thinking, I understood that you would have meant to use a
>> background running process in the same spark job that would periodically
>> create a new dataframe and re-register temp view with same name, within the
>> same spark session.
>>
>> Could you please give me some pointers to documentation on how to create
>> such asynchronous background process in spark streaming? Is Scala’s
>> “Futures” the way to achieve this?
>>
>>
>>
>> Thanks,
>>
>> Hemanth
>>
>>
>>
>>
>>
>> *From: *Tathagata Das <ta...@gmail.com>
>>
>>
>> *Date: *Friday, 21 April 2017 at 0.03
>> *To: *Hemanth Gudela <he...@qvantel.com>
>>
>> *Cc: *Georg Heiler <ge...@gmail.com>, "user@spark.apache.org" <
>> user@spark.apache.org>
>>
>>
>> *Subject: *Re: Spark structured streaming: Is it possible to
>> periodically refresh static data frame?
>>
>>
>>
>> Here are couple of ideas.
>>
>> 1. You can set up a Structured Streaming query to update in-memory table.
>>
>> Look at the memory sink in the programming guide -
>> http://spark.apache.org/docs/latest/structured-
>> streaming-programming-guide.html#output-sinks
>>
>> So you can query the latest table using a specified table name, and also
>> join that table with another stream. However, note that this in-memory
>> table is maintained in the driver, and so you have be careful about the
>> size of the table.
>>
>>
>>
>> 2. If you cannot define a streaming query in the slow moving due to
>> unavailability of connector for your streaming data source, then you can
>> always define a batch Dataframe and register it as view, and then run a
>> background then periodically creates a new Dataframe with updated data and
>> re-registers it as a view with the same name. Any streaming query that
>> joins a streaming dataframe with the view will automatically start using
>> the most updated data as soon as the view is updated.
>>
>>
>>
>> Hope this helps.
>>
>>
>>
>>
>>
>> On Thu, Apr 20, 2017 at 1:30 PM, Hemanth Gudela <
>> hemanth.gudela@qvantel.com> wrote:
>>
>> Thanks Georg for your reply.
>>
>> But I’m not sure if I fully understood your answer.
>>
>>
>>
>> If you meant to join two streams (one reading Kafka, and another reading
>> database table), then I think it’s not possible, because
>>
>> 1.       According to documentation
>> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#data-sources>,
>> Structured streaming does not support database as a streaming source
>>
>> 2.       Joining between two streams is not possible yet.
>>
>>
>>
>> Regards,
>>
>> Hemanth
>>
>>
>>
>> *From: *Georg Heiler <ge...@gmail.com>
>> *Date: *Thursday, 20 April 2017 at 23.11
>> *To: *Hemanth Gudela <he...@qvantel.com>, "user@spark.apache.org"
>> <us...@spark.apache.org>
>> *Subject: *Re: Spark structured streaming: Is it possible to
>> periodically refresh static data frame?
>>
>>
>>
>> What about treating the static data as a (slow) stream as well?
>>
>>
>>
>> Hemanth Gudela <he...@qvantel.com> schrieb am Do., 20. Apr.
>> 2017 um 22:09 Uhr:
>>
>> Hello,
>>
>>
>>
>> I am working on a use case where there is a need to join streaming data
>> frame with a static data frame.
>>
>> The streaming data frame continuously gets data from Kafka topics,
>> whereas static data frame fetches data from a database table.
>>
>>
>>
>> However, as the underlying database table is getting updated often, I
>> must somehow manage to refresh my static data frame periodically to get the
>> latest information from underlying database table.
>>
>>
>>
>> My questions:
>>
>> 1.       Is it possible to periodically refresh static data frame?
>>
>> 2.       If refreshing static data frame is not possible, is there a
>> mechanism to automatically stop & restarting spark structured streaming
>> job, so that every time the job restarts, the static data frame gets
>> updated with latest information from underlying database table.
>>
>> 3.       If 1) and 2) are not possible, please suggest alternatives to
>> achieve my requirement described above.
>>
>>
>>
>> Thanks,
>>
>> Hemanth
>>
>>
>>
>

Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

Posted by Georg Heiler <ge...@gmail.com>.

You could write your views to hive or maybe tachyon.

Is the periodically updated data big?
Hemanth Gudela <he...@qvantel.com> schrieb am Fr. 21. Apr. 2017 um
16:55:

> Being new to spark, I think I need your suggestion again.
>
>
>
> #2 you can always define a batch Dataframe and register it as view, and
> then run a background then periodically creates a new Dataframe with
> updated data and re-registers it as a view with the same name
>
>
>
> I seem to have misunderstood your statement and tried registering static
> dataframe as a temp view (“myTempView”) using createOrReplaceView in one
> spark session, and tried re-registering another refreshed dataframe as temp
> view with same name (“myTempView”) in another session. However, with this
> approach, I have failed to achieve what I’m aiming for, because views are
> local to one spark session.
>
> From spark 2.1.0 onwards, Global view is a nice feature, but still would
> not solve my problem, because global view cannot be updated.
>
>
>
> So after much thinking, I understood that you would have meant to use a
> background running process in the same spark job that would periodically
> create a new dataframe and re-register temp view with same name, within the
> same spark session.
>
> Could you please give me some pointers to documentation on how to create
> such asynchronous background process in spark streaming? Is Scala’s
> “Futures” the way to achieve this?
>
>
>
> Thanks,
>
> Hemanth
>
>
>
>
>
> *From: *Tathagata Das <ta...@gmail.com>
>
>
> *Date: *Friday, 21 April 2017 at 0.03
> *To: *Hemanth Gudela <he...@qvantel.com>
>
> *Cc: *Georg Heiler <ge...@gmail.com>, "user@spark.apache.org" <
> user@spark.apache.org>
>
>
> *Subject: *Re: Spark structured streaming: Is it possible to periodically
> refresh static data frame?
>
>
>
> Here are couple of ideas.
>
> 1. You can set up a Structured Streaming query to update in-memory table.
>
> Look at the memory sink in the programming guide -
> http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks
>
> So you can query the latest table using a specified table name, and also
> join that table with another stream. However, note that this in-memory
> table is maintained in the driver, and so you have be careful about the
> size of the table.
>
>
>
> 2. If you cannot define a streaming query in the slow moving due to
> unavailability of connector for your streaming data source, then you can
> always define a batch Dataframe and register it as view, and then run a
> background then periodically creates a new Dataframe with updated data and
> re-registers it as a view with the same name. Any streaming query that
> joins a streaming dataframe with the view will automatically start using
> the most updated data as soon as the view is updated.
>
>
>
> Hope this helps.
>
>
>
>
>
> On Thu, Apr 20, 2017 at 1:30 PM, Hemanth Gudela <
> hemanth.gudela@qvantel.com> wrote:
>
> Thanks Georg for your reply.
>
> But I’m not sure if I fully understood your answer.
>
>
>
> If you meant to join two streams (one reading Kafka, and another reading
> database table), then I think it’s not possible, because
>
> 1.       According to documentation
> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#data-sources>,
> Structured streaming does not support database as a streaming source
>
> 2.       Joining between two streams is not possible yet.
>
>
>
> Regards,
>
> Hemanth
>
>
>
> *From: *Georg Heiler <ge...@gmail.com>
> *Date: *Thursday, 20 April 2017 at 23.11
> *To: *Hemanth Gudela <he...@qvantel.com>, "user@spark.apache.org"
> <us...@spark.apache.org>
> *Subject: *Re: Spark structured streaming: Is it possible to periodically
> refresh static data frame?
>
>
>
> What about treating the static data as a (slow) stream as well?
>
>
>
> Hemanth Gudela <he...@qvantel.com> schrieb am Do., 20. Apr. 2017
> um 22:09 Uhr:
>
> Hello,
>
>
>
> I am working on a use case where there is a need to join streaming data
> frame with a static data frame.
>
> The streaming data frame continuously gets data from Kafka topics, whereas
> static data frame fetches data from a database table.
>
>
>
> However, as the underlying database table is getting updated often, I must
> somehow manage to refresh my static data frame periodically to get the
> latest information from underlying database table.
>
>
>
> My questions:
>
> 1.       Is it possible to periodically refresh static data frame?
>
> 2.       If refreshing static data frame is not possible, is there a
> mechanism to automatically stop & restarting spark structured streaming
> job, so that every time the job restarts, the static data frame gets
> updated with latest information from underlying database table.
>
> 3.       If 1) and 2) are not possible, please suggest alternatives to
> achieve my requirement described above.
>
>
>
> Thanks,
>
> Hemanth
>
>
>

Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

Posted by Hemanth Gudela <he...@qvantel.com>.

Being new to spark, I think I need your suggestion again.

#2 you can always define a batch Dataframe and register it as view, and then run a background then periodically creates a new Dataframe with updated data and re-registers it as a view with the same name

I seem to have misunderstood your statement and tried registering static dataframe as a temp view (“myTempView”) using createOrReplaceView in one spark session, and tried re-registering another refreshed dataframe as temp view with same name (“myTempView”) in another session. However, with this approach, I have failed to achieve what I’m aiming for, because views are local to one spark session.
From spark 2.1.0 onwards, Global view is a nice feature, but still would not solve my problem, because global view cannot be updated.

So after much thinking, I understood that you would have meant to use a background running process in the same spark job that would periodically create a new dataframe and re-register temp view with same name, within the same spark session.
Could you please give me some pointers to documentation on how to create such asynchronous background process in spark streaming? Is Scala’s “Futures” the way to achieve this?

Thanks,
Hemanth


From: Tathagata Das <ta...@gmail.com>
Date: Friday, 21 April 2017 at 0.03
To: Hemanth Gudela <he...@qvantel.com>
Cc: Georg Heiler <ge...@gmail.com>, "user@spark.apache.org" <us...@spark.apache.org>
Subject: Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

Here are couple of ideas.
1. You can set up a Structured Streaming query to update in-memory table.
Look at the memory sink in the programming guide - http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks
So you can query the latest table using a specified table name, and also join that table with another stream. However, note that this in-memory table is maintained in the driver, and so you have be careful about the size of the table.

2. If you cannot define a streaming query in the slow moving due to unavailability of connector for your streaming data source, then you can always define a batch Dataframe and register it as view, and then run a background then periodically creates a new Dataframe with updated data and re-registers it as a view with the same name. Any streaming query that joins a streaming dataframe with the view will automatically start using the most updated data as soon as the view is updated.

Hope this helps.


On Thu, Apr 20, 2017 at 1:30 PM, Hemanth Gudela <he...@qvantel.com>> wrote:
Thanks Georg for your reply.
But I’m not sure if I fully understood your answer.

If you meant to join two streams (one reading Kafka, and another reading database table), then I think it’s not possible, because

1.       According to documentation<http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#data-sources>, Structured streaming does not support database as a streaming source

2.       Joining between two streams is not possible yet.

Regards,
Hemanth

From: Georg Heiler <ge...@gmail.com>>
Date: Thursday, 20 April 2017 at 23.11
To: Hemanth Gudela <he...@qvantel.com>>, "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

What about treating the static data as a (slow) stream as well?

Hemanth Gudela <he...@qvantel.com>> schrieb am Do., 20. Apr. 2017 um 22:09 Uhr:
Hello,

I am working on a use case where there is a need to join streaming data frame with a static data frame.
The streaming data frame continuously gets data from Kafka topics, whereas static data frame fetches data from a database table.

However, as the underlying database table is getting updated often, I must somehow manage to refresh my static data frame periodically to get the latest information from underlying database table.

My questions:

1.       Is it possible to periodically refresh static data frame?

2.       If refreshing static data frame is not possible, is there a mechanism to automatically stop & restarting spark structured streaming job, so that every time the job restarts, the static data frame gets updated with latest information from underlying database table.

3.       If 1) and 2) are not possible, please suggest alternatives to achieve my requirement described above.

Thanks,
Hemanth

Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

Posted by Georg Heiler <ge...@gmail.com>.

Unfortunately I think this currently might require the old api.
Hemanth Gudela <he...@qvantel.com> schrieb am Fr. 21. Apr. 2017 um
05:58:

> Idea #2 probably suits my needs better, because
>
> -          Streaming query does not have a source database connector yet
>
> -          My source database table is big, so in-memory table could be
> huge for driver to handle.
>
>
>
> Thanks for cool ideas, TD!
>
>
>
> Regards,
>
> Hemanth
>
>
>
> *From: *Tathagata Das <ta...@gmail.com>
> *Date: *Friday, 21 April 2017 at 0.03
> *To: *Hemanth Gudela <he...@qvantel.com>
> *Cc: *Georg Heiler <ge...@gmail.com>, "user@spark.apache.org" <
> user@spark.apache.org>
>
>
> *Subject: *Re: Spark structured streaming: Is it possible to periodically
> refresh static data frame?
>
>
>
> Here are couple of ideas.
>
> 1. You can set up a Structured Streaming query to update in-memory table.
>
> Look at the memory sink in the programming guide -
> http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks
>
> So you can query the latest table using a specified table name, and also
> join that table with another stream. However, note that this in-memory
> table is maintained in the driver, and so you have be careful about the
> size of the table.
>
>
>
> 2. If you cannot define a streaming query in the slow moving due to
> unavailability of connector for your streaming data source, then you can
> always define a batch Dataframe and register it as view, and then run a
> background then periodically creates a new Dataframe with updated data and
> re-registers it as a view with the same name. Any streaming query that
> joins a streaming dataframe with the view will automatically start using
> the most updated data as soon as the view is updated.
>
>
>
> Hope this helps.
>
>
>
>
>
> On Thu, Apr 20, 2017 at 1:30 PM, Hemanth Gudela <
> hemanth.gudela@qvantel.com> wrote:
>
> Thanks Georg for your reply.
>
> But I’m not sure if I fully understood your answer.
>
>
>
> If you meant to join two streams (one reading Kafka, and another reading
> database table), then I think it’s not possible, because
>
> 1.       According to documentation
> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#data-sources>,
> Structured streaming does not support database as a streaming source
>
> 2.       Joining between two streams is not possible yet.
>
>
>
> Regards,
>
> Hemanth
>
>
>
> *From: *Georg Heiler <ge...@gmail.com>
> *Date: *Thursday, 20 April 2017 at 23.11
> *To: *Hemanth Gudela <he...@qvantel.com>, "user@spark.apache.org"
> <us...@spark.apache.org>
> *Subject: *Re: Spark structured streaming: Is it possible to periodically
> refresh static data frame?
>
>
>
> What about treating the static data as a (slow) stream as well?
>
>
>
> Hemanth Gudela <he...@qvantel.com> schrieb am Do., 20. Apr. 2017
> um 22:09 Uhr:
>
> Hello,
>
>
>
> I am working on a use case where there is a need to join streaming data
> frame with a static data frame.
>
> The streaming data frame continuously gets data from Kafka topics, whereas
> static data frame fetches data from a database table.
>
>
>
> However, as the underlying database table is getting updated often, I must
> somehow manage to refresh my static data frame periodically to get the
> latest information from underlying database table.
>
>
>
> My questions:
>
> 1.       Is it possible to periodically refresh static data frame?
>
> 2.       If refreshing static data frame is not possible, is there a
> mechanism to automatically stop & restarting spark structured streaming
> job, so that every time the job restarts, the static data frame gets
> updated with latest information from underlying database table.
>
> 3.       If 1) and 2) are not possible, please suggest alternatives to
> achieve my requirement described above.
>
>
>
> Thanks,
>
> Hemanth
>
>
>

Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

Posted by Hemanth Gudela <he...@qvantel.com>.

Idea #2 probably suits my needs better, because

-          Streaming query does not have a source database connector yet

-          My source database table is big, so in-memory table could be huge for driver to handle.

Thanks for cool ideas, TD!

Regards,
Hemanth

From: Tathagata Das <ta...@gmail.com>
Date: Friday, 21 April 2017 at 0.03
To: Hemanth Gudela <he...@qvantel.com>
Cc: Georg Heiler <ge...@gmail.com>, "user@spark.apache.org" <us...@spark.apache.org>
Subject: Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

Here are couple of ideas.
1. You can set up a Structured Streaming query to update in-memory table.
Look at the memory sink in the programming guide - http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks
So you can query the latest table using a specified table name, and also join that table with another stream. However, note that this in-memory table is maintained in the driver, and so you have be careful about the size of the table.

2. If you cannot define a streaming query in the slow moving due to unavailability of connector for your streaming data source, then you can always define a batch Dataframe and register it as view, and then run a background then periodically creates a new Dataframe with updated data and re-registers it as a view with the same name. Any streaming query that joins a streaming dataframe with the view will automatically start using the most updated data as soon as the view is updated.

Hope this helps.


On Thu, Apr 20, 2017 at 1:30 PM, Hemanth Gudela <he...@qvantel.com>> wrote:
Thanks Georg for your reply.
But I’m not sure if I fully understood your answer.

If you meant to join two streams (one reading Kafka, and another reading database table), then I think it’s not possible, because

1.       According to documentation<http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#data-sources>, Structured streaming does not support database as a streaming source

2.       Joining between two streams is not possible yet.

Regards,
Hemanth

From: Georg Heiler <ge...@gmail.com>>
Date: Thursday, 20 April 2017 at 23.11
To: Hemanth Gudela <he...@qvantel.com>>, "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

What about treating the static data as a (slow) stream as well?

Hemanth Gudela <he...@qvantel.com>> schrieb am Do., 20. Apr. 2017 um 22:09 Uhr:
Hello,

I am working on a use case where there is a need to join streaming data frame with a static data frame.
The streaming data frame continuously gets data from Kafka topics, whereas static data frame fetches data from a database table.

However, as the underlying database table is getting updated often, I must somehow manage to refresh my static data frame periodically to get the latest information from underlying database table.

My questions:

1.       Is it possible to periodically refresh static data frame?

2.       If refreshing static data frame is not possible, is there a mechanism to automatically stop & restarting spark structured streaming job, so that every time the job restarts, the static data frame gets updated with latest information from underlying database table.

3.       If 1) and 2) are not possible, please suggest alternatives to achieve my requirement described above.

Thanks,
Hemanth

Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

Posted by Tathagata Das <ta...@gmail.com>.

Here are couple of ideas.
1. You can set up a Structured Streaming query to update in-memory table.
Look at the memory sink in the programming guide - http://spark.apache.org/
docs/latest/structured-streaming-programming-guide.html#output-sinks
So you can query the latest table using a specified table name, and also
join that table with another stream. However, note that this in-memory
table is maintained in the driver, and so you have be careful about the
size of the table.

2. If you cannot define a streaming query in the slow moving due to
unavailability of connector for your streaming data source, then you can
always define a batch Dataframe and register it as view, and then run a
background then periodically creates a new Dataframe with updated data and
re-registers it as a view with the same name. Any streaming query that
joins a streaming dataframe with the view will automatically start using
the most updated data as soon as the view is updated.

Hope this helps.

On Thu, Apr 20, 2017 at 1:30 PM, Hemanth Gudela <he...@qvantel.com>
wrote:

> Thanks Georg for your reply.
>
> But I’m not sure if I fully understood your answer.
>
>
>
> If you meant to join two streams (one reading Kafka, and another reading
> database table), then I think it’s not possible, because
>
> 1.       According to documentation
> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#data-sources>,
> Structured streaming does not support database as a streaming source
>
> 2.       Joining between two streams is not possible yet.
>
>
>
> Regards,
>
> Hemanth
>
>
>
> *From: *Georg Heiler <ge...@gmail.com>
> *Date: *Thursday, 20 April 2017 at 23.11
> *To: *Hemanth Gudela <he...@qvantel.com>, "user@spark.apache.org"
> <us...@spark.apache.org>
> *Subject: *Re: Spark structured streaming: Is it possible to periodically
> refresh static data frame?
>
>
>
> What about treating the static data as a (slow) stream as well?
>
>
>
> Hemanth Gudela <he...@qvantel.com> schrieb am Do., 20. Apr. 2017
> um 22:09 Uhr:
>
> Hello,
>
>
>
> I am working on a use case where there is a need to join streaming data
> frame with a static data frame.
>
> The streaming data frame continuously gets data from Kafka topics, whereas
> static data frame fetches data from a database table.
>
>
>
> However, as the underlying database table is getting updated often, I must
> somehow manage to refresh my static data frame periodically to get the
> latest information from underlying database table.
>
>
>
> My questions:
>
> 1.       Is it possible to periodically refresh static data frame?
>
> 2.       If refreshing static data frame is not possible, is there a
> mechanism to automatically stop & restarting spark structured streaming
> job, so that every time the job restarts, the static data frame gets
> updated with latest information from underlying database table.
>
> 3.       If 1) and 2) are not possible, please suggest alternatives to
> achieve my requirement described above.
>
>
>
> Thanks,
>
> Hemanth
>
>

Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

Posted by Hemanth Gudela <he...@qvantel.com>.

Thanks Georg for your reply.
But I’m not sure if I fully understood your answer.

If you meant to join two streams (one reading Kafka, and another reading database table), then I think it’s not possible, because

1.       According to documentation<http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#data-sources>, Structured streaming does not support database as a streaming source

2.       Joining between two streams is not possible yet.

Regards,
Hemanth

From: Georg Heiler <ge...@gmail.com>
Date: Thursday, 20 April 2017 at 23.11
To: Hemanth Gudela <he...@qvantel.com>, "user@spark.apache.org" <us...@spark.apache.org>
Subject: Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

What about treating the static data as a (slow) stream as well?

Hemanth Gudela <he...@qvantel.com>> schrieb am Do., 20. Apr. 2017 um 22:09 Uhr:
Hello,

I am working on a use case where there is a need to join streaming data frame with a static data frame.
The streaming data frame continuously gets data from Kafka topics, whereas static data frame fetches data from a database table.

However, as the underlying database table is getting updated often, I must somehow manage to refresh my static data frame periodically to get the latest information from underlying database table.

My questions:

1.       Is it possible to periodically refresh static data frame?

2.       If refreshing static data frame is not possible, is there a mechanism to automatically stop & restarting spark structured streaming job, so that every time the job restarts, the static data frame gets updated with latest information from underlying database table.

3.       If 1) and 2) are not possible, please suggest alternatives to achieve my requirement described above.

Thanks,
Hemanth

Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

Posted by Georg Heiler <ge...@gmail.com>.

What about treating the static data as a (slow) stream as well?

Hemanth Gudela <he...@qvantel.com> schrieb am Do., 20. Apr. 2017
um 22:09 Uhr:

> Hello,
>
>
>
> I am working on a use case where there is a need to join streaming data
> frame with a static data frame.
>
> The streaming data frame continuously gets data from Kafka topics, whereas
> static data frame fetches data from a database table.
>
>
>
> However, as the underlying database table is getting updated often, I must
> somehow manage to refresh my static data frame periodically to get the
> latest information from underlying database table.
>
>
>
> My questions:
>
> 1.       Is it possible to periodically refresh static data frame?
>
> 2.       If refreshing static data frame is not possible, is there a
> mechanism to automatically stop & restarting spark structured streaming
> job, so that every time the job restarts, the static data frame gets
> updated with latest information from underlying database table.
>
> 3.       If 1) and 2) are not possible, please suggest alternatives to
> achieve my requirement described above.
>
>
>
> Thanks,
>
> Hemanth
>