You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by satyajit vegesna <sa...@gmail.com> on 2017/12/12 00:59:10 UTC

Joining streaming data with static table data.

Hi All,

I working on real time reporting project and i have a question about
structured streaming job, that is going to stream a particular table
records and would have to join to an existing table.

Stream ----> query/join to another DF/DS ---> update the Stream data record.

Now i have a problem on how do i approach the mid layer(query/join to
another DF/DS), should i create a DF from spark.read.format("JDBC") or
"stream and maintain the data in memory sink" or if there is any better way
to do it.

Would like to know, if anyone has faced a similar scenario and have any
suggestion on how to go ahead.

Regards,
Satyajit.

Re: Joining streaming data with static table data.

Posted by Vikash Pareek <vi...@gmail.com>.

Hi Satyajit,

For the query/join part there is a couple of approaches.
1. create a dataframe from all incoming streaming batch (i.e. actually an
rdd) and join with your reference data (coming from existing table)
2. you can use structure streaming that basically consists of schema in
every batch (you can understand it as a stream of dataframes)

While joining with reference data, if it is static data then load once and
persist it or if it is dynamic data then keep updating this at a regular
interval.


Best Regards,
Vikash Pareek



-----

__Vikash Pareek
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Joining streaming data with static table data.

Posted by Vikash Pareek <vi...@gmail.com>.

Hi Satyajit,

For the query/join part there is a couple of approaches.
1. create a dataframe from all incoming streaming batch (i.e. actually an
rdd) and join with your reference data (coming from existing table) 2. you
can use structure streaming that basically consists of the schema in every
batch (you can understand it as a stream of dataframes)

While joining with reference data, if it is static data then load once and
persist it or if it is dynamic data then keep updating this at a regular
interval.


Best Regards,
Vikash Pareek




-----

__Vikash Pareek
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Joining streaming data with static table data.

Posted by Rishi Mishra <rm...@snappydata.io>.

You can do a join between streaming dataset and a static dataset. I would
prefer your first approach. But the problem with this approach is
performance.
Unless you cache the dataset , every time you fire a join query it will
fetch the latest records from the table.



Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)

https://in.linkedin.com/in/rishiteshmishra

On Tue, Dec 12, 2017 at 6:29 AM, satyajit vegesna <
satyajit.apasprk@gmail.com> wrote:

> Hi All,
>
> I working on real time reporting project and i have a question about
> structured streaming job, that is going to stream a particular table
> records and would have to join to an existing table.
>
> Stream ----> query/join to another DF/DS ---> update the Stream data
> record.
>
> Now i have a problem on how do i approach the mid layer(query/join to
> another DF/DS), should i create a DF from spark.read.format("JDBC") or
> "stream and maintain the data in memory sink" or if there is any better way
> to do it.
>
> Would like to know, if anyone has faced a similar scenario and have any
> suggestion on how to go ahead.
>
> Regards,
> Satyajit.
>

Re: Joining streaming data with static table data.

Posted by Rishi Mishra <rm...@snappydata.io>.

You can do a join between streaming dataset and a static dataset. I would
prefer your first approach. But the problem with this approach is
performance.
Unless you cache the dataset , every time you fire a join query it will
fetch the latest records from the table.



Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)

https://in.linkedin.com/in/rishiteshmishra

On Tue, Dec 12, 2017 at 6:29 AM, satyajit vegesna <
satyajit.apasprk@gmail.com> wrote:

> Hi All,
>
> I working on real time reporting project and i have a question about
> structured streaming job, that is going to stream a particular table
> records and would have to join to an existing table.
>
> Stream ----> query/join to another DF/DS ---> update the Stream data
> record.
>
> Now i have a problem on how do i approach the mid layer(query/join to
> another DF/DS), should i create a DF from spark.read.format("JDBC") or
> "stream and maintain the data in memory sink" or if there is any better way
> to do it.
>
> Would like to know, if anyone has faced a similar scenario and have any
> suggestion on how to go ahead.
>
> Regards,
> Satyajit.
>