You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Bryan Jeffrey <br...@gmail.com> on 2019/11/18 14:20:46 UTC

Structured Streaming & Enrichment Broadcasts

Hello.

We're running applications using Spark Streaming.  We're going to begin
work to move to using Structured Streaming.  One of our key scenarios is to
lookup values from an external data source for each record in an incoming
stream.  In Spark Streaming we currently read the external data, broadcast
it and then lookup the value from the broadcast.  The broadcast value is
refreshed on a periodic basis - with the need to refresh evaluated on each
batch (in a foreachRDD).  The broadcasts are somewhat large (~1M records).
Each stream we're doing the lookup(s) for is ~6M records / second.

While we could conceivably continue this pattern in Structured Streaming
with Spark 2.4.x and the 'foreachBatch', based on my read of documentation
this seems like a bit of an anti-pattern in Structured Streaming.

So I am looking for advice: What mechanism would you suggest to on a
periodic basis read an external data source and do a fast lookup for a
streaming input.  One option appears to be to do a broadcast left outer
join?  In the past this mechanism has been less easy to performance tune
than doing an explicit broadcast and lookup.

Regards,

Bryan Jeffrey

Re: Structured Streaming & Enrichment Broadcasts

Posted by hahaha sc <sh...@gmail.com>.
I have a scenario similar to yours, but we are using udf to do exactly
that. But you need to get the value of a broadcast variable from the udf.
But it's not clear how to achieve it, does anyone know?

Burak Yavuz <br...@gmail.com> 于2019年11月19日周二 下午12:23写道:

> If you store the data that you're going to broadcast as a Delta table (see
> delta.io) and perform a stream-batch (where your Delta table is the
> batch) join, it will auto-update once the table receives any updates.
>
> Best,
> Burak
>
> On Mon, Nov 18, 2019, 6:21 AM Bryan Jeffrey <br...@gmail.com>
> wrote:
>
>> Hello.
>>
>> We're running applications using Spark Streaming.  We're going to begin
>> work to move to using Structured Streaming.  One of our key scenarios is to
>> lookup values from an external data source for each record in an incoming
>> stream.  In Spark Streaming we currently read the external data, broadcast
>> it and then lookup the value from the broadcast.  The broadcast value is
>> refreshed on a periodic basis - with the need to refresh evaluated on each
>> batch (in a foreachRDD).  The broadcasts are somewhat large (~1M records).
>> Each stream we're doing the lookup(s) for is ~6M records / second.
>>
>> While we could conceivably continue this pattern in Structured Streaming
>> with Spark 2.4.x and the 'foreachBatch', based on my read of documentation
>> this seems like a bit of an anti-pattern in Structured Streaming.
>>
>> So I am looking for advice: What mechanism would you suggest to on a
>> periodic basis read an external data source and do a fast lookup for a
>> streaming input.  One option appears to be to do a broadcast left outer
>> join?  In the past this mechanism has been less easy to performance tune
>> than doing an explicit broadcast and lookup.
>>
>> Regards,
>>
>> Bryan Jeffrey
>>
>

Re: Structured Streaming & Enrichment Broadcasts

Posted by Burak Yavuz <br...@gmail.com>.
If you store the data that you're going to broadcast as a Delta table (see
delta.io) and perform a stream-batch (where your Delta table is the batch)
join, it will auto-update once the table receives any updates.

Best,
Burak

On Mon, Nov 18, 2019, 6:21 AM Bryan Jeffrey <br...@gmail.com> wrote:

> Hello.
>
> We're running applications using Spark Streaming.  We're going to begin
> work to move to using Structured Streaming.  One of our key scenarios is to
> lookup values from an external data source for each record in an incoming
> stream.  In Spark Streaming we currently read the external data, broadcast
> it and then lookup the value from the broadcast.  The broadcast value is
> refreshed on a periodic basis - with the need to refresh evaluated on each
> batch (in a foreachRDD).  The broadcasts are somewhat large (~1M records).
> Each stream we're doing the lookup(s) for is ~6M records / second.
>
> While we could conceivably continue this pattern in Structured Streaming
> with Spark 2.4.x and the 'foreachBatch', based on my read of documentation
> this seems like a bit of an anti-pattern in Structured Streaming.
>
> So I am looking for advice: What mechanism would you suggest to on a
> periodic basis read an external data source and do a fast lookup for a
> streaming input.  One option appears to be to do a broadcast left outer
> join?  In the past this mechanism has been less easy to performance tune
> than doing an explicit broadcast and lookup.
>
> Regards,
>
> Bryan Jeffrey
>