You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by kant kodali <ka...@gmail.com> on 2022/08/04 10:35:29 UTC

Re: structured streaming join of streaming dataframe with static dataframe performance

I suspect it is probably because the incoming rows when I joined with static frame can lead to variable degree of skewness over time and if so it is probably better to employ different join strategies at run time. But if you know your Dataset I believe you can just do broadcast join for your case! 

Its been a while since I used spark so you might want to wait for more authoritative response 

Sent from my iPhone

> On Jul 17, 2022, at 5:38 PM, Koert Kuipers <ko...@tresata.com> wrote:
> 
> 
> i was surprised to find out that if a streaming dataframe is joined with a static dataframe, that the static dataframe is re-shuffled for every microbatch, which adds considerable overhead.
> 
> wouldn't it make more sense to re-use the shuffle files?
> 
> or if that is not possible then load the static dataframe into the statestore? this would turn the join into a lookup (in rocksdb)?
> 
> 
> CONFIDENTIALITY NOTICE: This electronic communication and any files transmitted with it are confidential, privileged and intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution (electronic or otherwise) or forwarding of, or the taking of any action in reliance on the contents of this transmission is strictly prohibited. Please notify the sender immediately by e-mail if you have received this email by mistake and delete this email from your system.
> 
> Is it necessary to print this email? If you care about the environment like we do, please refrain from printing emails. It helps to keep the environment forested and litter-free.

Re: structured streaming join of streaming dataframe with static dataframe performance

Posted by Koert Kuipers <ko...@tresata.com>.

thats good point about skewness and potential join optimizations. i will
try turning off all skew optimizations, and force a sort-merge-join, and
see if it then re-uses shuffle files on the static side.

unfortunately my static side is too large to broadcast. the streaming side
can be broadcasted indeed and this is faster but it still forces spark to
re-read (but not shuffle) the static side on every join, which in my case
still adds minutes to every microbatch.

On Thu, Aug 4, 2022 at 6:35 AM kant kodali <ka...@gmail.com> wrote:

> I suspect it is probably because the incoming rows when I joined with
> static frame can lead to variable degree of skewness over time and if so it
> is probably better to employ different join strategies at run time. But if
> you know your Dataset I believe you can just do broadcast join for your
> case!
>
> Its been a while since I used spark so you might want to wait for more
> authoritative response
>
> Sent from my iPhone
>
> On Jul 17, 2022, at 5:38 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
> 
> i was surprised to find out that if a streaming dataframe is joined with a
> static dataframe, that the static dataframe is re-shuffled for every
> microbatch, which adds considerable overhead.
>
> wouldn't it make more sense to re-use the shuffle files?
>
> or if that is not possible then load the static dataframe into the
> statestore? this would turn the join into a lookup (in rocksdb)?
>
>
> CONFIDENTIALITY NOTICE: This electronic communication and any files
> transmitted with it are confidential, privileged and intended solely for
> the use of the individual or entity to whom they are addressed. If you are
> not the intended recipient, you are hereby notified that any disclosure,
> copying, distribution (electronic or otherwise) or forwarding of, or the
> taking of any action in reliance on the contents of this transmission is
> strictly prohibited. Please notify the sender immediately by e-mail if you
> have received this email by mistake and delete this email from your system.
>
> Is it necessary to print this email? If you care about the environment
> like we do, please refrain from printing emails. It helps to keep the
> environment forested and litter-free.
>
>

-- 
CONFIDENTIALITY NOTICE: This electronic communication and any files 
transmitted with it are confidential, privileged and intended solely for 
the use of the individual or entity to whom they are addressed. If you are 
not the intended recipient, you are hereby notified that any disclosure, 
copying, distribution (electronic or otherwise) or forwarding of, or the 
taking of any action in reliance on the contents of this transmission is 
strictly prohibited. Please notify the sender immediately by e-mail if you 
have received this email by mistake and delete this email from your system.


Is it necessary to print this email? If you care about the environment 
like we do, please refrain from printing emails. It helps to keep the 
environment forested and litter-free.