You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Timothy Chen <tn...@gmail.com> on 2013/11/15 19:04:40 UTC

Broadcast exchange


Sent from my iPhone

Begin forwarded message:

> From: Timothy Chen <tn...@gmail.com>
> Date: November 15, 2013 at 1:32:26 AM HST
> To: Jacques Nadeau <jn...@maprtech.com>
> Subject: Broadcast exchange
> 
> Hey Jacques,
> 
> I'm working on the broadcast sender, and I remember we had some chat about the scenario and you mentioned buffering.
> 
> Right now I have the most simplest version which I have single broadcast exchange which a single broadcast sender sends every incoming batch to all the random receivers.
> 
> Couple questions I have now:
> 
> 1) the scenario where I thought broadcast is very useful is when we join two data sources and one of them is a small fact table, so we copy that table to all join nodes for fast joining. However, I don't remember seeing any test if we have multiple scans joining?
> 
> The physical plan for that will be two scan nodes, one uses broadcast exchange another perhaps hash to random exchange, and join from the two exchanges that sends to screen.
> 
> Does that plan work? I feel like I might be missing something.
> 
> 2) my impl doesn't do any buffering and just creates a fragment writable batch for each destination from the incoming batch buffers and sends it in a for loop. I was wondering what is the buffering you thought necessary when you mentioned it earlier？
> 
> Btw I'll be soon experience the hangout time pain as the china folks as I am going to be in Taiwan for 2 weeks from this Sunday :)
> 
> Tim
> 
> Sent from my iPhone

Re: Broadcast exchange

Posted by Jacques Nadeau <ja...@apache.org>.

> > 1) the scenario where I thought broadcast is very useful is when we join
> two data sources and one of them is a small fact table, so we copy that
> table to all join nodes for fast joining. However, I don't remember seeing
> any test if we have multiple scans joining?
>

I believe the join test tests this.  Your understanding is correct.


> >
> > The physical plan for that will be two scan nodes, one uses broadcast
> exchange another perhaps hash to random exchange, and join from the two
> exchanges that sends to screen.
> >
> > Does that plan work? I feel like I might be missing something.
>

Yes, that plan is possible. However, a simpler plan would be:

     store
       |
     join
     /  \
     |  scan
     |
random-receiver
     |
broadcast-sender
     |
       scan

This is where we don't transmit the larger table across the wire
unnecessarily.  (In this example, a more likely but more complex plan might
have an aggregation after the join.

> 2) my impl doesn't do any buffering and just creates a fragment writable
> batch for each destination from the incoming batch buffers and sends it in
> a for loop. I was wondering what is the buffering you thought necessary
> when you mentioned it earlier？
>

You don't need to do buffering.  That is handled by the rest of the
execution engine and Steven's incoming buffer spooling feature.


> >
> > Btw I'll be soon experience the hangout time pain as the china folks as
> I am going to be in Taiwan for 2 weeks from this Sunday :)
>

Good times!