You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by Milinda Pathirage <mp...@umail.iu.edu> on 2016/01/06 18:02:38 UTC

What is the best way to determine input to a join is a relation

Hi Devs,

I need to figure out which input is the relation in a stream-to-relation
join. Is there a good way to do this rather than traversing the inputs
until a scan operator is found.

Thanks
Milinda

-- 
Milinda Pathirage

PhD Student | Research Assistant
School of Informatics and Computing | Data to Insight Center
Indiana University

twitter: milindalakmal
skype: milinda.pathirage
blog: http://milinda.pathirage.org

Re: What is the best way to determine input to a join is a relation

Posted by Milinda Pathirage <mp...@umail.iu.edu>.

Hi Julian,

I think "timeliness of sort" is a interesting concept that can use for
streaming query planning and optimizations generally. For example, in Samza
we are thinking of timeouts to handle out-of-order arrivals for hopping and
tumbling windows. My plan was to let user configure the timeout at Samza
level somehow . But having this information during query planning time will
be useful for generating physical plans.

On the other hand, my original question was motivated by how I am trying to
implement stream-to-relaiton joins in SamzaSQL. Samza supports a special
type of streams called 'bootstrapped streams' which can be used to populate
local state needed for  a streaming task before start performing actual
computation. Samza wait for stream task to finish consuming the bootstrap
stream before delivering other streams to the streaming task.

My plan was to utilize this capability to load the finite relation (only
relevant partitions assuming changelog of the relaiton is partitioned on
the same attributes as input streams) to the streaming task's local state
to perform a hash-join (in case of equi-join). For this purpose I need to
know exactly which stream corresponds to a relation changelog during query
planning to generate code to handle this situation.

So in this situation, I think there should be a way to exactly figure out
which input (input tree) is the relation. But timeliness of sorting will
not be 100% accurate always (in case of streams with zero sort-delay).

Thanks
Milinda

On Wed, Jan 6, 2016 at 1:31 PM, Julian Hyde <jh...@apache.org> wrote:

> I’m going to rephrase your question to “What is the best way to determine
> whether a relation is a table?” I like to use the term “relation” to cover
> both finite relations (tables) and infinite relations (streams).
>
> I think a useful trait would be “timeliness of sort”. It sounds very
> abstract but bear with me.
>
> Suppose I have a relation R and I want to do a stream join on its rowtime
> column. I might ask the following question:
>
>   What is the maximum amount that any row might be delayed when I execute
> ‘select * from R order by rowtime’, waiting for all applicable rows?
>
> Some cases:
> * If R is a table (i.e. a relation that does not have new data arriving
> during the execution of the query), we have all the data already, so the
> delay is zero. If R is a stream sorted by rowtime, the delay is zero (or
> perhaps a small value t that represents the network latency).
> * If R is a stream based on a log files, and the rowtime column is based
> on the wallclock time of those servers, and the log files are pushed every
> hour, then the delay is 1 hour.
> * If R is a stream sorted by some other column, then the delay is infinite.
>
> Measuring the sort-delay is more general than saying whether a relation is
> sorted. For a stream, the sort-delay is zero. If the sort-delay is
> infinity, or too high, we don’t consider it to be a viable plan.
>
> As I said, this is a very abstract concept. I’ve not fully thought through
> the idea, and it doesn’t directly answer your question, but I think we can
> develop it to achieve your goal, which is to optimize stream-table joins.
> Do you think that it is a useful concept worth developing?
>
> Julian
>
>
> > On Jan 6, 2016, at 9:55 AM, Jacques Nadeau <ja...@apache.org> wrote:
> >
> > It seems like it should be a trait. The one problem you'll hit though is
> > how to propagate that trait. Maybe Julian has some good ideas.
> > On Jan 6, 2016 9:02 AM, "Milinda Pathirage" <mp...@umail.iu.edu>
> wrote:
> >
> >> Hi Devs,
> >>
> >> I need to figure out which input is the relation in a stream-to-relation
> >> join. Is there a good way to do this rather than traversing the inputs
> >> until a scan operator is found.
> >>
> >> Thanks
> >> Milinda
> >>
> >> --
> >> Milinda Pathirage
> >>
> >> PhD Student | Research Assistant
> >> School of Informatics and Computing | Data to Insight Center
> >> Indiana University
> >>
> >> twitter: milindalakmal
> >> skype: milinda.pathirage
> >> blog: http://milinda.pathirage.org
> >>
>
>

-- 
Milinda Pathirage

PhD Student | Research Assistant
School of Informatics and Computing | Data to Insight Center
Indiana University

twitter: milindalakmal
skype: milinda.pathirage
blog: http://milinda.pathirage.org

Re: What is the best way to determine input to a join is a relation

Posted by Julian Hyde <jh...@apache.org>.

I’m going to rephrase your question to “What is the best way to determine whether a relation is a table?” I like to use the term “relation” to cover both finite relations (tables) and infinite relations (streams).

I think a useful trait would be “timeliness of sort”. It sounds very abstract but bear with me.

Suppose I have a relation R and I want to do a stream join on its rowtime column. I might ask the following question:

  What is the maximum amount that any row might be delayed when I execute ‘select * from R order by rowtime’, waiting for all applicable rows?

Some cases:
* If R is a table (i.e. a relation that does not have new data arriving during the execution of the query), we have all the data already, so the delay is zero. If R is a stream sorted by rowtime, the delay is zero (or perhaps a small value t that represents the network latency).
* If R is a stream based on a log files, and the rowtime column is based on the wallclock time of those servers, and the log files are pushed every hour, then the delay is 1 hour.
* If R is a stream sorted by some other column, then the delay is infinite.

Measuring the sort-delay is more general than saying whether a relation is sorted. For a stream, the sort-delay is zero. If the sort-delay is infinity, or too high, we don’t consider it to be a viable plan.

As I said, this is a very abstract concept. I’ve not fully thought through the idea, and it doesn’t directly answer your question, but I think we can develop it to achieve your goal, which is to optimize stream-table joins. Do you think that it is a useful concept worth developing?

Julian

> On Jan 6, 2016, at 9:55 AM, Jacques Nadeau <ja...@apache.org> wrote:
> 
> It seems like it should be a trait. The one problem you'll hit though is
> how to propagate that trait. Maybe Julian has some good ideas.
> On Jan 6, 2016 9:02 AM, "Milinda Pathirage" <mp...@umail.iu.edu> wrote:
> 
>> Hi Devs,
>> 
>> I need to figure out which input is the relation in a stream-to-relation
>> join. Is there a good way to do this rather than traversing the inputs
>> until a scan operator is found.
>> 
>> Thanks
>> Milinda
>> 
>> --
>> Milinda Pathirage
>> 
>> PhD Student | Research Assistant
>> School of Informatics and Computing | Data to Insight Center
>> Indiana University
>> 
>> twitter: milindalakmal
>> skype: milinda.pathirage
>> blog: http://milinda.pathirage.org
>>

Re: What is the best way to determine input to a join is a relation

Posted by Jacques Nadeau <ja...@apache.org>.

It seems like it should be a trait. The one problem you'll hit though is
how to propagate that trait. Maybe Julian has some good ideas.
On Jan 6, 2016 9:02 AM, "Milinda Pathirage" <mp...@umail.iu.edu> wrote:

> Hi Devs,
>
> I need to figure out which input is the relation in a stream-to-relation
> join. Is there a good way to do this rather than traversing the inputs
> until a scan operator is found.
>
> Thanks
> Milinda
>
> --
> Milinda Pathirage
>
> PhD Student | Research Assistant
> School of Informatics and Computing | Data to Insight Center
> Indiana University
>
> twitter: milindalakmal
> skype: milinda.pathirage
> blog: http://milinda.pathirage.org
>