You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Christos Kozanitis <ko...@berkeley.edu> on 2014/07/26 12:32:17 UTC

SparkSQL extensions

Hello

I was wondering is it easy for you guys to point me to what modules I need
to update if I had to add extra functionality to sparkSQL?

I was thinking to implement a region-join operator and I guess I should add
the implementation details under joins.scala but what else do I need to
modify?

thanks
Christos

Re: SparkSQL extensions

Posted by Michael Armbrust <mi...@databricks.com>.

Ah, I understand now.  That sounds pretty useful and is something we would
currently plan very inefficiently.


On Sun, Jul 27, 2014 at 1:07 AM, Christos Kozanitis <ko...@berkeley.edu>
wrote:

> Thanks Michael for the recommendations. Actually the region-join (or I
> could name it range-join or interval-join) that I was thinking should join
> the entries of two tables with inequality predicates. For example if table
> A(col1 int, col2 int) contains entries (1,4) and (10,12) and table b(c1
> int, c2 int) contains entries (3,6) and (43,23) then the region-join of A,
> B on (col1 < c1 and c2 < col2) should produce the tuple(1,4,3,6).
>
> Does it make sense?
>
> Actually there is a JIRA on a similar topic for Hive here:
> https://issues.apache.org/jira/browse/HIVE-556
>
> Also ADAM implements region-joins here:
> https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/RegionJoin.scala
>
> I was thinking to provide an improved version of method "partitionAndJoin"
> from the ADAM implementation above
>
>
>
> On Sat, Jul 26, 2014 at 12:37 PM, Michael Armbrust <michael@databricks.com
> > wrote:
>
>> A very simple example of adding a new operator to Spark SQL:
>> https://github.com/apache/spark/pull/1366
>> An example of adding a new type of join to Spark SQL:
>> https://github.com/apache/spark/pull/837
>>
>> Basically, you will need to add a new physical operator that inherits
>> from SparkPlan and a Strategy that causes the query planner to select it.
>>  Maybe you can explain a little more what you mean by region-join?  If its
>> only a different algorithm, and not a logically different type of join,
>> then you will not need to make some of he logical modifications that the
>> second PR did.
>>
>> Often the hardest part here is going to be figuring out when to use one
>> join over another.  Right now the rules are pretty straightforward: The
>> joins that are picked first are the most efficient but only handle certain
>> cases (inner joins with equality predicates).  When that is not the case it
>> falls back on slower, but more general operators.  If there are more subtle
>> trade offs involved then we may need to wait until we have more statistics
>> to help us make the choice.
>>
>> I'd suggest opening a JIRA and proposing a design before going too far.
>>
>> Michael
>>
>>
>> On Sat, Jul 26, 2014 at 3:32 AM, Christos Kozanitis <
>> kozanitis@berkeley.edu> wrote:
>>
>>> Hello
>>>
>>> I was wondering is it easy for you guys to point me to what modules I
>>> need to update if I had to add extra functionality to sparkSQL?
>>>
>>> I was thinking to implement a region-join operator and I guess I should
>>> add the implementation details under joins.scala but what else do I need to
>>> modify?
>>>
>>> thanks
>>> Christos
>>>
>>
>>
>

Re: SparkSQL extensions

Posted by Christos Kozanitis <ko...@berkeley.edu>.

Thanks Michael for the recommendations. Actually the region-join (or I
could name it range-join or interval-join) that I was thinking should join
the entries of two tables with inequality predicates. For example if table
A(col1 int, col2 int) contains entries (1,4) and (10,12) and table b(c1
int, c2 int) contains entries (3,6) and (43,23) then the region-join of A,
B on (col1 < c1 and c2 < col2) should produce the tuple(1,4,3,6).

Does it make sense?

Actually there is a JIRA on a similar topic for Hive here:
https://issues.apache.org/jira/browse/HIVE-556

Also ADAM implements region-joins here:
https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/RegionJoin.scala

I was thinking to provide an improved version of method "partitionAndJoin"
from the ADAM implementation above



On Sat, Jul 26, 2014 at 12:37 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> A very simple example of adding a new operator to Spark SQL:
> https://github.com/apache/spark/pull/1366
> An example of adding a new type of join to Spark SQL:
> https://github.com/apache/spark/pull/837
>
> Basically, you will need to add a new physical operator that inherits from
> SparkPlan and a Strategy that causes the query planner to select it.  Maybe
> you can explain a little more what you mean by region-join?  If its only a
> different algorithm, and not a logically different type of join, then you
> will not need to make some of he logical modifications that the second PR
> did.
>
> Often the hardest part here is going to be figuring out when to use one
> join over another.  Right now the rules are pretty straightforward: The
> joins that are picked first are the most efficient but only handle certain
> cases (inner joins with equality predicates).  When that is not the case it
> falls back on slower, but more general operators.  If there are more subtle
> trade offs involved then we may need to wait until we have more statistics
> to help us make the choice.
>
> I'd suggest opening a JIRA and proposing a design before going too far.
>
> Michael
>
>
> On Sat, Jul 26, 2014 at 3:32 AM, Christos Kozanitis <
> kozanitis@berkeley.edu> wrote:
>
>> Hello
>>
>> I was wondering is it easy for you guys to point me to what modules I
>> need to update if I had to add extra functionality to sparkSQL?
>>
>> I was thinking to implement a region-join operator and I guess I should
>> add the implementation details under joins.scala but what else do I need to
>> modify?
>>
>> thanks
>> Christos
>>
>
>

Re: SparkSQL extensions

Posted by Michael Armbrust <mi...@databricks.com>.

A very simple example of adding a new operator to Spark SQL:
https://github.com/apache/spark/pull/1366
An example of adding a new type of join to Spark SQL:
https://github.com/apache/spark/pull/837

Basically, you will need to add a new physical operator that inherits from
SparkPlan and a Strategy that causes the query planner to select it.  Maybe
you can explain a little more what you mean by region-join?  If its only a
different algorithm, and not a logically different type of join, then you
will not need to make some of he logical modifications that the second PR
did.

Often the hardest part here is going to be figuring out when to use one
join over another.  Right now the rules are pretty straightforward: The
joins that are picked first are the most efficient but only handle certain
cases (inner joins with equality predicates).  When that is not the case it
falls back on slower, but more general operators.  If there are more subtle
trade offs involved then we may need to wait until we have more statistics
to help us make the choice.

I'd suggest opening a JIRA and proposing a design before going too far.

Michael

On Sat, Jul 26, 2014 at 3:32 AM, Christos Kozanitis <ko...@berkeley.edu>
wrote:

> Hello
>
> I was wondering is it easy for you guys to point me to what modules I need
> to update if I had to add extra functionality to sparkSQL?
>
> I was thinking to implement a region-join operator and I guess I should
> add the implementation details under joins.scala but what else do I need to
> modify?
>
> thanks
> Christos
>