You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ignite.apache.org by Николай Ижиков <ni...@gmail.com> on 2017/11/28 17:54:19 UTC

Optimization of SQL queries from Spark Data Frame to Ignite

Hello, guys.

I have implemented basic support of Spark Data Frame API [1], [2] for Ignite.
Spark provides API for a custom strategy to optimize queries from spark to underlying data source(Ignite).

The goal of optimization(obvious, just to be on the same page):
Minimize data transfer between Spark and Ignite.
Speedup query execution.

I see 3 ways to optimize queries:

	1. *Join Reduce* If one make some query that join two or more Ignite tables, we have to pass all join info to Ignite and transfer to Spark only result of table join.
	To implement it we have to extend current implementation with new RelationProvider that can generate all kind of joins for two or more tables.
	We should add some tests, also.
	The question is - how join result should be partitioned?


	2. *Order by* If one make some query to Ignite table with order by clause we can execute sorting on Ignite side.
	But it seems that currently Spark doesn’t have any way to tell that partitions already sorted.


	3. *Key filter* If one make query with `WHERE key = XXX` or `WHERE key IN (X, Y, Z)`, we can reduce number of partitions.
	And query only partitions that store certain key values.
	Is this kind of optimization already built in Ignite or I should implement it by myself?

May be, there is any other way to make queries run faster?

[1] https://spark.apache.org/docs/latest/sql-programming-guide.html
[2] https://github.com/apache/ignite/pull/2742

Re: Optimization of SQL queries from Spark Data Frame to Ignite

Posted by Николай Ижиков <ni...@gmail.com>.
Valentin,

> process the AST generated by Spark and convert it to Ignite SQL...
> Does it make sense to you?

Yes.
I think it is a great approach.

Let's implement such feature as the second step of Data Frame integration.

2017-11-29 3:23 GMT+03:00 Valentin Kulichenko <valentin.kulichenko@gmail.com
>:

> Nikolay,
>
> Custom strategy allows to fully process the AST generated by Spark and
> convert it to Ignite SQL, so there will be no execution on Spark side at
> all. This is what we are trying to achieve here. Basically, one will be
> able to use DataFrame API to execute queries directly on Ignite. Does it
> make sense to you?
>
> I would recommend you to take a look at MemSQL implementation which does
> similar stuff: https://github.com/memsql/memsql-spark-connector
>
> Note that this approach will work only if all relations included in AST are
> Ignite tables. Otherwise, strategy should return null so that Spark falls
> back to its regular mode. Ignite will be used as regular data source in
> this case, and probably it's possible to implement some optimizations here
> as well. However, I never investigated this and it seems like another
> separate discussion.
>
> -Val
>
> On Tue, Nov 28, 2017 at 9:54 AM, Николай Ижиков <ni...@gmail.com>
> wrote:
>
> > Hello, guys.
> >
> > I have implemented basic support of Spark Data Frame API [1], [2] for
> > Ignite.
> > Spark provides API for a custom strategy to optimize queries from spark
> to
> > underlying data source(Ignite).
> >
> > The goal of optimization(obvious, just to be on the same page):
> > Minimize data transfer between Spark and Ignite.
> > Speedup query execution.
> >
> > I see 3 ways to optimize queries:
> >
> >         1. *Join Reduce* If one make some query that join two or more
> > Ignite tables, we have to pass all join info to Ignite and transfer to
> > Spark only result of table join.
> >         To implement it we have to extend current implementation with new
> > RelationProvider that can generate all kind of joins for two or more
> tables.
> >         We should add some tests, also.
> >         The question is - how join result should be partitioned?
> >
> >
> >         2. *Order by* If one make some query to Ignite table with order
> by
> > clause we can execute sorting on Ignite side.
> >         But it seems that currently Spark doesn’t have any way to tell
> > that partitions already sorted.
> >
> >
> >         3. *Key filter* If one make query with `WHERE key = XXX` or
> `WHERE
> > key IN (X, Y, Z)`, we can reduce number of partitions.
> >         And query only partitions that store certain key values.
> >         Is this kind of optimization already built in Ignite or I should
> > implement it by myself?
> >
> > May be, there is any other way to make queries run faster?
> >
> > [1] https://spark.apache.org/docs/latest/sql-programming-guide.html
> > [2] https://github.com/apache/ignite/pull/2742
> >
>



-- 
Nikolay Izhikov
NIzhikov.dev@gmail.com

Re: Optimization of SQL queries from Spark Data Frame to Ignite

Posted by Valentin Kulichenko <va...@gmail.com>.
Great! Let me know if you need any assistance and/or intermediate review.

-Val

On Thu, Nov 30, 2017 at 12:05 AM, Николай Ижиков <ni...@gmail.com>
wrote:

> Valentin,
>
> > Can you please create a separate ticket for the strategy implementation
> then?
>
> Done.
>
> https://issues.apache.org/jira/browse/IGNITE-7077
>
> > Any idea on how long will it take?
>
> I think it will take 2-4 weeks to implement such a strategy.
> I try my best to make a ready to review PR before the end of the year.
>
>
> 30.11.2017 02:13, Valentin Kulichenko пишет:
>
> Nikolay,
>>
>> Can you please create a separate ticket for the strategy implementation
>> then? Any idea on how long will it take?
>>
>> As for querying a partition, both SqlQuery and SqlFieldQuery allow to
>> specify set of partitions to work with (see setPartitions method). I think
>> that should be enough.
>>
>> -Val
>>
>> On Wed, Nov 29, 2017 at 3:39 AM, Vladimir Ozerov <vo...@gridgain.com>
>> wrote:
>>
>> Hi Nikolay,
>>>
>>> No, it is not possible to get this info from public API, neither we
>>> planned
>>> to expose it. See IGNITE-4509 and commit *fbf0e353* to get better
>>> understanding on how this was implemented.
>>>
>>> Vladimir.
>>>
>>> On Wed, Nov 29, 2017 at 2:01 PM, Николай Ижиков <ni...@gmail.com>
>>> wrote:
>>>
>>> Hello, Vladimir.
>>>>
>>>> partition pruning is already implemented in Ignite, so there is no need
>>>>>
>>>> to do this on your own.
>>>>
>>>> Spark work with partitioned data set.
>>>> It is required to provide data partition information to Spark from
>>>> custom
>>>> Data Source(Ignite).
>>>>
>>>> Can I get information about pruned partitions throw some public API?
>>>> Is there a plan or ticket to implement such API?
>>>>
>>>>
>>>>
>>>> 2017-11-29 10:34 GMT+03:00 Vladimir Ozerov <vo...@gridgain.com>:
>>>>
>>>> Nikolay,
>>>>>
>>>>> Regarding p3. - partition pruning is already implemented in Ignite, so
>>>>> there is no need to do this on your own.
>>>>>
>>>>> On Wed, Nov 29, 2017 at 3:23 AM, Valentin Kulichenko <
>>>>> valentin.kulichenko@gmail.com> wrote:
>>>>>
>>>>> Nikolay,
>>>>>>
>>>>>> Custom strategy allows to fully process the AST generated by Spark
>>>>>>
>>>>> and
>>>
>>>> convert it to Ignite SQL, so there will be no execution on Spark side
>>>>>>
>>>>> at
>>>>
>>>>> all. This is what we are trying to achieve here. Basically, one will
>>>>>>
>>>>> be
>>>
>>>> able to use DataFrame API to execute queries directly on Ignite. Does
>>>>>>
>>>>> it
>>>>
>>>>> make sense to you?
>>>>>>
>>>>>> I would recommend you to take a look at MemSQL implementation which
>>>>>>
>>>>> does
>>>>
>>>>> similar stuff: https://github.com/memsql/memsql-spark-connector
>>>>>>
>>>>>> Note that this approach will work only if all relations included in
>>>>>>
>>>>> AST
>>>
>>>> are
>>>>>
>>>>>> Ignite tables. Otherwise, strategy should return null so that Spark
>>>>>>
>>>>> falls
>>>>
>>>>> back to its regular mode. Ignite will be used as regular data source
>>>>>>
>>>>> in
>>>
>>>> this case, and probably it's possible to implement some optimizations
>>>>>>
>>>>> here
>>>>>
>>>>>> as well. However, I never investigated this and it seems like another
>>>>>> separate discussion.
>>>>>>
>>>>>> -Val
>>>>>>
>>>>>> On Tue, Nov 28, 2017 at 9:54 AM, Николай Ижиков <
>>>>>>
>>>>> nizhikov.dev@gmail.com>
>>>>
>>>>> wrote:
>>>>>>
>>>>>> Hello, guys.
>>>>>>>
>>>>>>> I have implemented basic support of Spark Data Frame API [1], [2]
>>>>>>>
>>>>>> for
>>>
>>>> Ignite.
>>>>>>> Spark provides API for a custom strategy to optimize queries from
>>>>>>>
>>>>>> spark
>>>>
>>>>> to
>>>>>>
>>>>>>> underlying data source(Ignite).
>>>>>>>
>>>>>>> The goal of optimization(obvious, just to be on the same page):
>>>>>>> Minimize data transfer between Spark and Ignite.
>>>>>>> Speedup query execution.
>>>>>>>
>>>>>>> I see 3 ways to optimize queries:
>>>>>>>
>>>>>>>          1. *Join Reduce* If one make some query that join two or
>>>>>>>
>>>>>> more
>>>
>>>> Ignite tables, we have to pass all join info to Ignite and transfer
>>>>>>>
>>>>>> to
>>>>
>>>>> Spark only result of table join.
>>>>>>>          To implement it we have to extend current implementation
>>>>>>>
>>>>>> with
>>>
>>>> new
>>>>>
>>>>>> RelationProvider that can generate all kind of joins for two or
>>>>>>>
>>>>>> more
>>>
>>>> tables.
>>>>>>
>>>>>>>          We should add some tests, also.
>>>>>>>          The question is - how join result should be partitioned?
>>>>>>>
>>>>>>>
>>>>>>>          2. *Order by* If one make some query to Ignite table with
>>>>>>>
>>>>>> order
>>>>
>>>>> by
>>>>>>
>>>>>>> clause we can execute sorting on Ignite side.
>>>>>>>          But it seems that currently Spark doesn’t have any way to
>>>>>>>
>>>>>> tell
>>>>
>>>>> that partitions already sorted.
>>>>>>>
>>>>>>>
>>>>>>>          3. *Key filter* If one make query with `WHERE key = XXX` or
>>>>>>>
>>>>>> `WHERE
>>>>>>
>>>>>>> key IN (X, Y, Z)`, we can reduce number of partitions.
>>>>>>>          And query only partitions that store certain key values.
>>>>>>>          Is this kind of optimization already built in Ignite or I
>>>>>>>
>>>>>> should
>>>>>
>>>>>> implement it by myself?
>>>>>>>
>>>>>>> May be, there is any other way to make queries run faster?
>>>>>>>
>>>>>>> [1] https://spark.apache.org/docs/latest/sql-programming-guide.
>>>>>>>
>>>>>> html
>>>
>>>> [2] https://github.com/apache/ignite/pull/2742
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nikolay Izhikov
>>>> NIzhikov.dev@gmail.com
>>>>
>>>>
>>>
>>

Re: Optimization of SQL queries from Spark Data Frame to Ignite

Posted by Николай Ижиков <ni...@gmail.com>.
Valentin,

 > Can you please create a separate ticket for the strategy implementation then?

Done.

https://issues.apache.org/jira/browse/IGNITE-7077

 > Any idea on how long will it take?

I think it will take 2-4 weeks to implement such a strategy.
I try my best to make a ready to review PR before the end of the year.


30.11.2017 02:13, Valentin Kulichenko пишет:
> Nikolay,
> 
> Can you please create a separate ticket for the strategy implementation
> then? Any idea on how long will it take?
> 
> As for querying a partition, both SqlQuery and SqlFieldQuery allow to
> specify set of partitions to work with (see setPartitions method). I think
> that should be enough.
> 
> -Val
> 
> On Wed, Nov 29, 2017 at 3:39 AM, Vladimir Ozerov <vo...@gridgain.com>
> wrote:
> 
>> Hi Nikolay,
>>
>> No, it is not possible to get this info from public API, neither we planned
>> to expose it. See IGNITE-4509 and commit *fbf0e353* to get better
>> understanding on how this was implemented.
>>
>> Vladimir.
>>
>> On Wed, Nov 29, 2017 at 2:01 PM, Николай Ижиков <ni...@gmail.com>
>> wrote:
>>
>>> Hello, Vladimir.
>>>
>>>> partition pruning is already implemented in Ignite, so there is no need
>>> to do this on your own.
>>>
>>> Spark work with partitioned data set.
>>> It is required to provide data partition information to Spark from custom
>>> Data Source(Ignite).
>>>
>>> Can I get information about pruned partitions throw some public API?
>>> Is there a plan or ticket to implement such API?
>>>
>>>
>>>
>>> 2017-11-29 10:34 GMT+03:00 Vladimir Ozerov <vo...@gridgain.com>:
>>>
>>>> Nikolay,
>>>>
>>>> Regarding p3. - partition pruning is already implemented in Ignite, so
>>>> there is no need to do this on your own.
>>>>
>>>> On Wed, Nov 29, 2017 at 3:23 AM, Valentin Kulichenko <
>>>> valentin.kulichenko@gmail.com> wrote:
>>>>
>>>>> Nikolay,
>>>>>
>>>>> Custom strategy allows to fully process the AST generated by Spark
>> and
>>>>> convert it to Ignite SQL, so there will be no execution on Spark side
>>> at
>>>>> all. This is what we are trying to achieve here. Basically, one will
>> be
>>>>> able to use DataFrame API to execute queries directly on Ignite. Does
>>> it
>>>>> make sense to you?
>>>>>
>>>>> I would recommend you to take a look at MemSQL implementation which
>>> does
>>>>> similar stuff: https://github.com/memsql/memsql-spark-connector
>>>>>
>>>>> Note that this approach will work only if all relations included in
>> AST
>>>> are
>>>>> Ignite tables. Otherwise, strategy should return null so that Spark
>>> falls
>>>>> back to its regular mode. Ignite will be used as regular data source
>> in
>>>>> this case, and probably it's possible to implement some optimizations
>>>> here
>>>>> as well. However, I never investigated this and it seems like another
>>>>> separate discussion.
>>>>>
>>>>> -Val
>>>>>
>>>>> On Tue, Nov 28, 2017 at 9:54 AM, Николай Ижиков <
>>> nizhikov.dev@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello, guys.
>>>>>>
>>>>>> I have implemented basic support of Spark Data Frame API [1], [2]
>> for
>>>>>> Ignite.
>>>>>> Spark provides API for a custom strategy to optimize queries from
>>> spark
>>>>> to
>>>>>> underlying data source(Ignite).
>>>>>>
>>>>>> The goal of optimization(obvious, just to be on the same page):
>>>>>> Minimize data transfer between Spark and Ignite.
>>>>>> Speedup query execution.
>>>>>>
>>>>>> I see 3 ways to optimize queries:
>>>>>>
>>>>>>          1. *Join Reduce* If one make some query that join two or
>> more
>>>>>> Ignite tables, we have to pass all join info to Ignite and transfer
>>> to
>>>>>> Spark only result of table join.
>>>>>>          To implement it we have to extend current implementation
>> with
>>>> new
>>>>>> RelationProvider that can generate all kind of joins for two or
>> more
>>>>> tables.
>>>>>>          We should add some tests, also.
>>>>>>          The question is - how join result should be partitioned?
>>>>>>
>>>>>>
>>>>>>          2. *Order by* If one make some query to Ignite table with
>>> order
>>>>> by
>>>>>> clause we can execute sorting on Ignite side.
>>>>>>          But it seems that currently Spark doesn’t have any way to
>>> tell
>>>>>> that partitions already sorted.
>>>>>>
>>>>>>
>>>>>>          3. *Key filter* If one make query with `WHERE key = XXX` or
>>>>> `WHERE
>>>>>> key IN (X, Y, Z)`, we can reduce number of partitions.
>>>>>>          And query only partitions that store certain key values.
>>>>>>          Is this kind of optimization already built in Ignite or I
>>>> should
>>>>>> implement it by myself?
>>>>>>
>>>>>> May be, there is any other way to make queries run faster?
>>>>>>
>>>>>> [1] https://spark.apache.org/docs/latest/sql-programming-guide.
>> html
>>>>>> [2] https://github.com/apache/ignite/pull/2742
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Nikolay Izhikov
>>> NIzhikov.dev@gmail.com
>>>
>>
> 

Re: Optimization of SQL queries from Spark Data Frame to Ignite

Posted by Valentin Kulichenko <va...@gmail.com>.
Nikolay,

Can you please create a separate ticket for the strategy implementation
then? Any idea on how long will it take?

As for querying a partition, both SqlQuery and SqlFieldQuery allow to
specify set of partitions to work with (see setPartitions method). I think
that should be enough.

-Val

On Wed, Nov 29, 2017 at 3:39 AM, Vladimir Ozerov <vo...@gridgain.com>
wrote:

> Hi Nikolay,
>
> No, it is not possible to get this info from public API, neither we planned
> to expose it. See IGNITE-4509 and commit *fbf0e353* to get better
> understanding on how this was implemented.
>
> Vladimir.
>
> On Wed, Nov 29, 2017 at 2:01 PM, Николай Ижиков <ni...@gmail.com>
> wrote:
>
> > Hello, Vladimir.
> >
> > > partition pruning is already implemented in Ignite, so there is no need
> > to do this on your own.
> >
> > Spark work with partitioned data set.
> > It is required to provide data partition information to Spark from custom
> > Data Source(Ignite).
> >
> > Can I get information about pruned partitions throw some public API?
> > Is there a plan or ticket to implement such API?
> >
> >
> >
> > 2017-11-29 10:34 GMT+03:00 Vladimir Ozerov <vo...@gridgain.com>:
> >
> > > Nikolay,
> > >
> > > Regarding p3. - partition pruning is already implemented in Ignite, so
> > > there is no need to do this on your own.
> > >
> > > On Wed, Nov 29, 2017 at 3:23 AM, Valentin Kulichenko <
> > > valentin.kulichenko@gmail.com> wrote:
> > >
> > > > Nikolay,
> > > >
> > > > Custom strategy allows to fully process the AST generated by Spark
> and
> > > > convert it to Ignite SQL, so there will be no execution on Spark side
> > at
> > > > all. This is what we are trying to achieve here. Basically, one will
> be
> > > > able to use DataFrame API to execute queries directly on Ignite. Does
> > it
> > > > make sense to you?
> > > >
> > > > I would recommend you to take a look at MemSQL implementation which
> > does
> > > > similar stuff: https://github.com/memsql/memsql-spark-connector
> > > >
> > > > Note that this approach will work only if all relations included in
> AST
> > > are
> > > > Ignite tables. Otherwise, strategy should return null so that Spark
> > falls
> > > > back to its regular mode. Ignite will be used as regular data source
> in
> > > > this case, and probably it's possible to implement some optimizations
> > > here
> > > > as well. However, I never investigated this and it seems like another
> > > > separate discussion.
> > > >
> > > > -Val
> > > >
> > > > On Tue, Nov 28, 2017 at 9:54 AM, Николай Ижиков <
> > nizhikov.dev@gmail.com>
> > > > wrote:
> > > >
> > > > > Hello, guys.
> > > > >
> > > > > I have implemented basic support of Spark Data Frame API [1], [2]
> for
> > > > > Ignite.
> > > > > Spark provides API for a custom strategy to optimize queries from
> > spark
> > > > to
> > > > > underlying data source(Ignite).
> > > > >
> > > > > The goal of optimization(obvious, just to be on the same page):
> > > > > Minimize data transfer between Spark and Ignite.
> > > > > Speedup query execution.
> > > > >
> > > > > I see 3 ways to optimize queries:
> > > > >
> > > > >         1. *Join Reduce* If one make some query that join two or
> more
> > > > > Ignite tables, we have to pass all join info to Ignite and transfer
> > to
> > > > > Spark only result of table join.
> > > > >         To implement it we have to extend current implementation
> with
> > > new
> > > > > RelationProvider that can generate all kind of joins for two or
> more
> > > > tables.
> > > > >         We should add some tests, also.
> > > > >         The question is - how join result should be partitioned?
> > > > >
> > > > >
> > > > >         2. *Order by* If one make some query to Ignite table with
> > order
> > > > by
> > > > > clause we can execute sorting on Ignite side.
> > > > >         But it seems that currently Spark doesn’t have any way to
> > tell
> > > > > that partitions already sorted.
> > > > >
> > > > >
> > > > >         3. *Key filter* If one make query with `WHERE key = XXX` or
> > > > `WHERE
> > > > > key IN (X, Y, Z)`, we can reduce number of partitions.
> > > > >         And query only partitions that store certain key values.
> > > > >         Is this kind of optimization already built in Ignite or I
> > > should
> > > > > implement it by myself?
> > > > >
> > > > > May be, there is any other way to make queries run faster?
> > > > >
> > > > > [1] https://spark.apache.org/docs/latest/sql-programming-guide.
> html
> > > > > [2] https://github.com/apache/ignite/pull/2742
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Nikolay Izhikov
> > NIzhikov.dev@gmail.com
> >
>

Re: Optimization of SQL queries from Spark Data Frame to Ignite

Posted by Vladimir Ozerov <vo...@gridgain.com>.
Hi Nikolay,

No, it is not possible to get this info from public API, neither we planned
to expose it. See IGNITE-4509 and commit *fbf0e353* to get better
understanding on how this was implemented.

Vladimir.

On Wed, Nov 29, 2017 at 2:01 PM, Николай Ижиков <ni...@gmail.com>
wrote:

> Hello, Vladimir.
>
> > partition pruning is already implemented in Ignite, so there is no need
> to do this on your own.
>
> Spark work with partitioned data set.
> It is required to provide data partition information to Spark from custom
> Data Source(Ignite).
>
> Can I get information about pruned partitions throw some public API?
> Is there a plan or ticket to implement such API?
>
>
>
> 2017-11-29 10:34 GMT+03:00 Vladimir Ozerov <vo...@gridgain.com>:
>
> > Nikolay,
> >
> > Regarding p3. - partition pruning is already implemented in Ignite, so
> > there is no need to do this on your own.
> >
> > On Wed, Nov 29, 2017 at 3:23 AM, Valentin Kulichenko <
> > valentin.kulichenko@gmail.com> wrote:
> >
> > > Nikolay,
> > >
> > > Custom strategy allows to fully process the AST generated by Spark and
> > > convert it to Ignite SQL, so there will be no execution on Spark side
> at
> > > all. This is what we are trying to achieve here. Basically, one will be
> > > able to use DataFrame API to execute queries directly on Ignite. Does
> it
> > > make sense to you?
> > >
> > > I would recommend you to take a look at MemSQL implementation which
> does
> > > similar stuff: https://github.com/memsql/memsql-spark-connector
> > >
> > > Note that this approach will work only if all relations included in AST
> > are
> > > Ignite tables. Otherwise, strategy should return null so that Spark
> falls
> > > back to its regular mode. Ignite will be used as regular data source in
> > > this case, and probably it's possible to implement some optimizations
> > here
> > > as well. However, I never investigated this and it seems like another
> > > separate discussion.
> > >
> > > -Val
> > >
> > > On Tue, Nov 28, 2017 at 9:54 AM, Николай Ижиков <
> nizhikov.dev@gmail.com>
> > > wrote:
> > >
> > > > Hello, guys.
> > > >
> > > > I have implemented basic support of Spark Data Frame API [1], [2] for
> > > > Ignite.
> > > > Spark provides API for a custom strategy to optimize queries from
> spark
> > > to
> > > > underlying data source(Ignite).
> > > >
> > > > The goal of optimization(obvious, just to be on the same page):
> > > > Minimize data transfer between Spark and Ignite.
> > > > Speedup query execution.
> > > >
> > > > I see 3 ways to optimize queries:
> > > >
> > > >         1. *Join Reduce* If one make some query that join two or more
> > > > Ignite tables, we have to pass all join info to Ignite and transfer
> to
> > > > Spark only result of table join.
> > > >         To implement it we have to extend current implementation with
> > new
> > > > RelationProvider that can generate all kind of joins for two or more
> > > tables.
> > > >         We should add some tests, also.
> > > >         The question is - how join result should be partitioned?
> > > >
> > > >
> > > >         2. *Order by* If one make some query to Ignite table with
> order
> > > by
> > > > clause we can execute sorting on Ignite side.
> > > >         But it seems that currently Spark doesn’t have any way to
> tell
> > > > that partitions already sorted.
> > > >
> > > >
> > > >         3. *Key filter* If one make query with `WHERE key = XXX` or
> > > `WHERE
> > > > key IN (X, Y, Z)`, we can reduce number of partitions.
> > > >         And query only partitions that store certain key values.
> > > >         Is this kind of optimization already built in Ignite or I
> > should
> > > > implement it by myself?
> > > >
> > > > May be, there is any other way to make queries run faster?
> > > >
> > > > [1] https://spark.apache.org/docs/latest/sql-programming-guide.html
> > > > [2] https://github.com/apache/ignite/pull/2742
> > > >
> > >
> >
>
>
>
> --
> Nikolay Izhikov
> NIzhikov.dev@gmail.com
>

Re: Optimization of SQL queries from Spark Data Frame to Ignite

Posted by Николай Ижиков <ni...@gmail.com>.
Hello, Vladimir.

> partition pruning is already implemented in Ignite, so there is no need
to do this on your own.

Spark work with partitioned data set.
It is required to provide data partition information to Spark from custom
Data Source(Ignite).

Can I get information about pruned partitions throw some public API?
Is there a plan or ticket to implement such API?



2017-11-29 10:34 GMT+03:00 Vladimir Ozerov <vo...@gridgain.com>:

> Nikolay,
>
> Regarding p3. - partition pruning is already implemented in Ignite, so
> there is no need to do this on your own.
>
> On Wed, Nov 29, 2017 at 3:23 AM, Valentin Kulichenko <
> valentin.kulichenko@gmail.com> wrote:
>
> > Nikolay,
> >
> > Custom strategy allows to fully process the AST generated by Spark and
> > convert it to Ignite SQL, so there will be no execution on Spark side at
> > all. This is what we are trying to achieve here. Basically, one will be
> > able to use DataFrame API to execute queries directly on Ignite. Does it
> > make sense to you?
> >
> > I would recommend you to take a look at MemSQL implementation which does
> > similar stuff: https://github.com/memsql/memsql-spark-connector
> >
> > Note that this approach will work only if all relations included in AST
> are
> > Ignite tables. Otherwise, strategy should return null so that Spark falls
> > back to its regular mode. Ignite will be used as regular data source in
> > this case, and probably it's possible to implement some optimizations
> here
> > as well. However, I never investigated this and it seems like another
> > separate discussion.
> >
> > -Val
> >
> > On Tue, Nov 28, 2017 at 9:54 AM, Николай Ижиков <ni...@gmail.com>
> > wrote:
> >
> > > Hello, guys.
> > >
> > > I have implemented basic support of Spark Data Frame API [1], [2] for
> > > Ignite.
> > > Spark provides API for a custom strategy to optimize queries from spark
> > to
> > > underlying data source(Ignite).
> > >
> > > The goal of optimization(obvious, just to be on the same page):
> > > Minimize data transfer between Spark and Ignite.
> > > Speedup query execution.
> > >
> > > I see 3 ways to optimize queries:
> > >
> > >         1. *Join Reduce* If one make some query that join two or more
> > > Ignite tables, we have to pass all join info to Ignite and transfer to
> > > Spark only result of table join.
> > >         To implement it we have to extend current implementation with
> new
> > > RelationProvider that can generate all kind of joins for two or more
> > tables.
> > >         We should add some tests, also.
> > >         The question is - how join result should be partitioned?
> > >
> > >
> > >         2. *Order by* If one make some query to Ignite table with order
> > by
> > > clause we can execute sorting on Ignite side.
> > >         But it seems that currently Spark doesn’t have any way to tell
> > > that partitions already sorted.
> > >
> > >
> > >         3. *Key filter* If one make query with `WHERE key = XXX` or
> > `WHERE
> > > key IN (X, Y, Z)`, we can reduce number of partitions.
> > >         And query only partitions that store certain key values.
> > >         Is this kind of optimization already built in Ignite or I
> should
> > > implement it by myself?
> > >
> > > May be, there is any other way to make queries run faster?
> > >
> > > [1] https://spark.apache.org/docs/latest/sql-programming-guide.html
> > > [2] https://github.com/apache/ignite/pull/2742
> > >
> >
>



-- 
Nikolay Izhikov
NIzhikov.dev@gmail.com

Re: Optimization of SQL queries from Spark Data Frame to Ignite

Posted by Vladimir Ozerov <vo...@gridgain.com>.
Nikolay,

Regarding p3. - partition pruning is already implemented in Ignite, so
there is no need to do this on your own.

On Wed, Nov 29, 2017 at 3:23 AM, Valentin Kulichenko <
valentin.kulichenko@gmail.com> wrote:

> Nikolay,
>
> Custom strategy allows to fully process the AST generated by Spark and
> convert it to Ignite SQL, so there will be no execution on Spark side at
> all. This is what we are trying to achieve here. Basically, one will be
> able to use DataFrame API to execute queries directly on Ignite. Does it
> make sense to you?
>
> I would recommend you to take a look at MemSQL implementation which does
> similar stuff: https://github.com/memsql/memsql-spark-connector
>
> Note that this approach will work only if all relations included in AST are
> Ignite tables. Otherwise, strategy should return null so that Spark falls
> back to its regular mode. Ignite will be used as regular data source in
> this case, and probably it's possible to implement some optimizations here
> as well. However, I never investigated this and it seems like another
> separate discussion.
>
> -Val
>
> On Tue, Nov 28, 2017 at 9:54 AM, Николай Ижиков <ni...@gmail.com>
> wrote:
>
> > Hello, guys.
> >
> > I have implemented basic support of Spark Data Frame API [1], [2] for
> > Ignite.
> > Spark provides API for a custom strategy to optimize queries from spark
> to
> > underlying data source(Ignite).
> >
> > The goal of optimization(obvious, just to be on the same page):
> > Minimize data transfer between Spark and Ignite.
> > Speedup query execution.
> >
> > I see 3 ways to optimize queries:
> >
> >         1. *Join Reduce* If one make some query that join two or more
> > Ignite tables, we have to pass all join info to Ignite and transfer to
> > Spark only result of table join.
> >         To implement it we have to extend current implementation with new
> > RelationProvider that can generate all kind of joins for two or more
> tables.
> >         We should add some tests, also.
> >         The question is - how join result should be partitioned?
> >
> >
> >         2. *Order by* If one make some query to Ignite table with order
> by
> > clause we can execute sorting on Ignite side.
> >         But it seems that currently Spark doesn’t have any way to tell
> > that partitions already sorted.
> >
> >
> >         3. *Key filter* If one make query with `WHERE key = XXX` or
> `WHERE
> > key IN (X, Y, Z)`, we can reduce number of partitions.
> >         And query only partitions that store certain key values.
> >         Is this kind of optimization already built in Ignite or I should
> > implement it by myself?
> >
> > May be, there is any other way to make queries run faster?
> >
> > [1] https://spark.apache.org/docs/latest/sql-programming-guide.html
> > [2] https://github.com/apache/ignite/pull/2742
> >
>

Re: Optimization of SQL queries from Spark Data Frame to Ignite

Posted by Valentin Kulichenko <va...@gmail.com>.
Nikolay,

Custom strategy allows to fully process the AST generated by Spark and
convert it to Ignite SQL, so there will be no execution on Spark side at
all. This is what we are trying to achieve here. Basically, one will be
able to use DataFrame API to execute queries directly on Ignite. Does it
make sense to you?

I would recommend you to take a look at MemSQL implementation which does
similar stuff: https://github.com/memsql/memsql-spark-connector

Note that this approach will work only if all relations included in AST are
Ignite tables. Otherwise, strategy should return null so that Spark falls
back to its regular mode. Ignite will be used as regular data source in
this case, and probably it's possible to implement some optimizations here
as well. However, I never investigated this and it seems like another
separate discussion.

-Val

On Tue, Nov 28, 2017 at 9:54 AM, Николай Ижиков <ni...@gmail.com>
wrote:

> Hello, guys.
>
> I have implemented basic support of Spark Data Frame API [1], [2] for
> Ignite.
> Spark provides API for a custom strategy to optimize queries from spark to
> underlying data source(Ignite).
>
> The goal of optimization(obvious, just to be on the same page):
> Minimize data transfer between Spark and Ignite.
> Speedup query execution.
>
> I see 3 ways to optimize queries:
>
>         1. *Join Reduce* If one make some query that join two or more
> Ignite tables, we have to pass all join info to Ignite and transfer to
> Spark only result of table join.
>         To implement it we have to extend current implementation with new
> RelationProvider that can generate all kind of joins for two or more tables.
>         We should add some tests, also.
>         The question is - how join result should be partitioned?
>
>
>         2. *Order by* If one make some query to Ignite table with order by
> clause we can execute sorting on Ignite side.
>         But it seems that currently Spark doesn’t have any way to tell
> that partitions already sorted.
>
>
>         3. *Key filter* If one make query with `WHERE key = XXX` or `WHERE
> key IN (X, Y, Z)`, we can reduce number of partitions.
>         And query only partitions that store certain key values.
>         Is this kind of optimization already built in Ignite or I should
> implement it by myself?
>
> May be, there is any other way to make queries run faster?
>
> [1] https://spark.apache.org/docs/latest/sql-programming-guide.html
> [2] https://github.com/apache/ignite/pull/2742
>