You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ignite.apache.org by Valentin Kulichenko <va...@gmail.com> on 2017/12/29 21:22:49 UTC

Spark data frames integration merged

Igniters,

Great news! We completed and merged first part of integration with Spark
data frames [1]. It contains implementation of Spark data source which
allows to use DataFrame API to query Ignite data, as well as join it with
other data frames originated from different sources.

Next planned steps are the following:
- Implement custom execution strategy to avoid transferring data from
Ignite to Spark when possible [2]. This should give serious performance
improvement in cases when only Ignite tables participate in a query.
- Implement ability to save a data frame into Ignite via DataFrameWrite API
[3].

[1] https://issues.apache.org/jira/browse/IGNITE-3084
[2] https://issues.apache.org/jira/browse/IGNITE-7077
[3] https://issues.apache.org/jira/browse/IGNITE-7337

Nikolay Izhikov, thanks for the contribution and for all the hard work!

-Val

Re: Spark data frames integration merged

Posted by Nikolay Izhikov <ni...@gmail.com>.

Hello, Denis.

> Nikolay, could you document the feature before the release [1]?

Yes, I can. 
I will document these feature in a next few days.


В Пт, 29/12/2017 в 13:37 -0800, Denis Magda пишет:
> Great news,
> 
> Thanks Nikolay and Val!
> 
> Nikolay, could you document the feature before the release [1]? I’ve
> granted you required permission.
> 
> More on the doc process can be found here [2].
> 
> [1] https://issues.apache.org/jira/browse/IGNITE-7345
> [2] https://cwiki.apache.org/confluence/display/IGNITE/How+to+Documen
> t
> 
> —
> Denis
> 
> > On Dec 29, 2017, at 1:22 PM, Valentin Kulichenko <valentin.kulichen
> > ko@gmail.com> wrote:
> > 
> > Igniters,
> > 
> > Great news! We completed and merged first part of integration with
> > Spark data frames [1]. It contains implementation of Spark data
> > source which allows to use DataFrame API to query Ignite data, as
> > well as join it with other data frames originated from different
> > sources.
> > 
> > Next planned steps are the following:
> > - Implement custom execution strategy to avoid transferring data
> > from Ignite to Spark when possible [2]. This should give serious
> > performance improvement in cases when only Ignite tables
> > participate in a query.
> > - Implement ability to save a data frame into Ignite via
> > DataFrameWrite API [3].
> > 
> > [1] https://issues.apache.org/jira/browse/IGNITE-3084
> > [2] https://issues.apache.org/jira/browse/IGNITE-7077
> > [3] https://issues.apache.org/jira/browse/IGNITE-7337
> > 
> > Nikolay Izhikov, thanks for the contribution and for all the hard
> > work!
> > 
> > -Val
> 
>

Re: Spark data frames integration merged

Posted by Denis Magda <dm...@apache.org>.

Great news,

Thanks Nikolay and Val!

Nikolay, could you document the feature before the release [1]? I’ve granted you required permission.

More on the doc process can be found here [2].

[1] https://issues.apache.org/jira/browse/IGNITE-7345 <https://issues.apache.org/jira/browse/IGNITE-7345>
[2] https://cwiki.apache.org/confluence/display/IGNITE/How+to+Document <https://cwiki.apache.org/confluence/display/IGNITE/How+to+Document>

—
Denis

> On Dec 29, 2017, at 1:22 PM, Valentin Kulichenko <va...@gmail.com> wrote:
> 
> Igniters,
> 
> Great news! We completed and merged first part of integration with Spark data frames [1]. It contains implementation of Spark data source which allows to use DataFrame API to query Ignite data, as well as join it with other data frames originated from different sources.
> 
> Next planned steps are the following:
> - Implement custom execution strategy to avoid transferring data from Ignite to Spark when possible [2]. This should give serious performance improvement in cases when only Ignite tables participate in a query.
> - Implement ability to save a data frame into Ignite via DataFrameWrite API [3].
> 
> [1] https://issues.apache.org/jira/browse/IGNITE-3084 <https://issues.apache.org/jira/browse/IGNITE-3084>
> [2] https://issues.apache.org/jira/browse/IGNITE-7077 <https://issues.apache.org/jira/browse/IGNITE-7077>
> [3] https://issues.apache.org/jira/browse/IGNITE-7337 <https://issues.apache.org/jira/browse/IGNITE-7337>
> 
> Nikolay Izhikov, thanks for the contribution and for all the hard work!
> 
> -Val

Re: Spark data frames integration merged

Posted by Nikolay Izhikov <ni...@gmail.com>.

Hello, guys.

Currently `getPreferredLocations` implemented in 
`IgniteRDD -> IgniteAbstractRDD`.

But DataFrame implementation uses 
`IgniteSQLDataFrameRDD -> IgniteSqlRDD -> IgniteAbstractRDD`

Where `->` is extension.

So, for now, getPreferredLocation doesn't implemented for a
IgniteDataFrame.

Please, take a look [1], [2].

I think it a very good idea to implement `getPreferredLocation` inside
`IgniteSQLDataFrameRDD` or event inside `IgniteAbstractRDD`

Can someone file a ticket? Or I can do it by myself.


[1] - https://github.com/apache/ignite/blob/master/modules/spark/src/ma
in/scala/org/apache/ignite/spark/IgniteRDD.scala#L50

[2] - https://github.com/apache/ignite/blob/master/modules/spark/src/ma
in/scala/org/apache/ignite/spark/impl/IgniteSQLDataFrameRDD.scala#L40


В Ср, 03/01/2018 в 15:35 -0800, Valentin Kulichenko пишет:
> Revin,
> 
> I doubt IgniteRDD#getPrefferredLocations has any affect on data
> frames, but this is an interesting point. Nikolay, as a developer of
> this functionality, can you please comment on this?
> 
> -Val
> 
> On Wed, Jan 3, 2018 at 1:22 PM, Revin Chalil <rc...@expedia.com>
> wrote:
> > Thanks Val for the info on indexes with DF. Do you know if adding
> > index / affinitykeys on the cache help with the join, when the
> > IgniteRDD is joined with a spark DF? The below from docs say that
> > 
> > “IgniteRDD also provides affinity information to Spark via
> > getPrefferredLocations method so that RDD computations use data
> > locality.”
> > 
> > I was wondering, if the affinitykey on the cache can be utilized in
> > the spark join?
> > 
> > 
> > On 1/3/18, 12:27 PM, "vkulichenko" <va...@gmail.com>
> > wrote:
> > 
> >     Indexes would not be used during joins, at least in current
> > implementation.
> >     Current integration is implemented as a regular Spark data
> > source which
> >     provides each relation separately. Spark then performs join by
> > itself, so
> >     Ignite indexes do not help.
> > 
> >     The easiest way to get binaries would be to use a nightly build
> > [1] , but it
> >     seems to be broken for some reason (latest is from May 31). I
> > guess the only
> >     option at the moment is to build from source.
> > 
> >     [1]
> >     https://builds.apache.org/view/H-L/view/Ignite/job/Ignite-night
> > ly/lastSuccessfulBuild/
> > 
> >     -Val
> > 
> > 
> > 
> >     --
> >     Sent from: http://apache-ignite-users.70518.x6.nabble.com/
> > 
> > 
> 
>

Re: Spark data frames integration merged

Posted by Nikolay Izhikov <ni...@gmail.com>.

Hello, guys.

Currently `getPreferredLocations` implemented in 
`IgniteRDD -> IgniteAbstractRDD`.

But DataFrame implementation uses 
`IgniteSQLDataFrameRDD -> IgniteSqlRDD -> IgniteAbstractRDD`

Where `->` is extension.

So, for now, getPreferredLocation doesn't implemented for a
IgniteDataFrame.

Please, take a look [1], [2].

I think it a very good idea to implement `getPreferredLocation` inside
`IgniteSQLDataFrameRDD` or event inside `IgniteAbstractRDD`

Can someone file a ticket? Or I can do it by myself.


[1] - https://github.com/apache/ignite/blob/master/modules/spark/src/ma
in/scala/org/apache/ignite/spark/IgniteRDD.scala#L50

[2] - https://github.com/apache/ignite/blob/master/modules/spark/src/ma
in/scala/org/apache/ignite/spark/impl/IgniteSQLDataFrameRDD.scala#L40


В Ср, 03/01/2018 в 15:35 -0800, Valentin Kulichenko пишет:
> Revin,
> 
> I doubt IgniteRDD#getPrefferredLocations has any affect on data
> frames, but this is an interesting point. Nikolay, as a developer of
> this functionality, can you please comment on this?
> 
> -Val
> 
> On Wed, Jan 3, 2018 at 1:22 PM, Revin Chalil <rc...@expedia.com>
> wrote:
> > Thanks Val for the info on indexes with DF. Do you know if adding
> > index / affinitykeys on the cache help with the join, when the
> > IgniteRDD is joined with a spark DF? The below from docs say that
> > 
> > “IgniteRDD also provides affinity information to Spark via
> > getPrefferredLocations method so that RDD computations use data
> > locality.”
> > 
> > I was wondering, if the affinitykey on the cache can be utilized in
> > the spark join?
> > 
> > 
> > On 1/3/18, 12:27 PM, "vkulichenko" <va...@gmail.com>
> > wrote:
> > 
> >     Indexes would not be used during joins, at least in current
> > implementation.
> >     Current integration is implemented as a regular Spark data
> > source which
> >     provides each relation separately. Spark then performs join by
> > itself, so
> >     Ignite indexes do not help.
> > 
> >     The easiest way to get binaries would be to use a nightly build
> > [1] , but it
> >     seems to be broken for some reason (latest is from May 31). I
> > guess the only
> >     option at the moment is to build from source.
> > 
> >     [1]
> >     https://builds.apache.org/view/H-L/view/Ignite/job/Ignite-night
> > ly/lastSuccessfulBuild/
> > 
> >     -Val
> > 
> > 
> > 
> >     --
> >     Sent from: http://apache-ignite-users.70518.x6.nabble.com/
> > 
> > 
> 
>

Re: Spark data frames integration merged

Posted by Valentin Kulichenko <va...@gmail.com>.

Revin,

I doubt IgniteRDD#getPrefferredLocations has any affect on data frames, but
this is an interesting point. Nikolay, as a developer of this
functionality, can you please comment on this?

-Val

On Wed, Jan 3, 2018 at 1:22 PM, Revin Chalil <rc...@expedia.com> wrote:

> Thanks Val for the info on indexes with DF. Do you know if adding index /
> affinitykeys on the cache help with the join, when the IgniteRDD is joined
> with a spark DF? The below from docs say that
>
> “IgniteRDD also provides affinity information to Spark via
> getPrefferredLocations method so that RDD computations use data locality.”
>
> I was wondering, if the affinitykey on the cache can be utilized in the
> spark join?
>
>
> On 1/3/18, 12:27 PM, "vkulichenko" <va...@gmail.com> wrote:
>
>     Indexes would not be used during joins, at least in current
> implementation.
>     Current integration is implemented as a regular Spark data source which
>     provides each relation separately. Spark then performs join by itself,
> so
>     Ignite indexes do not help.
>
>     The easiest way to get binaries would be to use a nightly build [1] ,
> but it
>     seems to be broken for some reason (latest is from May 31). I guess
> the only
>     option at the moment is to build from source.
>
>     [1]
>     https://builds.apache.org/view/H-L/view/Ignite/job/Ignite-nightly/
> lastSuccessfulBuild/
>
>     -Val
>
>
>
>     --
>     Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>
>
>

Re: Spark data frames integration merged

Posted by Valentin Kulichenko <va...@gmail.com>.

Revin,

I doubt IgniteRDD#getPrefferredLocations has any affect on data frames, but
this is an interesting point. Nikolay, as a developer of this
functionality, can you please comment on this?

-Val

On Wed, Jan 3, 2018 at 1:22 PM, Revin Chalil <rc...@expedia.com> wrote:

> Thanks Val for the info on indexes with DF. Do you know if adding index /
> affinitykeys on the cache help with the join, when the IgniteRDD is joined
> with a spark DF? The below from docs say that
>
> “IgniteRDD also provides affinity information to Spark via
> getPrefferredLocations method so that RDD computations use data locality.”
>
> I was wondering, if the affinitykey on the cache can be utilized in the
> spark join?
>
>
> On 1/3/18, 12:27 PM, "vkulichenko" <va...@gmail.com> wrote:
>
>     Indexes would not be used during joins, at least in current
> implementation.
>     Current integration is implemented as a regular Spark data source which
>     provides each relation separately. Spark then performs join by itself,
> so
>     Ignite indexes do not help.
>
>     The easiest way to get binaries would be to use a nightly build [1] ,
> but it
>     seems to be broken for some reason (latest is from May 31). I guess
> the only
>     option at the moment is to build from source.
>
>     [1]
>     https://builds.apache.org/view/H-L/view/Ignite/job/Ignite-nightly/
> lastSuccessfulBuild/
>
>     -Val
>
>
>
>     --
>     Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>
>
>

Re: Spark data frames integration merged

Posted by Revin Chalil <rc...@expedia.com>.

Thanks Val for the info on indexes with DF. Do you know if adding index / affinitykeys on the cache help with the join, when the IgniteRDD is joined with a spark DF? The below from docs say that 

“IgniteRDD also provides affinity information to Spark via getPrefferredLocations method so that RDD computations use data locality.”

I was wondering, if the affinitykey on the cache can be utilized in the spark join? 


On 1/3/18, 12:27 PM, "vkulichenko" <va...@gmail.com> wrote:

    Indexes would not be used during joins, at least in current implementation.
    Current integration is implemented as a regular Spark data source which
    provides each relation separately. Spark then performs join by itself, so
    Ignite indexes do not help.
    
    The easiest way to get binaries would be to use a nightly build [1] , but it
    seems to be broken for some reason (latest is from May 31). I guess the only
    option at the moment is to build from source.
    
    [1]
    https://builds.apache.org/view/H-L/view/Ignite/job/Ignite-nightly/lastSuccessfulBuild/
    
    -Val
    
    
    
    --
    Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Spark data frames integration merged

Posted by vkulichenko <va...@gmail.com>.

Indexes would not be used during joins, at least in current implementation.
Current integration is implemented as a regular Spark data source which
provides each relation separately. Spark then performs join by itself, so
Ignite indexes do not help.

The easiest way to get binaries would be to use a nightly build [1] , but it
seems to be broken for some reason (latest is from May 31). I guess the only
option at the moment is to build from source.

[1]
https://builds.apache.org/view/H-L/view/Ignite/job/Ignite-nightly/lastSuccessfulBuild/

-Val



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Spark data frames integration merged

Posted by Denis Magda <dm...@apache.org>.

Revin,

Excellent, please keep me in the loop and let me know once you achieve the next milestone being ready for the production. This type of use cases help to spread a word about Ignite which is really-really helpful!

—
Denis

> On Jan 5, 2018, at 12:27 AM, Revin Chalil <rc...@expedia.com> wrote:
> 
> Thanks Denis. I watched your recent 2 webinars and they were very helpful.
>  
> I can definitely create a page explaining how (currently three) ignite shared-rdd caches are shared across multiple spark streaming apps for data enrichment here at expedia, once the solution is stabilized. We are not in production yet. I have enabled native persistence and had some hiccups during our testing but is looking better today.
>  
> We are currently working to optimize the join between incremental data and shared-rdd dataframe in spark as there are several spark Apps and the total memory is limited. This part does not have much to do with Ignite but mostly spark optimization, I believe. We do load the entire ignite-cache (~50GB each) into spark executors and the cache is trimmed based on the business rules, daily.
>  
> We will keep in touch and thanks again for all the great work and help everyone.
>  
> Revin
>  
> From: Denis Magda <dm...@apache.org>
> Date: Thursday, January 4, 2018 at 12:34 PM
> To: Revin Chalil <rc...@expedia.com>
> Cc: "dev@ignite.apache.org" <de...@ignite.apache.org>
> Subject: Re: Spark data frames integration merged
>  
> Revin, 
>  
> As as side note, do you have a public article published or any other relevant material that explains how Ignite is used at Expedia?
>  
> You would help the community out a lot if such information is referenced from this page:
> https://ignite.apache.org/provenusecases.html <https://ignite.apache.org/provenusecases.html>
>  
> —
> Denis
>  
> On Jan 3, 2018, at 11:24 AM, Revin Chalil <rchalil@expedia.com <ma...@expedia.com>> wrote:
>  
> Thank you and this is great news. 
> 
> We currently use the Ignite cache as a Reference dataset RDD in Spark, convert it into a spark DataFrame and then join this DF with the incoming-data DF. I hope we can change this 3 step process to a single step with the Spark DF integration. If so, would index / affinitykeys on the join columns help with performance? We currently do not have them defined on the Reference dataset. Are there examples available joining ignite DF with Spark DF? Also, what is the best way to get the latest executables with the IGNITE-3084 included? Thanks again. 
> 
> 
> On 12/29/17, 10:34 PM, "Nikolay Izhikov" <nizhikov.dev@gmail.com <ma...@gmail.com>> wrote:
> 
>    Thank you, guys.
> 
>    Val, thanks for all reviews, advices and patience.
> 
>    Anton, thanks for ignite wisdom you share with me.
> 
>    Looking forward for next issues :)
> 
>    P.S Happy New Year for all Ignite community!
> 
>    В Пт, 29/12/2017 в 13:22 -0800, Valentin Kulichenko пишет:
> 
> Igniters,
> 
> Great news! We completed and merged first part of integration with
> Spark data frames [1]. It contains implementation of Spark data
> source which allows to use DataFrame API to query Ignite data, as
> well as join it with other data frames originated from different
> sources.
> 
> Next planned steps are the following:
> - Implement custom execution strategy to avoid transferring data from
> Ignite to Spark when possible [2]. This should give serious
> performance improvement in cases when only Ignite tables participate
> in a query.
> - Implement ability to save a data frame into Ignite via
> DataFrameWrite API [3].
> 
> [1] https://issues.apache.org/jira/browse/IGNITE-3084 <https://issues.apache.org/jira/browse/IGNITE-3084>
> [2] https://issues.apache.org/jira/browse/IGNITE-7077 <https://issues.apache.org/jira/browse/IGNITE-7077>
> [3] https://issues.apache.org/jira/browse/IGNITE-7337 <https://issues.apache.org/jira/browse/IGNITE-7337>
> 
> Nikolay Izhikov, thanks for the contribution and for all the hard
> work!
> 
> -Val
>  
> 
>

Re: Spark data frames integration merged

Posted by Revin Chalil <rc...@expedia.com>.

Thanks Denis. I watched your recent 2 webinars and they were very helpful.

I can definitely create a page explaining how (currently three) ignite shared-rdd caches are shared across multiple spark streaming apps for data enrichment here at expedia, once the solution is stabilized. We are not in production yet. I have enabled native persistence and had some hiccups during our testing but is looking better today.

We are currently working to optimize the join between incremental data and shared-rdd dataframe in spark as there are several spark Apps and the total memory is limited. This part does not have much to do with Ignite but mostly spark optimization, I believe. We do load the entire ignite-cache (~50GB each) into spark executors and the cache is trimmed based on the business rules, daily.

We will keep in touch and thanks again for all the great work and help everyone.

Revin

From: Denis Magda <dm...@apache.org>
Date: Thursday, January 4, 2018 at 12:34 PM
To: Revin Chalil <rc...@expedia.com>
Cc: "dev@ignite.apache.org" <de...@ignite.apache.org>
Subject: Re: Spark data frames integration merged

Revin,

As as side note, do you have a public article published or any other relevant material that explains how Ignite is used at Expedia?

You would help the community out a lot if such information is referenced from this page:
https://ignite.apache.org/provenusecases.html

—
Denis

On Jan 3, 2018, at 11:24 AM, Revin Chalil <rc...@expedia.com>> wrote:

Thank you and this is great news.

We currently use the Ignite cache as a Reference dataset RDD in Spark, convert it into a spark DataFrame and then join this DF with the incoming-data DF. I hope we can change this 3 step process to a single step with the Spark DF integration. If so, would index / affinitykeys on the join columns help with performance? We currently do not have them defined on the Reference dataset. Are there examples available joining ignite DF with Spark DF? Also, what is the best way to get the latest executables with the IGNITE-3084 included? Thanks again.


On 12/29/17, 10:34 PM, "Nikolay Izhikov" <ni...@gmail.com>> wrote:

   Thank you, guys.

   Val, thanks for all reviews, advices and patience.

   Anton, thanks for ignite wisdom you share with me.

   Looking forward for next issues :)

   P.S Happy New Year for all Ignite community!

   В Пт, 29/12/2017 в 13:22 -0800, Valentin Kulichenko пишет:

Igniters,

Great news! We completed and merged first part of integration with
Spark data frames [1]. It contains implementation of Spark data
source which allows to use DataFrame API to query Ignite data, as
well as join it with other data frames originated from different
sources.

Next planned steps are the following:
- Implement custom execution strategy to avoid transferring data from
Ignite to Spark when possible [2]. This should give serious
performance improvement in cases when only Ignite tables participate
in a query.
- Implement ability to save a data frame into Ignite via
DataFrameWrite API [3].

[1] https://issues.apache.org/jira/browse/IGNITE-3084
[2] https://issues.apache.org/jira/browse/IGNITE-7077
[3] https://issues.apache.org/jira/browse/IGNITE-7337

Nikolay Izhikov, thanks for the contribution and for all the hard
work!

-Val

Re: Spark data frames integration merged

Posted by Denis Magda <dm...@apache.org>.

Revin,

As as side note, do you have a public article published or any other relevant material that explains how Ignite is used at Expedia?

You would help the community out a lot if such information is referenced from this page:
https://ignite.apache.org/provenusecases.html <https://ignite.apache.org/provenusecases.html>

—
Denis

> On Jan 3, 2018, at 11:24 AM, Revin Chalil <rc...@expedia.com> wrote:
> 
> Thank you and this is great news. 
> 
> We currently use the Ignite cache as a Reference dataset RDD in Spark, convert it into a spark DataFrame and then join this DF with the incoming-data DF. I hope we can change this 3 step process to a single step with the Spark DF integration. If so, would index / affinitykeys on the join columns help with performance? We currently do not have them defined on the Reference dataset. Are there examples available joining ignite DF with Spark DF? Also, what is the best way to get the latest executables with the IGNITE-3084 included? Thanks again. 
> 
> 
> On 12/29/17, 10:34 PM, "Nikolay Izhikov" <ni...@gmail.com> wrote:
> 
>    Thank you, guys.
> 
>    Val, thanks for all reviews, advices and patience.
> 
>    Anton, thanks for ignite wisdom you share with me.
> 
>    Looking forward for next issues :)
> 
>    P.S Happy New Year for all Ignite community!
> 
>    В Пт, 29/12/2017 в 13:22 -0800, Valentin Kulichenko пишет:
>> Igniters,
>> 
>> Great news! We completed and merged first part of integration with
>> Spark data frames [1]. It contains implementation of Spark data
>> source which allows to use DataFrame API to query Ignite data, as
>> well as join it with other data frames originated from different
>> sources.
>> 
>> Next planned steps are the following:
>> - Implement custom execution strategy to avoid transferring data from
>> Ignite to Spark when possible [2]. This should give serious
>> performance improvement in cases when only Ignite tables participate
>> in a query.
>> - Implement ability to save a data frame into Ignite via
>> DataFrameWrite API [3].
>> 
>> [1] https://issues.apache.org/jira/browse/IGNITE-3084
>> [2] https://issues.apache.org/jira/browse/IGNITE-7077
>> [3] https://issues.apache.org/jira/browse/IGNITE-7337
>> 
>> Nikolay Izhikov, thanks for the contribution and for all the hard
>> work!
>> 
>> -Val
> 
>

Re: Spark data frames integration merged

Posted by Revin Chalil <rc...@expedia.com>.

Thank you and this is great news. 

We currently use the Ignite cache as a Reference dataset RDD in Spark, convert it into a spark DataFrame and then join this DF with the incoming-data DF. I hope we can change this 3 step process to a single step with the Spark DF integration. If so, would index / affinitykeys on the join columns help with performance? We currently do not have them defined on the Reference dataset. Are there examples available joining ignite DF with Spark DF? Also, what is the best way to get the latest executables with the IGNITE-3084 included? Thanks again. 

On 12/29/17, 10:34 PM, "Nikolay Izhikov" <ni...@gmail.com> wrote:

    Thank you, guys.

    Val, thanks for all reviews, advices and patience.

    Anton, thanks for ignite wisdom you share with me.

    Looking forward for next issues :)

    P.S Happy New Year for all Ignite community!

    В Пт, 29/12/2017 в 13:22 -0800, Valentin Kulichenko пишет:
    > Igniters,
    > 
    > Great news! We completed and merged first part of integration with
    > Spark data frames [1]. It contains implementation of Spark data
    > source which allows to use DataFrame API to query Ignite data, as
    > well as join it with other data frames originated from different
    > sources.
    > 
    > Next planned steps are the following:
    > - Implement custom execution strategy to avoid transferring data from
    > Ignite to Spark when possible [2]. This should give serious
    > performance improvement in cases when only Ignite tables participate
    > in a query.
    > - Implement ability to save a data frame into Ignite via
    > DataFrameWrite API [3].
    > 
    > [1] https://issues.apache.org/jira/browse/IGNITE-3084
    > [2] https://issues.apache.org/jira/browse/IGNITE-7077
    > [3] https://issues.apache.org/jira/browse/IGNITE-7337
    > 
    > Nikolay Izhikov, thanks for the contribution and for all the hard
    > work!
    > 
    > -Val

Re: Spark data frames integration merged

Posted by Revin Chalil <rc...@expedia.com>.

Thank you and this is great news. 

We currently use the Ignite cache as a Reference dataset RDD in Spark, convert it into a spark DataFrame and then join this DF with the incoming-data DF. I hope we can change this 3 step process to a single step with the Spark DF integration. If so, would index / affinitykeys on the join columns help with performance? We currently do not have them defined on the Reference dataset. Are there examples available joining ignite DF with Spark DF? Also, what is the best way to get the latest executables with the IGNITE-3084 included? Thanks again. 

On 12/29/17, 10:34 PM, "Nikolay Izhikov" <ni...@gmail.com> wrote:

    Thank you, guys.

    Val, thanks for all reviews, advices and patience.

    Anton, thanks for ignite wisdom you share with me.

    Looking forward for next issues :)

    P.S Happy New Year for all Ignite community!

    В Пт, 29/12/2017 в 13:22 -0800, Valentin Kulichenko пишет:
    > Igniters,
    > 
    > Great news! We completed and merged first part of integration with
    > Spark data frames [1]. It contains implementation of Spark data
    > source which allows to use DataFrame API to query Ignite data, as
    > well as join it with other data frames originated from different
    > sources.
    > 
    > Next planned steps are the following:
    > - Implement custom execution strategy to avoid transferring data from
    > Ignite to Spark when possible [2]. This should give serious
    > performance improvement in cases when only Ignite tables participate
    > in a query.
    > - Implement ability to save a data frame into Ignite via
    > DataFrameWrite API [3].
    > 
    > [1] https://issues.apache.org/jira/browse/IGNITE-3084
    > [2] https://issues.apache.org/jira/browse/IGNITE-7077
    > [3] https://issues.apache.org/jira/browse/IGNITE-7337
    > 
    > Nikolay Izhikov, thanks for the contribution and for all the hard
    > work!
    > 
    > -Val

Re: Spark data frames integration merged

Posted by Nikolay Izhikov <ni...@gmail.com>.

Thank you, guys.

Val, thanks for all reviews, advices and patience.

Anton, thanks for ignite wisdom you share with me.

Looking forward for next issues :)

P.S Happy New Year for all Ignite community!

В Пт, 29/12/2017 в 13:22 -0800, Valentin Kulichenko пишет:
> Igniters,
> 
> Great news! We completed and merged first part of integration with
> Spark data frames [1]. It contains implementation of Spark data
> source which allows to use DataFrame API to query Ignite data, as
> well as join it with other data frames originated from different
> sources.
> 
> Next planned steps are the following:
> - Implement custom execution strategy to avoid transferring data from
> Ignite to Spark when possible [2]. This should give serious
> performance improvement in cases when only Ignite tables participate
> in a query.
> - Implement ability to save a data frame into Ignite via
> DataFrameWrite API [3].
> 
> [1] https://issues.apache.org/jira/browse/IGNITE-3084
> [2] https://issues.apache.org/jira/browse/IGNITE-7077
> [3] https://issues.apache.org/jira/browse/IGNITE-7337
> 
> Nikolay Izhikov, thanks for the contribution and for all the hard
> work!
> 
> -Val

Re: Spark data frames integration merged

Posted by Denis Magda <dm...@apache.org>.

Great news,

Thanks Nikolay and Val!

Nikolay, could you document the feature before the release [1]? I’ve granted you required permission.

More on the doc process can be found here [2].

[1] https://issues.apache.org/jira/browse/IGNITE-7345 <https://issues.apache.org/jira/browse/IGNITE-7345>
[2] https://cwiki.apache.org/confluence/display/IGNITE/How+to+Document <https://cwiki.apache.org/confluence/display/IGNITE/How+to+Document>

—
Denis

> On Dec 29, 2017, at 1:22 PM, Valentin Kulichenko <va...@gmail.com> wrote:
> 
> Igniters,
> 
> Great news! We completed and merged first part of integration with Spark data frames [1]. It contains implementation of Spark data source which allows to use DataFrame API to query Ignite data, as well as join it with other data frames originated from different sources.
> 
> Next planned steps are the following:
> - Implement custom execution strategy to avoid transferring data from Ignite to Spark when possible [2]. This should give serious performance improvement in cases when only Ignite tables participate in a query.
> - Implement ability to save a data frame into Ignite via DataFrameWrite API [3].
> 
> [1] https://issues.apache.org/jira/browse/IGNITE-3084 <https://issues.apache.org/jira/browse/IGNITE-3084>
> [2] https://issues.apache.org/jira/browse/IGNITE-7077 <https://issues.apache.org/jira/browse/IGNITE-7077>
> [3] https://issues.apache.org/jira/browse/IGNITE-7337 <https://issues.apache.org/jira/browse/IGNITE-7337>
> 
> Nikolay Izhikov, thanks for the contribution and for all the hard work!
> 
> -Val

Re: Spark data frames integration merged

Posted by Nikolay Izhikov <ni...@gmail.com>.

Thank you, guys.

Val, thanks for all reviews, advices and patience.

Anton, thanks for ignite wisdom you share with me.

Looking forward for next issues :)

P.S Happy New Year for all Ignite community!

В Пт, 29/12/2017 в 13:22 -0800, Valentin Kulichenko пишет:
> Igniters,
> 
> Great news! We completed and merged first part of integration with
> Spark data frames [1]. It contains implementation of Spark data
> source which allows to use DataFrame API to query Ignite data, as
> well as join it with other data frames originated from different
> sources.
> 
> Next planned steps are the following:
> - Implement custom execution strategy to avoid transferring data from
> Ignite to Spark when possible [2]. This should give serious
> performance improvement in cases when only Ignite tables participate
> in a query.
> - Implement ability to save a data frame into Ignite via
> DataFrameWrite API [3].
> 
> [1] https://issues.apache.org/jira/browse/IGNITE-3084
> [2] https://issues.apache.org/jira/browse/IGNITE-7077
> [3] https://issues.apache.org/jira/browse/IGNITE-7337
> 
> Nikolay Izhikov, thanks for the contribution and for all the hard
> work!
> 
> -Val