You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ignite.apache.org by Николай Ижиков <ni...@gmail.com> on 2017/09/25 18:14:37 UTC

Integration of Spark and Ignite. Prototype.

Hello, guys.

Currently, I’m working on integration between Spark and Ignite [1].

For now, I implement following:
    * Ignite DataSource implementation(IgniteRelationProvider)
    * DataFrame support for Ignite SQL table.
    * IgniteCatalog implementation for a transparent resolving of ignites
SQL tables.

Implementation of it can be found in PR [2]
It would be great if someone provides feedback for a prototype.

I made some examples in PR so you can see how API suppose to be used [3].
[4].

I need some advice. Can you help me?

1. How should this PR be tested?

Of course, I need to provide some unit tests. But what about scalability
tests, etc.
Maybe we need some Yardstick benchmark or similar?
What are your thoughts?
Which scenarios should I consider in the first place?

2. Should we provide Spark Catalog implementation inside Ignite codebase?

A current implementation of Spark Catalog based on *internal Spark API*.
Spark community seems not interested in making Catalog API public or
including Ignite Catalog in Spark code base [5], [6].

*Should we include Spark internal API implementation inside Ignite code
base?*

Or should we consider to include Catalog implementation in some external
module?
That will be created and released outside Ignite?(we still can support and
develop it inside Ignite community).

[1] https://issues.apache.org/jira/browse/IGNITE-3084
[2] https://github.com/apache/ignite/pull/2742
[3] https://github.com/apache/ignite/pull/2742/files#diff-
f4ff509cef3018e221394474775e0905
[4] https://github.com/apache/ignite/pull/2742/files#diff-
f2b670497d81e780dfd5098c5dd8a89c
[5] http://apache-spark-developers-list.1001551.n3.
nabble.com/Spark-Core-Custom-Catalog-Integration-between-
Apache-Ignite-and-Apache-Spark-td22452.html
[6] https://issues.apache.org/jira/browse/SPARK-17767

--
Nikolay Izhikov
NIzhikov.dev@gmail.com

Re: Integration of Spark and Ignite. Prototype.

Posted by Николай Ижиков <ni...@gmail.com>.

Val, Thank you.

I fixed issues and answered questions from comments.
Please, take a look.

2017-12-13 3:28 GMT+03:00 Valentin Kulichenko <valentin.kulichenko@gmail.com
>:

> Hi Nikolay,
>
> I reviewed the code and left several comments in the ticket [1]. Please
> take a look.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-3084
>
> -Val
>
> On Mon, Dec 4, 2017 at 3:03 PM, Valentin Kulichenko <
> valentin.kulichenko@gmail.com> wrote:
>
> > Denis,
> >
> > Nikolay was doing final changes and TC stabilization. I'm planning to do
> > final review this week, so hopefully we will merge the code soon.
> >
> > -Val
> >
> > On Mon, Dec 4, 2017 at 1:31 PM, Denis Magda <dm...@apache.org> wrote:
> >
> >> Nikolay, Val,
> >>
> >> Since we agreed to release the feature without the strategy support, can
> >> the current integration meet the world in 2.4 release? Please chime in
> this
> >> conversation:
> >> http://apache-ignite-developers.2346864.n4.nabble.com/Time-
> >> and-scope-for-Apache-Ignite-2-4-td24987.html
> >>
> >> —
> >> Denis
> >>
> >> > On Nov 28, 2017, at 5:42 PM, Valentin Kulichenko <
> >> valentin.kulichenko@gmail.com> wrote:
> >> >
> >> > Denis,
> >> >
> >> > Agree. I will do the final review in next few days and merge the code.
> >> >
> >> > -Val
> >> >
> >> > On Tue, Nov 28, 2017 at 5:28 PM, Denis Magda <dm...@apache.org>
> wrote:
> >> >
> >> >> Guys,
> >> >>
> >> >> Looking into the parallel discussion about the strategy support I
> would
> >> >> change my initial stance and support the idea of releasing the
> >> integration
> >> >> in its current state. Is the code ready to be merged into the master?
> >> Let’s
> >> >> concentrate on this first and handle the strategy support as a
> separate
> >> >> JIRA task. Agree?
> >> >>
> >> >> —
> >> >> Denis
> >> >>
> >> >>> On Nov 27, 2017, at 3:47 PM, Valentin Kulichenko <
> >> >> valentin.kulichenko@gmail.com> wrote:
> >> >>>
> >> >>> Nikolay,
> >> >>>
> >> >>> Let's estimate the strategy implementation work, and then decide
> >> weather
> >> >> to
> >> >>> merge the code in current state or not. If anything is unclear,
> please
> >> >>> start a separate discussion.
> >> >>>
> >> >>> -Val
> >> >>>
> >> >>> On Fri, Nov 24, 2017 at 5:42 AM, Николай Ижиков <
> >> nizhikov.dev@gmail.com>
> >> >>> wrote:
> >> >>>
> >> >>>> Hello, Val, Denis.
> >> >>>>
> >> >>>>> Personally, I think that we should release the integration only
> >> after
> >> >>>> the strategy is fully supported.
> >> >>>>
> >> >>>> I see two major reason to propose merge of DataFrame API
> >> implementation
> >> >>>> without custom strategy:
> >> >>>>
> >> >>>> 1. My PR is relatively huge, already. From my experience of
> >> interaction
> >> >>>> with Ignite community - the bigger PR becomes, the more time of
> >> >> commiters
> >> >>>> required to review PR.
> >> >>>> So, I propose to move smaller, but complete steps here.
> >> >>>>
> >> >>>> 2. It is not clear for me what exactly includes "custom strategy
> and
> >> >>>> optimization".
> >> >>>> Seems, that additional discussion required.
> >> >>>> I think, I can put my thoughts on the paper and start discussion
> >> right
> >> >>>> after basic implementation is done.
> >> >>>>
> >> >>>>> Custom strategy implementation is actually very important for this
> >> >>>> integration.
> >> >>>>
> >> >>>> Understand and fully agreed.
> >> >>>> I'm ready to continue work in that area.
> >> >>>>
> >> >>>> 23.11.2017 02:15, Denis Magda пишет:
> >> >>>>
> >> >>>> Val, Nikolay,
> >> >>>>>
> >> >>>>> Personally, I think that we should release the integration only
> >> after
> >> >> the
> >> >>>>> strategy is fully supported. Without the strategy we don’t really
> >> >> leverage
> >> >>>>> from Ignite’s SQL engine and introduce redundant data movement
> >> between
> >> >>>>> Ignite and Spark nodes.
> >> >>>>>
> >> >>>>> How big is the effort to support the strategy in terms of the
> >> amount of
> >> >>>>> work left? 40%, 60%, 80%?
> >> >>>>>
> >> >>>>> —
> >> >>>>> Denis
> >> >>>>>
> >> >>>>> On Nov 22, 2017, at 2:57 PM, Valentin Kulichenko <
> >> >>>>>> valentin.kulichenko@gmail.com> wrote:
> >> >>>>>>
> >> >>>>>> Nikolay,
> >> >>>>>>
> >> >>>>>> Custom strategy implementation is actually very important for
> this
> >> >>>>>> integration. Basically, it will allow to create a SQL query for
> >> Ignite
> >> >>>>>> and
> >> >>>>>> execute it directly on the cluster. Your current implementation
> >> only
> >> >>>>>> adds a
> >> >>>>>> new DataSource which means that Spark will fetch data in its own
> >> >> memory
> >> >>>>>> first, and then do most of the work (like joins for example).
> Does
> >> it
> >> >>>>>> make
> >> >>>>>> sense to you? Can you please take a look at this and provide your
> >> >>>>>> thoughts
> >> >>>>>> on how much development is implied there?
> >> >>>>>>
> >> >>>>>> Current code looks good to me though and I'm OK if the strategy
> is
> >> >>>>>> implemented as a next step in a scope of separate ticket. I will
> do
> >> >> final
> >> >>>>>> review early next week and will merge it if everything is OK.
> >> >>>>>>
> >> >>>>>> -Val
> >> >>>>>>
> >> >>>>>> On Thu, Oct 19, 2017 at 7:29 AM, Николай Ижиков <
> >> >> nizhikov.dev@gmail.com>
> >> >>>>>> wrote:
> >> >>>>>>
> >> >>>>>> Hello.
> >> >>>>>>>
> >> >>>>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two
> >> >> Catalog
> >> >>>>>>>>
> >> >>>>>>> implementations and what is the difference?
> >> >>>>>>>
> >> >>>>>>> IgniteCatalog removed.
> >> >>>>>>>
> >> >>>>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have
> >> to be
> >> >>>>>>>>
> >> >>>>>>> set manually on SQLContext each time it's created....Is there
> any
> >> >> way to
> >> >>>>>>> automate this and improve usability?
> >> >>>>>>>
> >> >>>>>>> IgniteStrategy and IgniteOptimization are removed as it empty
> now.
> >> >>>>>>>
> >> >>>>>>> Actually, I think it makes sense to create a builder similar to
> >> >>>>>>>>
> >> >>>>>>> SparkSession.builder()...
> >> >>>>>>>
> >> >>>>>>> IgniteBuilder added.
> >> >>>>>>> Syntax looks like:
> >> >>>>>>>
> >> >>>>>>> ```
> >> >>>>>>> val igniteSession = IgniteSparkSession.builder()
> >> >>>>>>>   .appName("Spark Ignite catalog example")
> >> >>>>>>>   .master("local")
> >> >>>>>>>   .config("spark.executor.instances", "2")
> >> >>>>>>>   .igniteConfig(CONFIG)
> >> >>>>>>>   .getOrCreate()
> >> >>>>>>>
> >> >>>>>>> igniteSession.catalog.listTables().show()
> >> >>>>>>> ```
> >> >>>>>>>
> >> >>>>>>> Please, see updated PR - https://github.com/apache/igni
> >> te/pull/2742
> >> >>>>>>>
> >> >>>>>>> 2017-10-18 20:02 GMT+03:00 Николай Ижиков <
> nizhikov.dev@gmail.com
> >> >:
> >> >>>>>>>
> >> >>>>>>> Hello, Valentin.
> >> >>>>>>>>
> >> >>>>>>>> My answers is below.
> >> >>>>>>>> Dmitry, do we need to move discussion to Jira?
> >> >>>>>>>>
> >> >>>>>>>> 1. Why do we have org.apache.spark.sql.ignite package in our
> >> >> codebase?
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> As I mentioned earlier, to implement and override Spark Catalog
> >> one
> >> >>>>>>>> have
> >> >>>>>>>> to use internal(private) Spark API.
> >> >>>>>>>> So I have to use package `org.spark.sql.***` to have access to
> >> >> private
> >> >>>>>>>> class and variables.
> >> >>>>>>>>
> >> >>>>>>>> For example, SharedState class that stores link to
> >> ExternalCatalog
> >> >>>>>>>> declared as `private[sql] class SharedState` - i.e. package
> >> private.
> >> >>>>>>>>
> >> >>>>>>>> Can these classes reside under org.apache.ignite.spark instead?
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> No, as long as we want to have our own implementation of
> >> >>>>>>>> ExternalCatalog.
> >> >>>>>>>>
> >> >>>>>>>> 2. IgniteRelationProvider contains multiple constants which I
> >> guess
> >> >> are
> >> >>>>>>>>>
> >> >>>>>>>> some king of config options. Can you describe the purpose of
> >> each of
> >> >>>>>>>> them?
> >> >>>>>>>>
> >> >>>>>>>> I extend comments for this options.
> >> >>>>>>>> Please, see my commit [1] or PR HEAD:
> >> >>>>>>>>
> >> >>>>>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two
> >> >> Catalog
> >> >>>>>>>>>
> >> >>>>>>>> implementations and what is the difference?
> >> >>>>>>>>
> >> >>>>>>>> Good catch, thank you!
> >> >>>>>>>> After additional research I founded that only
> >> IgniteExternalCatalog
> >> >>>>>>>> required.
> >> >>>>>>>> I will update PR with IgniteCatalog remove in a few days.
> >> >>>>>>>>
> >> >>>>>>>> 4. IgniteStrategy and IgniteOptimization are currently no-op.
> >> What
> >> >> are
> >> >>>>>>>>>
> >> >>>>>>>> our plans on implementing them? Also, what exactly is planned
> in
> >> >>>>>>>> IgniteOptimization and what is its purpose?
> >> >>>>>>>>
> >> >>>>>>>> Actually, this is very good question :)
> >> >>>>>>>> And I need advice from experienced community members here:
> >> >>>>>>>>
> >> >>>>>>>> `IgniteOptimization` purpose is to modify query plan created by
> >> >> Spark.
> >> >>>>>>>> Currently, we have one optimization described in IGNITE-3084
> [2]
> >> by
> >> >>>>>>>> you,
> >> >>>>>>>> Valentin :) :
> >> >>>>>>>>
> >> >>>>>>>> “If there are non-Ignite relations in the plan, we should fall
> >> back
> >> >> to
> >> >>>>>>>> native Spark strategies“
> >> >>>>>>>>
> >> >>>>>>>> I think we can go little further and reduce join of two Ignite
> >> >> backed
> >> >>>>>>>> Data Frames into single Ignite SQL query. Currently, this
> >> feature is
> >> >>>>>>>> unimplemented.
> >> >>>>>>>>
> >> >>>>>>>> *Do we need it now? Or we can postpone it and concentrates on
> >> basic
> >> >>>>>>>> Data
> >> >>>>>>>> Frame and Catalog implementation?*
> >> >>>>>>>>
> >> >>>>>>>> `Strategy` purpose, as you correctly mentioned in [2], is
> >> transform
> >> >>>>>>>> LogicalPlan into physical operators.
> >> >>>>>>>> I don’t have ideas how to use this opportunity. So I think we
> >> don’t
> >> >>>>>>>> need
> >> >>>>>>>> IgniteStrategy.
> >> >>>>>>>>
> >> >>>>>>>> Can you or anyone else suggest some optimization strategy to
> >> speed
> >> >> up
> >> >>>>>>>> SQL
> >> >>>>>>>> query execution?
> >> >>>>>>>>
> >> >>>>>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have
> >> to
> >> >> be
> >> >>>>>>>>>
> >> >>>>>>>> set manually on SQLContext each time it's created....Is there
> any
> >> >> way
> >> >>>>>>>> to
> >> >>>>>>>> automate this and improve usability?
> >> >>>>>>>>
> >> >>>>>>>> These classes added to `extraOptimizations` when one using
> >> >>>>>>>> IgniteSparkSession.
> >> >>>>>>>> As far as I know, there is no way to automatically add these
> >> >> classes to
> >> >>>>>>>> regular SparkSession.
> >> >>>>>>>>
> >> >>>>>>>> 6. What is the purpose of IgniteSparkSession? I see it's used
> in
> >> >>>>>>>>>
> >> >>>>>>>> IgniteCatalogExample but not in IgniteDataFrameExample, which
> is
> >> >>>>>>>> Confusing.
> >> >>>>>>>>
> >> >>>>>>>> DataFrame API is *public* Spark API. So anyone can provide
> >> >>>>>>>> implementation
> >> >>>>>>>> and plug it into Spark. That’s why IgniteDataFrameExample
> doesn’t
> >> >> need
> >> >>>>>>>> any
> >> >>>>>>>> Ignite specific session.
> >> >>>>>>>>
> >> >>>>>>>> Catalog API is *internal* Spark API. There is no way to plug
> >> custom
> >> >>>>>>>> catalog implementation into Spark [3]. So we have to use
> >> >>>>>>>> `IgniteSparkSession` that extends regular SparkSession and
> >> overrides
> >> >>>>>>>> links
> >> >>>>>>>> to `ExternalCatalog`.
> >> >>>>>>>>
> >> >>>>>>>> 7. To create IgniteSparkSession we first create IgniteContext.
> >> Is it
> >> >>>>>>>>>
> >> >>>>>>>> really needed? It looks like we can directly provide the
> >> >> configuration
> >> >>>>>>>> file; if IgniteSparkSession really requires IgniteContext, it
> can
> >> >>>>>>>> create it
> >> >>>>>>>> by itself under the hood.
> >> >>>>>>>>
> >> >>>>>>>> Actually, IgniteContext is base class for Ignite <-> Spark
> >> >> integration
> >> >>>>>>>> for now. So I tried to reuse it here. I like the idea to remove
> >> >>>>>>>> explicit
> >> >>>>>>>> usage of IgniteContext.
> >> >>>>>>>> Will implement it in a few days.
> >> >>>>>>>>
> >> >>>>>>>> Actually, I think it makes sense to create a builder similar to
> >> >>>>>>>>>
> >> >>>>>>>> SparkSession.builder()...
> >> >>>>>>>>
> >> >>>>>>>> Great idea! I will implement such builder in a few days.
> >> >>>>>>>>
> >> >>>>>>>> 9. Do I understand correctly that IgniteCacheRelation is for
> the
> >> >> case
> >> >>>>>>>>>
> >> >>>>>>>> when we don't have SQL configured on Ignite side?
> >> >>>>>>>>
> >> >>>>>>>> Yes, IgniteCacheRelation is Data Frame implementation for a
> >> >> key-value
> >> >>>>>>>> cache.
> >> >>>>>>>>
> >> >>>>>>>> I thought we decided not to support this, no? Or this is
> >> something
> >> >>>>>>>>> else?
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> My understanding is following:
> >> >>>>>>>>
> >> >>>>>>>> 1. We can’t support automatic resolving key-value caches in
> >> >>>>>>>> *ExternalCatalog*. Because there is no way to reliably detect
> key
> >> >> and
> >> >>>>>>>> value
> >> >>>>>>>> classes.
> >> >>>>>>>>
> >> >>>>>>>> 2. We can support key-value caches in regular Data Frame
> >> >>>>>>>> implementation.
> >> >>>>>>>> Because we can require user to provide key and value classes
> >> >>>>>>>> explicitly.
> >> >>>>>>>>
> >> >>>>>>>> 8. Can you clarify the query syntax in
> >> >> IgniteDataFrameExample#nativeS
> >> >>>>>>>>>
> >> >>>>>>>> parkSqlFromCacheExample2?
> >> >>>>>>>>
> >> >>>>>>>> Key-value cache:
> >> >>>>>>>>
> >> >>>>>>>> key - java.lang.Long,
> >> >>>>>>>> value - case class Person(name: String, birthDate:
> >> java.util.Date)
> >> >>>>>>>>
> >> >>>>>>>> Schema of data frame for cache is:
> >> >>>>>>>>
> >> >>>>>>>> key - long
> >> >>>>>>>> value.name - string
> >> >>>>>>>> value.birthDate - date
> >> >>>>>>>>
> >> >>>>>>>> So we can select data from data from cache:
> >> >>>>>>>>
> >> >>>>>>>> SELECT
> >> >>>>>>>> key, `value.name`,  `value.birthDate`
> >> >>>>>>>> FROM
> >> >>>>>>>> testCache
> >> >>>>>>>> WHERE key >= 2 AND `value.name` like '%0'
> >> >>>>>>>>
> >> >>>>>>>> [1] https://github.com/apache/ignite/pull/2742/commits/
> faf3ed6fe
> >> >>>>>>>> bf417bc59b0519156fd4d09114c8da7
> >> >>>>>>>> [2] https://issues.apache.org/jira/browse/IGNITE-3084?
> focusedCom
> >> >>>>>>>> mentId=15794210&page=com.atlassian.jira.plugin.system.issuet
> >> >>>>>>>> abpanels:comment-tabpanel#comment-15794210
> >> >>>>>>>> [3] https://issues.apache.org/jira/browse/SPARK-17767?
> focusedCom
> >> >>>>>>>> mentId=15543733&page=com.atlassian.jira.plugin.system.issuet
> >> >>>>>>>> abpanels:comment-tabpanel#comment-15543733
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> 18.10.2017 04:39, Dmitriy Setrakyan пишет:
> >> >>>>>>>>
> >> >>>>>>>> Val, thanks for the review. Can I ask you to add the same
> >> comments
> >> >> to
> >> >>>>>>>> the
> >> >>>>>>>>
> >> >>>>>>>>> ticket?
> >> >>>>>>>>>
> >> >>>>>>>>> On Tue, Oct 17, 2017 at 3:20 PM, Valentin Kulichenko <
> >> >>>>>>>>> valentin.kulichenko@gmail.com> wrote:
> >> >>>>>>>>>
> >> >>>>>>>>> Nikolay, Anton,
> >> >>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>> I did a high level review of the code. First of all,
> impressive
> >> >>>>>>>>>> results!
> >> >>>>>>>>>> However, I have some questions/comments.
> >> >>>>>>>>>>
> >> >>>>>>>>>> 1. Why do we have org.apache.spark.sql.ignite package in our
> >> >>>>>>>>>> codebase?
> >> >>>>>>>>>> Can
> >> >>>>>>>>>> these classes reside under org.apache.ignite.spark instead?
> >> >>>>>>>>>> 2. IgniteRelationProvider contains multiple constants which I
> >> >> guess
> >> >>>>>>>>>> are
> >> >>>>>>>>>> some king of config options. Can you describe the purpose of
> >> each
> >> >> of
> >> >>>>>>>>>> them?
> >> >>>>>>>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have
> two
> >> >>>>>>>>>> Catalog
> >> >>>>>>>>>> implementations and what is the difference?
> >> >>>>>>>>>> 4. IgniteStrategy and IgniteOptimization are currently no-op.
> >> What
> >> >>>>>>>>>> are
> >> >>>>>>>>>> our
> >> >>>>>>>>>> plans on implementing them? Also, what exactly is planned in
> >> >>>>>>>>>> IgniteOptimization and what is its purpose?
> >> >>>>>>>>>> 5. I don't like that IgniteStrategy and IgniteOptimization
> have
> >> >> to be
> >> >>>>>>>>>> set
> >> >>>>>>>>>> manually on SQLContext each time it's created. This seems to
> be
> >> >> very
> >> >>>>>>>>>> error
> >> >>>>>>>>>> prone. Is there any way to automate this and improve
> usability?
> >> >>>>>>>>>> 6. What is the purpose of IgniteSparkSession? I see it's used
> >> >>>>>>>>>> in IgniteCatalogExample but not in IgniteDataFrameExample,
> >> which
> >> >> is
> >> >>>>>>>>>> confusing.
> >> >>>>>>>>>> 7. To create IgniteSparkSession we first create
> IgniteContext.
> >> Is
> >> >> it
> >> >>>>>>>>>> really
> >> >>>>>>>>>> needed? It looks like we can directly provide the
> configuration
> >> >>>>>>>>>> file; if
> >> >>>>>>>>>> IgniteSparkSession really requires IgniteContext, it can
> create
> >> >> it by
> >> >>>>>>>>>> itself under the hood. Actually, I think it makes sense to
> >> create
> >> >> a
> >> >>>>>>>>>> builder
> >> >>>>>>>>>> similar to SparkSession.builder(), it would be good if our
> APIs
> >> >> here
> >> >>>>>>>>>> are
> >> >>>>>>>>>> consistent with Spark APIs.
> >> >>>>>>>>>> 8. Can you clarify the query syntax
> >> >>>>>>>>>> inIgniteDataFrameExample#nativeSparkSqlFromCacheExample2?
> >> >>>>>>>>>> 9. Do I understand correctly that IgniteCacheRelation is for
> >> the
> >> >> case
> >> >>>>>>>>>> when
> >> >>>>>>>>>> we don't have SQL configured on Ignite side? I thought we
> >> decided
> >> >>>>>>>>>> not to
> >> >>>>>>>>>> support this, no? Or this is something else?
> >> >>>>>>>>>>
> >> >>>>>>>>>> Thanks!
> >> >>>>>>>>>>
> >> >>>>>>>>>> -Val
> >> >>>>>>>>>>
> >> >>>>>>>>>> On Tue, Oct 17, 2017 at 4:40 AM, Anton Vinogradov <
> >> >>>>>>>>>> avinogradov@gridgain.com>
> >> >>>>>>>>>> wrote:
> >> >>>>>>>>>>
> >> >>>>>>>>>> Sounds awesome.
> >> >>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> I'll try to review API & tests this week.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Val,
> >> >>>>>>>>>>> Your review still required :)
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> On Tue, Oct 17, 2017 at 2:36 PM, Николай Ижиков <
> >> >>>>>>>>>>> nizhikov.dev@gmail.com>
> >> >>>>>>>>>>> wrote:
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Yes
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> 17 окт. 2017 г. 2:34 PM пользователь "Anton Vinogradov" <
> >> >>>>>>>>>>>> avinogradov@gridgain.com> написал:
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> Nikolay,
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> So, it will be able to start regular spark and ignite
> >> clusters
> >> >>>>>>>>>>>>> and,
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> using
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> peer classloading via spark-context, perform any DataFrame
> >> >> request,
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>> correct?
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> On Tue, Oct 17, 2017 at 2:25 PM, Николай Ижиков <
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> nizhikov.dev@gmail.com>
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> wrote:
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> Hello, Anton.
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> An example you provide is a path to a master *local*
> file.
> >> >>>>>>>>>>>>>> These libraries are added to the classpath for each
> remote
> >> >> node
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> running
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>> submitted job.
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> Please, see documentation:
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
> >> >>>>>>>>>>>>>> spark/SparkContext.html#addJar(java.lang.String)
> >> >>>>>>>>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
> >> >>>>>>>>>>>>>> spark/SparkContext.html#addFile(java.lang.String)
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> 2017-10-17 13:10 GMT+03:00 Anton Vinogradov <
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> avinogradov@gridgain.com
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>> :
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> Nikolay,
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> With Data Frame API implementation there are no
> >> requirements
> >> >> to
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> have
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>> any
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> Ignite files on spark worker nodes.
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> What do you mean? I see code like:
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> spark.sparkContext.addJar(MAVEN_HOME +
> >> >>>>>>>>>>>>>>> "/org/apache/ignite/ignite-core/2.3.0-SNAPSHOT/ignite-
> >> >>>>>>>>>>>>>>> core-2.3.0-SNAPSHOT.jar")
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> On Mon, Oct 16, 2017 at 5:22 PM, Николай Ижиков <
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> nizhikov.dev@gmail.com>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> wrote:
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> Hello, guys.
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> I have created example application to run Ignite Data
> >> Frame
> >> >> on
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> standalone
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> Spark cluster.
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> With Data Frame API implementation there are no
> >> >> requirements to
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> have
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>> any
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> Ignite files on spark worker nodes.
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> I ran this application on the free dataset: ATP tennis
> >> match
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> statistics.
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> data - https://github.com/nizhikov/atp_matches
> >> >>>>>>>>>>>>>>>> app - https://github.com/nizhikov/ig
> >> nite-spark-df-example
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> Valentin, do you have a chance to look at my changes?
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> 2017-10-12 6:03 GMT+03:00 Valentin Kulichenko <
> >> >>>>>>>>>>>>>>>> valentin.kulichenko@gmail.com
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> :
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> Hi Nikolay,
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> Sorry for delay on this, got a little swamped lately.
> I
> >> >> will
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> do
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>> my
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> best
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> to
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> review the code this week.
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> -Val
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> On Mon, Oct 9, 2017 at 11:48 AM, Николай Ижиков <
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> nizhikov.dev@gmail.com>
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> wrote:
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> Hello, Valentin.
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>> Did you have a chance to look at my changes?
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>> Now I think I have done almost all required features.
> >> >>>>>>>>>>>>>>>>>> I want to make some performance test to ensure my
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>> implementation
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>> work
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> properly with a significant amount of data.
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> And I definitely need some feedback for my changes.
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>> 2017-10-09 18:45 GMT+03:00 Николай Ижиков <
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>> nizhikov.dev@gmail.com
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>> :
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> Hello, guys.
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>> Which version of Spark do we want to use?
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>> 1. Currently, Ignite depends on Spark 2.1.0.
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>    * Can be run on JDK 7.
> >> >>>>>>>>>>>>>>>>>>>    * Still supported: 2.1.2 will be released soon.
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>> 2. Latest Spark version is 2.2.0.
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>    * Can be run only on JDK 8+
> >> >>>>>>>>>>>>>>>>>>>    * Released Jul 11, 2017.
> >> >>>>>>>>>>>>>>>>>>>    * Already supported by huge vendors(Amazon for
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>> example).
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>>> Note that in IGNITE-3084 I implement some internal Spark
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>> API.
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>> So It will take some effort to switch between Spark 2.1 and
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>> 2.2
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>> 2017-09-27 2:20 GMT+03:00 Valentin Kulichenko <
> >> >>>>>>>>>>>>>>>>>>> valentin.kulichenko@gmail.com>:
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>> I will review in the next few days.
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>> -Val
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>> On Tue, Sep 26, 2017 at 2:23 PM, Denis Magda <
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>> dmagda@apache.org
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>> wrote:
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> Hello Nikolay,
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>> This is good news. Finally this capability is
> >> coming to
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>> Ignite.
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> Val, Vladimir, could you do a preliminary review?
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>> Answering on your questions.
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>> 1. Yardstick should be enough for performance
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>> measurements.
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>> As a
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> Spark
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> user, I will be curious to know what’s the point of
> this
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>> integration.
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> Probably we need to compare Spark + Ignite and Spark +
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>> Hive
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>> or
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> Spark +
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> RDBMS cases.
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>> 2. If Spark community is reluctant let’s include
> the
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>> module
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>> in
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> ignite-spark integration.
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>> —
> >> >>>>>>>>>>>>>>>>>>>>> Denis
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>> On Sep 25, 2017, at 11:14 AM, Николай Ижиков <
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> nizhikov.dev@gmail.com>
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>> wrote:
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> Hello, guys.
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> Currently, I’m working on integration between
> Spark
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> and
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>> Ignite
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>> [1].
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> For now, I implement following:
> >> >>>>>>>>>>>>>>>>>>>>>>   * Ignite DataSource implementation(
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> IgniteRelationProvider)
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>   * DataFrame support for Ignite SQL table.
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>   * IgniteCatalog implementation for a transparent
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> resolving
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>> of
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> ignites
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>> SQL tables.
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> Implementation of it can be found in PR [2]
> >> >>>>>>>>>>>>>>>>>>>>>> It would be great if someone provides feedback
> for
> >> a
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> prototype.
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> I made some examples in PR so you can see how API
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> suppose
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>> to
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> be
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> used [3].
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>> [4].
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> I need some advice. Can you help me?
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> 1. How should this PR be tested?
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> Of course, I need to provide some unit tests. But
> >> what
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> about
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>> scalability
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>> tests, etc.
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> Maybe we need some Yardstick benchmark or
> similar?
> >> >>>>>>>>>>>>>>>>>>>>>> What are your thoughts?
> >> >>>>>>>>>>>>>>>>>>>>>> Which scenarios should I consider in the first
> >> place?
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> 2. Should we provide Spark Catalog implementation
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> inside
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>> Ignite
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>>> codebase?
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>> A current implementation of Spark Catalog based on
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> *internal
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>> Spark
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> API*.
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>> Spark community seems not interested in making
> >> Catalog
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> API
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>> public
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> or
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> including Ignite Catalog in Spark code base [5], [6].
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> *Should we include Spark internal API
> >> implementation
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> inside
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>> Ignite
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> code
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>> base?*
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> Or should we consider to include Catalog
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> implementation
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>> in
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> some
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> external
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>> module?
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> That will be created and released outside
> >> Ignite?(we
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> still
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>> can
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> support
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>> and
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>> develop it inside Ignite community).
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> [1] https://issues.apache.org/jira
> >> /browse/IGNITE-3084
> >> >>>>>>>>>>>>>>>>>>>>>> [2] https://github.com/apache/ignite/pull/2742
> >> >>>>>>>>>>>>>>>>>>>>>> [3] https://github.com/apache/
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> ignite/pull/2742/files#diff-
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>> f4ff509cef3018e221394474775e0905
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> [4] https://github.com/apache/
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> ignite/pull/2742/files#diff-
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>> f2b670497d81e780dfd5098c5dd8a89c
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> [5] http://apache-spark-developers-list.1001551.n3.
> >> >>>>>>>>>>>>>>>>>>>>>> nabble.com/Spark-Core-Custom-
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> Catalog-Integration-between-
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>> Apache-Ignite-and-Apache-Spark-td22452.html
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>> [6] https://issues.apache.org/jira/browse/SPARK-17767
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>> --
> >> >>>>>>>>>>>>>>>>>>>>>> Nikolay Izhikov
> >> >>>>>>>>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>> --
> >> >>>>>>>>>>>>>>>>>>> Nikolay Izhikov
> >> >>>>>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>> --
> >> >>>>>>>>>>>>>>>>>> Nikolay Izhikov
> >> >>>>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> --
> >> >>>>>>>>>>>>>>>> Nikolay Izhikov
> >> >>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> --
> >> >>>>>>>>>>>>>> Nikolay Izhikov
> >> >>>>>>>>>>>>>> NIzhikov.dev@gmail.com
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>
> >> >>>>>>> --
> >> >>>>>>> Nikolay Izhikov
> >> >>>>>>> NIzhikov.dev@gmail.com
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>
> >> >>
> >> >>
> >>
> >>
> >
>



-- 
Nikolay Izhikov
NIzhikov.dev@gmail.com

Re: Integration of Spark and Ignite. Prototype.

Posted by Valentin Kulichenko <va...@gmail.com>.

Hi Nikolay,

I reviewed the code and left several comments in the ticket [1]. Please
take a look.

[1] https://issues.apache.org/jira/browse/IGNITE-3084

-Val

On Mon, Dec 4, 2017 at 3:03 PM, Valentin Kulichenko <
valentin.kulichenko@gmail.com> wrote:

> Denis,
>
> Nikolay was doing final changes and TC stabilization. I'm planning to do
> final review this week, so hopefully we will merge the code soon.
>
> -Val
>
> On Mon, Dec 4, 2017 at 1:31 PM, Denis Magda <dm...@apache.org> wrote:
>
>> Nikolay, Val,
>>
>> Since we agreed to release the feature without the strategy support, can
>> the current integration meet the world in 2.4 release? Please chime in this
>> conversation:
>> http://apache-ignite-developers.2346864.n4.nabble.com/Time-
>> and-scope-for-Apache-Ignite-2-4-td24987.html
>>
>> —
>> Denis
>>
>> > On Nov 28, 2017, at 5:42 PM, Valentin Kulichenko <
>> valentin.kulichenko@gmail.com> wrote:
>> >
>> > Denis,
>> >
>> > Agree. I will do the final review in next few days and merge the code.
>> >
>> > -Val
>> >
>> > On Tue, Nov 28, 2017 at 5:28 PM, Denis Magda <dm...@apache.org> wrote:
>> >
>> >> Guys,
>> >>
>> >> Looking into the parallel discussion about the strategy support I would
>> >> change my initial stance and support the idea of releasing the
>> integration
>> >> in its current state. Is the code ready to be merged into the master?
>> Let’s
>> >> concentrate on this first and handle the strategy support as a separate
>> >> JIRA task. Agree?
>> >>
>> >> —
>> >> Denis
>> >>
>> >>> On Nov 27, 2017, at 3:47 PM, Valentin Kulichenko <
>> >> valentin.kulichenko@gmail.com> wrote:
>> >>>
>> >>> Nikolay,
>> >>>
>> >>> Let's estimate the strategy implementation work, and then decide
>> weather
>> >> to
>> >>> merge the code in current state or not. If anything is unclear, please
>> >>> start a separate discussion.
>> >>>
>> >>> -Val
>> >>>
>> >>> On Fri, Nov 24, 2017 at 5:42 AM, Николай Ижиков <
>> nizhikov.dev@gmail.com>
>> >>> wrote:
>> >>>
>> >>>> Hello, Val, Denis.
>> >>>>
>> >>>>> Personally, I think that we should release the integration only
>> after
>> >>>> the strategy is fully supported.
>> >>>>
>> >>>> I see two major reason to propose merge of DataFrame API
>> implementation
>> >>>> without custom strategy:
>> >>>>
>> >>>> 1. My PR is relatively huge, already. From my experience of
>> interaction
>> >>>> with Ignite community - the bigger PR becomes, the more time of
>> >> commiters
>> >>>> required to review PR.
>> >>>> So, I propose to move smaller, but complete steps here.
>> >>>>
>> >>>> 2. It is not clear for me what exactly includes "custom strategy and
>> >>>> optimization".
>> >>>> Seems, that additional discussion required.
>> >>>> I think, I can put my thoughts on the paper and start discussion
>> right
>> >>>> after basic implementation is done.
>> >>>>
>> >>>>> Custom strategy implementation is actually very important for this
>> >>>> integration.
>> >>>>
>> >>>> Understand and fully agreed.
>> >>>> I'm ready to continue work in that area.
>> >>>>
>> >>>> 23.11.2017 02:15, Denis Magda пишет:
>> >>>>
>> >>>> Val, Nikolay,
>> >>>>>
>> >>>>> Personally, I think that we should release the integration only
>> after
>> >> the
>> >>>>> strategy is fully supported. Without the strategy we don’t really
>> >> leverage
>> >>>>> from Ignite’s SQL engine and introduce redundant data movement
>> between
>> >>>>> Ignite and Spark nodes.
>> >>>>>
>> >>>>> How big is the effort to support the strategy in terms of the
>> amount of
>> >>>>> work left? 40%, 60%, 80%?
>> >>>>>
>> >>>>> —
>> >>>>> Denis
>> >>>>>
>> >>>>> On Nov 22, 2017, at 2:57 PM, Valentin Kulichenko <
>> >>>>>> valentin.kulichenko@gmail.com> wrote:
>> >>>>>>
>> >>>>>> Nikolay,
>> >>>>>>
>> >>>>>> Custom strategy implementation is actually very important for this
>> >>>>>> integration. Basically, it will allow to create a SQL query for
>> Ignite
>> >>>>>> and
>> >>>>>> execute it directly on the cluster. Your current implementation
>> only
>> >>>>>> adds a
>> >>>>>> new DataSource which means that Spark will fetch data in its own
>> >> memory
>> >>>>>> first, and then do most of the work (like joins for example). Does
>> it
>> >>>>>> make
>> >>>>>> sense to you? Can you please take a look at this and provide your
>> >>>>>> thoughts
>> >>>>>> on how much development is implied there?
>> >>>>>>
>> >>>>>> Current code looks good to me though and I'm OK if the strategy is
>> >>>>>> implemented as a next step in a scope of separate ticket. I will do
>> >> final
>> >>>>>> review early next week and will merge it if everything is OK.
>> >>>>>>
>> >>>>>> -Val
>> >>>>>>
>> >>>>>> On Thu, Oct 19, 2017 at 7:29 AM, Николай Ижиков <
>> >> nizhikov.dev@gmail.com>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>> Hello.
>> >>>>>>>
>> >>>>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two
>> >> Catalog
>> >>>>>>>>
>> >>>>>>> implementations and what is the difference?
>> >>>>>>>
>> >>>>>>> IgniteCatalog removed.
>> >>>>>>>
>> >>>>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have
>> to be
>> >>>>>>>>
>> >>>>>>> set manually on SQLContext each time it's created....Is there any
>> >> way to
>> >>>>>>> automate this and improve usability?
>> >>>>>>>
>> >>>>>>> IgniteStrategy and IgniteOptimization are removed as it empty now.
>> >>>>>>>
>> >>>>>>> Actually, I think it makes sense to create a builder similar to
>> >>>>>>>>
>> >>>>>>> SparkSession.builder()...
>> >>>>>>>
>> >>>>>>> IgniteBuilder added.
>> >>>>>>> Syntax looks like:
>> >>>>>>>
>> >>>>>>> ```
>> >>>>>>> val igniteSession = IgniteSparkSession.builder()
>> >>>>>>>   .appName("Spark Ignite catalog example")
>> >>>>>>>   .master("local")
>> >>>>>>>   .config("spark.executor.instances", "2")
>> >>>>>>>   .igniteConfig(CONFIG)
>> >>>>>>>   .getOrCreate()
>> >>>>>>>
>> >>>>>>> igniteSession.catalog.listTables().show()
>> >>>>>>> ```
>> >>>>>>>
>> >>>>>>> Please, see updated PR - https://github.com/apache/igni
>> te/pull/2742
>> >>>>>>>
>> >>>>>>> 2017-10-18 20:02 GMT+03:00 Николай Ижиков <nizhikov.dev@gmail.com
>> >:
>> >>>>>>>
>> >>>>>>> Hello, Valentin.
>> >>>>>>>>
>> >>>>>>>> My answers is below.
>> >>>>>>>> Dmitry, do we need to move discussion to Jira?
>> >>>>>>>>
>> >>>>>>>> 1. Why do we have org.apache.spark.sql.ignite package in our
>> >> codebase?
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>> As I mentioned earlier, to implement and override Spark Catalog
>> one
>> >>>>>>>> have
>> >>>>>>>> to use internal(private) Spark API.
>> >>>>>>>> So I have to use package `org.spark.sql.***` to have access to
>> >> private
>> >>>>>>>> class and variables.
>> >>>>>>>>
>> >>>>>>>> For example, SharedState class that stores link to
>> ExternalCatalog
>> >>>>>>>> declared as `private[sql] class SharedState` - i.e. package
>> private.
>> >>>>>>>>
>> >>>>>>>> Can these classes reside under org.apache.ignite.spark instead?
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>> No, as long as we want to have our own implementation of
>> >>>>>>>> ExternalCatalog.
>> >>>>>>>>
>> >>>>>>>> 2. IgniteRelationProvider contains multiple constants which I
>> guess
>> >> are
>> >>>>>>>>>
>> >>>>>>>> some king of config options. Can you describe the purpose of
>> each of
>> >>>>>>>> them?
>> >>>>>>>>
>> >>>>>>>> I extend comments for this options.
>> >>>>>>>> Please, see my commit [1] or PR HEAD:
>> >>>>>>>>
>> >>>>>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two
>> >> Catalog
>> >>>>>>>>>
>> >>>>>>>> implementations and what is the difference?
>> >>>>>>>>
>> >>>>>>>> Good catch, thank you!
>> >>>>>>>> After additional research I founded that only
>> IgniteExternalCatalog
>> >>>>>>>> required.
>> >>>>>>>> I will update PR with IgniteCatalog remove in a few days.
>> >>>>>>>>
>> >>>>>>>> 4. IgniteStrategy and IgniteOptimization are currently no-op.
>> What
>> >> are
>> >>>>>>>>>
>> >>>>>>>> our plans on implementing them? Also, what exactly is planned in
>> >>>>>>>> IgniteOptimization and what is its purpose?
>> >>>>>>>>
>> >>>>>>>> Actually, this is very good question :)
>> >>>>>>>> And I need advice from experienced community members here:
>> >>>>>>>>
>> >>>>>>>> `IgniteOptimization` purpose is to modify query plan created by
>> >> Spark.
>> >>>>>>>> Currently, we have one optimization described in IGNITE-3084 [2]
>> by
>> >>>>>>>> you,
>> >>>>>>>> Valentin :) :
>> >>>>>>>>
>> >>>>>>>> “If there are non-Ignite relations in the plan, we should fall
>> back
>> >> to
>> >>>>>>>> native Spark strategies“
>> >>>>>>>>
>> >>>>>>>> I think we can go little further and reduce join of two Ignite
>> >> backed
>> >>>>>>>> Data Frames into single Ignite SQL query. Currently, this
>> feature is
>> >>>>>>>> unimplemented.
>> >>>>>>>>
>> >>>>>>>> *Do we need it now? Or we can postpone it and concentrates on
>> basic
>> >>>>>>>> Data
>> >>>>>>>> Frame and Catalog implementation?*
>> >>>>>>>>
>> >>>>>>>> `Strategy` purpose, as you correctly mentioned in [2], is
>> transform
>> >>>>>>>> LogicalPlan into physical operators.
>> >>>>>>>> I don’t have ideas how to use this opportunity. So I think we
>> don’t
>> >>>>>>>> need
>> >>>>>>>> IgniteStrategy.
>> >>>>>>>>
>> >>>>>>>> Can you or anyone else suggest some optimization strategy to
>> speed
>> >> up
>> >>>>>>>> SQL
>> >>>>>>>> query execution?
>> >>>>>>>>
>> >>>>>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have
>> to
>> >> be
>> >>>>>>>>>
>> >>>>>>>> set manually on SQLContext each time it's created....Is there any
>> >> way
>> >>>>>>>> to
>> >>>>>>>> automate this and improve usability?
>> >>>>>>>>
>> >>>>>>>> These classes added to `extraOptimizations` when one using
>> >>>>>>>> IgniteSparkSession.
>> >>>>>>>> As far as I know, there is no way to automatically add these
>> >> classes to
>> >>>>>>>> regular SparkSession.
>> >>>>>>>>
>> >>>>>>>> 6. What is the purpose of IgniteSparkSession? I see it's used in
>> >>>>>>>>>
>> >>>>>>>> IgniteCatalogExample but not in IgniteDataFrameExample, which is
>> >>>>>>>> Confusing.
>> >>>>>>>>
>> >>>>>>>> DataFrame API is *public* Spark API. So anyone can provide
>> >>>>>>>> implementation
>> >>>>>>>> and plug it into Spark. That’s why IgniteDataFrameExample doesn’t
>> >> need
>> >>>>>>>> any
>> >>>>>>>> Ignite specific session.
>> >>>>>>>>
>> >>>>>>>> Catalog API is *internal* Spark API. There is no way to plug
>> custom
>> >>>>>>>> catalog implementation into Spark [3]. So we have to use
>> >>>>>>>> `IgniteSparkSession` that extends regular SparkSession and
>> overrides
>> >>>>>>>> links
>> >>>>>>>> to `ExternalCatalog`.
>> >>>>>>>>
>> >>>>>>>> 7. To create IgniteSparkSession we first create IgniteContext.
>> Is it
>> >>>>>>>>>
>> >>>>>>>> really needed? It looks like we can directly provide the
>> >> configuration
>> >>>>>>>> file; if IgniteSparkSession really requires IgniteContext, it can
>> >>>>>>>> create it
>> >>>>>>>> by itself under the hood.
>> >>>>>>>>
>> >>>>>>>> Actually, IgniteContext is base class for Ignite <-> Spark
>> >> integration
>> >>>>>>>> for now. So I tried to reuse it here. I like the idea to remove
>> >>>>>>>> explicit
>> >>>>>>>> usage of IgniteContext.
>> >>>>>>>> Will implement it in a few days.
>> >>>>>>>>
>> >>>>>>>> Actually, I think it makes sense to create a builder similar to
>> >>>>>>>>>
>> >>>>>>>> SparkSession.builder()...
>> >>>>>>>>
>> >>>>>>>> Great idea! I will implement such builder in a few days.
>> >>>>>>>>
>> >>>>>>>> 9. Do I understand correctly that IgniteCacheRelation is for the
>> >> case
>> >>>>>>>>>
>> >>>>>>>> when we don't have SQL configured on Ignite side?
>> >>>>>>>>
>> >>>>>>>> Yes, IgniteCacheRelation is Data Frame implementation for a
>> >> key-value
>> >>>>>>>> cache.
>> >>>>>>>>
>> >>>>>>>> I thought we decided not to support this, no? Or this is
>> something
>> >>>>>>>>> else?
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>> My understanding is following:
>> >>>>>>>>
>> >>>>>>>> 1. We can’t support automatic resolving key-value caches in
>> >>>>>>>> *ExternalCatalog*. Because there is no way to reliably detect key
>> >> and
>> >>>>>>>> value
>> >>>>>>>> classes.
>> >>>>>>>>
>> >>>>>>>> 2. We can support key-value caches in regular Data Frame
>> >>>>>>>> implementation.
>> >>>>>>>> Because we can require user to provide key and value classes
>> >>>>>>>> explicitly.
>> >>>>>>>>
>> >>>>>>>> 8. Can you clarify the query syntax in
>> >> IgniteDataFrameExample#nativeS
>> >>>>>>>>>
>> >>>>>>>> parkSqlFromCacheExample2?
>> >>>>>>>>
>> >>>>>>>> Key-value cache:
>> >>>>>>>>
>> >>>>>>>> key - java.lang.Long,
>> >>>>>>>> value - case class Person(name: String, birthDate:
>> java.util.Date)
>> >>>>>>>>
>> >>>>>>>> Schema of data frame for cache is:
>> >>>>>>>>
>> >>>>>>>> key - long
>> >>>>>>>> value.name - string
>> >>>>>>>> value.birthDate - date
>> >>>>>>>>
>> >>>>>>>> So we can select data from data from cache:
>> >>>>>>>>
>> >>>>>>>> SELECT
>> >>>>>>>> key, `value.name`,  `value.birthDate`
>> >>>>>>>> FROM
>> >>>>>>>> testCache
>> >>>>>>>> WHERE key >= 2 AND `value.name` like '%0'
>> >>>>>>>>
>> >>>>>>>> [1] https://github.com/apache/ignite/pull/2742/commits/faf3ed6fe
>> >>>>>>>> bf417bc59b0519156fd4d09114c8da7
>> >>>>>>>> [2] https://issues.apache.org/jira/browse/IGNITE-3084?focusedCom
>> >>>>>>>> mentId=15794210&page=com.atlassian.jira.plugin.system.issuet
>> >>>>>>>> abpanels:comment-tabpanel#comment-15794210
>> >>>>>>>> [3] https://issues.apache.org/jira/browse/SPARK-17767?focusedCom
>> >>>>>>>> mentId=15543733&page=com.atlassian.jira.plugin.system.issuet
>> >>>>>>>> abpanels:comment-tabpanel#comment-15543733
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> 18.10.2017 04:39, Dmitriy Setrakyan пишет:
>> >>>>>>>>
>> >>>>>>>> Val, thanks for the review. Can I ask you to add the same
>> comments
>> >> to
>> >>>>>>>> the
>> >>>>>>>>
>> >>>>>>>>> ticket?
>> >>>>>>>>>
>> >>>>>>>>> On Tue, Oct 17, 2017 at 3:20 PM, Valentin Kulichenko <
>> >>>>>>>>> valentin.kulichenko@gmail.com> wrote:
>> >>>>>>>>>
>> >>>>>>>>> Nikolay, Anton,
>> >>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> I did a high level review of the code. First of all, impressive
>> >>>>>>>>>> results!
>> >>>>>>>>>> However, I have some questions/comments.
>> >>>>>>>>>>
>> >>>>>>>>>> 1. Why do we have org.apache.spark.sql.ignite package in our
>> >>>>>>>>>> codebase?
>> >>>>>>>>>> Can
>> >>>>>>>>>> these classes reside under org.apache.ignite.spark instead?
>> >>>>>>>>>> 2. IgniteRelationProvider contains multiple constants which I
>> >> guess
>> >>>>>>>>>> are
>> >>>>>>>>>> some king of config options. Can you describe the purpose of
>> each
>> >> of
>> >>>>>>>>>> them?
>> >>>>>>>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two
>> >>>>>>>>>> Catalog
>> >>>>>>>>>> implementations and what is the difference?
>> >>>>>>>>>> 4. IgniteStrategy and IgniteOptimization are currently no-op.
>> What
>> >>>>>>>>>> are
>> >>>>>>>>>> our
>> >>>>>>>>>> plans on implementing them? Also, what exactly is planned in
>> >>>>>>>>>> IgniteOptimization and what is its purpose?
>> >>>>>>>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have
>> >> to be
>> >>>>>>>>>> set
>> >>>>>>>>>> manually on SQLContext each time it's created. This seems to be
>> >> very
>> >>>>>>>>>> error
>> >>>>>>>>>> prone. Is there any way to automate this and improve usability?
>> >>>>>>>>>> 6. What is the purpose of IgniteSparkSession? I see it's used
>> >>>>>>>>>> in IgniteCatalogExample but not in IgniteDataFrameExample,
>> which
>> >> is
>> >>>>>>>>>> confusing.
>> >>>>>>>>>> 7. To create IgniteSparkSession we first create IgniteContext.
>> Is
>> >> it
>> >>>>>>>>>> really
>> >>>>>>>>>> needed? It looks like we can directly provide the configuration
>> >>>>>>>>>> file; if
>> >>>>>>>>>> IgniteSparkSession really requires IgniteContext, it can create
>> >> it by
>> >>>>>>>>>> itself under the hood. Actually, I think it makes sense to
>> create
>> >> a
>> >>>>>>>>>> builder
>> >>>>>>>>>> similar to SparkSession.builder(), it would be good if our APIs
>> >> here
>> >>>>>>>>>> are
>> >>>>>>>>>> consistent with Spark APIs.
>> >>>>>>>>>> 8. Can you clarify the query syntax
>> >>>>>>>>>> inIgniteDataFrameExample#nativeSparkSqlFromCacheExample2?
>> >>>>>>>>>> 9. Do I understand correctly that IgniteCacheRelation is for
>> the
>> >> case
>> >>>>>>>>>> when
>> >>>>>>>>>> we don't have SQL configured on Ignite side? I thought we
>> decided
>> >>>>>>>>>> not to
>> >>>>>>>>>> support this, no? Or this is something else?
>> >>>>>>>>>>
>> >>>>>>>>>> Thanks!
>> >>>>>>>>>>
>> >>>>>>>>>> -Val
>> >>>>>>>>>>
>> >>>>>>>>>> On Tue, Oct 17, 2017 at 4:40 AM, Anton Vinogradov <
>> >>>>>>>>>> avinogradov@gridgain.com>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> Sounds awesome.
>> >>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> I'll try to review API & tests this week.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Val,
>> >>>>>>>>>>> Your review still required :)
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Tue, Oct 17, 2017 at 2:36 PM, Николай Ижиков <
>> >>>>>>>>>>> nizhikov.dev@gmail.com>
>> >>>>>>>>>>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>> Yes
>> >>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> 17 окт. 2017 г. 2:34 PM пользователь "Anton Vinogradov" <
>> >>>>>>>>>>>> avinogradov@gridgain.com> написал:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Nikolay,
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> So, it will be able to start regular spark and ignite
>> clusters
>> >>>>>>>>>>>>> and,
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> using
>> >>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> peer classloading via spark-context, perform any DataFrame
>> >> request,
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> correct?
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> On Tue, Oct 17, 2017 at 2:25 PM, Николай Ижиков <
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> nizhikov.dev@gmail.com>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Hello, Anton.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> An example you provide is a path to a master *local* file.
>> >>>>>>>>>>>>>> These libraries are added to the classpath for each remote
>> >> node
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> running
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>> submitted job.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Please, see documentation:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
>> >>>>>>>>>>>>>> spark/SparkContext.html#addJar(java.lang.String)
>> >>>>>>>>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
>> >>>>>>>>>>>>>> spark/SparkContext.html#addFile(java.lang.String)
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> 2017-10-17 13:10 GMT+03:00 Anton Vinogradov <
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> avinogradov@gridgain.com
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>> :
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Nikolay,
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> With Data Frame API implementation there are no
>> requirements
>> >> to
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> have
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>> any
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Ignite files on spark worker nodes.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> What do you mean? I see code like:
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> spark.sparkContext.addJar(MAVEN_HOME +
>> >>>>>>>>>>>>>>> "/org/apache/ignite/ignite-core/2.3.0-SNAPSHOT/ignite-
>> >>>>>>>>>>>>>>> core-2.3.0-SNAPSHOT.jar")
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> On Mon, Oct 16, 2017 at 5:22 PM, Николай Ижиков <
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> nizhikov.dev@gmail.com>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Hello, guys.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> I have created example application to run Ignite Data
>> Frame
>> >> on
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> standalone
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Spark cluster.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> With Data Frame API implementation there are no
>> >> requirements to
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> have
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>> any
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Ignite files on spark worker nodes.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> I ran this application on the free dataset: ATP tennis
>> match
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> statistics.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> data - https://github.com/nizhikov/atp_matches
>> >>>>>>>>>>>>>>>> app - https://github.com/nizhikov/ig
>> nite-spark-df-example
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Valentin, do you have a chance to look at my changes?
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> 2017-10-12 6:03 GMT+03:00 Valentin Kulichenko <
>> >>>>>>>>>>>>>>>> valentin.kulichenko@gmail.com
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> :
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Hi Nikolay,
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Sorry for delay on this, got a little swamped lately. I
>> >> will
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> do
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>> my
>> >>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> best
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> to
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> review the code this week.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> -Val
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> On Mon, Oct 9, 2017 at 11:48 AM, Николай Ижиков <
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> nizhikov.dev@gmail.com>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Hello, Valentin.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Did you have a chance to look at my changes?
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> Now I think I have done almost all required features.
>> >>>>>>>>>>>>>>>>>> I want to make some performance test to ensure my
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> implementation
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>> work
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> properly with a significant amount of data.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> And I definitely need some feedback for my changes.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> 2017-10-09 18:45 GMT+03:00 Николай Ижиков <
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> nizhikov.dev@gmail.com
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>> :
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Hello, guys.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> Which version of Spark do we want to use?
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> 1. Currently, Ignite depends on Spark 2.1.0.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>    * Can be run on JDK 7.
>> >>>>>>>>>>>>>>>>>>>    * Still supported: 2.1.2 will be released soon.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> 2. Latest Spark version is 2.2.0.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>    * Can be run only on JDK 8+
>> >>>>>>>>>>>>>>>>>>>    * Released Jul 11, 2017.
>> >>>>>>>>>>>>>>>>>>>    * Already supported by huge vendors(Amazon for
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> example).
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>> Note that in IGNITE-3084 I implement some internal Spark
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> API.
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>> So It will take some effort to switch between Spark 2.1 and
>> >>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> 2.2
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> 2017-09-27 2:20 GMT+03:00 Valentin Kulichenko <
>> >>>>>>>>>>>>>>>>>>> valentin.kulichenko@gmail.com>:
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> I will review in the next few days.
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> -Val
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> On Tue, Sep 26, 2017 at 2:23 PM, Denis Magda <
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> dmagda@apache.org
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Hello Nikolay,
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> This is good news. Finally this capability is
>> coming to
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Ignite.
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Val, Vladimir, could you do a preliminary review?
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Answering on your questions.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> 1. Yardstick should be enough for performance
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> measurements.
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>> As a
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Spark
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> user, I will be curious to know what’s the point of this
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> integration.
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Probably we need to compare Spark + Ignite and Spark +
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> Hive
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>> or
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Spark +
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> RDBMS cases.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> 2. If Spark community is reluctant let’s include the
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> module
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>> in
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> ignite-spark integration.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> —
>> >>>>>>>>>>>>>>>>>>>>> Denis
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> On Sep 25, 2017, at 11:14 AM, Николай Ижиков <
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> nizhikov.dev@gmail.com>
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> Hello, guys.
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> Currently, I’m working on integration between Spark
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>> Ignite
>> >>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> [1].
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> For now, I implement following:
>> >>>>>>>>>>>>>>>>>>>>>>   * Ignite DataSource implementation(
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> IgniteRelationProvider)
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>   * DataFrame support for Ignite SQL table.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>   * IgniteCatalog implementation for a transparent
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> resolving
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>> of
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> ignites
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> SQL tables.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> Implementation of it can be found in PR [2]
>> >>>>>>>>>>>>>>>>>>>>>> It would be great if someone provides feedback for
>> a
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> prototype.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> I made some examples in PR so you can see how API
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> suppose
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>> to
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> be
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> used [3].
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> [4].
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> I need some advice. Can you help me?
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> 1. How should this PR be tested?
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> Of course, I need to provide some unit tests. But
>> what
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> about
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>> scalability
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> tests, etc.
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> Maybe we need some Yardstick benchmark or similar?
>> >>>>>>>>>>>>>>>>>>>>>> What are your thoughts?
>> >>>>>>>>>>>>>>>>>>>>>> Which scenarios should I consider in the first
>> place?
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> 2. Should we provide Spark Catalog implementation
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> inside
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>> Ignite
>> >>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>>> codebase?
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> A current implementation of Spark Catalog based on
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> *internal
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>> Spark
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> API*.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> Spark community seems not interested in making
>> Catalog
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> API
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>> public
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> or
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> including Ignite Catalog in Spark code base [5], [6].
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> *Should we include Spark internal API
>> implementation
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> inside
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>> Ignite
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> code
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> base?*
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> Or should we consider to include Catalog
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> implementation
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>> in
>> >>>>>>>>>>>
>> >>>>>>>>>>> some
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> external
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> module?
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> That will be created and released outside
>> Ignite?(we
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> still
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>> can
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> support
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>> and
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>> develop it inside Ignite community).
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> [1] https://issues.apache.org/jira
>> /browse/IGNITE-3084
>> >>>>>>>>>>>>>>>>>>>>>> [2] https://github.com/apache/ignite/pull/2742
>> >>>>>>>>>>>>>>>>>>>>>> [3] https://github.com/apache/
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> ignite/pull/2742/files#diff-
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>> f4ff509cef3018e221394474775e0905
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> [4] https://github.com/apache/
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> ignite/pull/2742/files#diff-
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>> f2b670497d81e780dfd5098c5dd8a89c
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> [5] http://apache-spark-developers-list.1001551.n3.
>> >>>>>>>>>>>>>>>>>>>>>> nabble.com/Spark-Core-Custom-
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> Catalog-Integration-between-
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>> Apache-Ignite-and-Apache-Spark-td22452.html
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> [6] https://issues.apache.org/jira/browse/SPARK-17767
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>> --
>> >>>>>>>>>>>>>>>>>>>>>> Nikolay Izhikov
>> >>>>>>>>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>> --
>> >>>>>>>>>>>>>>>>>>> Nikolay Izhikov
>> >>>>>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>> --
>> >>>>>>>>>>>>>>>>>> Nikolay Izhikov
>> >>>>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> --
>> >>>>>>>>>>>>>>>> Nikolay Izhikov
>> >>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> --
>> >>>>>>>>>>>>>> Nikolay Izhikov
>> >>>>>>>>>>>>>> NIzhikov.dev@gmail.com
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> Nikolay Izhikov
>> >>>>>>> NIzhikov.dev@gmail.com
>> >>>>>>>
>> >>>>>>>
>> >>>>>
>> >>
>> >>
>>
>>
>

Re: Integration of Spark and Ignite. Prototype.

Posted by Valentin Kulichenko <va...@gmail.com>.

Denis,

Nikolay was doing final changes and TC stabilization. I'm planning to do
final review this week, so hopefully we will merge the code soon.

-Val

On Mon, Dec 4, 2017 at 1:31 PM, Denis Magda <dm...@apache.org> wrote:

> Nikolay, Val,
>
> Since we agreed to release the feature without the strategy support, can
> the current integration meet the world in 2.4 release? Please chime in this
> conversation:
> http://apache-ignite-developers.2346864.n4.nabble.
> com/Time-and-scope-for-Apache-Ignite-2-4-td24987.html
>
> —
> Denis
>
> > On Nov 28, 2017, at 5:42 PM, Valentin Kulichenko <
> valentin.kulichenko@gmail.com> wrote:
> >
> > Denis,
> >
> > Agree. I will do the final review in next few days and merge the code.
> >
> > -Val
> >
> > On Tue, Nov 28, 2017 at 5:28 PM, Denis Magda <dm...@apache.org> wrote:
> >
> >> Guys,
> >>
> >> Looking into the parallel discussion about the strategy support I would
> >> change my initial stance and support the idea of releasing the
> integration
> >> in its current state. Is the code ready to be merged into the master?
> Let’s
> >> concentrate on this first and handle the strategy support as a separate
> >> JIRA task. Agree?
> >>
> >> —
> >> Denis
> >>
> >>> On Nov 27, 2017, at 3:47 PM, Valentin Kulichenko <
> >> valentin.kulichenko@gmail.com> wrote:
> >>>
> >>> Nikolay,
> >>>
> >>> Let's estimate the strategy implementation work, and then decide
> weather
> >> to
> >>> merge the code in current state or not. If anything is unclear, please
> >>> start a separate discussion.
> >>>
> >>> -Val
> >>>
> >>> On Fri, Nov 24, 2017 at 5:42 AM, Николай Ижиков <
> nizhikov.dev@gmail.com>
> >>> wrote:
> >>>
> >>>> Hello, Val, Denis.
> >>>>
> >>>>> Personally, I think that we should release the integration only after
> >>>> the strategy is fully supported.
> >>>>
> >>>> I see two major reason to propose merge of DataFrame API
> implementation
> >>>> without custom strategy:
> >>>>
> >>>> 1. My PR is relatively huge, already. From my experience of
> interaction
> >>>> with Ignite community - the bigger PR becomes, the more time of
> >> commiters
> >>>> required to review PR.
> >>>> So, I propose to move smaller, but complete steps here.
> >>>>
> >>>> 2. It is not clear for me what exactly includes "custom strategy and
> >>>> optimization".
> >>>> Seems, that additional discussion required.
> >>>> I think, I can put my thoughts on the paper and start discussion right
> >>>> after basic implementation is done.
> >>>>
> >>>>> Custom strategy implementation is actually very important for this
> >>>> integration.
> >>>>
> >>>> Understand and fully agreed.
> >>>> I'm ready to continue work in that area.
> >>>>
> >>>> 23.11.2017 02:15, Denis Magda пишет:
> >>>>
> >>>> Val, Nikolay,
> >>>>>
> >>>>> Personally, I think that we should release the integration only after
> >> the
> >>>>> strategy is fully supported. Without the strategy we don’t really
> >> leverage
> >>>>> from Ignite’s SQL engine and introduce redundant data movement
> between
> >>>>> Ignite and Spark nodes.
> >>>>>
> >>>>> How big is the effort to support the strategy in terms of the amount
> of
> >>>>> work left? 40%, 60%, 80%?
> >>>>>
> >>>>> —
> >>>>> Denis
> >>>>>
> >>>>> On Nov 22, 2017, at 2:57 PM, Valentin Kulichenko <
> >>>>>> valentin.kulichenko@gmail.com> wrote:
> >>>>>>
> >>>>>> Nikolay,
> >>>>>>
> >>>>>> Custom strategy implementation is actually very important for this
> >>>>>> integration. Basically, it will allow to create a SQL query for
> Ignite
> >>>>>> and
> >>>>>> execute it directly on the cluster. Your current implementation only
> >>>>>> adds a
> >>>>>> new DataSource which means that Spark will fetch data in its own
> >> memory
> >>>>>> first, and then do most of the work (like joins for example). Does
> it
> >>>>>> make
> >>>>>> sense to you? Can you please take a look at this and provide your
> >>>>>> thoughts
> >>>>>> on how much development is implied there?
> >>>>>>
> >>>>>> Current code looks good to me though and I'm OK if the strategy is
> >>>>>> implemented as a next step in a scope of separate ticket. I will do
> >> final
> >>>>>> review early next week and will merge it if everything is OK.
> >>>>>>
> >>>>>> -Val
> >>>>>>
> >>>>>> On Thu, Oct 19, 2017 at 7:29 AM, Николай Ижиков <
> >> nizhikov.dev@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>> Hello.
> >>>>>>>
> >>>>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two
> >> Catalog
> >>>>>>>>
> >>>>>>> implementations and what is the difference?
> >>>>>>>
> >>>>>>> IgniteCatalog removed.
> >>>>>>>
> >>>>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have to
> be
> >>>>>>>>
> >>>>>>> set manually on SQLContext each time it's created....Is there any
> >> way to
> >>>>>>> automate this and improve usability?
> >>>>>>>
> >>>>>>> IgniteStrategy and IgniteOptimization are removed as it empty now.
> >>>>>>>
> >>>>>>> Actually, I think it makes sense to create a builder similar to
> >>>>>>>>
> >>>>>>> SparkSession.builder()...
> >>>>>>>
> >>>>>>> IgniteBuilder added.
> >>>>>>> Syntax looks like:
> >>>>>>>
> >>>>>>> ```
> >>>>>>> val igniteSession = IgniteSparkSession.builder()
> >>>>>>>   .appName("Spark Ignite catalog example")
> >>>>>>>   .master("local")
> >>>>>>>   .config("spark.executor.instances", "2")
> >>>>>>>   .igniteConfig(CONFIG)
> >>>>>>>   .getOrCreate()
> >>>>>>>
> >>>>>>> igniteSession.catalog.listTables().show()
> >>>>>>> ```
> >>>>>>>
> >>>>>>> Please, see updated PR - https://github.com/apache/
> ignite/pull/2742
> >>>>>>>
> >>>>>>> 2017-10-18 20:02 GMT+03:00 Николай Ижиков <nizhikov.dev@gmail.com
> >:
> >>>>>>>
> >>>>>>> Hello, Valentin.
> >>>>>>>>
> >>>>>>>> My answers is below.
> >>>>>>>> Dmitry, do we need to move discussion to Jira?
> >>>>>>>>
> >>>>>>>> 1. Why do we have org.apache.spark.sql.ignite package in our
> >> codebase?
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> As I mentioned earlier, to implement and override Spark Catalog
> one
> >>>>>>>> have
> >>>>>>>> to use internal(private) Spark API.
> >>>>>>>> So I have to use package `org.spark.sql.***` to have access to
> >> private
> >>>>>>>> class and variables.
> >>>>>>>>
> >>>>>>>> For example, SharedState class that stores link to ExternalCatalog
> >>>>>>>> declared as `private[sql] class SharedState` - i.e. package
> private.
> >>>>>>>>
> >>>>>>>> Can these classes reside under org.apache.ignite.spark instead?
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> No, as long as we want to have our own implementation of
> >>>>>>>> ExternalCatalog.
> >>>>>>>>
> >>>>>>>> 2. IgniteRelationProvider contains multiple constants which I
> guess
> >> are
> >>>>>>>>>
> >>>>>>>> some king of config options. Can you describe the purpose of each
> of
> >>>>>>>> them?
> >>>>>>>>
> >>>>>>>> I extend comments for this options.
> >>>>>>>> Please, see my commit [1] or PR HEAD:
> >>>>>>>>
> >>>>>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two
> >> Catalog
> >>>>>>>>>
> >>>>>>>> implementations and what is the difference?
> >>>>>>>>
> >>>>>>>> Good catch, thank you!
> >>>>>>>> After additional research I founded that only
> IgniteExternalCatalog
> >>>>>>>> required.
> >>>>>>>> I will update PR with IgniteCatalog remove in a few days.
> >>>>>>>>
> >>>>>>>> 4. IgniteStrategy and IgniteOptimization are currently no-op. What
> >> are
> >>>>>>>>>
> >>>>>>>> our plans on implementing them? Also, what exactly is planned in
> >>>>>>>> IgniteOptimization and what is its purpose?
> >>>>>>>>
> >>>>>>>> Actually, this is very good question :)
> >>>>>>>> And I need advice from experienced community members here:
> >>>>>>>>
> >>>>>>>> `IgniteOptimization` purpose is to modify query plan created by
> >> Spark.
> >>>>>>>> Currently, we have one optimization described in IGNITE-3084 [2]
> by
> >>>>>>>> you,
> >>>>>>>> Valentin :) :
> >>>>>>>>
> >>>>>>>> “If there are non-Ignite relations in the plan, we should fall
> back
> >> to
> >>>>>>>> native Spark strategies“
> >>>>>>>>
> >>>>>>>> I think we can go little further and reduce join of two Ignite
> >> backed
> >>>>>>>> Data Frames into single Ignite SQL query. Currently, this feature
> is
> >>>>>>>> unimplemented.
> >>>>>>>>
> >>>>>>>> *Do we need it now? Or we can postpone it and concentrates on
> basic
> >>>>>>>> Data
> >>>>>>>> Frame and Catalog implementation?*
> >>>>>>>>
> >>>>>>>> `Strategy` purpose, as you correctly mentioned in [2], is
> transform
> >>>>>>>> LogicalPlan into physical operators.
> >>>>>>>> I don’t have ideas how to use this opportunity. So I think we
> don’t
> >>>>>>>> need
> >>>>>>>> IgniteStrategy.
> >>>>>>>>
> >>>>>>>> Can you or anyone else suggest some optimization strategy to speed
> >> up
> >>>>>>>> SQL
> >>>>>>>> query execution?
> >>>>>>>>
> >>>>>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have to
> >> be
> >>>>>>>>>
> >>>>>>>> set manually on SQLContext each time it's created....Is there any
> >> way
> >>>>>>>> to
> >>>>>>>> automate this and improve usability?
> >>>>>>>>
> >>>>>>>> These classes added to `extraOptimizations` when one using
> >>>>>>>> IgniteSparkSession.
> >>>>>>>> As far as I know, there is no way to automatically add these
> >> classes to
> >>>>>>>> regular SparkSession.
> >>>>>>>>
> >>>>>>>> 6. What is the purpose of IgniteSparkSession? I see it's used in
> >>>>>>>>>
> >>>>>>>> IgniteCatalogExample but not in IgniteDataFrameExample, which is
> >>>>>>>> Confusing.
> >>>>>>>>
> >>>>>>>> DataFrame API is *public* Spark API. So anyone can provide
> >>>>>>>> implementation
> >>>>>>>> and plug it into Spark. That’s why IgniteDataFrameExample doesn’t
> >> need
> >>>>>>>> any
> >>>>>>>> Ignite specific session.
> >>>>>>>>
> >>>>>>>> Catalog API is *internal* Spark API. There is no way to plug
> custom
> >>>>>>>> catalog implementation into Spark [3]. So we have to use
> >>>>>>>> `IgniteSparkSession` that extends regular SparkSession and
> overrides
> >>>>>>>> links
> >>>>>>>> to `ExternalCatalog`.
> >>>>>>>>
> >>>>>>>> 7. To create IgniteSparkSession we first create IgniteContext. Is
> it
> >>>>>>>>>
> >>>>>>>> really needed? It looks like we can directly provide the
> >> configuration
> >>>>>>>> file; if IgniteSparkSession really requires IgniteContext, it can
> >>>>>>>> create it
> >>>>>>>> by itself under the hood.
> >>>>>>>>
> >>>>>>>> Actually, IgniteContext is base class for Ignite <-> Spark
> >> integration
> >>>>>>>> for now. So I tried to reuse it here. I like the idea to remove
> >>>>>>>> explicit
> >>>>>>>> usage of IgniteContext.
> >>>>>>>> Will implement it in a few days.
> >>>>>>>>
> >>>>>>>> Actually, I think it makes sense to create a builder similar to
> >>>>>>>>>
> >>>>>>>> SparkSession.builder()...
> >>>>>>>>
> >>>>>>>> Great idea! I will implement such builder in a few days.
> >>>>>>>>
> >>>>>>>> 9. Do I understand correctly that IgniteCacheRelation is for the
> >> case
> >>>>>>>>>
> >>>>>>>> when we don't have SQL configured on Ignite side?
> >>>>>>>>
> >>>>>>>> Yes, IgniteCacheRelation is Data Frame implementation for a
> >> key-value
> >>>>>>>> cache.
> >>>>>>>>
> >>>>>>>> I thought we decided not to support this, no? Or this is something
> >>>>>>>>> else?
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> My understanding is following:
> >>>>>>>>
> >>>>>>>> 1. We can’t support automatic resolving key-value caches in
> >>>>>>>> *ExternalCatalog*. Because there is no way to reliably detect key
> >> and
> >>>>>>>> value
> >>>>>>>> classes.
> >>>>>>>>
> >>>>>>>> 2. We can support key-value caches in regular Data Frame
> >>>>>>>> implementation.
> >>>>>>>> Because we can require user to provide key and value classes
> >>>>>>>> explicitly.
> >>>>>>>>
> >>>>>>>> 8. Can you clarify the query syntax in
> >> IgniteDataFrameExample#nativeS
> >>>>>>>>>
> >>>>>>>> parkSqlFromCacheExample2?
> >>>>>>>>
> >>>>>>>> Key-value cache:
> >>>>>>>>
> >>>>>>>> key - java.lang.Long,
> >>>>>>>> value - case class Person(name: String, birthDate: java.util.Date)
> >>>>>>>>
> >>>>>>>> Schema of data frame for cache is:
> >>>>>>>>
> >>>>>>>> key - long
> >>>>>>>> value.name - string
> >>>>>>>> value.birthDate - date
> >>>>>>>>
> >>>>>>>> So we can select data from data from cache:
> >>>>>>>>
> >>>>>>>> SELECT
> >>>>>>>> key, `value.name`,  `value.birthDate`
> >>>>>>>> FROM
> >>>>>>>> testCache
> >>>>>>>> WHERE key >= 2 AND `value.name` like '%0'
> >>>>>>>>
> >>>>>>>> [1] https://github.com/apache/ignite/pull/2742/commits/faf3ed6fe
> >>>>>>>> bf417bc59b0519156fd4d09114c8da7
> >>>>>>>> [2] https://issues.apache.org/jira/browse/IGNITE-3084?focusedCom
> >>>>>>>> mentId=15794210&page=com.atlassian.jira.plugin.system.issuet
> >>>>>>>> abpanels:comment-tabpanel#comment-15794210
> >>>>>>>> [3] https://issues.apache.org/jira/browse/SPARK-17767?focusedCom
> >>>>>>>> mentId=15543733&page=com.atlassian.jira.plugin.system.issuet
> >>>>>>>> abpanels:comment-tabpanel#comment-15543733
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 18.10.2017 04:39, Dmitriy Setrakyan пишет:
> >>>>>>>>
> >>>>>>>> Val, thanks for the review. Can I ask you to add the same comments
> >> to
> >>>>>>>> the
> >>>>>>>>
> >>>>>>>>> ticket?
> >>>>>>>>>
> >>>>>>>>> On Tue, Oct 17, 2017 at 3:20 PM, Valentin Kulichenko <
> >>>>>>>>> valentin.kulichenko@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>> Nikolay, Anton,
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I did a high level review of the code. First of all, impressive
> >>>>>>>>>> results!
> >>>>>>>>>> However, I have some questions/comments.
> >>>>>>>>>>
> >>>>>>>>>> 1. Why do we have org.apache.spark.sql.ignite package in our
> >>>>>>>>>> codebase?
> >>>>>>>>>> Can
> >>>>>>>>>> these classes reside under org.apache.ignite.spark instead?
> >>>>>>>>>> 2. IgniteRelationProvider contains multiple constants which I
> >> guess
> >>>>>>>>>> are
> >>>>>>>>>> some king of config options. Can you describe the purpose of
> each
> >> of
> >>>>>>>>>> them?
> >>>>>>>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two
> >>>>>>>>>> Catalog
> >>>>>>>>>> implementations and what is the difference?
> >>>>>>>>>> 4. IgniteStrategy and IgniteOptimization are currently no-op.
> What
> >>>>>>>>>> are
> >>>>>>>>>> our
> >>>>>>>>>> plans on implementing them? Also, what exactly is planned in
> >>>>>>>>>> IgniteOptimization and what is its purpose?
> >>>>>>>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have
> >> to be
> >>>>>>>>>> set
> >>>>>>>>>> manually on SQLContext each time it's created. This seems to be
> >> very
> >>>>>>>>>> error
> >>>>>>>>>> prone. Is there any way to automate this and improve usability?
> >>>>>>>>>> 6. What is the purpose of IgniteSparkSession? I see it's used
> >>>>>>>>>> in IgniteCatalogExample but not in IgniteDataFrameExample, which
> >> is
> >>>>>>>>>> confusing.
> >>>>>>>>>> 7. To create IgniteSparkSession we first create IgniteContext.
> Is
> >> it
> >>>>>>>>>> really
> >>>>>>>>>> needed? It looks like we can directly provide the configuration
> >>>>>>>>>> file; if
> >>>>>>>>>> IgniteSparkSession really requires IgniteContext, it can create
> >> it by
> >>>>>>>>>> itself under the hood. Actually, I think it makes sense to
> create
> >> a
> >>>>>>>>>> builder
> >>>>>>>>>> similar to SparkSession.builder(), it would be good if our APIs
> >> here
> >>>>>>>>>> are
> >>>>>>>>>> consistent with Spark APIs.
> >>>>>>>>>> 8. Can you clarify the query syntax
> >>>>>>>>>> inIgniteDataFrameExample#nativeSparkSqlFromCacheExample2?
> >>>>>>>>>> 9. Do I understand correctly that IgniteCacheRelation is for the
> >> case
> >>>>>>>>>> when
> >>>>>>>>>> we don't have SQL configured on Ignite side? I thought we
> decided
> >>>>>>>>>> not to
> >>>>>>>>>> support this, no? Or this is something else?
> >>>>>>>>>>
> >>>>>>>>>> Thanks!
> >>>>>>>>>>
> >>>>>>>>>> -Val
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Oct 17, 2017 at 4:40 AM, Anton Vinogradov <
> >>>>>>>>>> avinogradov@gridgain.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Sounds awesome.
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> I'll try to review API & tests this week.
> >>>>>>>>>>>
> >>>>>>>>>>> Val,
> >>>>>>>>>>> Your review still required :)
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Oct 17, 2017 at 2:36 PM, Николай Ижиков <
> >>>>>>>>>>> nizhikov.dev@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Yes
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> 17 окт. 2017 г. 2:34 PM пользователь "Anton Vinogradov" <
> >>>>>>>>>>>> avinogradov@gridgain.com> написал:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Nikolay,
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> So, it will be able to start regular spark and ignite
> clusters
> >>>>>>>>>>>>> and,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> using
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> peer classloading via spark-context, perform any DataFrame
> >> request,
> >>>>>>>>>>>>
> >>>>>>>>>>>>> correct?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, Oct 17, 2017 at 2:25 PM, Николай Ижиков <
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> nizhikov.dev@gmail.com>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hello, Anton.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> An example you provide is a path to a master *local* file.
> >>>>>>>>>>>>>> These libraries are added to the classpath for each remote
> >> node
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> running
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> submitted job.
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Please, see documentation:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
> >>>>>>>>>>>>>> spark/SparkContext.html#addJar(java.lang.String)
> >>>>>>>>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
> >>>>>>>>>>>>>> spark/SparkContext.html#addFile(java.lang.String)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 2017-10-17 13:10 GMT+03:00 Anton Vinogradov <
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> avinogradov@gridgain.com
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> :
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Nikolay,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> With Data Frame API implementation there are no
> requirements
> >> to
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>> any
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Ignite files on spark worker nodes.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> What do you mean? I see code like:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> spark.sparkContext.addJar(MAVEN_HOME +
> >>>>>>>>>>>>>>> "/org/apache/ignite/ignite-core/2.3.0-SNAPSHOT/ignite-
> >>>>>>>>>>>>>>> core-2.3.0-SNAPSHOT.jar")
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Mon, Oct 16, 2017 at 5:22 PM, Николай Ижиков <
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> nizhikov.dev@gmail.com>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hello, guys.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I have created example application to run Ignite Data
> Frame
> >> on
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> standalone
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Spark cluster.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> With Data Frame API implementation there are no
> >> requirements to
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>> any
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Ignite files on spark worker nodes.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I ran this application on the free dataset: ATP tennis
> match
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> statistics.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> data - https://github.com/nizhikov/atp_matches
> >>>>>>>>>>>>>>>> app - https://github.com/nizhikov/ignite-spark-df-example
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Valentin, do you have a chance to look at my changes?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 2017-10-12 6:03 GMT+03:00 Valentin Kulichenko <
> >>>>>>>>>>>>>>>> valentin.kulichenko@gmail.com
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Nikolay,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Sorry for delay on this, got a little swamped lately. I
> >> will
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> do
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>> my
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> best
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> review the code this week.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> -Val
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Mon, Oct 9, 2017 at 11:48 AM, Николай Ижиков <
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> nizhikov.dev@gmail.com>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hello, Valentin.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Did you have a chance to look at my changes?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Now I think I have done almost all required features.
> >>>>>>>>>>>>>>>>>> I want to make some performance test to ensure my
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> implementation
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>> work
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> properly with a significant amount of data.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> And I definitely need some feedback for my changes.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> 2017-10-09 18:45 GMT+03:00 Николай Ижиков <
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> nizhikov.dev@gmail.com
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>> :
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hello, guys.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Which version of Spark do we want to use?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 1. Currently, Ignite depends on Spark 2.1.0.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>    * Can be run on JDK 7.
> >>>>>>>>>>>>>>>>>>>    * Still supported: 2.1.2 will be released soon.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 2. Latest Spark version is 2.2.0.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>    * Can be run only on JDK 8+
> >>>>>>>>>>>>>>>>>>>    * Released Jul 11, 2017.
> >>>>>>>>>>>>>>>>>>>    * Already supported by huge vendors(Amazon for
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> example).
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> Note that in IGNITE-3084 I implement some internal Spark
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> API.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>> So It will take some effort to switch between Spark 2.1 and
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 2.2
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 2017-09-27 2:20 GMT+03:00 Valentin Kulichenko <
> >>>>>>>>>>>>>>>>>>> valentin.kulichenko@gmail.com>:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I will review in the next few days.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> -Val
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Tue, Sep 26, 2017 at 2:23 PM, Denis Magda <
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> dmagda@apache.org
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hello Nikolay,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> This is good news. Finally this capability is coming
> to
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Ignite.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Val, Vladimir, could you do a preliminary review?
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Answering on your questions.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 1. Yardstick should be enough for performance
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> measurements.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>> As a
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Spark
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> user, I will be curious to know what’s the point of this
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> integration.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Probably we need to compare Spark + Ignite and Spark +
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Hive
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>> or
> >>>>>>>>>>>>
> >>>>>>>>>>>> Spark +
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> RDBMS cases.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 2. If Spark community is reluctant let’s include the
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> module
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>> in
> >>>>>>>>>>>>
> >>>>>>>>>>>> ignite-spark integration.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> —
> >>>>>>>>>>>>>>>>>>>>> Denis
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Sep 25, 2017, at 11:14 AM, Николай Ижиков <
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> nizhikov.dev@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Hello, guys.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Currently, I’m working on integration between Spark
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>> Ignite
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> [1].
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> For now, I implement following:
> >>>>>>>>>>>>>>>>>>>>>>   * Ignite DataSource implementation(
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> IgniteRelationProvider)
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>   * DataFrame support for Ignite SQL table.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>   * IgniteCatalog implementation for a transparent
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> resolving
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> ignites
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> SQL tables.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Implementation of it can be found in PR [2]
> >>>>>>>>>>>>>>>>>>>>>> It would be great if someone provides feedback for a
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> prototype.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I made some examples in PR so you can see how API
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> suppose
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>> to
> >>>>>>>>>>>>
> >>>>>>>>>>>> be
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> used [3].
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> [4].
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> I need some advice. Can you help me?
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> 1. How should this PR be tested?
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Of course, I need to provide some unit tests. But
> what
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>> scalability
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> tests, etc.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Maybe we need some Yardstick benchmark or similar?
> >>>>>>>>>>>>>>>>>>>>>> What are your thoughts?
> >>>>>>>>>>>>>>>>>>>>>> Which scenarios should I consider in the first
> place?
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> 2. Should we provide Spark Catalog implementation
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> inside
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>> Ignite
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>> codebase?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> A current implementation of Spark Catalog based on
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> *internal
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>> Spark
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> API*.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Spark community seems not interested in making Catalog
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> API
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>> public
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> including Ignite Catalog in Spark code base [5], [6].
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> *Should we include Spark internal API implementation
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> inside
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>> Ignite
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> code
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> base?*
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Or should we consider to include Catalog
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> implementation
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>> in
> >>>>>>>>>>>
> >>>>>>>>>>> some
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> external
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> module?
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> That will be created and released outside Ignite?(we
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> still
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>> can
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> support
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> develop it inside Ignite community).
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> [1] https://issues.apache.org/
> jira/browse/IGNITE-3084
> >>>>>>>>>>>>>>>>>>>>>> [2] https://github.com/apache/ignite/pull/2742
> >>>>>>>>>>>>>>>>>>>>>> [3] https://github.com/apache/
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> ignite/pull/2742/files#diff-
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>> f4ff509cef3018e221394474775e0905
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> [4] https://github.com/apache/
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> ignite/pull/2742/files#diff-
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>> f2b670497d81e780dfd5098c5dd8a89c
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> [5] http://apache-spark-developers-list.1001551.n3.
> >>>>>>>>>>>>>>>>>>>>>> nabble.com/Spark-Core-Custom-
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Catalog-Integration-between-
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>> Apache-Ignite-and-Apache-Spark-td22452.html
> >>>>>>>>>>>>
> >>>>>>>>>>>>> [6] https://issues.apache.org/jira/browse/SPARK-17767
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>>>>>>> Nikolay Izhikov
> >>>>>>>>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>>>> Nikolay Izhikov
> >>>>>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>>> Nikolay Izhikov
> >>>>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>> Nikolay Izhikov
> >>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> --
> >>>>>>>>>>>>>> Nikolay Izhikov
> >>>>>>>>>>>>>> NIzhikov.dev@gmail.com
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Nikolay Izhikov
> >>>>>>> NIzhikov.dev@gmail.com
> >>>>>>>
> >>>>>>>
> >>>>>
> >>
> >>
>
>

Re: Integration of Spark and Ignite. Prototype.

Posted by Denis Magda <dm...@apache.org>.

Nikolay, Val,

Since we agreed to release the feature without the strategy support, can the current integration meet the world in 2.4 release? Please chime in this conversation:
http://apache-ignite-developers.2346864.n4.nabble.com/Time-and-scope-for-Apache-Ignite-2-4-td24987.html

—
Denis

> On Nov 28, 2017, at 5:42 PM, Valentin Kulichenko <va...@gmail.com> wrote:
> 
> Denis,
> 
> Agree. I will do the final review in next few days and merge the code.
> 
> -Val
> 
> On Tue, Nov 28, 2017 at 5:28 PM, Denis Magda <dm...@apache.org> wrote:
> 
>> Guys,
>> 
>> Looking into the parallel discussion about the strategy support I would
>> change my initial stance and support the idea of releasing the integration
>> in its current state. Is the code ready to be merged into the master? Let’s
>> concentrate on this first and handle the strategy support as a separate
>> JIRA task. Agree?
>> 
>> —
>> Denis
>> 
>>> On Nov 27, 2017, at 3:47 PM, Valentin Kulichenko <
>> valentin.kulichenko@gmail.com> wrote:
>>> 
>>> Nikolay,
>>> 
>>> Let's estimate the strategy implementation work, and then decide weather
>> to
>>> merge the code in current state or not. If anything is unclear, please
>>> start a separate discussion.
>>> 
>>> -Val
>>> 
>>> On Fri, Nov 24, 2017 at 5:42 AM, Николай Ижиков <ni...@gmail.com>
>>> wrote:
>>> 
>>>> Hello, Val, Denis.
>>>> 
>>>>> Personally, I think that we should release the integration only after
>>>> the strategy is fully supported.
>>>> 
>>>> I see two major reason to propose merge of DataFrame API implementation
>>>> without custom strategy:
>>>> 
>>>> 1. My PR is relatively huge, already. From my experience of interaction
>>>> with Ignite community - the bigger PR becomes, the more time of
>> commiters
>>>> required to review PR.
>>>> So, I propose to move smaller, but complete steps here.
>>>> 
>>>> 2. It is not clear for me what exactly includes "custom strategy and
>>>> optimization".
>>>> Seems, that additional discussion required.
>>>> I think, I can put my thoughts on the paper and start discussion right
>>>> after basic implementation is done.
>>>> 
>>>>> Custom strategy implementation is actually very important for this
>>>> integration.
>>>> 
>>>> Understand and fully agreed.
>>>> I'm ready to continue work in that area.
>>>> 
>>>> 23.11.2017 02:15, Denis Magda пишет:
>>>> 
>>>> Val, Nikolay,
>>>>> 
>>>>> Personally, I think that we should release the integration only after
>> the
>>>>> strategy is fully supported. Without the strategy we don’t really
>> leverage
>>>>> from Ignite’s SQL engine and introduce redundant data movement between
>>>>> Ignite and Spark nodes.
>>>>> 
>>>>> How big is the effort to support the strategy in terms of the amount of
>>>>> work left? 40%, 60%, 80%?
>>>>> 
>>>>> —
>>>>> Denis
>>>>> 
>>>>> On Nov 22, 2017, at 2:57 PM, Valentin Kulichenko <
>>>>>> valentin.kulichenko@gmail.com> wrote:
>>>>>> 
>>>>>> Nikolay,
>>>>>> 
>>>>>> Custom strategy implementation is actually very important for this
>>>>>> integration. Basically, it will allow to create a SQL query for Ignite
>>>>>> and
>>>>>> execute it directly on the cluster. Your current implementation only
>>>>>> adds a
>>>>>> new DataSource which means that Spark will fetch data in its own
>> memory
>>>>>> first, and then do most of the work (like joins for example). Does it
>>>>>> make
>>>>>> sense to you? Can you please take a look at this and provide your
>>>>>> thoughts
>>>>>> on how much development is implied there?
>>>>>> 
>>>>>> Current code looks good to me though and I'm OK if the strategy is
>>>>>> implemented as a next step in a scope of separate ticket. I will do
>> final
>>>>>> review early next week and will merge it if everything is OK.
>>>>>> 
>>>>>> -Val
>>>>>> 
>>>>>> On Thu, Oct 19, 2017 at 7:29 AM, Николай Ижиков <
>> nizhikov.dev@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>> Hello.
>>>>>>> 
>>>>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two
>> Catalog
>>>>>>>> 
>>>>>>> implementations and what is the difference?
>>>>>>> 
>>>>>>> IgniteCatalog removed.
>>>>>>> 
>>>>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have to be
>>>>>>>> 
>>>>>>> set manually on SQLContext each time it's created....Is there any
>> way to
>>>>>>> automate this and improve usability?
>>>>>>> 
>>>>>>> IgniteStrategy and IgniteOptimization are removed as it empty now.
>>>>>>> 
>>>>>>> Actually, I think it makes sense to create a builder similar to
>>>>>>>> 
>>>>>>> SparkSession.builder()...
>>>>>>> 
>>>>>>> IgniteBuilder added.
>>>>>>> Syntax looks like:
>>>>>>> 
>>>>>>> ```
>>>>>>> val igniteSession = IgniteSparkSession.builder()
>>>>>>>   .appName("Spark Ignite catalog example")
>>>>>>>   .master("local")
>>>>>>>   .config("spark.executor.instances", "2")
>>>>>>>   .igniteConfig(CONFIG)
>>>>>>>   .getOrCreate()
>>>>>>> 
>>>>>>> igniteSession.catalog.listTables().show()
>>>>>>> ```
>>>>>>> 
>>>>>>> Please, see updated PR - https://github.com/apache/ignite/pull/2742
>>>>>>> 
>>>>>>> 2017-10-18 20:02 GMT+03:00 Николай Ижиков <ni...@gmail.com>:
>>>>>>> 
>>>>>>> Hello, Valentin.
>>>>>>>> 
>>>>>>>> My answers is below.
>>>>>>>> Dmitry, do we need to move discussion to Jira?
>>>>>>>> 
>>>>>>>> 1. Why do we have org.apache.spark.sql.ignite package in our
>> codebase?
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> As I mentioned earlier, to implement and override Spark Catalog one
>>>>>>>> have
>>>>>>>> to use internal(private) Spark API.
>>>>>>>> So I have to use package `org.spark.sql.***` to have access to
>> private
>>>>>>>> class and variables.
>>>>>>>> 
>>>>>>>> For example, SharedState class that stores link to ExternalCatalog
>>>>>>>> declared as `private[sql] class SharedState` - i.e. package private.
>>>>>>>> 
>>>>>>>> Can these classes reside under org.apache.ignite.spark instead?
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> No, as long as we want to have our own implementation of
>>>>>>>> ExternalCatalog.
>>>>>>>> 
>>>>>>>> 2. IgniteRelationProvider contains multiple constants which I guess
>> are
>>>>>>>>> 
>>>>>>>> some king of config options. Can you describe the purpose of each of
>>>>>>>> them?
>>>>>>>> 
>>>>>>>> I extend comments for this options.
>>>>>>>> Please, see my commit [1] or PR HEAD:
>>>>>>>> 
>>>>>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two
>> Catalog
>>>>>>>>> 
>>>>>>>> implementations and what is the difference?
>>>>>>>> 
>>>>>>>> Good catch, thank you!
>>>>>>>> After additional research I founded that only IgniteExternalCatalog
>>>>>>>> required.
>>>>>>>> I will update PR with IgniteCatalog remove in a few days.
>>>>>>>> 
>>>>>>>> 4. IgniteStrategy and IgniteOptimization are currently no-op. What
>> are
>>>>>>>>> 
>>>>>>>> our plans on implementing them? Also, what exactly is planned in
>>>>>>>> IgniteOptimization and what is its purpose?
>>>>>>>> 
>>>>>>>> Actually, this is very good question :)
>>>>>>>> And I need advice from experienced community members here:
>>>>>>>> 
>>>>>>>> `IgniteOptimization` purpose is to modify query plan created by
>> Spark.
>>>>>>>> Currently, we have one optimization described in IGNITE-3084 [2] by
>>>>>>>> you,
>>>>>>>> Valentin :) :
>>>>>>>> 
>>>>>>>> “If there are non-Ignite relations in the plan, we should fall back
>> to
>>>>>>>> native Spark strategies“
>>>>>>>> 
>>>>>>>> I think we can go little further and reduce join of two Ignite
>> backed
>>>>>>>> Data Frames into single Ignite SQL query. Currently, this feature is
>>>>>>>> unimplemented.
>>>>>>>> 
>>>>>>>> *Do we need it now? Or we can postpone it and concentrates on basic
>>>>>>>> Data
>>>>>>>> Frame and Catalog implementation?*
>>>>>>>> 
>>>>>>>> `Strategy` purpose, as you correctly mentioned in [2], is transform
>>>>>>>> LogicalPlan into physical operators.
>>>>>>>> I don’t have ideas how to use this opportunity. So I think we don’t
>>>>>>>> need
>>>>>>>> IgniteStrategy.
>>>>>>>> 
>>>>>>>> Can you or anyone else suggest some optimization strategy to speed
>> up
>>>>>>>> SQL
>>>>>>>> query execution?
>>>>>>>> 
>>>>>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have to
>> be
>>>>>>>>> 
>>>>>>>> set manually on SQLContext each time it's created....Is there any
>> way
>>>>>>>> to
>>>>>>>> automate this and improve usability?
>>>>>>>> 
>>>>>>>> These classes added to `extraOptimizations` when one using
>>>>>>>> IgniteSparkSession.
>>>>>>>> As far as I know, there is no way to automatically add these
>> classes to
>>>>>>>> regular SparkSession.
>>>>>>>> 
>>>>>>>> 6. What is the purpose of IgniteSparkSession? I see it's used in
>>>>>>>>> 
>>>>>>>> IgniteCatalogExample but not in IgniteDataFrameExample, which is
>>>>>>>> Confusing.
>>>>>>>> 
>>>>>>>> DataFrame API is *public* Spark API. So anyone can provide
>>>>>>>> implementation
>>>>>>>> and plug it into Spark. That’s why IgniteDataFrameExample doesn’t
>> need
>>>>>>>> any
>>>>>>>> Ignite specific session.
>>>>>>>> 
>>>>>>>> Catalog API is *internal* Spark API. There is no way to plug custom
>>>>>>>> catalog implementation into Spark [3]. So we have to use
>>>>>>>> `IgniteSparkSession` that extends regular SparkSession and overrides
>>>>>>>> links
>>>>>>>> to `ExternalCatalog`.
>>>>>>>> 
>>>>>>>> 7. To create IgniteSparkSession we first create IgniteContext. Is it
>>>>>>>>> 
>>>>>>>> really needed? It looks like we can directly provide the
>> configuration
>>>>>>>> file; if IgniteSparkSession really requires IgniteContext, it can
>>>>>>>> create it
>>>>>>>> by itself under the hood.
>>>>>>>> 
>>>>>>>> Actually, IgniteContext is base class for Ignite <-> Spark
>> integration
>>>>>>>> for now. So I tried to reuse it here. I like the idea to remove
>>>>>>>> explicit
>>>>>>>> usage of IgniteContext.
>>>>>>>> Will implement it in a few days.
>>>>>>>> 
>>>>>>>> Actually, I think it makes sense to create a builder similar to
>>>>>>>>> 
>>>>>>>> SparkSession.builder()...
>>>>>>>> 
>>>>>>>> Great idea! I will implement such builder in a few days.
>>>>>>>> 
>>>>>>>> 9. Do I understand correctly that IgniteCacheRelation is for the
>> case
>>>>>>>>> 
>>>>>>>> when we don't have SQL configured on Ignite side?
>>>>>>>> 
>>>>>>>> Yes, IgniteCacheRelation is Data Frame implementation for a
>> key-value
>>>>>>>> cache.
>>>>>>>> 
>>>>>>>> I thought we decided not to support this, no? Or this is something
>>>>>>>>> else?
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> My understanding is following:
>>>>>>>> 
>>>>>>>> 1. We can’t support automatic resolving key-value caches in
>>>>>>>> *ExternalCatalog*. Because there is no way to reliably detect key
>> and
>>>>>>>> value
>>>>>>>> classes.
>>>>>>>> 
>>>>>>>> 2. We can support key-value caches in regular Data Frame
>>>>>>>> implementation.
>>>>>>>> Because we can require user to provide key and value classes
>>>>>>>> explicitly.
>>>>>>>> 
>>>>>>>> 8. Can you clarify the query syntax in
>> IgniteDataFrameExample#nativeS
>>>>>>>>> 
>>>>>>>> parkSqlFromCacheExample2?
>>>>>>>> 
>>>>>>>> Key-value cache:
>>>>>>>> 
>>>>>>>> key - java.lang.Long,
>>>>>>>> value - case class Person(name: String, birthDate: java.util.Date)
>>>>>>>> 
>>>>>>>> Schema of data frame for cache is:
>>>>>>>> 
>>>>>>>> key - long
>>>>>>>> value.name - string
>>>>>>>> value.birthDate - date
>>>>>>>> 
>>>>>>>> So we can select data from data from cache:
>>>>>>>> 
>>>>>>>> SELECT
>>>>>>>> key, `value.name`,  `value.birthDate`
>>>>>>>> FROM
>>>>>>>> testCache
>>>>>>>> WHERE key >= 2 AND `value.name` like '%0'
>>>>>>>> 
>>>>>>>> [1] https://github.com/apache/ignite/pull/2742/commits/faf3ed6fe
>>>>>>>> bf417bc59b0519156fd4d09114c8da7
>>>>>>>> [2] https://issues.apache.org/jira/browse/IGNITE-3084?focusedCom
>>>>>>>> mentId=15794210&page=com.atlassian.jira.plugin.system.issuet
>>>>>>>> abpanels:comment-tabpanel#comment-15794210
>>>>>>>> [3] https://issues.apache.org/jira/browse/SPARK-17767?focusedCom
>>>>>>>> mentId=15543733&page=com.atlassian.jira.plugin.system.issuet
>>>>>>>> abpanels:comment-tabpanel#comment-15543733
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 18.10.2017 04:39, Dmitriy Setrakyan пишет:
>>>>>>>> 
>>>>>>>> Val, thanks for the review. Can I ask you to add the same comments
>> to
>>>>>>>> the
>>>>>>>> 
>>>>>>>>> ticket?
>>>>>>>>> 
>>>>>>>>> On Tue, Oct 17, 2017 at 3:20 PM, Valentin Kulichenko <
>>>>>>>>> valentin.kulichenko@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> Nikolay, Anton,
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I did a high level review of the code. First of all, impressive
>>>>>>>>>> results!
>>>>>>>>>> However, I have some questions/comments.
>>>>>>>>>> 
>>>>>>>>>> 1. Why do we have org.apache.spark.sql.ignite package in our
>>>>>>>>>> codebase?
>>>>>>>>>> Can
>>>>>>>>>> these classes reside under org.apache.ignite.spark instead?
>>>>>>>>>> 2. IgniteRelationProvider contains multiple constants which I
>> guess
>>>>>>>>>> are
>>>>>>>>>> some king of config options. Can you describe the purpose of each
>> of
>>>>>>>>>> them?
>>>>>>>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two
>>>>>>>>>> Catalog
>>>>>>>>>> implementations and what is the difference?
>>>>>>>>>> 4. IgniteStrategy and IgniteOptimization are currently no-op. What
>>>>>>>>>> are
>>>>>>>>>> our
>>>>>>>>>> plans on implementing them? Also, what exactly is planned in
>>>>>>>>>> IgniteOptimization and what is its purpose?
>>>>>>>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have
>> to be
>>>>>>>>>> set
>>>>>>>>>> manually on SQLContext each time it's created. This seems to be
>> very
>>>>>>>>>> error
>>>>>>>>>> prone. Is there any way to automate this and improve usability?
>>>>>>>>>> 6. What is the purpose of IgniteSparkSession? I see it's used
>>>>>>>>>> in IgniteCatalogExample but not in IgniteDataFrameExample, which
>> is
>>>>>>>>>> confusing.
>>>>>>>>>> 7. To create IgniteSparkSession we first create IgniteContext. Is
>> it
>>>>>>>>>> really
>>>>>>>>>> needed? It looks like we can directly provide the configuration
>>>>>>>>>> file; if
>>>>>>>>>> IgniteSparkSession really requires IgniteContext, it can create
>> it by
>>>>>>>>>> itself under the hood. Actually, I think it makes sense to create
>> a
>>>>>>>>>> builder
>>>>>>>>>> similar to SparkSession.builder(), it would be good if our APIs
>> here
>>>>>>>>>> are
>>>>>>>>>> consistent with Spark APIs.
>>>>>>>>>> 8. Can you clarify the query syntax
>>>>>>>>>> inIgniteDataFrameExample#nativeSparkSqlFromCacheExample2?
>>>>>>>>>> 9. Do I understand correctly that IgniteCacheRelation is for the
>> case
>>>>>>>>>> when
>>>>>>>>>> we don't have SQL configured on Ignite side? I thought we decided
>>>>>>>>>> not to
>>>>>>>>>> support this, no? Or this is something else?
>>>>>>>>>> 
>>>>>>>>>> Thanks!
>>>>>>>>>> 
>>>>>>>>>> -Val
>>>>>>>>>> 
>>>>>>>>>> On Tue, Oct 17, 2017 at 4:40 AM, Anton Vinogradov <
>>>>>>>>>> avinogradov@gridgain.com>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Sounds awesome.
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I'll try to review API & tests this week.
>>>>>>>>>>> 
>>>>>>>>>>> Val,
>>>>>>>>>>> Your review still required :)
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Oct 17, 2017 at 2:36 PM, Николай Ижиков <
>>>>>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Yes
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 17 окт. 2017 г. 2:34 PM пользователь "Anton Vinogradov" <
>>>>>>>>>>>> avinogradov@gridgain.com> написал:
>>>>>>>>>>>> 
>>>>>>>>>>>> Nikolay,
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> So, it will be able to start regular spark and ignite clusters
>>>>>>>>>>>>> and,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> using
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> peer classloading via spark-context, perform any DataFrame
>> request,
>>>>>>>>>>>> 
>>>>>>>>>>>>> correct?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Oct 17, 2017 at 2:25 PM, Николай Ижиков <
>>>>>>>>>>>>> 
>>>>>>>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hello, Anton.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> An example you provide is a path to a master *local* file.
>>>>>>>>>>>>>> These libraries are added to the classpath for each remote
>> node
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> running
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> submitted job.
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Please, see documentation:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
>>>>>>>>>>>>>> spark/SparkContext.html#addJar(java.lang.String)
>>>>>>>>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
>>>>>>>>>>>>>> spark/SparkContext.html#addFile(java.lang.String)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2017-10-17 13:10 GMT+03:00 Anton Vinogradov <
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> avinogradov@gridgain.com
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> :
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Nikolay,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> With Data Frame API implementation there are no requirements
>> to
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> any
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Ignite files on spark worker nodes.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> What do you mean? I see code like:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> spark.sparkContext.addJar(MAVEN_HOME +
>>>>>>>>>>>>>>> "/org/apache/ignite/ignite-core/2.3.0-SNAPSHOT/ignite-
>>>>>>>>>>>>>>> core-2.3.0-SNAPSHOT.jar")
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Mon, Oct 16, 2017 at 5:22 PM, Николай Ижиков <
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hello, guys.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I have created example application to run Ignite Data Frame
>> on
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> standalone
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Spark cluster.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> With Data Frame API implementation there are no
>> requirements to
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> any
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Ignite files on spark worker nodes.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I ran this application on the free dataset: ATP tennis match
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> statistics.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> data - https://github.com/nizhikov/atp_matches
>>>>>>>>>>>>>>>> app - https://github.com/nizhikov/ignite-spark-df-example
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Valentin, do you have a chance to look at my changes?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 2017-10-12 6:03 GMT+03:00 Valentin Kulichenko <
>>>>>>>>>>>>>>>> valentin.kulichenko@gmail.com
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Nikolay,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Sorry for delay on this, got a little swamped lately. I
>> will
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>> my
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> best
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> review the code this week.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> -Val
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Mon, Oct 9, 2017 at 11:48 AM, Николай Ижиков <
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hello, Valentin.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Did you have a chance to look at my changes?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Now I think I have done almost all required features.
>>>>>>>>>>>>>>>>>> I want to make some performance test to ensure my
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>> work
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> properly with a significant amount of data.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> And I definitely need some feedback for my changes.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 2017-10-09 18:45 GMT+03:00 Николай Ижиков <
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> nizhikov.dev@gmail.com
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> :
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hello, guys.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Which version of Spark do we want to use?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 1. Currently, Ignite depends on Spark 2.1.0.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>    * Can be run on JDK 7.
>>>>>>>>>>>>>>>>>>>    * Still supported: 2.1.2 will be released soon.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 2. Latest Spark version is 2.2.0.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>    * Can be run only on JDK 8+
>>>>>>>>>>>>>>>>>>>    * Released Jul 11, 2017.
>>>>>>>>>>>>>>>>>>>    * Already supported by huge vendors(Amazon for
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> example).
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> Note that in IGNITE-3084 I implement some internal Spark
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> API.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>> So It will take some effort to switch between Spark 2.1 and
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 2.2
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 2017-09-27 2:20 GMT+03:00 Valentin Kulichenko <
>>>>>>>>>>>>>>>>>>> valentin.kulichenko@gmail.com>:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I will review in the next few days.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> -Val
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Tue, Sep 26, 2017 at 2:23 PM, Denis Magda <
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> dmagda@apache.org
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hello Nikolay,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> This is good news. Finally this capability is coming to
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Ignite.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Val, Vladimir, could you do a preliminary review?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Answering on your questions.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 1. Yardstick should be enough for performance
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> measurements.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> As a
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Spark
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> user, I will be curious to know what’s the point of this
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> integration.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Probably we need to compare Spark + Ignite and Spark +
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hive
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> or
>>>>>>>>>>>> 
>>>>>>>>>>>> Spark +
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> RDBMS cases.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 2. If Spark community is reluctant let’s include the
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> module
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> in
>>>>>>>>>>>> 
>>>>>>>>>>>> ignite-spark integration.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> —
>>>>>>>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Sep 25, 2017, at 11:14 AM, Николай Ижиков <
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hello, guys.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Currently, I’m working on integration between Spark
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>> Ignite
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> [1].
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> For now, I implement following:
>>>>>>>>>>>>>>>>>>>>>>   * Ignite DataSource implementation(
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> IgniteRelationProvider)
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>   * DataFrame support for Ignite SQL table.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>   * IgniteCatalog implementation for a transparent
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> resolving
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> of
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ignites
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> SQL tables.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Implementation of it can be found in PR [2]
>>>>>>>>>>>>>>>>>>>>>> It would be great if someone provides feedback for a
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> prototype.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I made some examples in PR so you can see how API
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> suppose
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> to
>>>>>>>>>>>> 
>>>>>>>>>>>> be
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> used [3].
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> [4].
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> I need some advice. Can you help me?
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 1. How should this PR be tested?
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Of course, I need to provide some unit tests. But what
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> scalability
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> tests, etc.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Maybe we need some Yardstick benchmark or similar?
>>>>>>>>>>>>>>>>>>>>>> What are your thoughts?
>>>>>>>>>>>>>>>>>>>>>> Which scenarios should I consider in the first place?
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 2. Should we provide Spark Catalog implementation
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> inside
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>> Ignite
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>>> codebase?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> A current implementation of Spark Catalog based on
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> *internal
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> Spark
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> API*.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Spark community seems not interested in making Catalog
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> API
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> public
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> including Ignite Catalog in Spark code base [5], [6].
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> *Should we include Spark internal API implementation
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> inside
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> Ignite
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> base?*
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Or should we consider to include Catalog
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>> in
>>>>>>>>>>> 
>>>>>>>>>>> some
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> external
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> module?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> That will be created and released outside Ignite?(we
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> can
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> support
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> develop it inside Ignite community).
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-3084
>>>>>>>>>>>>>>>>>>>>>> [2] https://github.com/apache/ignite/pull/2742
>>>>>>>>>>>>>>>>>>>>>> [3] https://github.com/apache/
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> ignite/pull/2742/files#diff-
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> f4ff509cef3018e221394474775e0905
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [4] https://github.com/apache/
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> ignite/pull/2742/files#diff-
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> f2b670497d81e780dfd5098c5dd8a89c
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [5] http://apache-spark-developers-list.1001551.n3.
>>>>>>>>>>>>>>>>>>>>>> nabble.com/Spark-Core-Custom-
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Catalog-Integration-between-
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> Apache-Ignite-and-Apache-Spark-td22452.html
>>>>>>>>>>>> 
>>>>>>>>>>>>> [6] https://issues.apache.org/jira/browse/SPARK-17767
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Nikolay Izhikov
>>>>>>> NIzhikov.dev@gmail.com
>>>>>>> 
>>>>>>> 
>>>>> 
>> 
>>

Re: Integration of Spark and Ignite. Prototype.

Posted by Valentin Kulichenko <va...@gmail.com>.

Denis,

Agree. I will do the final review in next few days and merge the code.

-Val

On Tue, Nov 28, 2017 at 5:28 PM, Denis Magda <dm...@apache.org> wrote:

> Guys,
>
> Looking into the parallel discussion about the strategy support I would
> change my initial stance and support the idea of releasing the integration
> in its current state. Is the code ready to be merged into the master? Let’s
> concentrate on this first and handle the strategy support as a separate
> JIRA task. Agree?
>
> —
> Denis
>
> > On Nov 27, 2017, at 3:47 PM, Valentin Kulichenko <
> valentin.kulichenko@gmail.com> wrote:
> >
> > Nikolay,
> >
> > Let's estimate the strategy implementation work, and then decide weather
> to
> > merge the code in current state or not. If anything is unclear, please
> > start a separate discussion.
> >
> > -Val
> >
> > On Fri, Nov 24, 2017 at 5:42 AM, Николай Ижиков <ni...@gmail.com>
> > wrote:
> >
> >> Hello, Val, Denis.
> >>
> >>> Personally, I think that we should release the integration only after
> >> the strategy is fully supported.
> >>
> >> I see two major reason to propose merge of DataFrame API implementation
> >> without custom strategy:
> >>
> >> 1. My PR is relatively huge, already. From my experience of interaction
> >> with Ignite community - the bigger PR becomes, the more time of
> commiters
> >> required to review PR.
> >> So, I propose to move smaller, but complete steps here.
> >>
> >> 2. It is not clear for me what exactly includes "custom strategy and
> >> optimization".
> >> Seems, that additional discussion required.
> >> I think, I can put my thoughts on the paper and start discussion right
> >> after basic implementation is done.
> >>
> >>> Custom strategy implementation is actually very important for this
> >> integration.
> >>
> >> Understand and fully agreed.
> >> I'm ready to continue work in that area.
> >>
> >> 23.11.2017 02:15, Denis Magda пишет:
> >>
> >> Val, Nikolay,
> >>>
> >>> Personally, I think that we should release the integration only after
> the
> >>> strategy is fully supported. Without the strategy we don’t really
> leverage
> >>> from Ignite’s SQL engine and introduce redundant data movement between
> >>> Ignite and Spark nodes.
> >>>
> >>> How big is the effort to support the strategy in terms of the amount of
> >>> work left? 40%, 60%, 80%?
> >>>
> >>> —
> >>> Denis
> >>>
> >>> On Nov 22, 2017, at 2:57 PM, Valentin Kulichenko <
> >>>> valentin.kulichenko@gmail.com> wrote:
> >>>>
> >>>> Nikolay,
> >>>>
> >>>> Custom strategy implementation is actually very important for this
> >>>> integration. Basically, it will allow to create a SQL query for Ignite
> >>>> and
> >>>> execute it directly on the cluster. Your current implementation only
> >>>> adds a
> >>>> new DataSource which means that Spark will fetch data in its own
> memory
> >>>> first, and then do most of the work (like joins for example). Does it
> >>>> make
> >>>> sense to you? Can you please take a look at this and provide your
> >>>> thoughts
> >>>> on how much development is implied there?
> >>>>
> >>>> Current code looks good to me though and I'm OK if the strategy is
> >>>> implemented as a next step in a scope of separate ticket. I will do
> final
> >>>> review early next week and will merge it if everything is OK.
> >>>>
> >>>> -Val
> >>>>
> >>>> On Thu, Oct 19, 2017 at 7:29 AM, Николай Ижиков <
> nizhikov.dev@gmail.com>
> >>>> wrote:
> >>>>
> >>>> Hello.
> >>>>>
> >>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two
> Catalog
> >>>>>>
> >>>>> implementations and what is the difference?
> >>>>>
> >>>>> IgniteCatalog removed.
> >>>>>
> >>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have to be
> >>>>>>
> >>>>> set manually on SQLContext each time it's created....Is there any
> way to
> >>>>> automate this and improve usability?
> >>>>>
> >>>>> IgniteStrategy and IgniteOptimization are removed as it empty now.
> >>>>>
> >>>>> Actually, I think it makes sense to create a builder similar to
> >>>>>>
> >>>>> SparkSession.builder()...
> >>>>>
> >>>>> IgniteBuilder added.
> >>>>> Syntax looks like:
> >>>>>
> >>>>> ```
> >>>>> val igniteSession = IgniteSparkSession.builder()
> >>>>>    .appName("Spark Ignite catalog example")
> >>>>>    .master("local")
> >>>>>    .config("spark.executor.instances", "2")
> >>>>>    .igniteConfig(CONFIG)
> >>>>>    .getOrCreate()
> >>>>>
> >>>>> igniteSession.catalog.listTables().show()
> >>>>> ```
> >>>>>
> >>>>> Please, see updated PR - https://github.com/apache/ignite/pull/2742
> >>>>>
> >>>>> 2017-10-18 20:02 GMT+03:00 Николай Ижиков <ni...@gmail.com>:
> >>>>>
> >>>>> Hello, Valentin.
> >>>>>>
> >>>>>> My answers is below.
> >>>>>> Dmitry, do we need to move discussion to Jira?
> >>>>>>
> >>>>>> 1. Why do we have org.apache.spark.sql.ignite package in our
> codebase?
> >>>>>>>
> >>>>>>
> >>>>>> As I mentioned earlier, to implement and override Spark Catalog one
> >>>>>> have
> >>>>>> to use internal(private) Spark API.
> >>>>>> So I have to use package `org.spark.sql.***` to have access to
> private
> >>>>>> class and variables.
> >>>>>>
> >>>>>> For example, SharedState class that stores link to ExternalCatalog
> >>>>>> declared as `private[sql] class SharedState` - i.e. package private.
> >>>>>>
> >>>>>> Can these classes reside under org.apache.ignite.spark instead?
> >>>>>>>
> >>>>>>
> >>>>>> No, as long as we want to have our own implementation of
> >>>>>> ExternalCatalog.
> >>>>>>
> >>>>>> 2. IgniteRelationProvider contains multiple constants which I guess
> are
> >>>>>>>
> >>>>>> some king of config options. Can you describe the purpose of each of
> >>>>>> them?
> >>>>>>
> >>>>>> I extend comments for this options.
> >>>>>> Please, see my commit [1] or PR HEAD:
> >>>>>>
> >>>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two
> Catalog
> >>>>>>>
> >>>>>> implementations and what is the difference?
> >>>>>>
> >>>>>> Good catch, thank you!
> >>>>>> After additional research I founded that only IgniteExternalCatalog
> >>>>>> required.
> >>>>>> I will update PR with IgniteCatalog remove in a few days.
> >>>>>>
> >>>>>> 4. IgniteStrategy and IgniteOptimization are currently no-op. What
> are
> >>>>>>>
> >>>>>> our plans on implementing them? Also, what exactly is planned in
> >>>>>> IgniteOptimization and what is its purpose?
> >>>>>>
> >>>>>> Actually, this is very good question :)
> >>>>>> And I need advice from experienced community members here:
> >>>>>>
> >>>>>> `IgniteOptimization` purpose is to modify query plan created by
> Spark.
> >>>>>> Currently, we have one optimization described in IGNITE-3084 [2] by
> >>>>>> you,
> >>>>>> Valentin :) :
> >>>>>>
> >>>>>> “If there are non-Ignite relations in the plan, we should fall back
> to
> >>>>>> native Spark strategies“
> >>>>>>
> >>>>>> I think we can go little further and reduce join of two Ignite
> backed
> >>>>>> Data Frames into single Ignite SQL query. Currently, this feature is
> >>>>>> unimplemented.
> >>>>>>
> >>>>>> *Do we need it now? Or we can postpone it and concentrates on basic
> >>>>>> Data
> >>>>>> Frame and Catalog implementation?*
> >>>>>>
> >>>>>> `Strategy` purpose, as you correctly mentioned in [2], is transform
> >>>>>> LogicalPlan into physical operators.
> >>>>>> I don’t have ideas how to use this opportunity. So I think we don’t
> >>>>>> need
> >>>>>> IgniteStrategy.
> >>>>>>
> >>>>>> Can you or anyone else suggest some optimization strategy to speed
> up
> >>>>>> SQL
> >>>>>> query execution?
> >>>>>>
> >>>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have to
> be
> >>>>>>>
> >>>>>> set manually on SQLContext each time it's created....Is there any
> way
> >>>>>> to
> >>>>>> automate this and improve usability?
> >>>>>>
> >>>>>> These classes added to `extraOptimizations` when one using
> >>>>>> IgniteSparkSession.
> >>>>>> As far as I know, there is no way to automatically add these
> classes to
> >>>>>> regular SparkSession.
> >>>>>>
> >>>>>> 6. What is the purpose of IgniteSparkSession? I see it's used in
> >>>>>>>
> >>>>>> IgniteCatalogExample but not in IgniteDataFrameExample, which is
> >>>>>> Confusing.
> >>>>>>
> >>>>>> DataFrame API is *public* Spark API. So anyone can provide
> >>>>>> implementation
> >>>>>> and plug it into Spark. That’s why IgniteDataFrameExample doesn’t
> need
> >>>>>> any
> >>>>>> Ignite specific session.
> >>>>>>
> >>>>>> Catalog API is *internal* Spark API. There is no way to plug custom
> >>>>>> catalog implementation into Spark [3]. So we have to use
> >>>>>> `IgniteSparkSession` that extends regular SparkSession and overrides
> >>>>>> links
> >>>>>> to `ExternalCatalog`.
> >>>>>>
> >>>>>> 7. To create IgniteSparkSession we first create IgniteContext. Is it
> >>>>>>>
> >>>>>> really needed? It looks like we can directly provide the
> configuration
> >>>>>> file; if IgniteSparkSession really requires IgniteContext, it can
> >>>>>> create it
> >>>>>> by itself under the hood.
> >>>>>>
> >>>>>> Actually, IgniteContext is base class for Ignite <-> Spark
> integration
> >>>>>> for now. So I tried to reuse it here. I like the idea to remove
> >>>>>> explicit
> >>>>>> usage of IgniteContext.
> >>>>>> Will implement it in a few days.
> >>>>>>
> >>>>>> Actually, I think it makes sense to create a builder similar to
> >>>>>>>
> >>>>>> SparkSession.builder()...
> >>>>>>
> >>>>>> Great idea! I will implement such builder in a few days.
> >>>>>>
> >>>>>> 9. Do I understand correctly that IgniteCacheRelation is for the
> case
> >>>>>>>
> >>>>>> when we don't have SQL configured on Ignite side?
> >>>>>>
> >>>>>> Yes, IgniteCacheRelation is Data Frame implementation for a
> key-value
> >>>>>> cache.
> >>>>>>
> >>>>>> I thought we decided not to support this, no? Or this is something
> >>>>>>> else?
> >>>>>>>
> >>>>>>
> >>>>>> My understanding is following:
> >>>>>>
> >>>>>> 1. We can’t support automatic resolving key-value caches in
> >>>>>> *ExternalCatalog*. Because there is no way to reliably detect key
> and
> >>>>>> value
> >>>>>> classes.
> >>>>>>
> >>>>>> 2. We can support key-value caches in regular Data Frame
> >>>>>> implementation.
> >>>>>> Because we can require user to provide key and value classes
> >>>>>> explicitly.
> >>>>>>
> >>>>>> 8. Can you clarify the query syntax in
> IgniteDataFrameExample#nativeS
> >>>>>>>
> >>>>>> parkSqlFromCacheExample2?
> >>>>>>
> >>>>>> Key-value cache:
> >>>>>>
> >>>>>> key - java.lang.Long,
> >>>>>> value - case class Person(name: String, birthDate: java.util.Date)
> >>>>>>
> >>>>>> Schema of data frame for cache is:
> >>>>>>
> >>>>>> key - long
> >>>>>> value.name - string
> >>>>>> value.birthDate - date
> >>>>>>
> >>>>>> So we can select data from data from cache:
> >>>>>>
> >>>>>> SELECT
> >>>>>>  key, `value.name`,  `value.birthDate`
> >>>>>> FROM
> >>>>>>  testCache
> >>>>>> WHERE key >= 2 AND `value.name` like '%0'
> >>>>>>
> >>>>>> [1] https://github.com/apache/ignite/pull/2742/commits/faf3ed6fe
> >>>>>> bf417bc59b0519156fd4d09114c8da7
> >>>>>> [2] https://issues.apache.org/jira/browse/IGNITE-3084?focusedCom
> >>>>>> mentId=15794210&page=com.atlassian.jira.plugin.system.issuet
> >>>>>> abpanels:comment-tabpanel#comment-15794210
> >>>>>> [3] https://issues.apache.org/jira/browse/SPARK-17767?focusedCom
> >>>>>> mentId=15543733&page=com.atlassian.jira.plugin.system.issuet
> >>>>>> abpanels:comment-tabpanel#comment-15543733
> >>>>>>
> >>>>>>
> >>>>>> 18.10.2017 04:39, Dmitriy Setrakyan пишет:
> >>>>>>
> >>>>>> Val, thanks for the review. Can I ask you to add the same comments
> to
> >>>>>> the
> >>>>>>
> >>>>>>> ticket?
> >>>>>>>
> >>>>>>> On Tue, Oct 17, 2017 at 3:20 PM, Valentin Kulichenko <
> >>>>>>> valentin.kulichenko@gmail.com> wrote:
> >>>>>>>
> >>>>>>> Nikolay, Anton,
> >>>>>>>
> >>>>>>>>
> >>>>>>>> I did a high level review of the code. First of all, impressive
> >>>>>>>> results!
> >>>>>>>> However, I have some questions/comments.
> >>>>>>>>
> >>>>>>>> 1. Why do we have org.apache.spark.sql.ignite package in our
> >>>>>>>> codebase?
> >>>>>>>> Can
> >>>>>>>> these classes reside under org.apache.ignite.spark instead?
> >>>>>>>> 2. IgniteRelationProvider contains multiple constants which I
> guess
> >>>>>>>> are
> >>>>>>>> some king of config options. Can you describe the purpose of each
> of
> >>>>>>>> them?
> >>>>>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two
> >>>>>>>> Catalog
> >>>>>>>> implementations and what is the difference?
> >>>>>>>> 4. IgniteStrategy and IgniteOptimization are currently no-op. What
> >>>>>>>> are
> >>>>>>>> our
> >>>>>>>> plans on implementing them? Also, what exactly is planned in
> >>>>>>>> IgniteOptimization and what is its purpose?
> >>>>>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have
> to be
> >>>>>>>> set
> >>>>>>>> manually on SQLContext each time it's created. This seems to be
> very
> >>>>>>>> error
> >>>>>>>> prone. Is there any way to automate this and improve usability?
> >>>>>>>> 6. What is the purpose of IgniteSparkSession? I see it's used
> >>>>>>>> in IgniteCatalogExample but not in IgniteDataFrameExample, which
> is
> >>>>>>>> confusing.
> >>>>>>>> 7. To create IgniteSparkSession we first create IgniteContext. Is
> it
> >>>>>>>> really
> >>>>>>>> needed? It looks like we can directly provide the configuration
> >>>>>>>> file; if
> >>>>>>>> IgniteSparkSession really requires IgniteContext, it can create
> it by
> >>>>>>>> itself under the hood. Actually, I think it makes sense to create
> a
> >>>>>>>> builder
> >>>>>>>> similar to SparkSession.builder(), it would be good if our APIs
> here
> >>>>>>>> are
> >>>>>>>> consistent with Spark APIs.
> >>>>>>>> 8. Can you clarify the query syntax
> >>>>>>>> inIgniteDataFrameExample#nativeSparkSqlFromCacheExample2?
> >>>>>>>> 9. Do I understand correctly that IgniteCacheRelation is for the
> case
> >>>>>>>> when
> >>>>>>>> we don't have SQL configured on Ignite side? I thought we decided
> >>>>>>>> not to
> >>>>>>>> support this, no? Or this is something else?
> >>>>>>>>
> >>>>>>>> Thanks!
> >>>>>>>>
> >>>>>>>> -Val
> >>>>>>>>
> >>>>>>>> On Tue, Oct 17, 2017 at 4:40 AM, Anton Vinogradov <
> >>>>>>>> avinogradov@gridgain.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Sounds awesome.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I'll try to review API & tests this week.
> >>>>>>>>>
> >>>>>>>>> Val,
> >>>>>>>>> Your review still required :)
> >>>>>>>>>
> >>>>>>>>> On Tue, Oct 17, 2017 at 2:36 PM, Николай Ижиков <
> >>>>>>>>> nizhikov.dev@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Yes
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 17 окт. 2017 г. 2:34 PM пользователь "Anton Vinogradov" <
> >>>>>>>>>> avinogradov@gridgain.com> написал:
> >>>>>>>>>>
> >>>>>>>>>> Nikolay,
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> So, it will be able to start regular spark and ignite clusters
> >>>>>>>>>>> and,
> >>>>>>>>>>>
> >>>>>>>>>>> using
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> peer classloading via spark-context, perform any DataFrame
> request,
> >>>>>>>>>>
> >>>>>>>>>>> correct?
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Oct 17, 2017 at 2:25 PM, Николай Ижиков <
> >>>>>>>>>>>
> >>>>>>>>>>> nizhikov.dev@gmail.com>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Hello, Anton.
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> An example you provide is a path to a master *local* file.
> >>>>>>>>>>>> These libraries are added to the classpath for each remote
> node
> >>>>>>>>>>>>
> >>>>>>>>>>>> running
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>> submitted job.
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> Please, see documentation:
> >>>>>>>>>>>>
> >>>>>>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
> >>>>>>>>>>>> spark/SparkContext.html#addJar(java.lang.String)
> >>>>>>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
> >>>>>>>>>>>> spark/SparkContext.html#addFile(java.lang.String)
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> 2017-10-17 13:10 GMT+03:00 Anton Vinogradov <
> >>>>>>>>>>>>
> >>>>>>>>>>>> avinogradov@gridgain.com
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>> :
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> Nikolay,
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> With Data Frame API implementation there are no requirements
> to
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> have
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>> any
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Ignite files on spark worker nodes.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> What do you mean? I see code like:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> spark.sparkContext.addJar(MAVEN_HOME +
> >>>>>>>>>>>>> "/org/apache/ignite/ignite-core/2.3.0-SNAPSHOT/ignite-
> >>>>>>>>>>>>> core-2.3.0-SNAPSHOT.jar")
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Mon, Oct 16, 2017 at 5:22 PM, Николай Ижиков <
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> nizhikov.dev@gmail.com>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hello, guys.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I have created example application to run Ignite Data Frame
> on
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> standalone
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Spark cluster.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> With Data Frame API implementation there are no
> requirements to
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> have
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>> any
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Ignite files on spark worker nodes.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I ran this application on the free dataset: ATP tennis match
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> statistics.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> data - https://github.com/nizhikov/atp_matches
> >>>>>>>>>>>>>> app - https://github.com/nizhikov/ignite-spark-df-example
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Valentin, do you have a chance to look at my changes?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 2017-10-12 6:03 GMT+03:00 Valentin Kulichenko <
> >>>>>>>>>>>>>> valentin.kulichenko@gmail.com
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> :
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Nikolay,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Sorry for delay on this, got a little swamped lately. I
> will
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> do
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>> my
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> best
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> to
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> review the code this week.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> -Val
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Mon, Oct 9, 2017 at 11:48 AM, Николай Ижиков <
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> nizhikov.dev@gmail.com>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hello, Valentin.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Did you have a chance to look at my changes?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Now I think I have done almost all required features.
> >>>>>>>>>>>>>>>> I want to make some performance test to ensure my
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> implementation
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>> work
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> properly with a significant amount of data.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> And I definitely need some feedback for my changes.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> 2017-10-09 18:45 GMT+03:00 Николай Ижиков <
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> nizhikov.dev@gmail.com
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>> :
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hello, guys.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Which version of Spark do we want to use?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 1. Currently, Ignite depends on Spark 2.1.0.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>     * Can be run on JDK 7.
> >>>>>>>>>>>>>>>>>     * Still supported: 2.1.2 will be released soon.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> 2. Latest Spark version is 2.2.0.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>     * Can be run only on JDK 8+
> >>>>>>>>>>>>>>>>>     * Released Jul 11, 2017.
> >>>>>>>>>>>>>>>>>     * Already supported by huge vendors(Amazon for
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> example).
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>
> >>>>>>>>> Note that in IGNITE-3084 I implement some internal Spark
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> API.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>> So It will take some effort to switch between Spark 2.1 and
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>>>>>>> 2.2
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>>>>>>> 2017-09-27 2:20 GMT+03:00 Valentin Kulichenko <
> >>>>>>>>>>>>>>>>> valentin.kulichenko@gmail.com>:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I will review in the next few days.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> -Val
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Tue, Sep 26, 2017 at 2:23 PM, Denis Magda <
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> dmagda@apache.org
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hello Nikolay,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> This is good news. Finally this capability is coming to
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Ignite.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> Val, Vladimir, could you do a preliminary review?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Answering on your questions.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 1. Yardstick should be enough for performance
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> measurements.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>> As a
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Spark
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> user, I will be curious to know what’s the point of this
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> integration.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Probably we need to compare Spark + Ignite and Spark +
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hive
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>> or
> >>>>>>>>>>
> >>>>>>>>>> Spark +
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>> RDBMS cases.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> 2. If Spark community is reluctant let’s include the
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> module
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>> in
> >>>>>>>>>>
> >>>>>>>>>> ignite-spark integration.
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> —
> >>>>>>>>>>>>>>>>>>> Denis
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Sep 25, 2017, at 11:14 AM, Николай Ижиков <
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> nizhikov.dev@gmail.com>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hello, guys.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Currently, I’m working on integration between Spark
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>> Ignite
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> [1].
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> For now, I implement following:
> >>>>>>>>>>>>>>>>>>>>    * Ignite DataSource implementation(
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> IgniteRelationProvider)
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>    * DataFrame support for Ignite SQL table.
> >>>>>>>>>>>>
> >>>>>>>>>>>>>    * IgniteCatalog implementation for a transparent
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> resolving
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>> of
> >>>>>>>>>>>>
> >>>>>>>>>>>> ignites
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> SQL tables.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Implementation of it can be found in PR [2]
> >>>>>>>>>>>>>>>>>>>> It would be great if someone provides feedback for a
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> prototype.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> I made some examples in PR so you can see how API
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> suppose
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>> to
> >>>>>>>>>>
> >>>>>>>>>> be
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> used [3].
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> [4].
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I need some advice. Can you help me?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 1. How should this PR be tested?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Of course, I need to provide some unit tests. But what
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>> scalability
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> tests, etc.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Maybe we need some Yardstick benchmark or similar?
> >>>>>>>>>>>>>>>>>>>> What are your thoughts?
> >>>>>>>>>>>>>>>>>>>> Which scenarios should I consider in the first place?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 2. Should we provide Spark Catalog implementation
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> inside
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>> Ignite
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>> codebase?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> A current implementation of Spark Catalog based on
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> *internal
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>> Spark
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> API*.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Spark community seems not interested in making Catalog
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> API
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>> public
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>> or
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> including Ignite Catalog in Spark code base [5], [6].
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> *Should we include Spark internal API implementation
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> inside
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>> Ignite
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> code
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> base?*
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Or should we consider to include Catalog
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> implementation
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>> in
> >>>>>>>>>
> >>>>>>>>> some
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> external
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> module?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> That will be created and released outside Ignite?(we
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> still
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>> can
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> support
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> develop it inside Ignite community).
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-3084
> >>>>>>>>>>>>>>>>>>>> [2] https://github.com/apache/ignite/pull/2742
> >>>>>>>>>>>>>>>>>>>> [3] https://github.com/apache/
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> ignite/pull/2742/files#diff-
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>> f4ff509cef3018e221394474775e0905
> >>>>>>>>>>>
> >>>>>>>>>>>> [4] https://github.com/apache/
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> ignite/pull/2742/files#diff-
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>> f2b670497d81e780dfd5098c5dd8a89c
> >>>>>>>>>>>
> >>>>>>>>>>>> [5] http://apache-spark-developers-list.1001551.n3.
> >>>>>>>>>>>>>>>>>>>> nabble.com/Spark-Core-Custom-
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Catalog-Integration-between-
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>> Apache-Ignite-and-Apache-Spark-td22452.html
> >>>>>>>>>>
> >>>>>>>>>>> [6] https://issues.apache.org/jira/browse/SPARK-17767
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>>>>> Nikolay Izhikov
> >>>>>>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>> Nikolay Izhikov
> >>>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>> Nikolay Izhikov
> >>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> --
> >>>>>>>>>>>>>> Nikolay Izhikov
> >>>>>>>>>>>>>> NIzhikov.dev@gmail.com
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Nikolay Izhikov
> >>>>>>>>>>>> NIzhikov.dev@gmail.com
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>>> --
> >>>>> Nikolay Izhikov
> >>>>> NIzhikov.dev@gmail.com
> >>>>>
> >>>>>
> >>>
>
>

Re: Integration of Spark and Ignite. Prototype.

Posted by Denis Magda <dm...@apache.org>.

Guys,

Looking into the parallel discussion about the strategy support I would change my initial stance and support the idea of releasing the integration in its current state. Is the code ready to be merged into the master? Let’s concentrate on this first and handle the strategy support as a separate JIRA task. Agree?

—
Denis

> On Nov 27, 2017, at 3:47 PM, Valentin Kulichenko <va...@gmail.com> wrote:
> 
> Nikolay,
> 
> Let's estimate the strategy implementation work, and then decide weather to
> merge the code in current state or not. If anything is unclear, please
> start a separate discussion.
> 
> -Val
> 
> On Fri, Nov 24, 2017 at 5:42 AM, Николай Ижиков <ni...@gmail.com>
> wrote:
> 
>> Hello, Val, Denis.
>> 
>>> Personally, I think that we should release the integration only after
>> the strategy is fully supported.
>> 
>> I see two major reason to propose merge of DataFrame API implementation
>> without custom strategy:
>> 
>> 1. My PR is relatively huge, already. From my experience of interaction
>> with Ignite community - the bigger PR becomes, the more time of commiters
>> required to review PR.
>> So, I propose to move smaller, but complete steps here.
>> 
>> 2. It is not clear for me what exactly includes "custom strategy and
>> optimization".
>> Seems, that additional discussion required.
>> I think, I can put my thoughts on the paper and start discussion right
>> after basic implementation is done.
>> 
>>> Custom strategy implementation is actually very important for this
>> integration.
>> 
>> Understand and fully agreed.
>> I'm ready to continue work in that area.
>> 
>> 23.11.2017 02:15, Denis Magda пишет:
>> 
>> Val, Nikolay,
>>> 
>>> Personally, I think that we should release the integration only after the
>>> strategy is fully supported. Without the strategy we don’t really leverage
>>> from Ignite’s SQL engine and introduce redundant data movement between
>>> Ignite and Spark nodes.
>>> 
>>> How big is the effort to support the strategy in terms of the amount of
>>> work left? 40%, 60%, 80%?
>>> 
>>> —
>>> Denis
>>> 
>>> On Nov 22, 2017, at 2:57 PM, Valentin Kulichenko <
>>>> valentin.kulichenko@gmail.com> wrote:
>>>> 
>>>> Nikolay,
>>>> 
>>>> Custom strategy implementation is actually very important for this
>>>> integration. Basically, it will allow to create a SQL query for Ignite
>>>> and
>>>> execute it directly on the cluster. Your current implementation only
>>>> adds a
>>>> new DataSource which means that Spark will fetch data in its own memory
>>>> first, and then do most of the work (like joins for example). Does it
>>>> make
>>>> sense to you? Can you please take a look at this and provide your
>>>> thoughts
>>>> on how much development is implied there?
>>>> 
>>>> Current code looks good to me though and I'm OK if the strategy is
>>>> implemented as a next step in a scope of separate ticket. I will do final
>>>> review early next week and will merge it if everything is OK.
>>>> 
>>>> -Val
>>>> 
>>>> On Thu, Oct 19, 2017 at 7:29 AM, Николай Ижиков <ni...@gmail.com>
>>>> wrote:
>>>> 
>>>> Hello.
>>>>> 
>>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two Catalog
>>>>>> 
>>>>> implementations and what is the difference?
>>>>> 
>>>>> IgniteCatalog removed.
>>>>> 
>>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have to be
>>>>>> 
>>>>> set manually on SQLContext each time it's created....Is there any way to
>>>>> automate this and improve usability?
>>>>> 
>>>>> IgniteStrategy and IgniteOptimization are removed as it empty now.
>>>>> 
>>>>> Actually, I think it makes sense to create a builder similar to
>>>>>> 
>>>>> SparkSession.builder()...
>>>>> 
>>>>> IgniteBuilder added.
>>>>> Syntax looks like:
>>>>> 
>>>>> ```
>>>>> val igniteSession = IgniteSparkSession.builder()
>>>>>    .appName("Spark Ignite catalog example")
>>>>>    .master("local")
>>>>>    .config("spark.executor.instances", "2")
>>>>>    .igniteConfig(CONFIG)
>>>>>    .getOrCreate()
>>>>> 
>>>>> igniteSession.catalog.listTables().show()
>>>>> ```
>>>>> 
>>>>> Please, see updated PR - https://github.com/apache/ignite/pull/2742
>>>>> 
>>>>> 2017-10-18 20:02 GMT+03:00 Николай Ижиков <ni...@gmail.com>:
>>>>> 
>>>>> Hello, Valentin.
>>>>>> 
>>>>>> My answers is below.
>>>>>> Dmitry, do we need to move discussion to Jira?
>>>>>> 
>>>>>> 1. Why do we have org.apache.spark.sql.ignite package in our codebase?
>>>>>>> 
>>>>>> 
>>>>>> As I mentioned earlier, to implement and override Spark Catalog one
>>>>>> have
>>>>>> to use internal(private) Spark API.
>>>>>> So I have to use package `org.spark.sql.***` to have access to private
>>>>>> class and variables.
>>>>>> 
>>>>>> For example, SharedState class that stores link to ExternalCatalog
>>>>>> declared as `private[sql] class SharedState` - i.e. package private.
>>>>>> 
>>>>>> Can these classes reside under org.apache.ignite.spark instead?
>>>>>>> 
>>>>>> 
>>>>>> No, as long as we want to have our own implementation of
>>>>>> ExternalCatalog.
>>>>>> 
>>>>>> 2. IgniteRelationProvider contains multiple constants which I guess are
>>>>>>> 
>>>>>> some king of config options. Can you describe the purpose of each of
>>>>>> them?
>>>>>> 
>>>>>> I extend comments for this options.
>>>>>> Please, see my commit [1] or PR HEAD:
>>>>>> 
>>>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two Catalog
>>>>>>> 
>>>>>> implementations and what is the difference?
>>>>>> 
>>>>>> Good catch, thank you!
>>>>>> After additional research I founded that only IgniteExternalCatalog
>>>>>> required.
>>>>>> I will update PR with IgniteCatalog remove in a few days.
>>>>>> 
>>>>>> 4. IgniteStrategy and IgniteOptimization are currently no-op. What are
>>>>>>> 
>>>>>> our plans on implementing them? Also, what exactly is planned in
>>>>>> IgniteOptimization and what is its purpose?
>>>>>> 
>>>>>> Actually, this is very good question :)
>>>>>> And I need advice from experienced community members here:
>>>>>> 
>>>>>> `IgniteOptimization` purpose is to modify query plan created by Spark.
>>>>>> Currently, we have one optimization described in IGNITE-3084 [2] by
>>>>>> you,
>>>>>> Valentin :) :
>>>>>> 
>>>>>> “If there are non-Ignite relations in the plan, we should fall back to
>>>>>> native Spark strategies“
>>>>>> 
>>>>>> I think we can go little further and reduce join of two Ignite backed
>>>>>> Data Frames into single Ignite SQL query. Currently, this feature is
>>>>>> unimplemented.
>>>>>> 
>>>>>> *Do we need it now? Or we can postpone it and concentrates on basic
>>>>>> Data
>>>>>> Frame and Catalog implementation?*
>>>>>> 
>>>>>> `Strategy` purpose, as you correctly mentioned in [2], is transform
>>>>>> LogicalPlan into physical operators.
>>>>>> I don’t have ideas how to use this opportunity. So I think we don’t
>>>>>> need
>>>>>> IgniteStrategy.
>>>>>> 
>>>>>> Can you or anyone else suggest some optimization strategy to speed up
>>>>>> SQL
>>>>>> query execution?
>>>>>> 
>>>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have to be
>>>>>>> 
>>>>>> set manually on SQLContext each time it's created....Is there any way
>>>>>> to
>>>>>> automate this and improve usability?
>>>>>> 
>>>>>> These classes added to `extraOptimizations` when one using
>>>>>> IgniteSparkSession.
>>>>>> As far as I know, there is no way to automatically add these classes to
>>>>>> regular SparkSession.
>>>>>> 
>>>>>> 6. What is the purpose of IgniteSparkSession? I see it's used in
>>>>>>> 
>>>>>> IgniteCatalogExample but not in IgniteDataFrameExample, which is
>>>>>> Confusing.
>>>>>> 
>>>>>> DataFrame API is *public* Spark API. So anyone can provide
>>>>>> implementation
>>>>>> and plug it into Spark. That’s why IgniteDataFrameExample doesn’t need
>>>>>> any
>>>>>> Ignite specific session.
>>>>>> 
>>>>>> Catalog API is *internal* Spark API. There is no way to plug custom
>>>>>> catalog implementation into Spark [3]. So we have to use
>>>>>> `IgniteSparkSession` that extends regular SparkSession and overrides
>>>>>> links
>>>>>> to `ExternalCatalog`.
>>>>>> 
>>>>>> 7. To create IgniteSparkSession we first create IgniteContext. Is it
>>>>>>> 
>>>>>> really needed? It looks like we can directly provide the configuration
>>>>>> file; if IgniteSparkSession really requires IgniteContext, it can
>>>>>> create it
>>>>>> by itself under the hood.
>>>>>> 
>>>>>> Actually, IgniteContext is base class for Ignite <-> Spark integration
>>>>>> for now. So I tried to reuse it here. I like the idea to remove
>>>>>> explicit
>>>>>> usage of IgniteContext.
>>>>>> Will implement it in a few days.
>>>>>> 
>>>>>> Actually, I think it makes sense to create a builder similar to
>>>>>>> 
>>>>>> SparkSession.builder()...
>>>>>> 
>>>>>> Great idea! I will implement such builder in a few days.
>>>>>> 
>>>>>> 9. Do I understand correctly that IgniteCacheRelation is for the case
>>>>>>> 
>>>>>> when we don't have SQL configured on Ignite side?
>>>>>> 
>>>>>> Yes, IgniteCacheRelation is Data Frame implementation for a key-value
>>>>>> cache.
>>>>>> 
>>>>>> I thought we decided not to support this, no? Or this is something
>>>>>>> else?
>>>>>>> 
>>>>>> 
>>>>>> My understanding is following:
>>>>>> 
>>>>>> 1. We can’t support automatic resolving key-value caches in
>>>>>> *ExternalCatalog*. Because there is no way to reliably detect key and
>>>>>> value
>>>>>> classes.
>>>>>> 
>>>>>> 2. We can support key-value caches in regular Data Frame
>>>>>> implementation.
>>>>>> Because we can require user to provide key and value classes
>>>>>> explicitly.
>>>>>> 
>>>>>> 8. Can you clarify the query syntax in IgniteDataFrameExample#nativeS
>>>>>>> 
>>>>>> parkSqlFromCacheExample2?
>>>>>> 
>>>>>> Key-value cache:
>>>>>> 
>>>>>> key - java.lang.Long,
>>>>>> value - case class Person(name: String, birthDate: java.util.Date)
>>>>>> 
>>>>>> Schema of data frame for cache is:
>>>>>> 
>>>>>> key - long
>>>>>> value.name - string
>>>>>> value.birthDate - date
>>>>>> 
>>>>>> So we can select data from data from cache:
>>>>>> 
>>>>>> SELECT
>>>>>>  key, `value.name`,  `value.birthDate`
>>>>>> FROM
>>>>>>  testCache
>>>>>> WHERE key >= 2 AND `value.name` like '%0'
>>>>>> 
>>>>>> [1] https://github.com/apache/ignite/pull/2742/commits/faf3ed6fe
>>>>>> bf417bc59b0519156fd4d09114c8da7
>>>>>> [2] https://issues.apache.org/jira/browse/IGNITE-3084?focusedCom
>>>>>> mentId=15794210&page=com.atlassian.jira.plugin.system.issuet
>>>>>> abpanels:comment-tabpanel#comment-15794210
>>>>>> [3] https://issues.apache.org/jira/browse/SPARK-17767?focusedCom
>>>>>> mentId=15543733&page=com.atlassian.jira.plugin.system.issuet
>>>>>> abpanels:comment-tabpanel#comment-15543733
>>>>>> 
>>>>>> 
>>>>>> 18.10.2017 04:39, Dmitriy Setrakyan пишет:
>>>>>> 
>>>>>> Val, thanks for the review. Can I ask you to add the same comments to
>>>>>> the
>>>>>> 
>>>>>>> ticket?
>>>>>>> 
>>>>>>> On Tue, Oct 17, 2017 at 3:20 PM, Valentin Kulichenko <
>>>>>>> valentin.kulichenko@gmail.com> wrote:
>>>>>>> 
>>>>>>> Nikolay, Anton,
>>>>>>> 
>>>>>>>> 
>>>>>>>> I did a high level review of the code. First of all, impressive
>>>>>>>> results!
>>>>>>>> However, I have some questions/comments.
>>>>>>>> 
>>>>>>>> 1. Why do we have org.apache.spark.sql.ignite package in our
>>>>>>>> codebase?
>>>>>>>> Can
>>>>>>>> these classes reside under org.apache.ignite.spark instead?
>>>>>>>> 2. IgniteRelationProvider contains multiple constants which I guess
>>>>>>>> are
>>>>>>>> some king of config options. Can you describe the purpose of each of
>>>>>>>> them?
>>>>>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two
>>>>>>>> Catalog
>>>>>>>> implementations and what is the difference?
>>>>>>>> 4. IgniteStrategy and IgniteOptimization are currently no-op. What
>>>>>>>> are
>>>>>>>> our
>>>>>>>> plans on implementing them? Also, what exactly is planned in
>>>>>>>> IgniteOptimization and what is its purpose?
>>>>>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have to be
>>>>>>>> set
>>>>>>>> manually on SQLContext each time it's created. This seems to be very
>>>>>>>> error
>>>>>>>> prone. Is there any way to automate this and improve usability?
>>>>>>>> 6. What is the purpose of IgniteSparkSession? I see it's used
>>>>>>>> in IgniteCatalogExample but not in IgniteDataFrameExample, which is
>>>>>>>> confusing.
>>>>>>>> 7. To create IgniteSparkSession we first create IgniteContext. Is it
>>>>>>>> really
>>>>>>>> needed? It looks like we can directly provide the configuration
>>>>>>>> file; if
>>>>>>>> IgniteSparkSession really requires IgniteContext, it can create it by
>>>>>>>> itself under the hood. Actually, I think it makes sense to create a
>>>>>>>> builder
>>>>>>>> similar to SparkSession.builder(), it would be good if our APIs here
>>>>>>>> are
>>>>>>>> consistent with Spark APIs.
>>>>>>>> 8. Can you clarify the query syntax
>>>>>>>> inIgniteDataFrameExample#nativeSparkSqlFromCacheExample2?
>>>>>>>> 9. Do I understand correctly that IgniteCacheRelation is for the case
>>>>>>>> when
>>>>>>>> we don't have SQL configured on Ignite side? I thought we decided
>>>>>>>> not to
>>>>>>>> support this, no? Or this is something else?
>>>>>>>> 
>>>>>>>> Thanks!
>>>>>>>> 
>>>>>>>> -Val
>>>>>>>> 
>>>>>>>> On Tue, Oct 17, 2017 at 4:40 AM, Anton Vinogradov <
>>>>>>>> avinogradov@gridgain.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Sounds awesome.
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I'll try to review API & tests this week.
>>>>>>>>> 
>>>>>>>>> Val,
>>>>>>>>> Your review still required :)
>>>>>>>>> 
>>>>>>>>> On Tue, Oct 17, 2017 at 2:36 PM, Николай Ижиков <
>>>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Yes
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 17 окт. 2017 г. 2:34 PM пользователь "Anton Vinogradov" <
>>>>>>>>>> avinogradov@gridgain.com> написал:
>>>>>>>>>> 
>>>>>>>>>> Nikolay,
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> So, it will be able to start regular spark and ignite clusters
>>>>>>>>>>> and,
>>>>>>>>>>> 
>>>>>>>>>>> using
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> peer classloading via spark-context, perform any DataFrame request,
>>>>>>>>>> 
>>>>>>>>>>> correct?
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Oct 17, 2017 at 2:25 PM, Николай Ижиков <
>>>>>>>>>>> 
>>>>>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Hello, Anton.
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> An example you provide is a path to a master *local* file.
>>>>>>>>>>>> These libraries are added to the classpath for each remote node
>>>>>>>>>>>> 
>>>>>>>>>>>> running
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> submitted job.
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> Please, see documentation:
>>>>>>>>>>>> 
>>>>>>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
>>>>>>>>>>>> spark/SparkContext.html#addJar(java.lang.String)
>>>>>>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
>>>>>>>>>>>> spark/SparkContext.html#addFile(java.lang.String)
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 2017-10-17 13:10 GMT+03:00 Anton Vinogradov <
>>>>>>>>>>>> 
>>>>>>>>>>>> avinogradov@gridgain.com
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> :
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> Nikolay,
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> With Data Frame API implementation there are no requirements to
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> have
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> any
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Ignite files on spark worker nodes.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> What do you mean? I see code like:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> spark.sparkContext.addJar(MAVEN_HOME +
>>>>>>>>>>>>> "/org/apache/ignite/ignite-core/2.3.0-SNAPSHOT/ignite-
>>>>>>>>>>>>> core-2.3.0-SNAPSHOT.jar")
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Mon, Oct 16, 2017 at 5:22 PM, Николай Ижиков <
>>>>>>>>>>>>> 
>>>>>>>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hello, guys.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I have created example application to run Ignite Data Frame on
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> standalone
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Spark cluster.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> With Data Frame API implementation there are no requirements to
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> have
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> any
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Ignite files on spark worker nodes.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I ran this application on the free dataset: ATP tennis match
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> statistics.
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> data - https://github.com/nizhikov/atp_matches
>>>>>>>>>>>>>> app - https://github.com/nizhikov/ignite-spark-df-example
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Valentin, do you have a chance to look at my changes?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2017-10-12 6:03 GMT+03:00 Valentin Kulichenko <
>>>>>>>>>>>>>> valentin.kulichenko@gmail.com
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Nikolay,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Sorry for delay on this, got a little swamped lately. I will
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>> my
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> best
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> to
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> review the code this week.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> -Val
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Mon, Oct 9, 2017 at 11:48 AM, Николай Ижиков <
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hello, Valentin.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Did you have a chance to look at my changes?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Now I think I have done almost all required features.
>>>>>>>>>>>>>>>> I want to make some performance test to ensure my
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>> work
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> properly with a significant amount of data.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> And I definitely need some feedback for my changes.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 2017-10-09 18:45 GMT+03:00 Николай Ижиков <
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> nizhikov.dev@gmail.com
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>> :
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hello, guys.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Which version of Spark do we want to use?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 1. Currently, Ignite depends on Spark 2.1.0.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>     * Can be run on JDK 7.
>>>>>>>>>>>>>>>>>     * Still supported: 2.1.2 will be released soon.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 2. Latest Spark version is 2.2.0.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>     * Can be run only on JDK 8+
>>>>>>>>>>>>>>>>>     * Released Jul 11, 2017.
>>>>>>>>>>>>>>>>>     * Already supported by huge vendors(Amazon for
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> example).
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>> 
>>>>>>>>> Note that in IGNITE-3084 I implement some internal Spark
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> API.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>> So It will take some effort to switch between Spark 2.1 and
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 2.2
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 2017-09-27 2:20 GMT+03:00 Valentin Kulichenko <
>>>>>>>>>>>>>>>>> valentin.kulichenko@gmail.com>:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I will review in the next few days.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> -Val
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Tue, Sep 26, 2017 at 2:23 PM, Denis Magda <
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> dmagda@apache.org
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hello Nikolay,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> This is good news. Finally this capability is coming to
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Ignite.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> Val, Vladimir, could you do a preliminary review?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Answering on your questions.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 1. Yardstick should be enough for performance
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> measurements.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>> As a
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Spark
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> user, I will be curious to know what’s the point of this
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> integration.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Probably we need to compare Spark + Ignite and Spark +
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hive
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>> or
>>>>>>>>>> 
>>>>>>>>>> Spark +
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>>> RDBMS cases.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 2. If Spark community is reluctant let’s include the
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> module
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>> in
>>>>>>>>>> 
>>>>>>>>>> ignite-spark integration.
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> —
>>>>>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Sep 25, 2017, at 11:14 AM, Николай Ижиков <
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hello, guys.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Currently, I’m working on integration between Spark
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>> Ignite
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> [1].
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> For now, I implement following:
>>>>>>>>>>>>>>>>>>>>    * Ignite DataSource implementation(
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> IgniteRelationProvider)
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>    * DataFrame support for Ignite SQL table.
>>>>>>>>>>>> 
>>>>>>>>>>>>>    * IgniteCatalog implementation for a transparent
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> resolving
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>> of
>>>>>>>>>>>> 
>>>>>>>>>>>> ignites
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> SQL tables.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Implementation of it can be found in PR [2]
>>>>>>>>>>>>>>>>>>>> It would be great if someone provides feedback for a
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> prototype.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> I made some examples in PR so you can see how API
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> suppose
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>> to
>>>>>>>>>> 
>>>>>>>>>> be
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> used [3].
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> [4].
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I need some advice. Can you help me?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 1. How should this PR be tested?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Of course, I need to provide some unit tests. But what
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>> scalability
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> tests, etc.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Maybe we need some Yardstick benchmark or similar?
>>>>>>>>>>>>>>>>>>>> What are your thoughts?
>>>>>>>>>>>>>>>>>>>> Which scenarios should I consider in the first place?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 2. Should we provide Spark Catalog implementation
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> inside
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>> Ignite
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>> codebase?
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> A current implementation of Spark Catalog based on
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> *internal
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>> Spark
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> API*.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Spark community seems not interested in making Catalog
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> API
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>> public
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>>> or
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> including Ignite Catalog in Spark code base [5], [6].
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> *Should we include Spark internal API implementation
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> inside
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>> Ignite
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> code
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> base?*
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Or should we consider to include Catalog
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>> in
>>>>>>>>> 
>>>>>>>>> some
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> external
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> module?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> That will be created and released outside Ignite?(we
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>> can
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> support
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> develop it inside Ignite community).
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-3084
>>>>>>>>>>>>>>>>>>>> [2] https://github.com/apache/ignite/pull/2742
>>>>>>>>>>>>>>>>>>>> [3] https://github.com/apache/
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> ignite/pull/2742/files#diff-
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>> f4ff509cef3018e221394474775e0905
>>>>>>>>>>> 
>>>>>>>>>>>> [4] https://github.com/apache/
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> ignite/pull/2742/files#diff-
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>> f2b670497d81e780dfd5098c5dd8a89c
>>>>>>>>>>> 
>>>>>>>>>>>> [5] http://apache-spark-developers-list.1001551.n3.
>>>>>>>>>>>>>>>>>>>> nabble.com/Spark-Core-Custom-
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Catalog-Integration-between-
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>> Apache-Ignite-and-Apache-Spark-td22452.html
>>>>>>>>>> 
>>>>>>>>>>> [6] https://issues.apache.org/jira/browse/SPARK-17767
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> --
>>>>> Nikolay Izhikov
>>>>> NIzhikov.dev@gmail.com
>>>>> 
>>>>> 
>>>

Re: Integration of Spark and Ignite. Prototype.

Posted by Valentin Kulichenko <va...@gmail.com>.

Nikolay,

Let's estimate the strategy implementation work, and then decide weather to
merge the code in current state or not. If anything is unclear, please
start a separate discussion.

-Val

On Fri, Nov 24, 2017 at 5:42 AM, Николай Ижиков <ni...@gmail.com>
wrote:

> Hello, Val, Denis.
>
> > Personally, I think that we should release the integration only after
> the strategy is fully supported.
>
> I see two major reason to propose merge of DataFrame API implementation
> without custom strategy:
>
> 1. My PR is relatively huge, already. From my experience of interaction
> with Ignite community - the bigger PR becomes, the more time of commiters
> required to review PR.
> So, I propose to move smaller, but complete steps here.
>
> 2. It is not clear for me what exactly includes "custom strategy and
> optimization".
> Seems, that additional discussion required.
> I think, I can put my thoughts on the paper and start discussion right
> after basic implementation is done.
>
> > Custom strategy implementation is actually very important for this
> integration.
>
> Understand and fully agreed.
> I'm ready to continue work in that area.
>
> 23.11.2017 02:15, Denis Magda пишет:
>
> Val, Nikolay,
>>
>> Personally, I think that we should release the integration only after the
>> strategy is fully supported. Without the strategy we don’t really leverage
>> from Ignite’s SQL engine and introduce redundant data movement between
>> Ignite and Spark nodes.
>>
>> How big is the effort to support the strategy in terms of the amount of
>> work left? 40%, 60%, 80%?
>>
>> —
>> Denis
>>
>> On Nov 22, 2017, at 2:57 PM, Valentin Kulichenko <
>>> valentin.kulichenko@gmail.com> wrote:
>>>
>>> Nikolay,
>>>
>>> Custom strategy implementation is actually very important for this
>>> integration. Basically, it will allow to create a SQL query for Ignite
>>> and
>>> execute it directly on the cluster. Your current implementation only
>>> adds a
>>> new DataSource which means that Spark will fetch data in its own memory
>>> first, and then do most of the work (like joins for example). Does it
>>> make
>>> sense to you? Can you please take a look at this and provide your
>>> thoughts
>>> on how much development is implied there?
>>>
>>> Current code looks good to me though and I'm OK if the strategy is
>>> implemented as a next step in a scope of separate ticket. I will do final
>>> review early next week and will merge it if everything is OK.
>>>
>>> -Val
>>>
>>> On Thu, Oct 19, 2017 at 7:29 AM, Николай Ижиков <ni...@gmail.com>
>>> wrote:
>>>
>>> Hello.
>>>>
>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two Catalog
>>>>>
>>>> implementations and what is the difference?
>>>>
>>>> IgniteCatalog removed.
>>>>
>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have to be
>>>>>
>>>> set manually on SQLContext each time it's created....Is there any way to
>>>> automate this and improve usability?
>>>>
>>>> IgniteStrategy and IgniteOptimization are removed as it empty now.
>>>>
>>>> Actually, I think it makes sense to create a builder similar to
>>>>>
>>>> SparkSession.builder()...
>>>>
>>>> IgniteBuilder added.
>>>> Syntax looks like:
>>>>
>>>> ```
>>>> val igniteSession = IgniteSparkSession.builder()
>>>>     .appName("Spark Ignite catalog example")
>>>>     .master("local")
>>>>     .config("spark.executor.instances", "2")
>>>>     .igniteConfig(CONFIG)
>>>>     .getOrCreate()
>>>>
>>>> igniteSession.catalog.listTables().show()
>>>> ```
>>>>
>>>> Please, see updated PR - https://github.com/apache/ignite/pull/2742
>>>>
>>>> 2017-10-18 20:02 GMT+03:00 Николай Ижиков <ni...@gmail.com>:
>>>>
>>>> Hello, Valentin.
>>>>>
>>>>> My answers is below.
>>>>> Dmitry, do we need to move discussion to Jira?
>>>>>
>>>>> 1. Why do we have org.apache.spark.sql.ignite package in our codebase?
>>>>>>
>>>>>
>>>>> As I mentioned earlier, to implement and override Spark Catalog one
>>>>> have
>>>>> to use internal(private) Spark API.
>>>>> So I have to use package `org.spark.sql.***` to have access to private
>>>>> class and variables.
>>>>>
>>>>> For example, SharedState class that stores link to ExternalCatalog
>>>>> declared as `private[sql] class SharedState` - i.e. package private.
>>>>>
>>>>> Can these classes reside under org.apache.ignite.spark instead?
>>>>>>
>>>>>
>>>>> No, as long as we want to have our own implementation of
>>>>> ExternalCatalog.
>>>>>
>>>>> 2. IgniteRelationProvider contains multiple constants which I guess are
>>>>>>
>>>>> some king of config options. Can you describe the purpose of each of
>>>>> them?
>>>>>
>>>>> I extend comments for this options.
>>>>> Please, see my commit [1] or PR HEAD:
>>>>>
>>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two Catalog
>>>>>>
>>>>> implementations and what is the difference?
>>>>>
>>>>> Good catch, thank you!
>>>>> After additional research I founded that only IgniteExternalCatalog
>>>>> required.
>>>>> I will update PR with IgniteCatalog remove in a few days.
>>>>>
>>>>> 4. IgniteStrategy and IgniteOptimization are currently no-op. What are
>>>>>>
>>>>> our plans on implementing them? Also, what exactly is planned in
>>>>> IgniteOptimization and what is its purpose?
>>>>>
>>>>> Actually, this is very good question :)
>>>>> And I need advice from experienced community members here:
>>>>>
>>>>> `IgniteOptimization` purpose is to modify query plan created by Spark.
>>>>> Currently, we have one optimization described in IGNITE-3084 [2] by
>>>>> you,
>>>>> Valentin :) :
>>>>>
>>>>> “If there are non-Ignite relations in the plan, we should fall back to
>>>>> native Spark strategies“
>>>>>
>>>>> I think we can go little further and reduce join of two Ignite backed
>>>>> Data Frames into single Ignite SQL query. Currently, this feature is
>>>>> unimplemented.
>>>>>
>>>>> *Do we need it now? Or we can postpone it and concentrates on basic
>>>>> Data
>>>>> Frame and Catalog implementation?*
>>>>>
>>>>> `Strategy` purpose, as you correctly mentioned in [2], is transform
>>>>> LogicalPlan into physical operators.
>>>>> I don’t have ideas how to use this opportunity. So I think we don’t
>>>>> need
>>>>> IgniteStrategy.
>>>>>
>>>>> Can you or anyone else suggest some optimization strategy to speed up
>>>>> SQL
>>>>> query execution?
>>>>>
>>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have to be
>>>>>>
>>>>> set manually on SQLContext each time it's created....Is there any way
>>>>> to
>>>>> automate this and improve usability?
>>>>>
>>>>> These classes added to `extraOptimizations` when one using
>>>>> IgniteSparkSession.
>>>>> As far as I know, there is no way to automatically add these classes to
>>>>> regular SparkSession.
>>>>>
>>>>> 6. What is the purpose of IgniteSparkSession? I see it's used in
>>>>>>
>>>>> IgniteCatalogExample but not in IgniteDataFrameExample, which is
>>>>> Confusing.
>>>>>
>>>>> DataFrame API is *public* Spark API. So anyone can provide
>>>>> implementation
>>>>> and plug it into Spark. That’s why IgniteDataFrameExample doesn’t need
>>>>> any
>>>>> Ignite specific session.
>>>>>
>>>>> Catalog API is *internal* Spark API. There is no way to plug custom
>>>>> catalog implementation into Spark [3]. So we have to use
>>>>> `IgniteSparkSession` that extends regular SparkSession and overrides
>>>>> links
>>>>> to `ExternalCatalog`.
>>>>>
>>>>> 7. To create IgniteSparkSession we first create IgniteContext. Is it
>>>>>>
>>>>> really needed? It looks like we can directly provide the configuration
>>>>> file; if IgniteSparkSession really requires IgniteContext, it can
>>>>> create it
>>>>> by itself under the hood.
>>>>>
>>>>> Actually, IgniteContext is base class for Ignite <-> Spark integration
>>>>> for now. So I tried to reuse it here. I like the idea to remove
>>>>> explicit
>>>>> usage of IgniteContext.
>>>>> Will implement it in a few days.
>>>>>
>>>>> Actually, I think it makes sense to create a builder similar to
>>>>>>
>>>>> SparkSession.builder()...
>>>>>
>>>>> Great idea! I will implement such builder in a few days.
>>>>>
>>>>> 9. Do I understand correctly that IgniteCacheRelation is for the case
>>>>>>
>>>>> when we don't have SQL configured on Ignite side?
>>>>>
>>>>> Yes, IgniteCacheRelation is Data Frame implementation for a key-value
>>>>> cache.
>>>>>
>>>>> I thought we decided not to support this, no? Or this is something
>>>>>> else?
>>>>>>
>>>>>
>>>>> My understanding is following:
>>>>>
>>>>> 1. We can’t support automatic resolving key-value caches in
>>>>> *ExternalCatalog*. Because there is no way to reliably detect key and
>>>>> value
>>>>> classes.
>>>>>
>>>>> 2. We can support key-value caches in regular Data Frame
>>>>> implementation.
>>>>> Because we can require user to provide key and value classes
>>>>> explicitly.
>>>>>
>>>>> 8. Can you clarify the query syntax in IgniteDataFrameExample#nativeS
>>>>>>
>>>>> parkSqlFromCacheExample2?
>>>>>
>>>>> Key-value cache:
>>>>>
>>>>> key - java.lang.Long,
>>>>> value - case class Person(name: String, birthDate: java.util.Date)
>>>>>
>>>>> Schema of data frame for cache is:
>>>>>
>>>>> key - long
>>>>> value.name - string
>>>>> value.birthDate - date
>>>>>
>>>>> So we can select data from data from cache:
>>>>>
>>>>> SELECT
>>>>>   key, `value.name`,  `value.birthDate`
>>>>> FROM
>>>>>   testCache
>>>>> WHERE key >= 2 AND `value.name` like '%0'
>>>>>
>>>>> [1] https://github.com/apache/ignite/pull/2742/commits/faf3ed6fe
>>>>> bf417bc59b0519156fd4d09114c8da7
>>>>> [2] https://issues.apache.org/jira/browse/IGNITE-3084?focusedCom
>>>>> mentId=15794210&page=com.atlassian.jira.plugin.system.issuet
>>>>> abpanels:comment-tabpanel#comment-15794210
>>>>> [3] https://issues.apache.org/jira/browse/SPARK-17767?focusedCom
>>>>> mentId=15543733&page=com.atlassian.jira.plugin.system.issuet
>>>>> abpanels:comment-tabpanel#comment-15543733
>>>>>
>>>>>
>>>>> 18.10.2017 04:39, Dmitriy Setrakyan пишет:
>>>>>
>>>>> Val, thanks for the review. Can I ask you to add the same comments to
>>>>> the
>>>>>
>>>>>> ticket?
>>>>>>
>>>>>> On Tue, Oct 17, 2017 at 3:20 PM, Valentin Kulichenko <
>>>>>> valentin.kulichenko@gmail.com> wrote:
>>>>>>
>>>>>> Nikolay, Anton,
>>>>>>
>>>>>>>
>>>>>>> I did a high level review of the code. First of all, impressive
>>>>>>> results!
>>>>>>> However, I have some questions/comments.
>>>>>>>
>>>>>>> 1. Why do we have org.apache.spark.sql.ignite package in our
>>>>>>> codebase?
>>>>>>> Can
>>>>>>> these classes reside under org.apache.ignite.spark instead?
>>>>>>> 2. IgniteRelationProvider contains multiple constants which I guess
>>>>>>> are
>>>>>>> some king of config options. Can you describe the purpose of each of
>>>>>>> them?
>>>>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two
>>>>>>> Catalog
>>>>>>> implementations and what is the difference?
>>>>>>> 4. IgniteStrategy and IgniteOptimization are currently no-op. What
>>>>>>> are
>>>>>>> our
>>>>>>> plans on implementing them? Also, what exactly is planned in
>>>>>>> IgniteOptimization and what is its purpose?
>>>>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have to be
>>>>>>> set
>>>>>>> manually on SQLContext each time it's created. This seems to be very
>>>>>>> error
>>>>>>> prone. Is there any way to automate this and improve usability?
>>>>>>> 6. What is the purpose of IgniteSparkSession? I see it's used
>>>>>>> in IgniteCatalogExample but not in IgniteDataFrameExample, which is
>>>>>>> confusing.
>>>>>>> 7. To create IgniteSparkSession we first create IgniteContext. Is it
>>>>>>> really
>>>>>>> needed? It looks like we can directly provide the configuration
>>>>>>> file; if
>>>>>>> IgniteSparkSession really requires IgniteContext, it can create it by
>>>>>>> itself under the hood. Actually, I think it makes sense to create a
>>>>>>> builder
>>>>>>> similar to SparkSession.builder(), it would be good if our APIs here
>>>>>>> are
>>>>>>> consistent with Spark APIs.
>>>>>>> 8. Can you clarify the query syntax
>>>>>>> inIgniteDataFrameExample#nativeSparkSqlFromCacheExample2?
>>>>>>> 9. Do I understand correctly that IgniteCacheRelation is for the case
>>>>>>> when
>>>>>>> we don't have SQL configured on Ignite side? I thought we decided
>>>>>>> not to
>>>>>>> support this, no? Or this is something else?
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> -Val
>>>>>>>
>>>>>>> On Tue, Oct 17, 2017 at 4:40 AM, Anton Vinogradov <
>>>>>>> avinogradov@gridgain.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Sounds awesome.
>>>>>>>
>>>>>>>>
>>>>>>>> I'll try to review API & tests this week.
>>>>>>>>
>>>>>>>> Val,
>>>>>>>> Your review still required :)
>>>>>>>>
>>>>>>>> On Tue, Oct 17, 2017 at 2:36 PM, Николай Ижиков <
>>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Yes
>>>>>>>>
>>>>>>>>>
>>>>>>>>> 17 окт. 2017 г. 2:34 PM пользователь "Anton Vinogradov" <
>>>>>>>>> avinogradov@gridgain.com> написал:
>>>>>>>>>
>>>>>>>>> Nikolay,
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> So, it will be able to start regular spark and ignite clusters
>>>>>>>>>> and,
>>>>>>>>>>
>>>>>>>>>> using
>>>>>>>>>
>>>>>>>>
>>>>>>>> peer classloading via spark-context, perform any DataFrame request,
>>>>>>>>>
>>>>>>>>>> correct?
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 17, 2017 at 2:25 PM, Николай Ижиков <
>>>>>>>>>>
>>>>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>>>
>>>>>>>>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hello, Anton.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> An example you provide is a path to a master *local* file.
>>>>>>>>>>> These libraries are added to the classpath for each remote node
>>>>>>>>>>>
>>>>>>>>>>> running
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> submitted job.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Please, see documentation:
>>>>>>>>>>>
>>>>>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
>>>>>>>>>>> spark/SparkContext.html#addJar(java.lang.String)
>>>>>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
>>>>>>>>>>> spark/SparkContext.html#addFile(java.lang.String)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2017-10-17 13:10 GMT+03:00 Anton Vinogradov <
>>>>>>>>>>>
>>>>>>>>>>> avinogradov@gridgain.com
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> :
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Nikolay,
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> With Data Frame API implementation there are no requirements to
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> have
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> any
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Ignite files on spark worker nodes.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> What do you mean? I see code like:
>>>>>>>>>>>>
>>>>>>>>>>>> spark.sparkContext.addJar(MAVEN_HOME +
>>>>>>>>>>>> "/org/apache/ignite/ignite-core/2.3.0-SNAPSHOT/ignite-
>>>>>>>>>>>> core-2.3.0-SNAPSHOT.jar")
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 16, 2017 at 5:22 PM, Николай Ижиков <
>>>>>>>>>>>>
>>>>>>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hello, guys.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have created example application to run Ignite Data Frame on
>>>>>>>>>>>>>
>>>>>>>>>>>>> standalone
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Spark cluster.
>>>>>>>>>>>>
>>>>>>>>>>>>> With Data Frame API implementation there are no requirements to
>>>>>>>>>>>>>
>>>>>>>>>>>>> have
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> any
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Ignite files on spark worker nodes.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I ran this application on the free dataset: ATP tennis match
>>>>>>>>>>>>>
>>>>>>>>>>>>> statistics.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> data - https://github.com/nizhikov/atp_matches
>>>>>>>>>>>>> app - https://github.com/nizhikov/ignite-spark-df-example
>>>>>>>>>>>>>
>>>>>>>>>>>>> Valentin, do you have a chance to look at my changes?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2017-10-12 6:03 GMT+03:00 Valentin Kulichenko <
>>>>>>>>>>>>> valentin.kulichenko@gmail.com
>>>>>>>>>>>>>
>>>>>>>>>>>>> :
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Nikolay,
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sorry for delay on this, got a little swamped lately. I will
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> do
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>> my
>>>>>>>>
>>>>>>>>>
>>>>>>>>> best
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> to
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> review the code this week.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Val
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Oct 9, 2017 at 11:48 AM, Николай Ижиков <
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello, Valentin.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Did you have a chance to look at my changes?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Now I think I have done almost all required features.
>>>>>>>>>>>>>>> I want to make some performance test to ensure my
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>> work
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> properly with a significant amount of data.
>>>>>>>>>>>>
>>>>>>>>>>>>> And I definitely need some feedback for my changes.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2017-10-09 18:45 GMT+03:00 Николай Ижиков <
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> nizhikov.dev@gmail.com
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>> :
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Hello, guys.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Which version of Spark do we want to use?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1. Currently, Ignite depends on Spark 2.1.0.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>      * Can be run on JDK 7.
>>>>>>>>>>>>>>>>      * Still supported: 2.1.2 will be released soon.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2. Latest Spark version is 2.2.0.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>      * Can be run only on JDK 8+
>>>>>>>>>>>>>>>>      * Released Jul 11, 2017.
>>>>>>>>>>>>>>>>      * Already supported by huge vendors(Amazon for
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> example).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>
>>>>>>>> Note that in IGNITE-3084 I implement some internal Spark
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> API.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>> So It will take some effort to switch between Spark 2.1 and
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>>>>>> 2.2
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>>>>>> 2017-09-27 2:20 GMT+03:00 Valentin Kulichenko <
>>>>>>>>>>>>>>>> valentin.kulichenko@gmail.com>:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I will review in the next few days.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -Val
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Sep 26, 2017 at 2:23 PM, Denis Magda <
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> dmagda@apache.org
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello Nikolay,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This is good news. Finally this capability is coming to
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Ignite.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Val, Vladimir, could you do a preliminary review?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Answering on your questions.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 1. Yardstick should be enough for performance
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> measurements.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>> As a
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Spark
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> user, I will be curious to know what’s the point of this
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> integration.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> Probably we need to compare Spark + Ignite and Spark +
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hive
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>> or
>>>>>>>>>
>>>>>>>>> Spark +
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> RDBMS cases.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 2. If Spark community is reluctant let’s include the
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> module
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>> in
>>>>>>>>>
>>>>>>>>> ignite-spark integration.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> —
>>>>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sep 25, 2017, at 11:14 AM, Николай Ижиков <
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hello, guys.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Currently, I’m working on integration between Spark
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>> Ignite
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> [1].
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> For now, I implement following:
>>>>>>>>>>>>>>>>>>>     * Ignite DataSource implementation(
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> IgniteRelationProvider)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>     * DataFrame support for Ignite SQL table.
>>>>>>>>>>>
>>>>>>>>>>>>     * IgniteCatalog implementation for a transparent
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> resolving
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>> of
>>>>>>>>>>>
>>>>>>>>>>> ignites
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> SQL tables.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Implementation of it can be found in PR [2]
>>>>>>>>>>>>>>>>>>> It would be great if someone provides feedback for a
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> prototype.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> I made some examples in PR so you can see how API
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> suppose
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>> to
>>>>>>>>>
>>>>>>>>> be
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> used [3].
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [4].
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I need some advice. Can you help me?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 1. How should this PR be tested?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Of course, I need to provide some unit tests. But what
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>> scalability
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>>>>> tests, etc.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Maybe we need some Yardstick benchmark or similar?
>>>>>>>>>>>>>>>>>>> What are your thoughts?
>>>>>>>>>>>>>>>>>>> Which scenarios should I consider in the first place?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 2. Should we provide Spark Catalog implementation
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> inside
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>> Ignite
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> codebase?
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> A current implementation of Spark Catalog based on
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> *internal
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>> Spark
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> API*.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Spark community seems not interested in making Catalog
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> API
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>> public
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> or
>>>>>>>>>>>>>
>>>>>>>>>>>>> including Ignite Catalog in Spark code base [5], [6].
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> *Should we include Spark internal API implementation
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> inside
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>> Ignite
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> code
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> base?*
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Or should we consider to include Catalog
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>> in
>>>>>>>>
>>>>>>>> some
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> external
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> module?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> That will be created and released outside Ignite?(we
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>> can
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> support
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> develop it inside Ignite community).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-3084
>>>>>>>>>>>>>>>>>>> [2] https://github.com/apache/ignite/pull/2742
>>>>>>>>>>>>>>>>>>> [3] https://github.com/apache/
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ignite/pull/2742/files#diff-
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>> f4ff509cef3018e221394474775e0905
>>>>>>>>>>
>>>>>>>>>>> [4] https://github.com/apache/
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ignite/pull/2742/files#diff-
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>> f2b670497d81e780dfd5098c5dd8a89c
>>>>>>>>>>
>>>>>>>>>>> [5] http://apache-spark-developers-list.1001551.n3.
>>>>>>>>>>>>>>>>>>> nabble.com/Spark-Core-Custom-
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Catalog-Integration-between-
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>> Apache-Ignite-and-Apache-Spark-td22452.html
>>>>>>>>>
>>>>>>>>>> [6] https://issues.apache.org/jira/browse/SPARK-17767
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Nikolay Izhikov
>>>> NIzhikov.dev@gmail.com
>>>>
>>>>
>>

Re: Integration of Spark and Ignite. Prototype.

Posted by Николай Ижиков <ni...@gmail.com>.

Hello, Val, Denis.

 > Personally, I think that we should release the integration only after the strategy is fully supported.

I see two major reason to propose merge of DataFrame API implementation without custom strategy:

1. My PR is relatively huge, already. From my experience of interaction with Ignite community - the bigger PR becomes, the more time of commiters required to review PR.
So, I propose to move smaller, but complete steps here.

2. It is not clear for me what exactly includes "custom strategy and optimization".
Seems, that additional discussion required.
I think, I can put my thoughts on the paper and start discussion right after basic implementation is done.

 > Custom strategy implementation is actually very important for this integration.

Understand and fully agreed.
I'm ready to continue work in that area.

23.11.2017 02:15, Denis Magda пишет:
> Val, Nikolay,
> 
> Personally, I think that we should release the integration only after the strategy is fully supported. Without the strategy we don’t really leverage from Ignite’s SQL engine and introduce redundant data movement between Ignite and Spark nodes.
> 
> How big is the effort to support the strategy in terms of the amount of work left? 40%, 60%, 80%?
> 
> —
> Denis
> 
>> On Nov 22, 2017, at 2:57 PM, Valentin Kulichenko <va...@gmail.com> wrote:
>>
>> Nikolay,
>>
>> Custom strategy implementation is actually very important for this
>> integration. Basically, it will allow to create a SQL query for Ignite and
>> execute it directly on the cluster. Your current implementation only adds a
>> new DataSource which means that Spark will fetch data in its own memory
>> first, and then do most of the work (like joins for example). Does it make
>> sense to you? Can you please take a look at this and provide your thoughts
>> on how much development is implied there?
>>
>> Current code looks good to me though and I'm OK if the strategy is
>> implemented as a next step in a scope of separate ticket. I will do final
>> review early next week and will merge it if everything is OK.
>>
>> -Val
>>
>> On Thu, Oct 19, 2017 at 7:29 AM, Николай Ижиков <ni...@gmail.com>
>> wrote:
>>
>>> Hello.
>>>
>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two Catalog
>>> implementations and what is the difference?
>>>
>>> IgniteCatalog removed.
>>>
>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have to be
>>> set manually on SQLContext each time it's created....Is there any way to
>>> automate this and improve usability?
>>>
>>> IgniteStrategy and IgniteOptimization are removed as it empty now.
>>>
>>>> Actually, I think it makes sense to create a builder similar to
>>> SparkSession.builder()...
>>>
>>> IgniteBuilder added.
>>> Syntax looks like:
>>>
>>> ```
>>> val igniteSession = IgniteSparkSession.builder()
>>>     .appName("Spark Ignite catalog example")
>>>     .master("local")
>>>     .config("spark.executor.instances", "2")
>>>     .igniteConfig(CONFIG)
>>>     .getOrCreate()
>>>
>>> igniteSession.catalog.listTables().show()
>>> ```
>>>
>>> Please, see updated PR - https://github.com/apache/ignite/pull/2742
>>>
>>> 2017-10-18 20:02 GMT+03:00 Николай Ижиков <ni...@gmail.com>:
>>>
>>>> Hello, Valentin.
>>>>
>>>> My answers is below.
>>>> Dmitry, do we need to move discussion to Jira?
>>>>
>>>>> 1. Why do we have org.apache.spark.sql.ignite package in our codebase?
>>>>
>>>> As I mentioned earlier, to implement and override Spark Catalog one have
>>>> to use internal(private) Spark API.
>>>> So I have to use package `org.spark.sql.***` to have access to private
>>>> class and variables.
>>>>
>>>> For example, SharedState class that stores link to ExternalCatalog
>>>> declared as `private[sql] class SharedState` - i.e. package private.
>>>>
>>>>> Can these classes reside under org.apache.ignite.spark instead?
>>>>
>>>> No, as long as we want to have our own implementation of ExternalCatalog.
>>>>
>>>>> 2. IgniteRelationProvider contains multiple constants which I guess are
>>>> some king of config options. Can you describe the purpose of each of them?
>>>>
>>>> I extend comments for this options.
>>>> Please, see my commit [1] or PR HEAD:
>>>>
>>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two Catalog
>>>> implementations and what is the difference?
>>>>
>>>> Good catch, thank you!
>>>> After additional research I founded that only IgniteExternalCatalog
>>>> required.
>>>> I will update PR with IgniteCatalog remove in a few days.
>>>>
>>>>> 4. IgniteStrategy and IgniteOptimization are currently no-op. What are
>>>> our plans on implementing them? Also, what exactly is planned in
>>>> IgniteOptimization and what is its purpose?
>>>>
>>>> Actually, this is very good question :)
>>>> And I need advice from experienced community members here:
>>>>
>>>> `IgniteOptimization` purpose is to modify query plan created by Spark.
>>>> Currently, we have one optimization described in IGNITE-3084 [2] by you,
>>>> Valentin :) :
>>>>
>>>> “If there are non-Ignite relations in the plan, we should fall back to
>>>> native Spark strategies“
>>>>
>>>> I think we can go little further and reduce join of two Ignite backed
>>>> Data Frames into single Ignite SQL query. Currently, this feature is
>>>> unimplemented.
>>>>
>>>> *Do we need it now? Or we can postpone it and concentrates on basic Data
>>>> Frame and Catalog implementation?*
>>>>
>>>> `Strategy` purpose, as you correctly mentioned in [2], is transform
>>>> LogicalPlan into physical operators.
>>>> I don’t have ideas how to use this opportunity. So I think we don’t need
>>>> IgniteStrategy.
>>>>
>>>> Can you or anyone else suggest some optimization strategy to speed up SQL
>>>> query execution?
>>>>
>>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have to be
>>>> set manually on SQLContext each time it's created....Is there any way to
>>>> automate this and improve usability?
>>>>
>>>> These classes added to `extraOptimizations` when one using
>>>> IgniteSparkSession.
>>>> As far as I know, there is no way to automatically add these classes to
>>>> regular SparkSession.
>>>>
>>>>> 6. What is the purpose of IgniteSparkSession? I see it's used in
>>>> IgniteCatalogExample but not in IgniteDataFrameExample, which is Confusing.
>>>>
>>>> DataFrame API is *public* Spark API. So anyone can provide implementation
>>>> and plug it into Spark. That’s why IgniteDataFrameExample doesn’t need any
>>>> Ignite specific session.
>>>>
>>>> Catalog API is *internal* Spark API. There is no way to plug custom
>>>> catalog implementation into Spark [3]. So we have to use
>>>> `IgniteSparkSession` that extends regular SparkSession and overrides links
>>>> to `ExternalCatalog`.
>>>>
>>>>> 7. To create IgniteSparkSession we first create IgniteContext. Is it
>>>> really needed? It looks like we can directly provide the configuration
>>>> file; if IgniteSparkSession really requires IgniteContext, it can create it
>>>> by itself under the hood.
>>>>
>>>> Actually, IgniteContext is base class for Ignite <-> Spark integration
>>>> for now. So I tried to reuse it here. I like the idea to remove explicit
>>>> usage of IgniteContext.
>>>> Will implement it in a few days.
>>>>
>>>>> Actually, I think it makes sense to create a builder similar to
>>>> SparkSession.builder()...
>>>>
>>>> Great idea! I will implement such builder in a few days.
>>>>
>>>>> 9. Do I understand correctly that IgniteCacheRelation is for the case
>>>> when we don't have SQL configured on Ignite side?
>>>>
>>>> Yes, IgniteCacheRelation is Data Frame implementation for a key-value
>>>> cache.
>>>>
>>>>> I thought we decided not to support this, no? Or this is something else?
>>>>
>>>> My understanding is following:
>>>>
>>>> 1. We can’t support automatic resolving key-value caches in
>>>> *ExternalCatalog*. Because there is no way to reliably detect key and value
>>>> classes.
>>>>
>>>> 2. We can support key-value caches in regular Data Frame implementation.
>>>> Because we can require user to provide key and value classes explicitly.
>>>>
>>>>> 8. Can you clarify the query syntax in IgniteDataFrameExample#nativeS
>>>> parkSqlFromCacheExample2?
>>>>
>>>> Key-value cache:
>>>>
>>>> key - java.lang.Long,
>>>> value - case class Person(name: String, birthDate: java.util.Date)
>>>>
>>>> Schema of data frame for cache is:
>>>>
>>>> key - long
>>>> value.name - string
>>>> value.birthDate - date
>>>>
>>>> So we can select data from data from cache:
>>>>
>>>> SELECT
>>>>   key, `value.name`,  `value.birthDate`
>>>> FROM
>>>>   testCache
>>>> WHERE key >= 2 AND `value.name` like '%0'
>>>>
>>>> [1] https://github.com/apache/ignite/pull/2742/commits/faf3ed6fe
>>>> bf417bc59b0519156fd4d09114c8da7
>>>> [2] https://issues.apache.org/jira/browse/IGNITE-3084?focusedCom
>>>> mentId=15794210&page=com.atlassian.jira.plugin.system.issuet
>>>> abpanels:comment-tabpanel#comment-15794210
>>>> [3] https://issues.apache.org/jira/browse/SPARK-17767?focusedCom
>>>> mentId=15543733&page=com.atlassian.jira.plugin.system.issuet
>>>> abpanels:comment-tabpanel#comment-15543733
>>>>
>>>>
>>>> 18.10.2017 04:39, Dmitriy Setrakyan пишет:
>>>>
>>>> Val, thanks for the review. Can I ask you to add the same comments to the
>>>>> ticket?
>>>>>
>>>>> On Tue, Oct 17, 2017 at 3:20 PM, Valentin Kulichenko <
>>>>> valentin.kulichenko@gmail.com> wrote:
>>>>>
>>>>> Nikolay, Anton,
>>>>>>
>>>>>> I did a high level review of the code. First of all, impressive results!
>>>>>> However, I have some questions/comments.
>>>>>>
>>>>>> 1. Why do we have org.apache.spark.sql.ignite package in our codebase?
>>>>>> Can
>>>>>> these classes reside under org.apache.ignite.spark instead?
>>>>>> 2. IgniteRelationProvider contains multiple constants which I guess are
>>>>>> some king of config options. Can you describe the purpose of each of
>>>>>> them?
>>>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two Catalog
>>>>>> implementations and what is the difference?
>>>>>> 4. IgniteStrategy and IgniteOptimization are currently no-op. What are
>>>>>> our
>>>>>> plans on implementing them? Also, what exactly is planned in
>>>>>> IgniteOptimization and what is its purpose?
>>>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have to be
>>>>>> set
>>>>>> manually on SQLContext each time it's created. This seems to be very
>>>>>> error
>>>>>> prone. Is there any way to automate this and improve usability?
>>>>>> 6. What is the purpose of IgniteSparkSession? I see it's used
>>>>>> in IgniteCatalogExample but not in IgniteDataFrameExample, which is
>>>>>> confusing.
>>>>>> 7. To create IgniteSparkSession we first create IgniteContext. Is it
>>>>>> really
>>>>>> needed? It looks like we can directly provide the configuration file; if
>>>>>> IgniteSparkSession really requires IgniteContext, it can create it by
>>>>>> itself under the hood. Actually, I think it makes sense to create a
>>>>>> builder
>>>>>> similar to SparkSession.builder(), it would be good if our APIs here are
>>>>>> consistent with Spark APIs.
>>>>>> 8. Can you clarify the query syntax
>>>>>> inIgniteDataFrameExample#nativeSparkSqlFromCacheExample2?
>>>>>> 9. Do I understand correctly that IgniteCacheRelation is for the case
>>>>>> when
>>>>>> we don't have SQL configured on Ignite side? I thought we decided not to
>>>>>> support this, no? Or this is something else?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> -Val
>>>>>>
>>>>>> On Tue, Oct 17, 2017 at 4:40 AM, Anton Vinogradov <
>>>>>> avinogradov@gridgain.com>
>>>>>> wrote:
>>>>>>
>>>>>> Sounds awesome.
>>>>>>>
>>>>>>> I'll try to review API & tests this week.
>>>>>>>
>>>>>>> Val,
>>>>>>> Your review still required :)
>>>>>>>
>>>>>>> On Tue, Oct 17, 2017 at 2:36 PM, Николай Ижиков <
>>>>>>> nizhikov.dev@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Yes
>>>>>>>>
>>>>>>>> 17 окт. 2017 г. 2:34 PM пользователь "Anton Vinogradov" <
>>>>>>>> avinogradov@gridgain.com> написал:
>>>>>>>>
>>>>>>>> Nikolay,
>>>>>>>>>
>>>>>>>>> So, it will be able to start regular spark and ignite clusters and,
>>>>>>>>>
>>>>>>>> using
>>>>>>>
>>>>>>>> peer classloading via spark-context, perform any DataFrame request,
>>>>>>>>> correct?
>>>>>>>>>
>>>>>>>>> On Tue, Oct 17, 2017 at 2:25 PM, Николай Ижиков <
>>>>>>>>>
>>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hello, Anton.
>>>>>>>>>>
>>>>>>>>>> An example you provide is a path to a master *local* file.
>>>>>>>>>> These libraries are added to the classpath for each remote node
>>>>>>>>>>
>>>>>>>>> running
>>>>>>>
>>>>>>>> submitted job.
>>>>>>>>>>
>>>>>>>>>> Please, see documentation:
>>>>>>>>>>
>>>>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
>>>>>>>>>> spark/SparkContext.html#addJar(java.lang.String)
>>>>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
>>>>>>>>>> spark/SparkContext.html#addFile(java.lang.String)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2017-10-17 13:10 GMT+03:00 Anton Vinogradov <
>>>>>>>>>>
>>>>>>>>> avinogradov@gridgain.com
>>>>>>>
>>>>>>>> :
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Nikolay,
>>>>>>>>>>>
>>>>>>>>>>> With Data Frame API implementation there are no requirements to
>>>>>>>>>>>>
>>>>>>>>>>> have
>>>>>>>>
>>>>>>>>> any
>>>>>>>>>>
>>>>>>>>>>> Ignite files on spark worker nodes.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> What do you mean? I see code like:
>>>>>>>>>>>
>>>>>>>>>>> spark.sparkContext.addJar(MAVEN_HOME +
>>>>>>>>>>> "/org/apache/ignite/ignite-core/2.3.0-SNAPSHOT/ignite-
>>>>>>>>>>> core-2.3.0-SNAPSHOT.jar")
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 16, 2017 at 5:22 PM, Николай Ижиков <
>>>>>>>>>>>
>>>>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hello, guys.
>>>>>>>>>>>>
>>>>>>>>>>>> I have created example application to run Ignite Data Frame on
>>>>>>>>>>>>
>>>>>>>>>>> standalone
>>>>>>>>>>
>>>>>>>>>>> Spark cluster.
>>>>>>>>>>>> With Data Frame API implementation there are no requirements to
>>>>>>>>>>>>
>>>>>>>>>>> have
>>>>>>>>
>>>>>>>>> any
>>>>>>>>>>
>>>>>>>>>>> Ignite files on spark worker nodes.
>>>>>>>>>>>>
>>>>>>>>>>>> I ran this application on the free dataset: ATP tennis match
>>>>>>>>>>>>
>>>>>>>>>>> statistics.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> data - https://github.com/nizhikov/atp_matches
>>>>>>>>>>>> app - https://github.com/nizhikov/ignite-spark-df-example
>>>>>>>>>>>>
>>>>>>>>>>>> Valentin, do you have a chance to look at my changes?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2017-10-12 6:03 GMT+03:00 Valentin Kulichenko <
>>>>>>>>>>>> valentin.kulichenko@gmail.com
>>>>>>>>>>>>
>>>>>>>>>>>>> :
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Nikolay,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry for delay on this, got a little swamped lately. I will
>>>>>>>>>>>>>
>>>>>>>>>>>> do
>>>>>>
>>>>>>> my
>>>>>>>>
>>>>>>>>> best
>>>>>>>>>>
>>>>>>>>>>> to
>>>>>>>>>>>>
>>>>>>>>>>>>> review the code this week.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Val
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Oct 9, 2017 at 11:48 AM, Николай Ижиков <
>>>>>>>>>>>>>
>>>>>>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hello, Valentin.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Did you have a chance to look at my changes?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Now I think I have done almost all required features.
>>>>>>>>>>>>>> I want to make some performance test to ensure my
>>>>>>>>>>>>>>
>>>>>>>>>>>>> implementation
>>>>>>>
>>>>>>>> work
>>>>>>>>>>
>>>>>>>>>>> properly with a significant amount of data.
>>>>>>>>>>>>>> And I definitely need some feedback for my changes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2017-10-09 18:45 GMT+03:00 Николай Ижиков <
>>>>>>>>>>>>>>
>>>>>>>>>>>>> nizhikov.dev@gmail.com
>>>>>>>>
>>>>>>>>> :
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>> Hello, guys.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Which version of Spark do we want to use?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1. Currently, Ignite depends on Spark 2.1.0.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>      * Can be run on JDK 7.
>>>>>>>>>>>>>>>      * Still supported: 2.1.2 will be released soon.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2. Latest Spark version is 2.2.0.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>      * Can be run only on JDK 8+
>>>>>>>>>>>>>>>      * Released Jul 11, 2017.
>>>>>>>>>>>>>>>      * Already supported by huge vendors(Amazon for
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> example).
>>>>>>
>>>>>>>
>>>>>>>>>>>>>>> Note that in IGNITE-3084 I implement some internal Spark
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> API.
>>>>>>
>>>>>>> So It will take some effort to switch between Spark 2.1 and
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2.2
>>>>>>>
>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2017-09-27 2:20 GMT+03:00 Valentin Kulichenko <
>>>>>>>>>>>>>>> valentin.kulichenko@gmail.com>:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I will review in the next few days.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -Val
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Sep 26, 2017 at 2:23 PM, Denis Magda <
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> dmagda@apache.org
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hello Nikolay,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This is good news. Finally this capability is coming to
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Ignite.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>>>>>> Val, Vladimir, could you do a preliminary review?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Answering on your questions.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1. Yardstick should be enough for performance
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> measurements.
>>>>>>>
>>>>>>>> As a
>>>>>>>>>
>>>>>>>>>> Spark
>>>>>>>>>>>>
>>>>>>>>>>>>> user, I will be curious to know what’s the point of this
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> integration.
>>>>>>>>>>>>
>>>>>>>>>>>>> Probably we need to compare Spark + Ignite and Spark +
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hive
>>>>>>>
>>>>>>>> or
>>>>>>>>
>>>>>>>>> Spark +
>>>>>>>>>>>>
>>>>>>>>>>>>> RDBMS cases.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 2. If Spark community is reluctant let’s include the
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> module
>>>>>>>
>>>>>>>> in
>>>>>>>>
>>>>>>>>> ignite-spark integration.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> —
>>>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sep 25, 2017, at 11:14 AM, Николай Ижиков <
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hello, guys.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Currently, I’m working on integration between Spark
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> and
>>>>>>
>>>>>>> Ignite
>>>>>>>>>
>>>>>>>>>> [1].
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> For now, I implement following:
>>>>>>>>>>>>>>>>>>     * Ignite DataSource implementation(
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> IgniteRelationProvider)
>>>>>>>>>
>>>>>>>>>>     * DataFrame support for Ignite SQL table.
>>>>>>>>>>>>>>>>>>     * IgniteCatalog implementation for a transparent
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> resolving
>>>>>>>>>
>>>>>>>>>> of
>>>>>>>>>>
>>>>>>>>>>> ignites
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> SQL tables.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Implementation of it can be found in PR [2]
>>>>>>>>>>>>>>>>>> It would be great if someone provides feedback for a
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> prototype.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I made some examples in PR so you can see how API
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> suppose
>>>>>>>
>>>>>>>> to
>>>>>>>>
>>>>>>>>> be
>>>>>>>>>>
>>>>>>>>>>> used [3].
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [4].
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I need some advice. Can you help me?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 1. How should this PR be tested?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Of course, I need to provide some unit tests. But what
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> about
>>>>>>>>
>>>>>>>>> scalability
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> tests, etc.
>>>>>>>>>>>>>>>>>> Maybe we need some Yardstick benchmark or similar?
>>>>>>>>>>>>>>>>>> What are your thoughts?
>>>>>>>>>>>>>>>>>> Which scenarios should I consider in the first place?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 2. Should we provide Spark Catalog implementation
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> inside
>>>>>>
>>>>>>> Ignite
>>>>>>>>>>
>>>>>>>>>>> codebase?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> A current implementation of Spark Catalog based on
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *internal
>>>>>>>>
>>>>>>>>> Spark
>>>>>>>>>>>
>>>>>>>>>>>> API*.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Spark community seems not interested in making Catalog
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> API
>>>>>>>
>>>>>>>> public
>>>>>>>>>>>
>>>>>>>>>>>> or
>>>>>>>>>>>>
>>>>>>>>>>>>> including Ignite Catalog in Spark code base [5], [6].
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> *Should we include Spark internal API implementation
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> inside
>>>>>>>>
>>>>>>>>> Ignite
>>>>>>>>>>>
>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> base?*
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Or should we consider to include Catalog
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> implementation
>>>>>>
>>>>>>> in
>>>>>>>
>>>>>>>> some
>>>>>>>>>>
>>>>>>>>>>> external
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> module?
>>>>>>>>>>>>>>>>>> That will be created and released outside Ignite?(we
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> still
>>>>>>>
>>>>>>>> can
>>>>>>>>>
>>>>>>>>>> support
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> develop it inside Ignite community).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-3084
>>>>>>>>>>>>>>>>>> [2] https://github.com/apache/ignite/pull/2742
>>>>>>>>>>>>>>>>>> [3] https://github.com/apache/
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ignite/pull/2742/files#diff-
>>>>>>>>
>>>>>>>>> f4ff509cef3018e221394474775e0905
>>>>>>>>>>>>>>>>>> [4] https://github.com/apache/
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ignite/pull/2742/files#diff-
>>>>>>>>
>>>>>>>>> f2b670497d81e780dfd5098c5dd8a89c
>>>>>>>>>>>>>>>>>> [5] http://apache-spark-developers-list.1001551.n3.
>>>>>>>>>>>>>>>>>> nabble.com/Spark-Core-Custom-
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Catalog-Integration-between-
>>>>>>>
>>>>>>>> Apache-Ignite-and-Apache-Spark-td22452.html
>>>>>>>>>>>>>>>>>> [6] https://issues.apache.org/jira/browse/SPARK-17767
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>>
>>> --
>>> Nikolay Izhikov
>>> NIzhikov.dev@gmail.com
>>>
>

Re: Integration of Spark and Ignite. Prototype.

Posted by Denis Magda <dm...@apache.org>.

Val, Nikolay,

Personally, I think that we should release the integration only after the strategy is fully supported. Without the strategy we don’t really leverage from Ignite’s SQL engine and introduce redundant data movement between Ignite and Spark nodes.

How big is the effort to support the strategy in terms of the amount of work left? 40%, 60%, 80%?

—
Denis

> On Nov 22, 2017, at 2:57 PM, Valentin Kulichenko <va...@gmail.com> wrote:
> 
> Nikolay,
> 
> Custom strategy implementation is actually very important for this
> integration. Basically, it will allow to create a SQL query for Ignite and
> execute it directly on the cluster. Your current implementation only adds a
> new DataSource which means that Spark will fetch data in its own memory
> first, and then do most of the work (like joins for example). Does it make
> sense to you? Can you please take a look at this and provide your thoughts
> on how much development is implied there?
> 
> Current code looks good to me though and I'm OK if the strategy is
> implemented as a next step in a scope of separate ticket. I will do final
> review early next week and will merge it if everything is OK.
> 
> -Val
> 
> On Thu, Oct 19, 2017 at 7:29 AM, Николай Ижиков <ni...@gmail.com>
> wrote:
> 
>> Hello.
>> 
>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two Catalog
>> implementations and what is the difference?
>> 
>> IgniteCatalog removed.
>> 
>>> 5. I don't like that IgniteStrategy and IgniteOptimization have to be
>> set manually on SQLContext each time it's created....Is there any way to
>> automate this and improve usability?
>> 
>> IgniteStrategy and IgniteOptimization are removed as it empty now.
>> 
>>> Actually, I think it makes sense to create a builder similar to
>> SparkSession.builder()...
>> 
>> IgniteBuilder added.
>> Syntax looks like:
>> 
>> ```
>> val igniteSession = IgniteSparkSession.builder()
>>    .appName("Spark Ignite catalog example")
>>    .master("local")
>>    .config("spark.executor.instances", "2")
>>    .igniteConfig(CONFIG)
>>    .getOrCreate()
>> 
>> igniteSession.catalog.listTables().show()
>> ```
>> 
>> Please, see updated PR - https://github.com/apache/ignite/pull/2742
>> 
>> 2017-10-18 20:02 GMT+03:00 Николай Ижиков <ni...@gmail.com>:
>> 
>>> Hello, Valentin.
>>> 
>>> My answers is below.
>>> Dmitry, do we need to move discussion to Jira?
>>> 
>>>> 1. Why do we have org.apache.spark.sql.ignite package in our codebase?
>>> 
>>> As I mentioned earlier, to implement and override Spark Catalog one have
>>> to use internal(private) Spark API.
>>> So I have to use package `org.spark.sql.***` to have access to private
>>> class and variables.
>>> 
>>> For example, SharedState class that stores link to ExternalCatalog
>>> declared as `private[sql] class SharedState` - i.e. package private.
>>> 
>>>> Can these classes reside under org.apache.ignite.spark instead?
>>> 
>>> No, as long as we want to have our own implementation of ExternalCatalog.
>>> 
>>>> 2. IgniteRelationProvider contains multiple constants which I guess are
>>> some king of config options. Can you describe the purpose of each of them?
>>> 
>>> I extend comments for this options.
>>> Please, see my commit [1] or PR HEAD:
>>> 
>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two Catalog
>>> implementations and what is the difference?
>>> 
>>> Good catch, thank you!
>>> After additional research I founded that only IgniteExternalCatalog
>>> required.
>>> I will update PR with IgniteCatalog remove in a few days.
>>> 
>>>> 4. IgniteStrategy and IgniteOptimization are currently no-op. What are
>>> our plans on implementing them? Also, what exactly is planned in
>>> IgniteOptimization and what is its purpose?
>>> 
>>> Actually, this is very good question :)
>>> And I need advice from experienced community members here:
>>> 
>>> `IgniteOptimization` purpose is to modify query plan created by Spark.
>>> Currently, we have one optimization described in IGNITE-3084 [2] by you,
>>> Valentin :) :
>>> 
>>> “If there are non-Ignite relations in the plan, we should fall back to
>>> native Spark strategies“
>>> 
>>> I think we can go little further and reduce join of two Ignite backed
>>> Data Frames into single Ignite SQL query. Currently, this feature is
>>> unimplemented.
>>> 
>>> *Do we need it now? Or we can postpone it and concentrates on basic Data
>>> Frame and Catalog implementation?*
>>> 
>>> `Strategy` purpose, as you correctly mentioned in [2], is transform
>>> LogicalPlan into physical operators.
>>> I don’t have ideas how to use this opportunity. So I think we don’t need
>>> IgniteStrategy.
>>> 
>>> Can you or anyone else suggest some optimization strategy to speed up SQL
>>> query execution?
>>> 
>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have to be
>>> set manually on SQLContext each time it's created....Is there any way to
>>> automate this and improve usability?
>>> 
>>> These classes added to `extraOptimizations` when one using
>>> IgniteSparkSession.
>>> As far as I know, there is no way to automatically add these classes to
>>> regular SparkSession.
>>> 
>>>> 6. What is the purpose of IgniteSparkSession? I see it's used in
>>> IgniteCatalogExample but not in IgniteDataFrameExample, which is Confusing.
>>> 
>>> DataFrame API is *public* Spark API. So anyone can provide implementation
>>> and plug it into Spark. That’s why IgniteDataFrameExample doesn’t need any
>>> Ignite specific session.
>>> 
>>> Catalog API is *internal* Spark API. There is no way to plug custom
>>> catalog implementation into Spark [3]. So we have to use
>>> `IgniteSparkSession` that extends regular SparkSession and overrides links
>>> to `ExternalCatalog`.
>>> 
>>>> 7. To create IgniteSparkSession we first create IgniteContext. Is it
>>> really needed? It looks like we can directly provide the configuration
>>> file; if IgniteSparkSession really requires IgniteContext, it can create it
>>> by itself under the hood.
>>> 
>>> Actually, IgniteContext is base class for Ignite <-> Spark integration
>>> for now. So I tried to reuse it here. I like the idea to remove explicit
>>> usage of IgniteContext.
>>> Will implement it in a few days.
>>> 
>>>> Actually, I think it makes sense to create a builder similar to
>>> SparkSession.builder()...
>>> 
>>> Great idea! I will implement such builder in a few days.
>>> 
>>>> 9. Do I understand correctly that IgniteCacheRelation is for the case
>>> when we don't have SQL configured on Ignite side?
>>> 
>>> Yes, IgniteCacheRelation is Data Frame implementation for a key-value
>>> cache.
>>> 
>>>> I thought we decided not to support this, no? Or this is something else?
>>> 
>>> My understanding is following:
>>> 
>>> 1. We can’t support automatic resolving key-value caches in
>>> *ExternalCatalog*. Because there is no way to reliably detect key and value
>>> classes.
>>> 
>>> 2. We can support key-value caches in regular Data Frame implementation.
>>> Because we can require user to provide key and value classes explicitly.
>>> 
>>>> 8. Can you clarify the query syntax in IgniteDataFrameExample#nativeS
>>> parkSqlFromCacheExample2?
>>> 
>>> Key-value cache:
>>> 
>>> key - java.lang.Long,
>>> value - case class Person(name: String, birthDate: java.util.Date)
>>> 
>>> Schema of data frame for cache is:
>>> 
>>> key - long
>>> value.name - string
>>> value.birthDate - date
>>> 
>>> So we can select data from data from cache:
>>> 
>>> SELECT
>>>  key, `value.name`,  `value.birthDate`
>>> FROM
>>>  testCache
>>> WHERE key >= 2 AND `value.name` like '%0'
>>> 
>>> [1] https://github.com/apache/ignite/pull/2742/commits/faf3ed6fe
>>> bf417bc59b0519156fd4d09114c8da7
>>> [2] https://issues.apache.org/jira/browse/IGNITE-3084?focusedCom
>>> mentId=15794210&page=com.atlassian.jira.plugin.system.issuet
>>> abpanels:comment-tabpanel#comment-15794210
>>> [3] https://issues.apache.org/jira/browse/SPARK-17767?focusedCom
>>> mentId=15543733&page=com.atlassian.jira.plugin.system.issuet
>>> abpanels:comment-tabpanel#comment-15543733
>>> 
>>> 
>>> 18.10.2017 04:39, Dmitriy Setrakyan пишет:
>>> 
>>> Val, thanks for the review. Can I ask you to add the same comments to the
>>>> ticket?
>>>> 
>>>> On Tue, Oct 17, 2017 at 3:20 PM, Valentin Kulichenko <
>>>> valentin.kulichenko@gmail.com> wrote:
>>>> 
>>>> Nikolay, Anton,
>>>>> 
>>>>> I did a high level review of the code. First of all, impressive results!
>>>>> However, I have some questions/comments.
>>>>> 
>>>>> 1. Why do we have org.apache.spark.sql.ignite package in our codebase?
>>>>> Can
>>>>> these classes reside under org.apache.ignite.spark instead?
>>>>> 2. IgniteRelationProvider contains multiple constants which I guess are
>>>>> some king of config options. Can you describe the purpose of each of
>>>>> them?
>>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two Catalog
>>>>> implementations and what is the difference?
>>>>> 4. IgniteStrategy and IgniteOptimization are currently no-op. What are
>>>>> our
>>>>> plans on implementing them? Also, what exactly is planned in
>>>>> IgniteOptimization and what is its purpose?
>>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have to be
>>>>> set
>>>>> manually on SQLContext each time it's created. This seems to be very
>>>>> error
>>>>> prone. Is there any way to automate this and improve usability?
>>>>> 6. What is the purpose of IgniteSparkSession? I see it's used
>>>>> in IgniteCatalogExample but not in IgniteDataFrameExample, which is
>>>>> confusing.
>>>>> 7. To create IgniteSparkSession we first create IgniteContext. Is it
>>>>> really
>>>>> needed? It looks like we can directly provide the configuration file; if
>>>>> IgniteSparkSession really requires IgniteContext, it can create it by
>>>>> itself under the hood. Actually, I think it makes sense to create a
>>>>> builder
>>>>> similar to SparkSession.builder(), it would be good if our APIs here are
>>>>> consistent with Spark APIs.
>>>>> 8. Can you clarify the query syntax
>>>>> inIgniteDataFrameExample#nativeSparkSqlFromCacheExample2?
>>>>> 9. Do I understand correctly that IgniteCacheRelation is for the case
>>>>> when
>>>>> we don't have SQL configured on Ignite side? I thought we decided not to
>>>>> support this, no? Or this is something else?
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> -Val
>>>>> 
>>>>> On Tue, Oct 17, 2017 at 4:40 AM, Anton Vinogradov <
>>>>> avinogradov@gridgain.com>
>>>>> wrote:
>>>>> 
>>>>> Sounds awesome.
>>>>>> 
>>>>>> I'll try to review API & tests this week.
>>>>>> 
>>>>>> Val,
>>>>>> Your review still required :)
>>>>>> 
>>>>>> On Tue, Oct 17, 2017 at 2:36 PM, Николай Ижиков <
>>>>>> nizhikov.dev@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>> Yes
>>>>>>> 
>>>>>>> 17 окт. 2017 г. 2:34 PM пользователь "Anton Vinogradov" <
>>>>>>> avinogradov@gridgain.com> написал:
>>>>>>> 
>>>>>>> Nikolay,
>>>>>>>> 
>>>>>>>> So, it will be able to start regular spark and ignite clusters and,
>>>>>>>> 
>>>>>>> using
>>>>>> 
>>>>>>> peer classloading via spark-context, perform any DataFrame request,
>>>>>>>> correct?
>>>>>>>> 
>>>>>>>> On Tue, Oct 17, 2017 at 2:25 PM, Николай Ижиков <
>>>>>>>> 
>>>>>>> nizhikov.dev@gmail.com>
>>>>>> 
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hello, Anton.
>>>>>>>>> 
>>>>>>>>> An example you provide is a path to a master *local* file.
>>>>>>>>> These libraries are added to the classpath for each remote node
>>>>>>>>> 
>>>>>>>> running
>>>>>> 
>>>>>>> submitted job.
>>>>>>>>> 
>>>>>>>>> Please, see documentation:
>>>>>>>>> 
>>>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
>>>>>>>>> spark/SparkContext.html#addJar(java.lang.String)
>>>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
>>>>>>>>> spark/SparkContext.html#addFile(java.lang.String)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 2017-10-17 13:10 GMT+03:00 Anton Vinogradov <
>>>>>>>>> 
>>>>>>>> avinogradov@gridgain.com
>>>>>> 
>>>>>>> :
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Nikolay,
>>>>>>>>>> 
>>>>>>>>>> With Data Frame API implementation there are no requirements to
>>>>>>>>>>> 
>>>>>>>>>> have
>>>>>>> 
>>>>>>>> any
>>>>>>>>> 
>>>>>>>>>> Ignite files on spark worker nodes.
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> What do you mean? I see code like:
>>>>>>>>>> 
>>>>>>>>>> spark.sparkContext.addJar(MAVEN_HOME +
>>>>>>>>>> "/org/apache/ignite/ignite-core/2.3.0-SNAPSHOT/ignite-
>>>>>>>>>> core-2.3.0-SNAPSHOT.jar")
>>>>>>>>>> 
>>>>>>>>>> On Mon, Oct 16, 2017 at 5:22 PM, Николай Ижиков <
>>>>>>>>>> 
>>>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hello, guys.
>>>>>>>>>>> 
>>>>>>>>>>> I have created example application to run Ignite Data Frame on
>>>>>>>>>>> 
>>>>>>>>>> standalone
>>>>>>>>> 
>>>>>>>>>> Spark cluster.
>>>>>>>>>>> With Data Frame API implementation there are no requirements to
>>>>>>>>>>> 
>>>>>>>>>> have
>>>>>>> 
>>>>>>>> any
>>>>>>>>> 
>>>>>>>>>> Ignite files on spark worker nodes.
>>>>>>>>>>> 
>>>>>>>>>>> I ran this application on the free dataset: ATP tennis match
>>>>>>>>>>> 
>>>>>>>>>> statistics.
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> data - https://github.com/nizhikov/atp_matches
>>>>>>>>>>> app - https://github.com/nizhikov/ignite-spark-df-example
>>>>>>>>>>> 
>>>>>>>>>>> Valentin, do you have a chance to look at my changes?
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 2017-10-12 6:03 GMT+03:00 Valentin Kulichenko <
>>>>>>>>>>> valentin.kulichenko@gmail.com
>>>>>>>>>>> 
>>>>>>>>>>>> :
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Hi Nikolay,
>>>>>>>>>>>> 
>>>>>>>>>>>> Sorry for delay on this, got a little swamped lately. I will
>>>>>>>>>>>> 
>>>>>>>>>>> do
>>>>> 
>>>>>> my
>>>>>>> 
>>>>>>>> best
>>>>>>>>> 
>>>>>>>>>> to
>>>>>>>>>>> 
>>>>>>>>>>>> review the code this week.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Val
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, Oct 9, 2017 at 11:48 AM, Николай Ижиков <
>>>>>>>>>>>> 
>>>>>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hello, Valentin.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Did you have a chance to look at my changes?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Now I think I have done almost all required features.
>>>>>>>>>>>>> I want to make some performance test to ensure my
>>>>>>>>>>>>> 
>>>>>>>>>>>> implementation
>>>>>> 
>>>>>>> work
>>>>>>>>> 
>>>>>>>>>> properly with a significant amount of data.
>>>>>>>>>>>>> And I definitely need some feedback for my changes.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 2017-10-09 18:45 GMT+03:00 Николай Ижиков <
>>>>>>>>>>>>> 
>>>>>>>>>>>> nizhikov.dev@gmail.com
>>>>>>> 
>>>>>>>> :
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>> Hello, guys.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Which version of Spark do we want to use?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 1. Currently, Ignite depends on Spark 2.1.0.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>     * Can be run on JDK 7.
>>>>>>>>>>>>>>     * Still supported: 2.1.2 will be released soon.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2. Latest Spark version is 2.2.0.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>     * Can be run only on JDK 8+
>>>>>>>>>>>>>>     * Released Jul 11, 2017.
>>>>>>>>>>>>>>     * Already supported by huge vendors(Amazon for
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> example).
>>>>> 
>>>>>> 
>>>>>>>>>>>>>> Note that in IGNITE-3084 I implement some internal Spark
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> API.
>>>>> 
>>>>>> So It will take some effort to switch between Spark 2.1 and
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 2.2
>>>>>> 
>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2017-09-27 2:20 GMT+03:00 Valentin Kulichenko <
>>>>>>>>>>>>>> valentin.kulichenko@gmail.com>:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I will review in the next few days.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> -Val
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Tue, Sep 26, 2017 at 2:23 PM, Denis Magda <
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> dmagda@apache.org
>>>>>>> 
>>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hello Nikolay,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This is good news. Finally this capability is coming to
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Ignite.
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>>>>>>>> Val, Vladimir, could you do a preliminary review?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Answering on your questions.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 1. Yardstick should be enough for performance
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> measurements.
>>>>>> 
>>>>>>> As a
>>>>>>>> 
>>>>>>>>> Spark
>>>>>>>>>>> 
>>>>>>>>>>>> user, I will be curious to know what’s the point of this
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> integration.
>>>>>>>>>>> 
>>>>>>>>>>>> Probably we need to compare Spark + Ignite and Spark +
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hive
>>>>>> 
>>>>>>> or
>>>>>>> 
>>>>>>>> Spark +
>>>>>>>>>>> 
>>>>>>>>>>>> RDBMS cases.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 2. If Spark community is reluctant let’s include the
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> module
>>>>>> 
>>>>>>> in
>>>>>>> 
>>>>>>>> ignite-spark integration.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> —
>>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Sep 25, 2017, at 11:14 AM, Николай Ижиков <
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hello, guys.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Currently, I’m working on integration between Spark
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> and
>>>>> 
>>>>>> Ignite
>>>>>>>> 
>>>>>>>>> [1].
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> For now, I implement following:
>>>>>>>>>>>>>>>>>    * Ignite DataSource implementation(
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> IgniteRelationProvider)
>>>>>>>> 
>>>>>>>>>    * DataFrame support for Ignite SQL table.
>>>>>>>>>>>>>>>>>    * IgniteCatalog implementation for a transparent
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> resolving
>>>>>>>> 
>>>>>>>>> of
>>>>>>>>> 
>>>>>>>>>> ignites
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> SQL tables.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Implementation of it can be found in PR [2]
>>>>>>>>>>>>>>>>> It would be great if someone provides feedback for a
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> prototype.
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I made some examples in PR so you can see how API
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> suppose
>>>>>> 
>>>>>>> to
>>>>>>> 
>>>>>>>> be
>>>>>>>>> 
>>>>>>>>>> used [3].
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> [4].
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I need some advice. Can you help me?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 1. How should this PR be tested?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Of course, I need to provide some unit tests. But what
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> about
>>>>>>> 
>>>>>>>> scalability
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> tests, etc.
>>>>>>>>>>>>>>>>> Maybe we need some Yardstick benchmark or similar?
>>>>>>>>>>>>>>>>> What are your thoughts?
>>>>>>>>>>>>>>>>> Which scenarios should I consider in the first place?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 2. Should we provide Spark Catalog implementation
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> inside
>>>>> 
>>>>>> Ignite
>>>>>>>>> 
>>>>>>>>>> codebase?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> A current implementation of Spark Catalog based on
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> *internal
>>>>>>> 
>>>>>>>> Spark
>>>>>>>>>> 
>>>>>>>>>>> API*.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Spark community seems not interested in making Catalog
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> API
>>>>>> 
>>>>>>> public
>>>>>>>>>> 
>>>>>>>>>>> or
>>>>>>>>>>> 
>>>>>>>>>>>> including Ignite Catalog in Spark code base [5], [6].
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> *Should we include Spark internal API implementation
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> inside
>>>>>>> 
>>>>>>>> Ignite
>>>>>>>>>> 
>>>>>>>>>>> code
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> base?*
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Or should we consider to include Catalog
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> implementation
>>>>> 
>>>>>> in
>>>>>> 
>>>>>>> some
>>>>>>>>> 
>>>>>>>>>> external
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> module?
>>>>>>>>>>>>>>>>> That will be created and released outside Ignite?(we
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> still
>>>>>> 
>>>>>>> can
>>>>>>>> 
>>>>>>>>> support
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> develop it inside Ignite community).
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-3084
>>>>>>>>>>>>>>>>> [2] https://github.com/apache/ignite/pull/2742
>>>>>>>>>>>>>>>>> [3] https://github.com/apache/
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> ignite/pull/2742/files#diff-
>>>>>>> 
>>>>>>>> f4ff509cef3018e221394474775e0905
>>>>>>>>>>>>>>>>> [4] https://github.com/apache/
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> ignite/pull/2742/files#diff-
>>>>>>> 
>>>>>>>> f2b670497d81e780dfd5098c5dd8a89c
>>>>>>>>>>>>>>>>> [5] http://apache-spark-developers-list.1001551.n3.
>>>>>>>>>>>>>>>>> nabble.com/Spark-Core-Custom-
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Catalog-Integration-between-
>>>>>> 
>>>>>>> Apache-Ignite-and-Apache-Spark-td22452.html
>>>>>>>>>>>>>>>>> [6] https://issues.apache.org/jira/browse/SPARK-17767
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Nikolay Izhikov
>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
>> 
>> --
>> Nikolay Izhikov
>> NIzhikov.dev@gmail.com
>>

Re: Integration of Spark and Ignite. Prototype.

Posted by Valentin Kulichenko <va...@gmail.com>.

Nikolay,

Custom strategy implementation is actually very important for this
integration. Basically, it will allow to create a SQL query for Ignite and
execute it directly on the cluster. Your current implementation only adds a
new DataSource which means that Spark will fetch data in its own memory
first, and then do most of the work (like joins for example). Does it make
sense to you? Can you please take a look at this and provide your thoughts
on how much development is implied there?

Current code looks good to me though and I'm OK if the strategy is
implemented as a next step in a scope of separate ticket. I will do final
review early next week and will merge it if everything is OK.

-Val

On Thu, Oct 19, 2017 at 7:29 AM, Николай Ижиков <ni...@gmail.com>
wrote:

> Hello.
>
> > 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two Catalog
> implementations and what is the difference?
>
> IgniteCatalog removed.
>
> > 5. I don't like that IgniteStrategy and IgniteOptimization have to be
> set manually on SQLContext each time it's created....Is there any way to
> automate this and improve usability?
>
> IgniteStrategy and IgniteOptimization are removed as it empty now.
>
> > Actually, I think it makes sense to create a builder similar to
> SparkSession.builder()...
>
> IgniteBuilder added.
> Syntax looks like:
>
> ```
> val igniteSession = IgniteSparkSession.builder()
>     .appName("Spark Ignite catalog example")
>     .master("local")
>     .config("spark.executor.instances", "2")
>     .igniteConfig(CONFIG)
>     .getOrCreate()
>
> igniteSession.catalog.listTables().show()
> ```
>
> Please, see updated PR - https://github.com/apache/ignite/pull/2742
>
> 2017-10-18 20:02 GMT+03:00 Николай Ижиков <ni...@gmail.com>:
>
>> Hello, Valentin.
>>
>> My answers is below.
>> Dmitry, do we need to move discussion to Jira?
>>
>> > 1. Why do we have org.apache.spark.sql.ignite package in our codebase?
>>
>> As I mentioned earlier, to implement and override Spark Catalog one have
>> to use internal(private) Spark API.
>> So I have to use package `org.spark.sql.***` to have access to private
>> class and variables.
>>
>> For example, SharedState class that stores link to ExternalCatalog
>> declared as `private[sql] class SharedState` - i.e. package private.
>>
>> > Can these classes reside under org.apache.ignite.spark instead?
>>
>> No, as long as we want to have our own implementation of ExternalCatalog.
>>
>> > 2. IgniteRelationProvider contains multiple constants which I guess are
>> some king of config options. Can you describe the purpose of each of them?
>>
>> I extend comments for this options.
>> Please, see my commit [1] or PR HEAD:
>>
>> > 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two Catalog
>> implementations and what is the difference?
>>
>> Good catch, thank you!
>> After additional research I founded that only IgniteExternalCatalog
>> required.
>> I will update PR with IgniteCatalog remove in a few days.
>>
>> > 4. IgniteStrategy and IgniteOptimization are currently no-op. What are
>> our plans on implementing them? Also, what exactly is planned in
>> IgniteOptimization and what is its purpose?
>>
>> Actually, this is very good question :)
>> And I need advice from experienced community members here:
>>
>> `IgniteOptimization` purpose is to modify query plan created by Spark.
>> Currently, we have one optimization described in IGNITE-3084 [2] by you,
>> Valentin :) :
>>
>> “If there are non-Ignite relations in the plan, we should fall back to
>> native Spark strategies“
>>
>> I think we can go little further and reduce join of two Ignite backed
>> Data Frames into single Ignite SQL query. Currently, this feature is
>> unimplemented.
>>
>> *Do we need it now? Or we can postpone it and concentrates on basic Data
>> Frame and Catalog implementation?*
>>
>> `Strategy` purpose, as you correctly mentioned in [2], is transform
>> LogicalPlan into physical operators.
>> I don’t have ideas how to use this opportunity. So I think we don’t need
>> IgniteStrategy.
>>
>> Can you or anyone else suggest some optimization strategy to speed up SQL
>> query execution?
>>
>> > 5. I don't like that IgniteStrategy and IgniteOptimization have to be
>> set manually on SQLContext each time it's created....Is there any way to
>> automate this and improve usability?
>>
>> These classes added to `extraOptimizations` when one using
>> IgniteSparkSession.
>> As far as I know, there is no way to automatically add these classes to
>> regular SparkSession.
>>
>> > 6. What is the purpose of IgniteSparkSession? I see it's used in
>> IgniteCatalogExample but not in IgniteDataFrameExample, which is Confusing.
>>
>> DataFrame API is *public* Spark API. So anyone can provide implementation
>> and plug it into Spark. That’s why IgniteDataFrameExample doesn’t need any
>> Ignite specific session.
>>
>> Catalog API is *internal* Spark API. There is no way to plug custom
>> catalog implementation into Spark [3]. So we have to use
>> `IgniteSparkSession` that extends regular SparkSession and overrides links
>> to `ExternalCatalog`.
>>
>> > 7. To create IgniteSparkSession we first create IgniteContext. Is it
>> really needed? It looks like we can directly provide the configuration
>> file; if IgniteSparkSession really requires IgniteContext, it can create it
>> by itself under the hood.
>>
>> Actually, IgniteContext is base class for Ignite <-> Spark integration
>> for now. So I tried to reuse it here. I like the idea to remove explicit
>> usage of IgniteContext.
>> Will implement it in a few days.
>>
>> > Actually, I think it makes sense to create a builder similar to
>> SparkSession.builder()...
>>
>> Great idea! I will implement such builder in a few days.
>>
>> > 9. Do I understand correctly that IgniteCacheRelation is for the case
>> when we don't have SQL configured on Ignite side?
>>
>> Yes, IgniteCacheRelation is Data Frame implementation for a key-value
>> cache.
>>
>> > I thought we decided not to support this, no? Or this is something else?
>>
>> My understanding is following:
>>
>> 1. We can’t support automatic resolving key-value caches in
>> *ExternalCatalog*. Because there is no way to reliably detect key and value
>> classes.
>>
>> 2. We can support key-value caches in regular Data Frame implementation.
>> Because we can require user to provide key and value classes explicitly.
>>
>> > 8. Can you clarify the query syntax in IgniteDataFrameExample#nativeS
>> parkSqlFromCacheExample2?
>>
>> Key-value cache:
>>
>> key - java.lang.Long,
>> value - case class Person(name: String, birthDate: java.util.Date)
>>
>> Schema of data frame for cache is:
>>
>> key - long
>> value.name - string
>> value.birthDate - date
>>
>> So we can select data from data from cache:
>>
>> SELECT
>>   key, `value.name`,  `value.birthDate`
>> FROM
>>   testCache
>> WHERE key >= 2 AND `value.name` like '%0'
>>
>> [1] https://github.com/apache/ignite/pull/2742/commits/faf3ed6fe
>> bf417bc59b0519156fd4d09114c8da7
>> [2] https://issues.apache.org/jira/browse/IGNITE-3084?focusedCom
>> mentId=15794210&page=com.atlassian.jira.plugin.system.issuet
>> abpanels:comment-tabpanel#comment-15794210
>> [3] https://issues.apache.org/jira/browse/SPARK-17767?focusedCom
>> mentId=15543733&page=com.atlassian.jira.plugin.system.issuet
>> abpanels:comment-tabpanel#comment-15543733
>>
>>
>> 18.10.2017 04:39, Dmitriy Setrakyan пишет:
>>
>> Val, thanks for the review. Can I ask you to add the same comments to the
>>> ticket?
>>>
>>> On Tue, Oct 17, 2017 at 3:20 PM, Valentin Kulichenko <
>>> valentin.kulichenko@gmail.com> wrote:
>>>
>>> Nikolay, Anton,
>>>>
>>>> I did a high level review of the code. First of all, impressive results!
>>>> However, I have some questions/comments.
>>>>
>>>> 1. Why do we have org.apache.spark.sql.ignite package in our codebase?
>>>> Can
>>>> these classes reside under org.apache.ignite.spark instead?
>>>> 2. IgniteRelationProvider contains multiple constants which I guess are
>>>> some king of config options. Can you describe the purpose of each of
>>>> them?
>>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two Catalog
>>>> implementations and what is the difference?
>>>> 4. IgniteStrategy and IgniteOptimization are currently no-op. What are
>>>> our
>>>> plans on implementing them? Also, what exactly is planned in
>>>> IgniteOptimization and what is its purpose?
>>>> 5. I don't like that IgniteStrategy and IgniteOptimization have to be
>>>> set
>>>> manually on SQLContext each time it's created. This seems to be very
>>>> error
>>>> prone. Is there any way to automate this and improve usability?
>>>> 6. What is the purpose of IgniteSparkSession? I see it's used
>>>> in IgniteCatalogExample but not in IgniteDataFrameExample, which is
>>>> confusing.
>>>> 7. To create IgniteSparkSession we first create IgniteContext. Is it
>>>> really
>>>> needed? It looks like we can directly provide the configuration file; if
>>>> IgniteSparkSession really requires IgniteContext, it can create it by
>>>> itself under the hood. Actually, I think it makes sense to create a
>>>> builder
>>>> similar to SparkSession.builder(), it would be good if our APIs here are
>>>> consistent with Spark APIs.
>>>> 8. Can you clarify the query syntax
>>>> inIgniteDataFrameExample#nativeSparkSqlFromCacheExample2?
>>>> 9. Do I understand correctly that IgniteCacheRelation is for the case
>>>> when
>>>> we don't have SQL configured on Ignite side? I thought we decided not to
>>>> support this, no? Or this is something else?
>>>>
>>>> Thanks!
>>>>
>>>> -Val
>>>>
>>>> On Tue, Oct 17, 2017 at 4:40 AM, Anton Vinogradov <
>>>> avinogradov@gridgain.com>
>>>> wrote:
>>>>
>>>> Sounds awesome.
>>>>>
>>>>> I'll try to review API & tests this week.
>>>>>
>>>>> Val,
>>>>> Your review still required :)
>>>>>
>>>>> On Tue, Oct 17, 2017 at 2:36 PM, Николай Ижиков <
>>>>> nizhikov.dev@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Yes
>>>>>>
>>>>>> 17 окт. 2017 г. 2:34 PM пользователь "Anton Vinogradov" <
>>>>>> avinogradov@gridgain.com> написал:
>>>>>>
>>>>>> Nikolay,
>>>>>>>
>>>>>>> So, it will be able to start regular spark and ignite clusters and,
>>>>>>>
>>>>>> using
>>>>>
>>>>>> peer classloading via spark-context, perform any DataFrame request,
>>>>>>> correct?
>>>>>>>
>>>>>>> On Tue, Oct 17, 2017 at 2:25 PM, Николай Ижиков <
>>>>>>>
>>>>>> nizhikov.dev@gmail.com>
>>>>>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hello, Anton.
>>>>>>>>
>>>>>>>> An example you provide is a path to a master *local* file.
>>>>>>>> These libraries are added to the classpath for each remote node
>>>>>>>>
>>>>>>> running
>>>>>
>>>>>> submitted job.
>>>>>>>>
>>>>>>>> Please, see documentation:
>>>>>>>>
>>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
>>>>>>>> spark/SparkContext.html#addJar(java.lang.String)
>>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
>>>>>>>> spark/SparkContext.html#addFile(java.lang.String)
>>>>>>>>
>>>>>>>>
>>>>>>>> 2017-10-17 13:10 GMT+03:00 Anton Vinogradov <
>>>>>>>>
>>>>>>> avinogradov@gridgain.com
>>>>>
>>>>>> :
>>>>>>>
>>>>>>>>
>>>>>>>> Nikolay,
>>>>>>>>>
>>>>>>>>> With Data Frame API implementation there are no requirements to
>>>>>>>>>>
>>>>>>>>> have
>>>>>>
>>>>>>> any
>>>>>>>>
>>>>>>>>> Ignite files on spark worker nodes.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> What do you mean? I see code like:
>>>>>>>>>
>>>>>>>>> spark.sparkContext.addJar(MAVEN_HOME +
>>>>>>>>> "/org/apache/ignite/ignite-core/2.3.0-SNAPSHOT/ignite-
>>>>>>>>> core-2.3.0-SNAPSHOT.jar")
>>>>>>>>>
>>>>>>>>> On Mon, Oct 16, 2017 at 5:22 PM, Николай Ижиков <
>>>>>>>>>
>>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hello, guys.
>>>>>>>>>>
>>>>>>>>>> I have created example application to run Ignite Data Frame on
>>>>>>>>>>
>>>>>>>>> standalone
>>>>>>>>
>>>>>>>>> Spark cluster.
>>>>>>>>>> With Data Frame API implementation there are no requirements to
>>>>>>>>>>
>>>>>>>>> have
>>>>>>
>>>>>>> any
>>>>>>>>
>>>>>>>>> Ignite files on spark worker nodes.
>>>>>>>>>>
>>>>>>>>>> I ran this application on the free dataset: ATP tennis match
>>>>>>>>>>
>>>>>>>>> statistics.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> data - https://github.com/nizhikov/atp_matches
>>>>>>>>>> app - https://github.com/nizhikov/ignite-spark-df-example
>>>>>>>>>>
>>>>>>>>>> Valentin, do you have a chance to look at my changes?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2017-10-12 6:03 GMT+03:00 Valentin Kulichenko <
>>>>>>>>>> valentin.kulichenko@gmail.com
>>>>>>>>>>
>>>>>>>>>>> :
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Nikolay,
>>>>>>>>>>>
>>>>>>>>>>> Sorry for delay on this, got a little swamped lately. I will
>>>>>>>>>>>
>>>>>>>>>> do
>>>>
>>>>> my
>>>>>>
>>>>>>> best
>>>>>>>>
>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>>>> review the code this week.
>>>>>>>>>>>
>>>>>>>>>>> -Val
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 9, 2017 at 11:48 AM, Николай Ижиков <
>>>>>>>>>>>
>>>>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hello, Valentin.
>>>>>>>>>>>>
>>>>>>>>>>>> Did you have a chance to look at my changes?
>>>>>>>>>>>>
>>>>>>>>>>>> Now I think I have done almost all required features.
>>>>>>>>>>>> I want to make some performance test to ensure my
>>>>>>>>>>>>
>>>>>>>>>>> implementation
>>>>>
>>>>>> work
>>>>>>>>
>>>>>>>>> properly with a significant amount of data.
>>>>>>>>>>>> And I definitely need some feedback for my changes.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2017-10-09 18:45 GMT+03:00 Николай Ижиков <
>>>>>>>>>>>>
>>>>>>>>>>> nizhikov.dev@gmail.com
>>>>>>
>>>>>>> :
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> Hello, guys.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Which version of Spark do we want to use?
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. Currently, Ignite depends on Spark 2.1.0.
>>>>>>>>>>>>>
>>>>>>>>>>>>>      * Can be run on JDK 7.
>>>>>>>>>>>>>      * Still supported: 2.1.2 will be released soon.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2. Latest Spark version is 2.2.0.
>>>>>>>>>>>>>
>>>>>>>>>>>>>      * Can be run only on JDK 8+
>>>>>>>>>>>>>      * Released Jul 11, 2017.
>>>>>>>>>>>>>      * Already supported by huge vendors(Amazon for
>>>>>>>>>>>>>
>>>>>>>>>>>> example).
>>>>
>>>>>
>>>>>>>>>>>>> Note that in IGNITE-3084 I implement some internal Spark
>>>>>>>>>>>>>
>>>>>>>>>>>> API.
>>>>
>>>>> So It will take some effort to switch between Spark 2.1 and
>>>>>>>>>>>>>
>>>>>>>>>>>> 2.2
>>>>>
>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2017-09-27 2:20 GMT+03:00 Valentin Kulichenko <
>>>>>>>>>>>>> valentin.kulichenko@gmail.com>:
>>>>>>>>>>>>>
>>>>>>>>>>>>> I will review in the next few days.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Val
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Sep 26, 2017 at 2:23 PM, Denis Magda <
>>>>>>>>>>>>>>
>>>>>>>>>>>>> dmagda@apache.org
>>>>>>
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>> Hello Nikolay,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This is good news. Finally this capability is coming to
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ignite.
>>>>>>>
>>>>>>>>
>>>>>>>>>>>>>>> Val, Vladimir, could you do a preliminary review?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Answering on your questions.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1. Yardstick should be enough for performance
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> measurements.
>>>>>
>>>>>> As a
>>>>>>>
>>>>>>>> Spark
>>>>>>>>>>
>>>>>>>>>>> user, I will be curious to know what’s the point of this
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> integration.
>>>>>>>>>>
>>>>>>>>>>> Probably we need to compare Spark + Ignite and Spark +
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hive
>>>>>
>>>>>> or
>>>>>>
>>>>>>> Spark +
>>>>>>>>>>
>>>>>>>>>>> RDBMS cases.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2. If Spark community is reluctant let’s include the
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> module
>>>>>
>>>>>> in
>>>>>>
>>>>>>> ignite-spark integration.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> —
>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sep 25, 2017, at 11:14 AM, Николай Ижиков <
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hello, guys.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Currently, I’m working on integration between Spark
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> and
>>>>
>>>>> Ignite
>>>>>>>
>>>>>>>> [1].
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>>>> For now, I implement following:
>>>>>>>>>>>>>>>>     * Ignite DataSource implementation(
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> IgniteRelationProvider)
>>>>>>>
>>>>>>>>     * DataFrame support for Ignite SQL table.
>>>>>>>>>>>>>>>>     * IgniteCatalog implementation for a transparent
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> resolving
>>>>>>>
>>>>>>>> of
>>>>>>>>
>>>>>>>>> ignites
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> SQL tables.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Implementation of it can be found in PR [2]
>>>>>>>>>>>>>>>> It would be great if someone provides feedback for a
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> prototype.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>>>>>> I made some examples in PR so you can see how API
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> suppose
>>>>>
>>>>>> to
>>>>>>
>>>>>>> be
>>>>>>>>
>>>>>>>>> used [3].
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [4].
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I need some advice. Can you help me?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1. How should this PR be tested?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Of course, I need to provide some unit tests. But what
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> about
>>>>>>
>>>>>>> scalability
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> tests, etc.
>>>>>>>>>>>>>>>> Maybe we need some Yardstick benchmark or similar?
>>>>>>>>>>>>>>>> What are your thoughts?
>>>>>>>>>>>>>>>> Which scenarios should I consider in the first place?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2. Should we provide Spark Catalog implementation
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> inside
>>>>
>>>>> Ignite
>>>>>>>>
>>>>>>>>> codebase?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> A current implementation of Spark Catalog based on
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *internal
>>>>>>
>>>>>>> Spark
>>>>>>>>>
>>>>>>>>>> API*.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Spark community seems not interested in making Catalog
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> API
>>>>>
>>>>>> public
>>>>>>>>>
>>>>>>>>>> or
>>>>>>>>>>
>>>>>>>>>>> including Ignite Catalog in Spark code base [5], [6].
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *Should we include Spark internal API implementation
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> inside
>>>>>>
>>>>>>> Ignite
>>>>>>>>>
>>>>>>>>>> code
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> base?*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Or should we consider to include Catalog
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> implementation
>>>>
>>>>> in
>>>>>
>>>>>> some
>>>>>>>>
>>>>>>>>> external
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> module?
>>>>>>>>>>>>>>>> That will be created and released outside Ignite?(we
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> still
>>>>>
>>>>>> can
>>>>>>>
>>>>>>>> support
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> develop it inside Ignite community).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-3084
>>>>>>>>>>>>>>>> [2] https://github.com/apache/ignite/pull/2742
>>>>>>>>>>>>>>>> [3] https://github.com/apache/
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ignite/pull/2742/files#diff-
>>>>>>
>>>>>>> f4ff509cef3018e221394474775e0905
>>>>>>>>>>>>>>>> [4] https://github.com/apache/
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ignite/pull/2742/files#diff-
>>>>>>
>>>>>>> f2b670497d81e780dfd5098c5dd8a89c
>>>>>>>>>>>>>>>> [5] http://apache-spark-developers-list.1001551.n3.
>>>>>>>>>>>>>>>> nabble.com/Spark-Core-Custom-
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Catalog-Integration-between-
>>>>>
>>>>>> Apache-Ignite-and-Apache-Spark-td22452.html
>>>>>>>>>>>>>>>> [6] https://issues.apache.org/jira/browse/SPARK-17767
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Nikolay Izhikov
>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>
>
> --
> Nikolay Izhikov
> NIzhikov.dev@gmail.com
>

Re: Integration of Spark and Ignite. Prototype.

Posted by Николай Ижиков <ni...@gmail.com>.

Hello.

> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two Catalog
implementations and what is the difference?

IgniteCatalog removed.

> 5. I don't like that IgniteStrategy and IgniteOptimization have to be set
manually on SQLContext each time it's created....Is there any way to
automate this and improve usability?

IgniteStrategy and IgniteOptimization are removed as it empty now.

> Actually, I think it makes sense to create a builder similar to
SparkSession.builder()...

IgniteBuilder added.
Syntax looks like:

```
val igniteSession = IgniteSparkSession.builder()
    .appName("Spark Ignite catalog example")
    .master("local")
    .config("spark.executor.instances", "2")
    .igniteConfig(CONFIG)
    .getOrCreate()

igniteSession.catalog.listTables().show()
```

Please, see updated PR - https://github.com/apache/ignite/pull/2742

2017-10-18 20:02 GMT+03:00 Николай Ижиков <ni...@gmail.com>:

> Hello, Valentin.
>
> My answers is below.
> Dmitry, do we need to move discussion to Jira?
>
> > 1. Why do we have org.apache.spark.sql.ignite package in our codebase?
>
> As I mentioned earlier, to implement and override Spark Catalog one have
> to use internal(private) Spark API.
> So I have to use package `org.spark.sql.***` to have access to private
> class and variables.
>
> For example, SharedState class that stores link to ExternalCatalog
> declared as `private[sql] class SharedState` - i.e. package private.
>
> > Can these classes reside under org.apache.ignite.spark instead?
>
> No, as long as we want to have our own implementation of ExternalCatalog.
>
> > 2. IgniteRelationProvider contains multiple constants which I guess are
> some king of config options. Can you describe the purpose of each of them?
>
> I extend comments for this options.
> Please, see my commit [1] or PR HEAD:
>
> > 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two Catalog
> implementations and what is the difference?
>
> Good catch, thank you!
> After additional research I founded that only IgniteExternalCatalog
> required.
> I will update PR with IgniteCatalog remove in a few days.
>
> > 4. IgniteStrategy and IgniteOptimization are currently no-op. What are
> our plans on implementing them? Also, what exactly is planned in
> IgniteOptimization and what is its purpose?
>
> Actually, this is very good question :)
> And I need advice from experienced community members here:
>
> `IgniteOptimization` purpose is to modify query plan created by Spark.
> Currently, we have one optimization described in IGNITE-3084 [2] by you,
> Valentin :) :
>
> “If there are non-Ignite relations in the plan, we should fall back to
> native Spark strategies“
>
> I think we can go little further and reduce join of two Ignite backed Data
> Frames into single Ignite SQL query. Currently, this feature is
> unimplemented.
>
> *Do we need it now? Or we can postpone it and concentrates on basic Data
> Frame and Catalog implementation?*
>
> `Strategy` purpose, as you correctly mentioned in [2], is transform
> LogicalPlan into physical operators.
> I don’t have ideas how to use this opportunity. So I think we don’t need
> IgniteStrategy.
>
> Can you or anyone else suggest some optimization strategy to speed up SQL
> query execution?
>
> > 5. I don't like that IgniteStrategy and IgniteOptimization have to be
> set manually on SQLContext each time it's created....Is there any way to
> automate this and improve usability?
>
> These classes added to `extraOptimizations` when one using
> IgniteSparkSession.
> As far as I know, there is no way to automatically add these classes to
> regular SparkSession.
>
> > 6. What is the purpose of IgniteSparkSession? I see it's used in
> IgniteCatalogExample but not in IgniteDataFrameExample, which is Confusing.
>
> DataFrame API is *public* Spark API. So anyone can provide implementation
> and plug it into Spark. That’s why IgniteDataFrameExample doesn’t need any
> Ignite specific session.
>
> Catalog API is *internal* Spark API. There is no way to plug custom
> catalog implementation into Spark [3]. So we have to use
> `IgniteSparkSession` that extends regular SparkSession and overrides links
> to `ExternalCatalog`.
>
> > 7. To create IgniteSparkSession we first create IgniteContext. Is it
> really needed? It looks like we can directly provide the configuration
> file; if IgniteSparkSession really requires IgniteContext, it can create it
> by itself under the hood.
>
> Actually, IgniteContext is base class for Ignite <-> Spark integration for
> now. So I tried to reuse it here. I like the idea to remove explicit usage
> of IgniteContext.
> Will implement it in a few days.
>
> > Actually, I think it makes sense to create a builder similar to
> SparkSession.builder()...
>
> Great idea! I will implement such builder in a few days.
>
> > 9. Do I understand correctly that IgniteCacheRelation is for the case
> when we don't have SQL configured on Ignite side?
>
> Yes, IgniteCacheRelation is Data Frame implementation for a key-value
> cache.
>
> > I thought we decided not to support this, no? Or this is something else?
>
> My understanding is following:
>
> 1. We can’t support automatic resolving key-value caches in
> *ExternalCatalog*. Because there is no way to reliably detect key and value
> classes.
>
> 2. We can support key-value caches in regular Data Frame implementation.
> Because we can require user to provide key and value classes explicitly.
>
> > 8. Can you clarify the query syntax in IgniteDataFrameExample#nativeS
> parkSqlFromCacheExample2?
>
> Key-value cache:
>
> key - java.lang.Long,
> value - case class Person(name: String, birthDate: java.util.Date)
>
> Schema of data frame for cache is:
>
> key - long
> value.name - string
> value.birthDate - date
>
> So we can select data from data from cache:
>
> SELECT
>   key, `value.name`,  `value.birthDate`
> FROM
>   testCache
> WHERE key >= 2 AND `value.name` like '%0'
>
> [1] https://github.com/apache/ignite/pull/2742/commits/faf3ed6fe
> bf417bc59b0519156fd4d09114c8da7
> [2] https://issues.apache.org/jira/browse/IGNITE-3084?focusedCom
> mentId=15794210&page=com.atlassian.jira.plugin.system.issuet
> abpanels:comment-tabpanel#comment-15794210
> [3] https://issues.apache.org/jira/browse/SPARK-17767?focusedCom
> mentId=15543733&page=com.atlassian.jira.plugin.system.issuet
> abpanels:comment-tabpanel#comment-15543733
>
>
> 18.10.2017 04:39, Dmitriy Setrakyan пишет:
>
> Val, thanks for the review. Can I ask you to add the same comments to the
>> ticket?
>>
>> On Tue, Oct 17, 2017 at 3:20 PM, Valentin Kulichenko <
>> valentin.kulichenko@gmail.com> wrote:
>>
>> Nikolay, Anton,
>>>
>>> I did a high level review of the code. First of all, impressive results!
>>> However, I have some questions/comments.
>>>
>>> 1. Why do we have org.apache.spark.sql.ignite package in our codebase?
>>> Can
>>> these classes reside under org.apache.ignite.spark instead?
>>> 2. IgniteRelationProvider contains multiple constants which I guess are
>>> some king of config options. Can you describe the purpose of each of
>>> them?
>>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two Catalog
>>> implementations and what is the difference?
>>> 4. IgniteStrategy and IgniteOptimization are currently no-op. What are
>>> our
>>> plans on implementing them? Also, what exactly is planned in
>>> IgniteOptimization and what is its purpose?
>>> 5. I don't like that IgniteStrategy and IgniteOptimization have to be set
>>> manually on SQLContext each time it's created. This seems to be very
>>> error
>>> prone. Is there any way to automate this and improve usability?
>>> 6. What is the purpose of IgniteSparkSession? I see it's used
>>> in IgniteCatalogExample but not in IgniteDataFrameExample, which is
>>> confusing.
>>> 7. To create IgniteSparkSession we first create IgniteContext. Is it
>>> really
>>> needed? It looks like we can directly provide the configuration file; if
>>> IgniteSparkSession really requires IgniteContext, it can create it by
>>> itself under the hood. Actually, I think it makes sense to create a
>>> builder
>>> similar to SparkSession.builder(), it would be good if our APIs here are
>>> consistent with Spark APIs.
>>> 8. Can you clarify the query syntax
>>> inIgniteDataFrameExample#nativeSparkSqlFromCacheExample2?
>>> 9. Do I understand correctly that IgniteCacheRelation is for the case
>>> when
>>> we don't have SQL configured on Ignite side? I thought we decided not to
>>> support this, no? Or this is something else?
>>>
>>> Thanks!
>>>
>>> -Val
>>>
>>> On Tue, Oct 17, 2017 at 4:40 AM, Anton Vinogradov <
>>> avinogradov@gridgain.com>
>>> wrote:
>>>
>>> Sounds awesome.
>>>>
>>>> I'll try to review API & tests this week.
>>>>
>>>> Val,
>>>> Your review still required :)
>>>>
>>>> On Tue, Oct 17, 2017 at 2:36 PM, Николай Ижиков <nizhikov.dev@gmail.com
>>>> >
>>>> wrote:
>>>>
>>>> Yes
>>>>>
>>>>> 17 окт. 2017 г. 2:34 PM пользователь "Anton Vinogradov" <
>>>>> avinogradov@gridgain.com> написал:
>>>>>
>>>>> Nikolay,
>>>>>>
>>>>>> So, it will be able to start regular spark and ignite clusters and,
>>>>>>
>>>>> using
>>>>
>>>>> peer classloading via spark-context, perform any DataFrame request,
>>>>>> correct?
>>>>>>
>>>>>> On Tue, Oct 17, 2017 at 2:25 PM, Николай Ижиков <
>>>>>>
>>>>> nizhikov.dev@gmail.com>
>>>>
>>>>> wrote:
>>>>>>
>>>>>> Hello, Anton.
>>>>>>>
>>>>>>> An example you provide is a path to a master *local* file.
>>>>>>> These libraries are added to the classpath for each remote node
>>>>>>>
>>>>>> running
>>>>
>>>>> submitted job.
>>>>>>>
>>>>>>> Please, see documentation:
>>>>>>>
>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
>>>>>>> spark/SparkContext.html#addJar(java.lang.String)
>>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
>>>>>>> spark/SparkContext.html#addFile(java.lang.String)
>>>>>>>
>>>>>>>
>>>>>>> 2017-10-17 13:10 GMT+03:00 Anton Vinogradov <
>>>>>>>
>>>>>> avinogradov@gridgain.com
>>>>
>>>>> :
>>>>>>
>>>>>>>
>>>>>>> Nikolay,
>>>>>>>>
>>>>>>>> With Data Frame API implementation there are no requirements to
>>>>>>>>>
>>>>>>>> have
>>>>>
>>>>>> any
>>>>>>>
>>>>>>>> Ignite files on spark worker nodes.
>>>>>>>>>
>>>>>>>>
>>>>>>>> What do you mean? I see code like:
>>>>>>>>
>>>>>>>> spark.sparkContext.addJar(MAVEN_HOME +
>>>>>>>> "/org/apache/ignite/ignite-core/2.3.0-SNAPSHOT/ignite-
>>>>>>>> core-2.3.0-SNAPSHOT.jar")
>>>>>>>>
>>>>>>>> On Mon, Oct 16, 2017 at 5:22 PM, Николай Ижиков <
>>>>>>>>
>>>>>>> nizhikov.dev@gmail.com>
>>>>>>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hello, guys.
>>>>>>>>>
>>>>>>>>> I have created example application to run Ignite Data Frame on
>>>>>>>>>
>>>>>>>> standalone
>>>>>>>
>>>>>>>> Spark cluster.
>>>>>>>>> With Data Frame API implementation there are no requirements to
>>>>>>>>>
>>>>>>>> have
>>>>>
>>>>>> any
>>>>>>>
>>>>>>>> Ignite files on spark worker nodes.
>>>>>>>>>
>>>>>>>>> I ran this application on the free dataset: ATP tennis match
>>>>>>>>>
>>>>>>>> statistics.
>>>>>>>
>>>>>>>>
>>>>>>>>> data - https://github.com/nizhikov/atp_matches
>>>>>>>>> app - https://github.com/nizhikov/ignite-spark-df-example
>>>>>>>>>
>>>>>>>>> Valentin, do you have a chance to look at my changes?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2017-10-12 6:03 GMT+03:00 Valentin Kulichenko <
>>>>>>>>> valentin.kulichenko@gmail.com
>>>>>>>>>
>>>>>>>>>> :
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Nikolay,
>>>>>>>>>>
>>>>>>>>>> Sorry for delay on this, got a little swamped lately. I will
>>>>>>>>>>
>>>>>>>>> do
>>>
>>>> my
>>>>>
>>>>>> best
>>>>>>>
>>>>>>>> to
>>>>>>>>>
>>>>>>>>>> review the code this week.
>>>>>>>>>>
>>>>>>>>>> -Val
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 9, 2017 at 11:48 AM, Николай Ижиков <
>>>>>>>>>>
>>>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hello, Valentin.
>>>>>>>>>>>
>>>>>>>>>>> Did you have a chance to look at my changes?
>>>>>>>>>>>
>>>>>>>>>>> Now I think I have done almost all required features.
>>>>>>>>>>> I want to make some performance test to ensure my
>>>>>>>>>>>
>>>>>>>>>> implementation
>>>>
>>>>> work
>>>>>>>
>>>>>>>> properly with a significant amount of data.
>>>>>>>>>>> And I definitely need some feedback for my changes.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2017-10-09 18:45 GMT+03:00 Николай Ижиков <
>>>>>>>>>>>
>>>>>>>>>> nizhikov.dev@gmail.com
>>>>>
>>>>>> :
>>>>>>>
>>>>>>>>
>>>>>>>>>>> Hello, guys.
>>>>>>>>>>>>
>>>>>>>>>>>> Which version of Spark do we want to use?
>>>>>>>>>>>>
>>>>>>>>>>>> 1. Currently, Ignite depends on Spark 2.1.0.
>>>>>>>>>>>>
>>>>>>>>>>>>      * Can be run on JDK 7.
>>>>>>>>>>>>      * Still supported: 2.1.2 will be released soon.
>>>>>>>>>>>>
>>>>>>>>>>>> 2. Latest Spark version is 2.2.0.
>>>>>>>>>>>>
>>>>>>>>>>>>      * Can be run only on JDK 8+
>>>>>>>>>>>>      * Released Jul 11, 2017.
>>>>>>>>>>>>      * Already supported by huge vendors(Amazon for
>>>>>>>>>>>>
>>>>>>>>>>> example).
>>>
>>>>
>>>>>>>>>>>> Note that in IGNITE-3084 I implement some internal Spark
>>>>>>>>>>>>
>>>>>>>>>>> API.
>>>
>>>> So It will take some effort to switch between Spark 2.1 and
>>>>>>>>>>>>
>>>>>>>>>>> 2.2
>>>>
>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2017-09-27 2:20 GMT+03:00 Valentin Kulichenko <
>>>>>>>>>>>> valentin.kulichenko@gmail.com>:
>>>>>>>>>>>>
>>>>>>>>>>>> I will review in the next few days.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Val
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Sep 26, 2017 at 2:23 PM, Denis Magda <
>>>>>>>>>>>>>
>>>>>>>>>>>> dmagda@apache.org
>>>>>
>>>>>>
>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>> Hello Nikolay,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This is good news. Finally this capability is coming to
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Ignite.
>>>>>>
>>>>>>>
>>>>>>>>>>>>>> Val, Vladimir, could you do a preliminary review?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Answering on your questions.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. Yardstick should be enough for performance
>>>>>>>>>>>>>>
>>>>>>>>>>>>> measurements.
>>>>
>>>>> As a
>>>>>>
>>>>>>> Spark
>>>>>>>>>
>>>>>>>>>> user, I will be curious to know what’s the point of this
>>>>>>>>>>>>>>
>>>>>>>>>>>>> integration.
>>>>>>>>>
>>>>>>>>>> Probably we need to compare Spark + Ignite and Spark +
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Hive
>>>>
>>>>> or
>>>>>
>>>>>> Spark +
>>>>>>>>>
>>>>>>>>>> RDBMS cases.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2. If Spark community is reluctant let’s include the
>>>>>>>>>>>>>>
>>>>>>>>>>>>> module
>>>>
>>>>> in
>>>>>
>>>>>> ignite-spark integration.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> —
>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sep 25, 2017, at 11:14 AM, Николай Ижиков <
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hello, guys.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Currently, I’m working on integration between Spark
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> and
>>>
>>>> Ignite
>>>>>>
>>>>>>> [1].
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>>>> For now, I implement following:
>>>>>>>>>>>>>>>     * Ignite DataSource implementation(
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> IgniteRelationProvider)
>>>>>>
>>>>>>>     * DataFrame support for Ignite SQL table.
>>>>>>>>>>>>>>>     * IgniteCatalog implementation for a transparent
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> resolving
>>>>>>
>>>>>>> of
>>>>>>>
>>>>>>>> ignites
>>>>>>>>>>>>>
>>>>>>>>>>>>>> SQL tables.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Implementation of it can be found in PR [2]
>>>>>>>>>>>>>>> It would be great if someone provides feedback for a
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> prototype.
>>>>>>>
>>>>>>>>
>>>>>>>>>>>>>>> I made some examples in PR so you can see how API
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> suppose
>>>>
>>>>> to
>>>>>
>>>>>> be
>>>>>>>
>>>>>>>> used [3].
>>>>>>>>>>>>>
>>>>>>>>>>>>>> [4].
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I need some advice. Can you help me?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1. How should this PR be tested?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Of course, I need to provide some unit tests. But what
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> about
>>>>>
>>>>>> scalability
>>>>>>>>>>>>>
>>>>>>>>>>>>>> tests, etc.
>>>>>>>>>>>>>>> Maybe we need some Yardstick benchmark or similar?
>>>>>>>>>>>>>>> What are your thoughts?
>>>>>>>>>>>>>>> Which scenarios should I consider in the first place?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2. Should we provide Spark Catalog implementation
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> inside
>>>
>>>> Ignite
>>>>>>>
>>>>>>>> codebase?
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> A current implementation of Spark Catalog based on
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *internal
>>>>>
>>>>>> Spark
>>>>>>>>
>>>>>>>>> API*.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Spark community seems not interested in making Catalog
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> API
>>>>
>>>>> public
>>>>>>>>
>>>>>>>>> or
>>>>>>>>>
>>>>>>>>>> including Ignite Catalog in Spark code base [5], [6].
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *Should we include Spark internal API implementation
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> inside
>>>>>
>>>>>> Ignite
>>>>>>>>
>>>>>>>>> code
>>>>>>>>>>>>>
>>>>>>>>>>>>>> base?*
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Or should we consider to include Catalog
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> implementation
>>>
>>>> in
>>>>
>>>>> some
>>>>>>>
>>>>>>>> external
>>>>>>>>>>>>>
>>>>>>>>>>>>>> module?
>>>>>>>>>>>>>>> That will be created and released outside Ignite?(we
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> still
>>>>
>>>>> can
>>>>>>
>>>>>>> support
>>>>>>>>>>>>>
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> develop it inside Ignite community).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-3084
>>>>>>>>>>>>>>> [2] https://github.com/apache/ignite/pull/2742
>>>>>>>>>>>>>>> [3] https://github.com/apache/
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ignite/pull/2742/files#diff-
>>>>>
>>>>>> f4ff509cef3018e221394474775e0905
>>>>>>>>>>>>>>> [4] https://github.com/apache/
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ignite/pull/2742/files#diff-
>>>>>
>>>>>> f2b670497d81e780dfd5098c5dd8a89c
>>>>>>>>>>>>>>> [5] http://apache-spark-developers-list.1001551.n3.
>>>>>>>>>>>>>>> nabble.com/Spark-Core-Custom-
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Catalog-Integration-between-
>>>>
>>>>> Apache-Ignite-and-Apache-Spark-td22452.html
>>>>>>>>>>>>>>> [6] https://issues.apache.org/jira/browse/SPARK-17767
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Nikolay Izhikov
>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nikolay Izhikov
>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>


-- 
Nikolay Izhikov
NIzhikov.dev@gmail.com

Re: Integration of Spark and Ignite. Prototype.

Posted by Николай Ижиков <ni...@gmail.com>.

Hello, Valentin.

My answers is below.
Dmitry, do we need to move discussion to Jira?

 > 1. Why do we have org.apache.spark.sql.ignite package in our codebase?

As I mentioned earlier, to implement and override Spark Catalog one have to use internal(private) Spark API.
So I have to use package `org.spark.sql.***` to have access to private class and variables.

For example, SharedState class that stores link to ExternalCatalog declared as `private[sql] class SharedState` - i.e. package private.

 > Can these classes reside under org.apache.ignite.spark instead?

No, as long as we want to have our own implementation of ExternalCatalog.

 > 2. IgniteRelationProvider contains multiple constants which I guess are some king of config options. Can you describe the purpose of each of them?

I extend comments for this options.
Please, see my commit [1] or PR HEAD:

 > 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two Catalog implementations and what is the difference?

Good catch, thank you!
After additional research I founded that only IgniteExternalCatalog required.
I will update PR with IgniteCatalog remove in a few days.

 > 4. IgniteStrategy and IgniteOptimization are currently no-op. What are our plans on implementing them? Also, what exactly is planned in IgniteOptimization and what is its purpose?

Actually, this is very good question :)
And I need advice from experienced community members here:

`IgniteOptimization` purpose is to modify query plan created by Spark.
Currently, we have one optimization described in IGNITE-3084 [2] by you, Valentin :) :

“If there are non-Ignite relations in the plan, we should fall back to native Spark strategies“

I think we can go little further and reduce join of two Ignite backed Data Frames into single Ignite SQL query. Currently, this feature is unimplemented.

*Do we need it now? Or we can postpone it and concentrates on basic Data Frame and Catalog implementation?*

`Strategy` purpose, as you correctly mentioned in [2], is transform LogicalPlan into physical operators.
I don’t have ideas how to use this opportunity. So I think we don’t need IgniteStrategy.

Can you or anyone else suggest some optimization strategy to speed up SQL query execution?

 > 5. I don't like that IgniteStrategy and IgniteOptimization have to be set manually on SQLContext each time it's created....Is there any way to automate this and improve usability?

These classes added to `extraOptimizations` when one using IgniteSparkSession.
As far as I know, there is no way to automatically add these classes to regular SparkSession.

 > 6. What is the purpose of IgniteSparkSession? I see it's used in IgniteCatalogExample but not in IgniteDataFrameExample, which is Confusing.

DataFrame API is *public* Spark API. So anyone can provide implementation and plug it into Spark. That’s why IgniteDataFrameExample doesn’t need any Ignite specific session.

Catalog API is *internal* Spark API. There is no way to plug custom catalog implementation into Spark [3]. So we have to use `IgniteSparkSession` that extends regular SparkSession and overrides links 
to `ExternalCatalog`.

 > 7. To create IgniteSparkSession we first create IgniteContext. Is it really needed? It looks like we can directly provide the configuration file; if IgniteSparkSession really requires 
IgniteContext, it can create it by itself under the hood.

Actually, IgniteContext is base class for Ignite <-> Spark integration for now. So I tried to reuse it here. I like the idea to remove explicit usage of IgniteContext.
Will implement it in a few days.

 > Actually, I think it makes sense to create a builder similar to SparkSession.builder()...

Great idea! I will implement such builder in a few days.

 > 9. Do I understand correctly that IgniteCacheRelation is for the case when we don't have SQL configured on Ignite side?

Yes, IgniteCacheRelation is Data Frame implementation for a key-value cache.

 > I thought we decided not to support this, no? Or this is something else?

My understanding is following:

1. We can’t support automatic resolving key-value caches in *ExternalCatalog*. Because there is no way to reliably detect key and value classes.

2. We can support key-value caches in regular Data Frame implementation. Because we can require user to provide key and value classes explicitly.

 > 8. Can you clarify the query syntax in IgniteDataFrameExample#nativeSparkSqlFromCacheExample2?

Key-value cache:

key - java.lang.Long,
value - case class Person(name: String, birthDate: java.util.Date)

Schema of data frame for cache is:

key - long
value.name - string
value.birthDate - date

So we can select data from data from cache:

SELECT
   key, `value.name`,  `value.birthDate`
FROM
   testCache
WHERE key >= 2 AND `value.name` like '%0'

[1] https://github.com/apache/ignite/pull/2742/commits/faf3ed6febf417bc59b0519156fd4d09114c8da7
[2] https://issues.apache.org/jira/browse/IGNITE-3084?focusedCommentId=15794210&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15794210
[3] https://issues.apache.org/jira/browse/SPARK-17767?focusedCommentId=15543733&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15543733


18.10.2017 04:39, Dmitriy Setrakyan пишет:
> Val, thanks for the review. Can I ask you to add the same comments to the
> ticket?
> 
> On Tue, Oct 17, 2017 at 3:20 PM, Valentin Kulichenko <
> valentin.kulichenko@gmail.com> wrote:
> 
>> Nikolay, Anton,
>>
>> I did a high level review of the code. First of all, impressive results!
>> However, I have some questions/comments.
>>
>> 1. Why do we have org.apache.spark.sql.ignite package in our codebase? Can
>> these classes reside under org.apache.ignite.spark instead?
>> 2. IgniteRelationProvider contains multiple constants which I guess are
>> some king of config options. Can you describe the purpose of each of them?
>> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two Catalog
>> implementations and what is the difference?
>> 4. IgniteStrategy and IgniteOptimization are currently no-op. What are our
>> plans on implementing them? Also, what exactly is planned in
>> IgniteOptimization and what is its purpose?
>> 5. I don't like that IgniteStrategy and IgniteOptimization have to be set
>> manually on SQLContext each time it's created. This seems to be very error
>> prone. Is there any way to automate this and improve usability?
>> 6. What is the purpose of IgniteSparkSession? I see it's used
>> in IgniteCatalogExample but not in IgniteDataFrameExample, which is
>> confusing.
>> 7. To create IgniteSparkSession we first create IgniteContext. Is it really
>> needed? It looks like we can directly provide the configuration file; if
>> IgniteSparkSession really requires IgniteContext, it can create it by
>> itself under the hood. Actually, I think it makes sense to create a builder
>> similar to SparkSession.builder(), it would be good if our APIs here are
>> consistent with Spark APIs.
>> 8. Can you clarify the query syntax
>> inIgniteDataFrameExample#nativeSparkSqlFromCacheExample2?
>> 9. Do I understand correctly that IgniteCacheRelation is for the case when
>> we don't have SQL configured on Ignite side? I thought we decided not to
>> support this, no? Or this is something else?
>>
>> Thanks!
>>
>> -Val
>>
>> On Tue, Oct 17, 2017 at 4:40 AM, Anton Vinogradov <
>> avinogradov@gridgain.com>
>> wrote:
>>
>>> Sounds awesome.
>>>
>>> I'll try to review API & tests this week.
>>>
>>> Val,
>>> Your review still required :)
>>>
>>> On Tue, Oct 17, 2017 at 2:36 PM, Николай Ижиков <ni...@gmail.com>
>>> wrote:
>>>
>>>> Yes
>>>>
>>>> 17 окт. 2017 г. 2:34 PM пользователь "Anton Vinogradov" <
>>>> avinogradov@gridgain.com> написал:
>>>>
>>>>> Nikolay,
>>>>>
>>>>> So, it will be able to start regular spark and ignite clusters and,
>>> using
>>>>> peer classloading via spark-context, perform any DataFrame request,
>>>>> correct?
>>>>>
>>>>> On Tue, Oct 17, 2017 at 2:25 PM, Николай Ижиков <
>>> nizhikov.dev@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello, Anton.
>>>>>>
>>>>>> An example you provide is a path to a master *local* file.
>>>>>> These libraries are added to the classpath for each remote node
>>> running
>>>>>> submitted job.
>>>>>>
>>>>>> Please, see documentation:
>>>>>>
>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
>>>>>> spark/SparkContext.html#addJar(java.lang.String)
>>>>>> http://spark.apache.org/docs/latest/api/java/org/apache/
>>>>>> spark/SparkContext.html#addFile(java.lang.String)
>>>>>>
>>>>>>
>>>>>> 2017-10-17 13:10 GMT+03:00 Anton Vinogradov <
>>> avinogradov@gridgain.com
>>>>> :
>>>>>>
>>>>>>> Nikolay,
>>>>>>>
>>>>>>>> With Data Frame API implementation there are no requirements to
>>>> have
>>>>>> any
>>>>>>>> Ignite files on spark worker nodes.
>>>>>>>
>>>>>>> What do you mean? I see code like:
>>>>>>>
>>>>>>> spark.sparkContext.addJar(MAVEN_HOME +
>>>>>>> "/org/apache/ignite/ignite-core/2.3.0-SNAPSHOT/ignite-
>>>>>>> core-2.3.0-SNAPSHOT.jar")
>>>>>>>
>>>>>>> On Mon, Oct 16, 2017 at 5:22 PM, Николай Ижиков <
>>>>> nizhikov.dev@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello, guys.
>>>>>>>>
>>>>>>>> I have created example application to run Ignite Data Frame on
>>>>>> standalone
>>>>>>>> Spark cluster.
>>>>>>>> With Data Frame API implementation there are no requirements to
>>>> have
>>>>>> any
>>>>>>>> Ignite files on spark worker nodes.
>>>>>>>>
>>>>>>>> I ran this application on the free dataset: ATP tennis match
>>>>>> statistics.
>>>>>>>>
>>>>>>>> data - https://github.com/nizhikov/atp_matches
>>>>>>>> app - https://github.com/nizhikov/ignite-spark-df-example
>>>>>>>>
>>>>>>>> Valentin, do you have a chance to look at my changes?
>>>>>>>>
>>>>>>>>
>>>>>>>> 2017-10-12 6:03 GMT+03:00 Valentin Kulichenko <
>>>>>>>> valentin.kulichenko@gmail.com
>>>>>>>>> :
>>>>>>>>
>>>>>>>>> Hi Nikolay,
>>>>>>>>>
>>>>>>>>> Sorry for delay on this, got a little swamped lately. I will
>> do
>>>> my
>>>>>> best
>>>>>>>> to
>>>>>>>>> review the code this week.
>>>>>>>>>
>>>>>>>>> -Val
>>>>>>>>>
>>>>>>>>> On Mon, Oct 9, 2017 at 11:48 AM, Николай Ижиков <
>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hello, Valentin.
>>>>>>>>>>
>>>>>>>>>> Did you have a chance to look at my changes?
>>>>>>>>>>
>>>>>>>>>> Now I think I have done almost all required features.
>>>>>>>>>> I want to make some performance test to ensure my
>>> implementation
>>>>>> work
>>>>>>>>>> properly with a significant amount of data.
>>>>>>>>>> And I definitely need some feedback for my changes.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2017-10-09 18:45 GMT+03:00 Николай Ижиков <
>>>> nizhikov.dev@gmail.com
>>>>>> :
>>>>>>>>>>
>>>>>>>>>>> Hello, guys.
>>>>>>>>>>>
>>>>>>>>>>> Which version of Spark do we want to use?
>>>>>>>>>>>
>>>>>>>>>>> 1. Currently, Ignite depends on Spark 2.1.0.
>>>>>>>>>>>
>>>>>>>>>>>      * Can be run on JDK 7.
>>>>>>>>>>>      * Still supported: 2.1.2 will be released soon.
>>>>>>>>>>>
>>>>>>>>>>> 2. Latest Spark version is 2.2.0.
>>>>>>>>>>>
>>>>>>>>>>>      * Can be run only on JDK 8+
>>>>>>>>>>>      * Released Jul 11, 2017.
>>>>>>>>>>>      * Already supported by huge vendors(Amazon for
>> example).
>>>>>>>>>>>
>>>>>>>>>>> Note that in IGNITE-3084 I implement some internal Spark
>> API.
>>>>>>>>>>> So It will take some effort to switch between Spark 2.1 and
>>> 2.2
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2017-09-27 2:20 GMT+03:00 Valentin Kulichenko <
>>>>>>>>>>> valentin.kulichenko@gmail.com>:
>>>>>>>>>>>
>>>>>>>>>>>> I will review in the next few days.
>>>>>>>>>>>>
>>>>>>>>>>>> -Val
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Sep 26, 2017 at 2:23 PM, Denis Magda <
>>>> dmagda@apache.org
>>>>>>
>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello Nikolay,
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is good news. Finally this capability is coming to
>>>>> Ignite.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Val, Vladimir, could you do a preliminary review?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Answering on your questions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. Yardstick should be enough for performance
>>> measurements.
>>>>> As a
>>>>>>>> Spark
>>>>>>>>>>>>> user, I will be curious to know what’s the point of this
>>>>>>>> integration.
>>>>>>>>>>>>> Probably we need to compare Spark + Ignite and Spark +
>>> Hive
>>>> or
>>>>>>>> Spark +
>>>>>>>>>>>>> RDBMS cases.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2. If Spark community is reluctant let’s include the
>>> module
>>>> in
>>>>>>>>>>>>> ignite-spark integration.
>>>>>>>>>>>>>
>>>>>>>>>>>>> —
>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sep 25, 2017, at 11:14 AM, Николай Ижиков <
>>>>>>>>>>>> nizhikov.dev@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello, guys.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Currently, I’m working on integration between Spark
>> and
>>>>> Ignite
>>>>>>>> [1].
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For now, I implement following:
>>>>>>>>>>>>>>     * Ignite DataSource implementation(
>>>>> IgniteRelationProvider)
>>>>>>>>>>>>>>     * DataFrame support for Ignite SQL table.
>>>>>>>>>>>>>>     * IgniteCatalog implementation for a transparent
>>>>> resolving
>>>>>> of
>>>>>>>>>>>> ignites
>>>>>>>>>>>>>> SQL tables.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Implementation of it can be found in PR [2]
>>>>>>>>>>>>>> It would be great if someone provides feedback for a
>>>>>> prototype.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I made some examples in PR so you can see how API
>>> suppose
>>>> to
>>>>>> be
>>>>>>>>>>>> used [3].
>>>>>>>>>>>>>> [4].
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I need some advice. Can you help me?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. How should this PR be tested?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Of course, I need to provide some unit tests. But what
>>>> about
>>>>>>>>>>>> scalability
>>>>>>>>>>>>>> tests, etc.
>>>>>>>>>>>>>> Maybe we need some Yardstick benchmark or similar?
>>>>>>>>>>>>>> What are your thoughts?
>>>>>>>>>>>>>> Which scenarios should I consider in the first place?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2. Should we provide Spark Catalog implementation
>> inside
>>>>>> Ignite
>>>>>>>>>>>> codebase?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> A current implementation of Spark Catalog based on
>>>> *internal
>>>>>>> Spark
>>>>>>>>>>>> API*.
>>>>>>>>>>>>>> Spark community seems not interested in making Catalog
>>> API
>>>>>>> public
>>>>>>>> or
>>>>>>>>>>>>>> including Ignite Catalog in Spark code base [5], [6].
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *Should we include Spark internal API implementation
>>>> inside
>>>>>>> Ignite
>>>>>>>>>>>> code
>>>>>>>>>>>>>> base?*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Or should we consider to include Catalog
>> implementation
>>> in
>>>>>> some
>>>>>>>>>>>> external
>>>>>>>>>>>>>> module?
>>>>>>>>>>>>>> That will be created and released outside Ignite?(we
>>> still
>>>>> can
>>>>>>>>>>>> support
>>>>>>>>>>>>> and
>>>>>>>>>>>>>> develop it inside Ignite community).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-3084
>>>>>>>>>>>>>> [2] https://github.com/apache/ignite/pull/2742
>>>>>>>>>>>>>> [3] https://github.com/apache/
>>>> ignite/pull/2742/files#diff-
>>>>>>>>>>>>>> f4ff509cef3018e221394474775e0905
>>>>>>>>>>>>>> [4] https://github.com/apache/
>>>> ignite/pull/2742/files#diff-
>>>>>>>>>>>>>> f2b670497d81e780dfd5098c5dd8a89c
>>>>>>>>>>>>>> [5] http://apache-spark-developers-list.1001551.n3.
>>>>>>>>>>>>>> nabble.com/Spark-Core-Custom-
>>> Catalog-Integration-between-
>>>>>>>>>>>>>> Apache-Ignite-and-Apache-Spark-td22452.html
>>>>>>>>>>>>>> [6] https://issues.apache.org/jira/browse/SPARK-17767
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Nikolay Izhikov
>>>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Nikolay Izhikov
>>>>>>>> NIzhikov.dev@gmail.com
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nikolay Izhikov
>>>>>> NIzhikov.dev@gmail.com
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Integration of Spark and Ignite. Prototype.

Posted by Dmitriy Setrakyan <ds...@apache.org>.

Val, thanks for the review. Can I ask you to add the same comments to the
ticket?

On Tue, Oct 17, 2017 at 3:20 PM, Valentin Kulichenko <
valentin.kulichenko@gmail.com> wrote:

> Nikolay, Anton,
>
> I did a high level review of the code. First of all, impressive results!
> However, I have some questions/comments.
>
> 1. Why do we have org.apache.spark.sql.ignite package in our codebase? Can
> these classes reside under org.apache.ignite.spark instead?
> 2. IgniteRelationProvider contains multiple constants which I guess are
> some king of config options. Can you describe the purpose of each of them?
> 3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two Catalog
> implementations and what is the difference?
> 4. IgniteStrategy and IgniteOptimization are currently no-op. What are our
> plans on implementing them? Also, what exactly is planned in
> IgniteOptimization and what is its purpose?
> 5. I don't like that IgniteStrategy and IgniteOptimization have to be set
> manually on SQLContext each time it's created. This seems to be very error
> prone. Is there any way to automate this and improve usability?
> 6. What is the purpose of IgniteSparkSession? I see it's used
> in IgniteCatalogExample but not in IgniteDataFrameExample, which is
> confusing.
> 7. To create IgniteSparkSession we first create IgniteContext. Is it really
> needed? It looks like we can directly provide the configuration file; if
> IgniteSparkSession really requires IgniteContext, it can create it by
> itself under the hood. Actually, I think it makes sense to create a builder
> similar to SparkSession.builder(), it would be good if our APIs here are
> consistent with Spark APIs.
> 8. Can you clarify the query syntax
> inIgniteDataFrameExample#nativeSparkSqlFromCacheExample2?
> 9. Do I understand correctly that IgniteCacheRelation is for the case when
> we don't have SQL configured on Ignite side? I thought we decided not to
> support this, no? Or this is something else?
>
> Thanks!
>
> -Val
>
> On Tue, Oct 17, 2017 at 4:40 AM, Anton Vinogradov <
> avinogradov@gridgain.com>
> wrote:
>
> > Sounds awesome.
> >
> > I'll try to review API & tests this week.
> >
> > Val,
> > Your review still required :)
> >
> > On Tue, Oct 17, 2017 at 2:36 PM, Николай Ижиков <ni...@gmail.com>
> > wrote:
> >
> > > Yes
> > >
> > > 17 окт. 2017 г. 2:34 PM пользователь "Anton Vinogradov" <
> > > avinogradov@gridgain.com> написал:
> > >
> > > > Nikolay,
> > > >
> > > > So, it will be able to start regular spark and ignite clusters and,
> > using
> > > > peer classloading via spark-context, perform any DataFrame request,
> > > > correct?
> > > >
> > > > On Tue, Oct 17, 2017 at 2:25 PM, Николай Ижиков <
> > nizhikov.dev@gmail.com>
> > > > wrote:
> > > >
> > > > > Hello, Anton.
> > > > >
> > > > > An example you provide is a path to a master *local* file.
> > > > > These libraries are added to the classpath for each remote node
> > running
> > > > > submitted job.
> > > > >
> > > > > Please, see documentation:
> > > > >
> > > > > http://spark.apache.org/docs/latest/api/java/org/apache/
> > > > > spark/SparkContext.html#addJar(java.lang.String)
> > > > > http://spark.apache.org/docs/latest/api/java/org/apache/
> > > > > spark/SparkContext.html#addFile(java.lang.String)
> > > > >
> > > > >
> > > > > 2017-10-17 13:10 GMT+03:00 Anton Vinogradov <
> > avinogradov@gridgain.com
> > > >:
> > > > >
> > > > > > Nikolay,
> > > > > >
> > > > > > > With Data Frame API implementation there are no requirements to
> > > have
> > > > > any
> > > > > > > Ignite files on spark worker nodes.
> > > > > >
> > > > > > What do you mean? I see code like:
> > > > > >
> > > > > > spark.sparkContext.addJar(MAVEN_HOME +
> > > > > > "/org/apache/ignite/ignite-core/2.3.0-SNAPSHOT/ignite-
> > > > > > core-2.3.0-SNAPSHOT.jar")
> > > > > >
> > > > > > On Mon, Oct 16, 2017 at 5:22 PM, Николай Ижиков <
> > > > nizhikov.dev@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hello, guys.
> > > > > > >
> > > > > > > I have created example application to run Ignite Data Frame on
> > > > > standalone
> > > > > > > Spark cluster.
> > > > > > > With Data Frame API implementation there are no requirements to
> > > have
> > > > > any
> > > > > > > Ignite files on spark worker nodes.
> > > > > > >
> > > > > > > I ran this application on the free dataset: ATP tennis match
> > > > > statistics.
> > > > > > >
> > > > > > > data - https://github.com/nizhikov/atp_matches
> > > > > > > app - https://github.com/nizhikov/ignite-spark-df-example
> > > > > > >
> > > > > > > Valentin, do you have a chance to look at my changes?
> > > > > > >
> > > > > > >
> > > > > > > 2017-10-12 6:03 GMT+03:00 Valentin Kulichenko <
> > > > > > > valentin.kulichenko@gmail.com
> > > > > > > >:
> > > > > > >
> > > > > > > > Hi Nikolay,
> > > > > > > >
> > > > > > > > Sorry for delay on this, got a little swamped lately. I will
> do
> > > my
> > > > > best
> > > > > > > to
> > > > > > > > review the code this week.
> > > > > > > >
> > > > > > > > -Val
> > > > > > > >
> > > > > > > > On Mon, Oct 9, 2017 at 11:48 AM, Николай Ижиков <
> > > > > > nizhikov.dev@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Hello, Valentin.
> > > > > > > >>
> > > > > > > >> Did you have a chance to look at my changes?
> > > > > > > >>
> > > > > > > >> Now I think I have done almost all required features.
> > > > > > > >> I want to make some performance test to ensure my
> > implementation
> > > > > work
> > > > > > > >> properly with a significant amount of data.
> > > > > > > >> And I definitely need some feedback for my changes.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> 2017-10-09 18:45 GMT+03:00 Николай Ижиков <
> > > nizhikov.dev@gmail.com
> > > > >:
> > > > > > > >>
> > > > > > > >>> Hello, guys.
> > > > > > > >>>
> > > > > > > >>> Which version of Spark do we want to use?
> > > > > > > >>>
> > > > > > > >>> 1. Currently, Ignite depends on Spark 2.1.0.
> > > > > > > >>>
> > > > > > > >>>     * Can be run on JDK 7.
> > > > > > > >>>     * Still supported: 2.1.2 will be released soon.
> > > > > > > >>>
> > > > > > > >>> 2. Latest Spark version is 2.2.0.
> > > > > > > >>>
> > > > > > > >>>     * Can be run only on JDK 8+
> > > > > > > >>>     * Released Jul 11, 2017.
> > > > > > > >>>     * Already supported by huge vendors(Amazon for
> example).
> > > > > > > >>>
> > > > > > > >>> Note that in IGNITE-3084 I implement some internal Spark
> API.
> > > > > > > >>> So It will take some effort to switch between Spark 2.1 and
> > 2.2
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>> 2017-09-27 2:20 GMT+03:00 Valentin Kulichenko <
> > > > > > > >>> valentin.kulichenko@gmail.com>:
> > > > > > > >>>
> > > > > > > >>>> I will review in the next few days.
> > > > > > > >>>>
> > > > > > > >>>> -Val
> > > > > > > >>>>
> > > > > > > >>>> On Tue, Sep 26, 2017 at 2:23 PM, Denis Magda <
> > > dmagda@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > > >>>>
> > > > > > > >>>> > Hello Nikolay,
> > > > > > > >>>> >
> > > > > > > >>>> > This is good news. Finally this capability is coming to
> > > > Ignite.
> > > > > > > >>>> >
> > > > > > > >>>> > Val, Vladimir, could you do a preliminary review?
> > > > > > > >>>> >
> > > > > > > >>>> > Answering on your questions.
> > > > > > > >>>> >
> > > > > > > >>>> > 1. Yardstick should be enough for performance
> > measurements.
> > > > As a
> > > > > > > Spark
> > > > > > > >>>> > user, I will be curious to know what’s the point of this
> > > > > > > integration.
> > > > > > > >>>> > Probably we need to compare Spark + Ignite and Spark +
> > Hive
> > > or
> > > > > > > Spark +
> > > > > > > >>>> > RDBMS cases.
> > > > > > > >>>> >
> > > > > > > >>>> > 2. If Spark community is reluctant let’s include the
> > module
> > > in
> > > > > > > >>>> > ignite-spark integration.
> > > > > > > >>>> >
> > > > > > > >>>> > —
> > > > > > > >>>> > Denis
> > > > > > > >>>> >
> > > > > > > >>>> > > On Sep 25, 2017, at 11:14 AM, Николай Ижиков <
> > > > > > > >>>> nizhikov.dev@gmail.com>
> > > > > > > >>>> > wrote:
> > > > > > > >>>> > >
> > > > > > > >>>> > > Hello, guys.
> > > > > > > >>>> > >
> > > > > > > >>>> > > Currently, I’m working on integration between Spark
> and
> > > > Ignite
> > > > > > > [1].
> > > > > > > >>>> > >
> > > > > > > >>>> > > For now, I implement following:
> > > > > > > >>>> > >    * Ignite DataSource implementation(
> > > > IgniteRelationProvider)
> > > > > > > >>>> > >    * DataFrame support for Ignite SQL table.
> > > > > > > >>>> > >    * IgniteCatalog implementation for a transparent
> > > > resolving
> > > > > of
> > > > > > > >>>> ignites
> > > > > > > >>>> > > SQL tables.
> > > > > > > >>>> > >
> > > > > > > >>>> > > Implementation of it can be found in PR [2]
> > > > > > > >>>> > > It would be great if someone provides feedback for a
> > > > > prototype.
> > > > > > > >>>> > >
> > > > > > > >>>> > > I made some examples in PR so you can see how API
> > suppose
> > > to
> > > > > be
> > > > > > > >>>> used [3].
> > > > > > > >>>> > > [4].
> > > > > > > >>>> > >
> > > > > > > >>>> > > I need some advice. Can you help me?
> > > > > > > >>>> > >
> > > > > > > >>>> > > 1. How should this PR be tested?
> > > > > > > >>>> > >
> > > > > > > >>>> > > Of course, I need to provide some unit tests. But what
> > > about
> > > > > > > >>>> scalability
> > > > > > > >>>> > > tests, etc.
> > > > > > > >>>> > > Maybe we need some Yardstick benchmark or similar?
> > > > > > > >>>> > > What are your thoughts?
> > > > > > > >>>> > > Which scenarios should I consider in the first place?
> > > > > > > >>>> > >
> > > > > > > >>>> > > 2. Should we provide Spark Catalog implementation
> inside
> > > > > Ignite
> > > > > > > >>>> codebase?
> > > > > > > >>>> > >
> > > > > > > >>>> > > A current implementation of Spark Catalog based on
> > > *internal
> > > > > > Spark
> > > > > > > >>>> API*.
> > > > > > > >>>> > > Spark community seems not interested in making Catalog
> > API
> > > > > > public
> > > > > > > or
> > > > > > > >>>> > > including Ignite Catalog in Spark code base [5], [6].
> > > > > > > >>>> > >
> > > > > > > >>>> > > *Should we include Spark internal API implementation
> > > inside
> > > > > > Ignite
> > > > > > > >>>> code
> > > > > > > >>>> > > base?*
> > > > > > > >>>> > >
> > > > > > > >>>> > > Or should we consider to include Catalog
> implementation
> > in
> > > > > some
> > > > > > > >>>> external
> > > > > > > >>>> > > module?
> > > > > > > >>>> > > That will be created and released outside Ignite?(we
> > still
> > > > can
> > > > > > > >>>> support
> > > > > > > >>>> > and
> > > > > > > >>>> > > develop it inside Ignite community).
> > > > > > > >>>> > >
> > > > > > > >>>> > > [1] https://issues.apache.org/jira/browse/IGNITE-3084
> > > > > > > >>>> > > [2] https://github.com/apache/ignite/pull/2742
> > > > > > > >>>> > > [3] https://github.com/apache/
> > > ignite/pull/2742/files#diff-
> > > > > > > >>>> > > f4ff509cef3018e221394474775e0905
> > > > > > > >>>> > > [4] https://github.com/apache/
> > > ignite/pull/2742/files#diff-
> > > > > > > >>>> > > f2b670497d81e780dfd5098c5dd8a89c
> > > > > > > >>>> > > [5] http://apache-spark-developers-list.1001551.n3.
> > > > > > > >>>> > > nabble.com/Spark-Core-Custom-
> > Catalog-Integration-between-
> > > > > > > >>>> > > Apache-Ignite-and-Apache-Spark-td22452.html
> > > > > > > >>>> > > [6] https://issues.apache.org/jira/browse/SPARK-17767
> > > > > > > >>>> > >
> > > > > > > >>>> > > --
> > > > > > > >>>> > > Nikolay Izhikov
> > > > > > > >>>> > > NIzhikov.dev@gmail.com
> > > > > > > >>>> >
> > > > > > > >>>> >
> > > > > > > >>>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>> --
> > > > > > > >>> Nikolay Izhikov
> > > > > > > >>> NIzhikov.dev@gmail.com
> > > > > > > >>>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> --
> > > > > > > >> Nikolay Izhikov
> > > > > > > >> NIzhikov.dev@gmail.com
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Nikolay Izhikov
> > > > > > > NIzhikov.dev@gmail.com
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Nikolay Izhikov
> > > > > NIzhikov.dev@gmail.com
> > > > >
> > > >
> > >
> >
>

Re: Integration of Spark and Ignite. Prototype.

Posted by Valentin Kulichenko <va...@gmail.com>.

Nikolay, Anton,

I did a high level review of the code. First of all, impressive results!
However, I have some questions/comments.

1. Why do we have org.apache.spark.sql.ignite package in our codebase? Can
these classes reside under org.apache.ignite.spark instead?
2. IgniteRelationProvider contains multiple constants which I guess are
some king of config options. Can you describe the purpose of each of them?
3. IgniteCatalog vs. IgniteExternalCatalog. Why do we have two Catalog
implementations and what is the difference?
4. IgniteStrategy and IgniteOptimization are currently no-op. What are our
plans on implementing them? Also, what exactly is planned in
IgniteOptimization and what is its purpose?
5. I don't like that IgniteStrategy and IgniteOptimization have to be set
manually on SQLContext each time it's created. This seems to be very error
prone. Is there any way to automate this and improve usability?
6. What is the purpose of IgniteSparkSession? I see it's used
in IgniteCatalogExample but not in IgniteDataFrameExample, which is
confusing.
7. To create IgniteSparkSession we first create IgniteContext. Is it really
needed? It looks like we can directly provide the configuration file; if
IgniteSparkSession really requires IgniteContext, it can create it by
itself under the hood. Actually, I think it makes sense to create a builder
similar to SparkSession.builder(), it would be good if our APIs here are
consistent with Spark APIs.
8. Can you clarify the query syntax
inIgniteDataFrameExample#nativeSparkSqlFromCacheExample2?
9. Do I understand correctly that IgniteCacheRelation is for the case when
we don't have SQL configured on Ignite side? I thought we decided not to
support this, no? Or this is something else?

Thanks!

-Val

On Tue, Oct 17, 2017 at 4:40 AM, Anton Vinogradov <av...@gridgain.com>
wrote:

> Sounds awesome.
>
> I'll try to review API & tests this week.
>
> Val,
> Your review still required :)
>
> On Tue, Oct 17, 2017 at 2:36 PM, Николай Ижиков <ni...@gmail.com>
> wrote:
>
> > Yes
> >
> > 17 окт. 2017 г. 2:34 PM пользователь "Anton Vinogradov" <
> > avinogradov@gridgain.com> написал:
> >
> > > Nikolay,
> > >
> > > So, it will be able to start regular spark and ignite clusters and,
> using
> > > peer classloading via spark-context, perform any DataFrame request,
> > > correct?
> > >
> > > On Tue, Oct 17, 2017 at 2:25 PM, Николай Ижиков <
> nizhikov.dev@gmail.com>
> > > wrote:
> > >
> > > > Hello, Anton.
> > > >
> > > > An example you provide is a path to a master *local* file.
> > > > These libraries are added to the classpath for each remote node
> running
> > > > submitted job.
> > > >
> > > > Please, see documentation:
> > > >
> > > > http://spark.apache.org/docs/latest/api/java/org/apache/
> > > > spark/SparkContext.html#addJar(java.lang.String)
> > > > http://spark.apache.org/docs/latest/api/java/org/apache/
> > > > spark/SparkContext.html#addFile(java.lang.String)
> > > >
> > > >
> > > > 2017-10-17 13:10 GMT+03:00 Anton Vinogradov <
> avinogradov@gridgain.com
> > >:
> > > >
> > > > > Nikolay,
> > > > >
> > > > > > With Data Frame API implementation there are no requirements to
> > have
> > > > any
> > > > > > Ignite files on spark worker nodes.
> > > > >
> > > > > What do you mean? I see code like:
> > > > >
> > > > > spark.sparkContext.addJar(MAVEN_HOME +
> > > > > "/org/apache/ignite/ignite-core/2.3.0-SNAPSHOT/ignite-
> > > > > core-2.3.0-SNAPSHOT.jar")
> > > > >
> > > > > On Mon, Oct 16, 2017 at 5:22 PM, Николай Ижиков <
> > > nizhikov.dev@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hello, guys.
> > > > > >
> > > > > > I have created example application to run Ignite Data Frame on
> > > > standalone
> > > > > > Spark cluster.
> > > > > > With Data Frame API implementation there are no requirements to
> > have
> > > > any
> > > > > > Ignite files on spark worker nodes.
> > > > > >
> > > > > > I ran this application on the free dataset: ATP tennis match
> > > > statistics.
> > > > > >
> > > > > > data - https://github.com/nizhikov/atp_matches
> > > > > > app - https://github.com/nizhikov/ignite-spark-df-example
> > > > > >
> > > > > > Valentin, do you have a chance to look at my changes?
> > > > > >
> > > > > >
> > > > > > 2017-10-12 6:03 GMT+03:00 Valentin Kulichenko <
> > > > > > valentin.kulichenko@gmail.com
> > > > > > >:
> > > > > >
> > > > > > > Hi Nikolay,
> > > > > > >
> > > > > > > Sorry for delay on this, got a little swamped lately. I will do
> > my
> > > > best
> > > > > > to
> > > > > > > review the code this week.
> > > > > > >
> > > > > > > -Val
> > > > > > >
> > > > > > > On Mon, Oct 9, 2017 at 11:48 AM, Николай Ижиков <
> > > > > nizhikov.dev@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Hello, Valentin.
> > > > > > >>
> > > > > > >> Did you have a chance to look at my changes?
> > > > > > >>
> > > > > > >> Now I think I have done almost all required features.
> > > > > > >> I want to make some performance test to ensure my
> implementation
> > > > work
> > > > > > >> properly with a significant amount of data.
> > > > > > >> And I definitely need some feedback for my changes.
> > > > > > >>
> > > > > > >>
> > > > > > >> 2017-10-09 18:45 GMT+03:00 Николай Ижиков <
> > nizhikov.dev@gmail.com
> > > >:
> > > > > > >>
> > > > > > >>> Hello, guys.
> > > > > > >>>
> > > > > > >>> Which version of Spark do we want to use?
> > > > > > >>>
> > > > > > >>> 1. Currently, Ignite depends on Spark 2.1.0.
> > > > > > >>>
> > > > > > >>>     * Can be run on JDK 7.
> > > > > > >>>     * Still supported: 2.1.2 will be released soon.
> > > > > > >>>
> > > > > > >>> 2. Latest Spark version is 2.2.0.
> > > > > > >>>
> > > > > > >>>     * Can be run only on JDK 8+
> > > > > > >>>     * Released Jul 11, 2017.
> > > > > > >>>     * Already supported by huge vendors(Amazon for example).
> > > > > > >>>
> > > > > > >>> Note that in IGNITE-3084 I implement some internal Spark API.
> > > > > > >>> So It will take some effort to switch between Spark 2.1 and
> 2.2
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> 2017-09-27 2:20 GMT+03:00 Valentin Kulichenko <
> > > > > > >>> valentin.kulichenko@gmail.com>:
> > > > > > >>>
> > > > > > >>>> I will review in the next few days.
> > > > > > >>>>
> > > > > > >>>> -Val
> > > > > > >>>>
> > > > > > >>>> On Tue, Sep 26, 2017 at 2:23 PM, Denis Magda <
> > dmagda@apache.org
> > > >
> > > > > > wrote:
> > > > > > >>>>
> > > > > > >>>> > Hello Nikolay,
> > > > > > >>>> >
> > > > > > >>>> > This is good news. Finally this capability is coming to
> > > Ignite.
> > > > > > >>>> >
> > > > > > >>>> > Val, Vladimir, could you do a preliminary review?
> > > > > > >>>> >
> > > > > > >>>> > Answering on your questions.
> > > > > > >>>> >
> > > > > > >>>> > 1. Yardstick should be enough for performance
> measurements.
> > > As a
> > > > > > Spark
> > > > > > >>>> > user, I will be curious to know what’s the point of this
> > > > > > integration.
> > > > > > >>>> > Probably we need to compare Spark + Ignite and Spark +
> Hive
> > or
> > > > > > Spark +
> > > > > > >>>> > RDBMS cases.
> > > > > > >>>> >
> > > > > > >>>> > 2. If Spark community is reluctant let’s include the
> module
> > in
> > > > > > >>>> > ignite-spark integration.
> > > > > > >>>> >
> > > > > > >>>> > —
> > > > > > >>>> > Denis
> > > > > > >>>> >
> > > > > > >>>> > > On Sep 25, 2017, at 11:14 AM, Николай Ижиков <
> > > > > > >>>> nizhikov.dev@gmail.com>
> > > > > > >>>> > wrote:
> > > > > > >>>> > >
> > > > > > >>>> > > Hello, guys.
> > > > > > >>>> > >
> > > > > > >>>> > > Currently, I’m working on integration between Spark and
> > > Ignite
> > > > > > [1].
> > > > > > >>>> > >
> > > > > > >>>> > > For now, I implement following:
> > > > > > >>>> > >    * Ignite DataSource implementation(
> > > IgniteRelationProvider)
> > > > > > >>>> > >    * DataFrame support for Ignite SQL table.
> > > > > > >>>> > >    * IgniteCatalog implementation for a transparent
> > > resolving
> > > > of
> > > > > > >>>> ignites
> > > > > > >>>> > > SQL tables.
> > > > > > >>>> > >
> > > > > > >>>> > > Implementation of it can be found in PR [2]
> > > > > > >>>> > > It would be great if someone provides feedback for a
> > > > prototype.
> > > > > > >>>> > >
> > > > > > >>>> > > I made some examples in PR so you can see how API
> suppose
> > to
> > > > be
> > > > > > >>>> used [3].
> > > > > > >>>> > > [4].
> > > > > > >>>> > >
> > > > > > >>>> > > I need some advice. Can you help me?
> > > > > > >>>> > >
> > > > > > >>>> > > 1. How should this PR be tested?
> > > > > > >>>> > >
> > > > > > >>>> > > Of course, I need to provide some unit tests. But what
> > about
> > > > > > >>>> scalability
> > > > > > >>>> > > tests, etc.
> > > > > > >>>> > > Maybe we need some Yardstick benchmark or similar?
> > > > > > >>>> > > What are your thoughts?
> > > > > > >>>> > > Which scenarios should I consider in the first place?
> > > > > > >>>> > >
> > > > > > >>>> > > 2. Should we provide Spark Catalog implementation inside
> > > > Ignite
> > > > > > >>>> codebase?
> > > > > > >>>> > >
> > > > > > >>>> > > A current implementation of Spark Catalog based on
> > *internal
> > > > > Spark
> > > > > > >>>> API*.
> > > > > > >>>> > > Spark community seems not interested in making Catalog
> API
> > > > > public
> > > > > > or
> > > > > > >>>> > > including Ignite Catalog in Spark code base [5], [6].
> > > > > > >>>> > >
> > > > > > >>>> > > *Should we include Spark internal API implementation
> > inside
> > > > > Ignite
> > > > > > >>>> code
> > > > > > >>>> > > base?*
> > > > > > >>>> > >
> > > > > > >>>> > > Or should we consider to include Catalog implementation
> in
> > > > some
> > > > > > >>>> external
> > > > > > >>>> > > module?
> > > > > > >>>> > > That will be created and released outside Ignite?(we
> still
> > > can
> > > > > > >>>> support
> > > > > > >>>> > and
> > > > > > >>>> > > develop it inside Ignite community).
> > > > > > >>>> > >
> > > > > > >>>> > > [1] https://issues.apache.org/jira/browse/IGNITE-3084
> > > > > > >>>> > > [2] https://github.com/apache/ignite/pull/2742
> > > > > > >>>> > > [3] https://github.com/apache/
> > ignite/pull/2742/files#diff-
> > > > > > >>>> > > f4ff509cef3018e221394474775e0905
> > > > > > >>>> > > [4] https://github.com/apache/
> > ignite/pull/2742/files#diff-
> > > > > > >>>> > > f2b670497d81e780dfd5098c5dd8a89c
> > > > > > >>>> > > [5] http://apache-spark-developers-list.1001551.n3.
> > > > > > >>>> > > nabble.com/Spark-Core-Custom-
> Catalog-Integration-between-
> > > > > > >>>> > > Apache-Ignite-and-Apache-Spark-td22452.html
> > > > > > >>>> > > [6] https://issues.apache.org/jira/browse/SPARK-17767
> > > > > > >>>> > >
> > > > > > >>>> > > --
> > > > > > >>>> > > Nikolay Izhikov
> > > > > > >>>> > > NIzhikov.dev@gmail.com
> > > > > > >>>> >
> > > > > > >>>> >
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> --
> > > > > > >>> Nikolay Izhikov
> > > > > > >>> NIzhikov.dev@gmail.com
> > > > > > >>>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> --
> > > > > > >> Nikolay Izhikov
> > > > > > >> NIzhikov.dev@gmail.com
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Nikolay Izhikov
> > > > > > NIzhikov.dev@gmail.com
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Nikolay Izhikov
> > > > NIzhikov.dev@gmail.com
> > > >
> > >
> >
>

Re: Integration of Spark and Ignite. Prototype.

Posted by Anton Vinogradov <av...@gridgain.com>.

Sounds awesome.

I'll try to review API & tests this week.

Val,
Your review still required :)

On Tue, Oct 17, 2017 at 2:36 PM, Николай Ижиков <ni...@gmail.com>
wrote:

> Yes
>
> 17 окт. 2017 г. 2:34 PM пользователь "Anton Vinogradov" <
> avinogradov@gridgain.com> написал:
>
> > Nikolay,
> >
> > So, it will be able to start regular spark and ignite clusters and, using
> > peer classloading via spark-context, perform any DataFrame request,
> > correct?
> >
> > On Tue, Oct 17, 2017 at 2:25 PM, Николай Ижиков <ni...@gmail.com>
> > wrote:
> >
> > > Hello, Anton.
> > >
> > > An example you provide is a path to a master *local* file.
> > > These libraries are added to the classpath for each remote node running
> > > submitted job.
> > >
> > > Please, see documentation:
> > >
> > > http://spark.apache.org/docs/latest/api/java/org/apache/
> > > spark/SparkContext.html#addJar(java.lang.String)
> > > http://spark.apache.org/docs/latest/api/java/org/apache/
> > > spark/SparkContext.html#addFile(java.lang.String)
> > >
> > >
> > > 2017-10-17 13:10 GMT+03:00 Anton Vinogradov <avinogradov@gridgain.com
> >:
> > >
> > > > Nikolay,
> > > >
> > > > > With Data Frame API implementation there are no requirements to
> have
> > > any
> > > > > Ignite files on spark worker nodes.
> > > >
> > > > What do you mean? I see code like:
> > > >
> > > > spark.sparkContext.addJar(MAVEN_HOME +
> > > > "/org/apache/ignite/ignite-core/2.3.0-SNAPSHOT/ignite-
> > > > core-2.3.0-SNAPSHOT.jar")
> > > >
> > > > On Mon, Oct 16, 2017 at 5:22 PM, Николай Ижиков <
> > nizhikov.dev@gmail.com>
> > > > wrote:
> > > >
> > > > > Hello, guys.
> > > > >
> > > > > I have created example application to run Ignite Data Frame on
> > > standalone
> > > > > Spark cluster.
> > > > > With Data Frame API implementation there are no requirements to
> have
> > > any
> > > > > Ignite files on spark worker nodes.
> > > > >
> > > > > I ran this application on the free dataset: ATP tennis match
> > > statistics.
> > > > >
> > > > > data - https://github.com/nizhikov/atp_matches
> > > > > app - https://github.com/nizhikov/ignite-spark-df-example
> > > > >
> > > > > Valentin, do you have a chance to look at my changes?
> > > > >
> > > > >
> > > > > 2017-10-12 6:03 GMT+03:00 Valentin Kulichenko <
> > > > > valentin.kulichenko@gmail.com
> > > > > >:
> > > > >
> > > > > > Hi Nikolay,
> > > > > >
> > > > > > Sorry for delay on this, got a little swamped lately. I will do
> my
> > > best
> > > > > to
> > > > > > review the code this week.
> > > > > >
> > > > > > -Val
> > > > > >
> > > > > > On Mon, Oct 9, 2017 at 11:48 AM, Николай Ижиков <
> > > > nizhikov.dev@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > >> Hello, Valentin.
> > > > > >>
> > > > > >> Did you have a chance to look at my changes?
> > > > > >>
> > > > > >> Now I think I have done almost all required features.
> > > > > >> I want to make some performance test to ensure my implementation
> > > work
> > > > > >> properly with a significant amount of data.
> > > > > >> And I definitely need some feedback for my changes.
> > > > > >>
> > > > > >>
> > > > > >> 2017-10-09 18:45 GMT+03:00 Николай Ижиков <
> nizhikov.dev@gmail.com
> > >:
> > > > > >>
> > > > > >>> Hello, guys.
> > > > > >>>
> > > > > >>> Which version of Spark do we want to use?
> > > > > >>>
> > > > > >>> 1. Currently, Ignite depends on Spark 2.1.0.
> > > > > >>>
> > > > > >>>     * Can be run on JDK 7.
> > > > > >>>     * Still supported: 2.1.2 will be released soon.
> > > > > >>>
> > > > > >>> 2. Latest Spark version is 2.2.0.
> > > > > >>>
> > > > > >>>     * Can be run only on JDK 8+
> > > > > >>>     * Released Jul 11, 2017.
> > > > > >>>     * Already supported by huge vendors(Amazon for example).
> > > > > >>>
> > > > > >>> Note that in IGNITE-3084 I implement some internal Spark API.
> > > > > >>> So It will take some effort to switch between Spark 2.1 and 2.2
> > > > > >>>
> > > > > >>>
> > > > > >>> 2017-09-27 2:20 GMT+03:00 Valentin Kulichenko <
> > > > > >>> valentin.kulichenko@gmail.com>:
> > > > > >>>
> > > > > >>>> I will review in the next few days.
> > > > > >>>>
> > > > > >>>> -Val
> > > > > >>>>
> > > > > >>>> On Tue, Sep 26, 2017 at 2:23 PM, Denis Magda <
> dmagda@apache.org
> > >
> > > > > wrote:
> > > > > >>>>
> > > > > >>>> > Hello Nikolay,
> > > > > >>>> >
> > > > > >>>> > This is good news. Finally this capability is coming to
> > Ignite.
> > > > > >>>> >
> > > > > >>>> > Val, Vladimir, could you do a preliminary review?
> > > > > >>>> >
> > > > > >>>> > Answering on your questions.
> > > > > >>>> >
> > > > > >>>> > 1. Yardstick should be enough for performance measurements.
> > As a
> > > > > Spark
> > > > > >>>> > user, I will be curious to know what’s the point of this
> > > > > integration.
> > > > > >>>> > Probably we need to compare Spark + Ignite and Spark + Hive
> or
> > > > > Spark +
> > > > > >>>> > RDBMS cases.
> > > > > >>>> >
> > > > > >>>> > 2. If Spark community is reluctant let’s include the module
> in
> > > > > >>>> > ignite-spark integration.
> > > > > >>>> >
> > > > > >>>> > —
> > > > > >>>> > Denis
> > > > > >>>> >
> > > > > >>>> > > On Sep 25, 2017, at 11:14 AM, Николай Ижиков <
> > > > > >>>> nizhikov.dev@gmail.com>
> > > > > >>>> > wrote:
> > > > > >>>> > >
> > > > > >>>> > > Hello, guys.
> > > > > >>>> > >
> > > > > >>>> > > Currently, I’m working on integration between Spark and
> > Ignite
> > > > > [1].
> > > > > >>>> > >
> > > > > >>>> > > For now, I implement following:
> > > > > >>>> > >    * Ignite DataSource implementation(
> > IgniteRelationProvider)
> > > > > >>>> > >    * DataFrame support for Ignite SQL table.
> > > > > >>>> > >    * IgniteCatalog implementation for a transparent
> > resolving
> > > of
> > > > > >>>> ignites
> > > > > >>>> > > SQL tables.
> > > > > >>>> > >
> > > > > >>>> > > Implementation of it can be found in PR [2]
> > > > > >>>> > > It would be great if someone provides feedback for a
> > > prototype.
> > > > > >>>> > >
> > > > > >>>> > > I made some examples in PR so you can see how API suppose
> to
> > > be
> > > > > >>>> used [3].
> > > > > >>>> > > [4].
> > > > > >>>> > >
> > > > > >>>> > > I need some advice. Can you help me?
> > > > > >>>> > >
> > > > > >>>> > > 1. How should this PR be tested?
> > > > > >>>> > >
> > > > > >>>> > > Of course, I need to provide some unit tests. But what
> about
> > > > > >>>> scalability
> > > > > >>>> > > tests, etc.
> > > > > >>>> > > Maybe we need some Yardstick benchmark or similar?
> > > > > >>>> > > What are your thoughts?
> > > > > >>>> > > Which scenarios should I consider in the first place?
> > > > > >>>> > >
> > > > > >>>> > > 2. Should we provide Spark Catalog implementation inside
> > > Ignite
> > > > > >>>> codebase?
> > > > > >>>> > >
> > > > > >>>> > > A current implementation of Spark Catalog based on
> *internal
> > > > Spark
> > > > > >>>> API*.
> > > > > >>>> > > Spark community seems not interested in making Catalog API
> > > > public
> > > > > or
> > > > > >>>> > > including Ignite Catalog in Spark code base [5], [6].
> > > > > >>>> > >
> > > > > >>>> > > *Should we include Spark internal API implementation
> inside
> > > > Ignite
> > > > > >>>> code
> > > > > >>>> > > base?*
> > > > > >>>> > >
> > > > > >>>> > > Or should we consider to include Catalog implementation in
> > > some
> > > > > >>>> external
> > > > > >>>> > > module?
> > > > > >>>> > > That will be created and released outside Ignite?(we still
> > can
> > > > > >>>> support
> > > > > >>>> > and
> > > > > >>>> > > develop it inside Ignite community).
> > > > > >>>> > >
> > > > > >>>> > > [1] https://issues.apache.org/jira/browse/IGNITE-3084
> > > > > >>>> > > [2] https://github.com/apache/ignite/pull/2742
> > > > > >>>> > > [3] https://github.com/apache/
> ignite/pull/2742/files#diff-
> > > > > >>>> > > f4ff509cef3018e221394474775e0905
> > > > > >>>> > > [4] https://github.com/apache/
> ignite/pull/2742/files#diff-
> > > > > >>>> > > f2b670497d81e780dfd5098c5dd8a89c
> > > > > >>>> > > [5] http://apache-spark-developers-list.1001551.n3.
> > > > > >>>> > > nabble.com/Spark-Core-Custom-Catalog-Integration-between-
> > > > > >>>> > > Apache-Ignite-and-Apache-Spark-td22452.html
> > > > > >>>> > > [6] https://issues.apache.org/jira/browse/SPARK-17767
> > > > > >>>> > >
> > > > > >>>> > > --
> > > > > >>>> > > Nikolay Izhikov
> > > > > >>>> > > NIzhikov.dev@gmail.com
> > > > > >>>> >
> > > > > >>>> >
> > > > > >>>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> --
> > > > > >>> Nikolay Izhikov
> > > > > >>> NIzhikov.dev@gmail.com
> > > > > >>>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> Nikolay Izhikov
> > > > > >> NIzhikov.dev@gmail.com
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Nikolay Izhikov
> > > > > NIzhikov.dev@gmail.com
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Nikolay Izhikov
> > > NIzhikov.dev@gmail.com
> > >
> >
>

Re: Integration of Spark and Ignite. Prototype.

Posted by Николай Ижиков <ni...@gmail.com>.

Yes

17 окт. 2017 г. 2:34 PM пользователь "Anton Vinogradov" <
avinogradov@gridgain.com> написал:

> Nikolay,
>
> So, it will be able to start regular spark and ignite clusters and, using
> peer classloading via spark-context, perform any DataFrame request,
> correct?
>
> On Tue, Oct 17, 2017 at 2:25 PM, Николай Ижиков <ni...@gmail.com>
> wrote:
>
> > Hello, Anton.
> >
> > An example you provide is a path to a master *local* file.
> > These libraries are added to the classpath for each remote node running
> > submitted job.
> >
> > Please, see documentation:
> >
> > http://spark.apache.org/docs/latest/api/java/org/apache/
> > spark/SparkContext.html#addJar(java.lang.String)
> > http://spark.apache.org/docs/latest/api/java/org/apache/
> > spark/SparkContext.html#addFile(java.lang.String)
> >
> >
> > 2017-10-17 13:10 GMT+03:00 Anton Vinogradov <av...@gridgain.com>:
> >
> > > Nikolay,
> > >
> > > > With Data Frame API implementation there are no requirements to have
> > any
> > > > Ignite files on spark worker nodes.
> > >
> > > What do you mean? I see code like:
> > >
> > > spark.sparkContext.addJar(MAVEN_HOME +
> > > "/org/apache/ignite/ignite-core/2.3.0-SNAPSHOT/ignite-
> > > core-2.3.0-SNAPSHOT.jar")
> > >
> > > On Mon, Oct 16, 2017 at 5:22 PM, Николай Ижиков <
> nizhikov.dev@gmail.com>
> > > wrote:
> > >
> > > > Hello, guys.
> > > >
> > > > I have created example application to run Ignite Data Frame on
> > standalone
> > > > Spark cluster.
> > > > With Data Frame API implementation there are no requirements to have
> > any
> > > > Ignite files on spark worker nodes.
> > > >
> > > > I ran this application on the free dataset: ATP tennis match
> > statistics.
> > > >
> > > > data - https://github.com/nizhikov/atp_matches
> > > > app - https://github.com/nizhikov/ignite-spark-df-example
> > > >
> > > > Valentin, do you have a chance to look at my changes?
> > > >
> > > >
> > > > 2017-10-12 6:03 GMT+03:00 Valentin Kulichenko <
> > > > valentin.kulichenko@gmail.com
> > > > >:
> > > >
> > > > > Hi Nikolay,
> > > > >
> > > > > Sorry for delay on this, got a little swamped lately. I will do my
> > best
> > > > to
> > > > > review the code this week.
> > > > >
> > > > > -Val
> > > > >
> > > > > On Mon, Oct 9, 2017 at 11:48 AM, Николай Ижиков <
> > > nizhikov.dev@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> Hello, Valentin.
> > > > >>
> > > > >> Did you have a chance to look at my changes?
> > > > >>
> > > > >> Now I think I have done almost all required features.
> > > > >> I want to make some performance test to ensure my implementation
> > work
> > > > >> properly with a significant amount of data.
> > > > >> And I definitely need some feedback for my changes.
> > > > >>
> > > > >>
> > > > >> 2017-10-09 18:45 GMT+03:00 Николай Ижиков <nizhikov.dev@gmail.com
> >:
> > > > >>
> > > > >>> Hello, guys.
> > > > >>>
> > > > >>> Which version of Spark do we want to use?
> > > > >>>
> > > > >>> 1. Currently, Ignite depends on Spark 2.1.0.
> > > > >>>
> > > > >>>     * Can be run on JDK 7.
> > > > >>>     * Still supported: 2.1.2 will be released soon.
> > > > >>>
> > > > >>> 2. Latest Spark version is 2.2.0.
> > > > >>>
> > > > >>>     * Can be run only on JDK 8+
> > > > >>>     * Released Jul 11, 2017.
> > > > >>>     * Already supported by huge vendors(Amazon for example).
> > > > >>>
> > > > >>> Note that in IGNITE-3084 I implement some internal Spark API.
> > > > >>> So It will take some effort to switch between Spark 2.1 and 2.2
> > > > >>>
> > > > >>>
> > > > >>> 2017-09-27 2:20 GMT+03:00 Valentin Kulichenko <
> > > > >>> valentin.kulichenko@gmail.com>:
> > > > >>>
> > > > >>>> I will review in the next few days.
> > > > >>>>
> > > > >>>> -Val
> > > > >>>>
> > > > >>>> On Tue, Sep 26, 2017 at 2:23 PM, Denis Magda <dmagda@apache.org
> >
> > > > wrote:
> > > > >>>>
> > > > >>>> > Hello Nikolay,
> > > > >>>> >
> > > > >>>> > This is good news. Finally this capability is coming to
> Ignite.
> > > > >>>> >
> > > > >>>> > Val, Vladimir, could you do a preliminary review?
> > > > >>>> >
> > > > >>>> > Answering on your questions.
> > > > >>>> >
> > > > >>>> > 1. Yardstick should be enough for performance measurements.
> As a
> > > > Spark
> > > > >>>> > user, I will be curious to know what’s the point of this
> > > > integration.
> > > > >>>> > Probably we need to compare Spark + Ignite and Spark + Hive or
> > > > Spark +
> > > > >>>> > RDBMS cases.
> > > > >>>> >
> > > > >>>> > 2. If Spark community is reluctant let’s include the module in
> > > > >>>> > ignite-spark integration.
> > > > >>>> >
> > > > >>>> > —
> > > > >>>> > Denis
> > > > >>>> >
> > > > >>>> > > On Sep 25, 2017, at 11:14 AM, Николай Ижиков <
> > > > >>>> nizhikov.dev@gmail.com>
> > > > >>>> > wrote:
> > > > >>>> > >
> > > > >>>> > > Hello, guys.
> > > > >>>> > >
> > > > >>>> > > Currently, I’m working on integration between Spark and
> Ignite
> > > > [1].
> > > > >>>> > >
> > > > >>>> > > For now, I implement following:
> > > > >>>> > >    * Ignite DataSource implementation(
> IgniteRelationProvider)
> > > > >>>> > >    * DataFrame support for Ignite SQL table.
> > > > >>>> > >    * IgniteCatalog implementation for a transparent
> resolving
> > of
> > > > >>>> ignites
> > > > >>>> > > SQL tables.
> > > > >>>> > >
> > > > >>>> > > Implementation of it can be found in PR [2]
> > > > >>>> > > It would be great if someone provides feedback for a
> > prototype.
> > > > >>>> > >
> > > > >>>> > > I made some examples in PR so you can see how API suppose to
> > be
> > > > >>>> used [3].
> > > > >>>> > > [4].
> > > > >>>> > >
> > > > >>>> > > I need some advice. Can you help me?
> > > > >>>> > >
> > > > >>>> > > 1. How should this PR be tested?
> > > > >>>> > >
> > > > >>>> > > Of course, I need to provide some unit tests. But what about
> > > > >>>> scalability
> > > > >>>> > > tests, etc.
> > > > >>>> > > Maybe we need some Yardstick benchmark or similar?
> > > > >>>> > > What are your thoughts?
> > > > >>>> > > Which scenarios should I consider in the first place?
> > > > >>>> > >
> > > > >>>> > > 2. Should we provide Spark Catalog implementation inside
> > Ignite
> > > > >>>> codebase?
> > > > >>>> > >
> > > > >>>> > > A current implementation of Spark Catalog based on *internal
> > > Spark
> > > > >>>> API*.
> > > > >>>> > > Spark community seems not interested in making Catalog API
> > > public
> > > > or
> > > > >>>> > > including Ignite Catalog in Spark code base [5], [6].
> > > > >>>> > >
> > > > >>>> > > *Should we include Spark internal API implementation inside
> > > Ignite
> > > > >>>> code
> > > > >>>> > > base?*
> > > > >>>> > >
> > > > >>>> > > Or should we consider to include Catalog implementation in
> > some
> > > > >>>> external
> > > > >>>> > > module?
> > > > >>>> > > That will be created and released outside Ignite?(we still
> can
> > > > >>>> support
> > > > >>>> > and
> > > > >>>> > > develop it inside Ignite community).
> > > > >>>> > >
> > > > >>>> > > [1] https://issues.apache.org/jira/browse/IGNITE-3084
> > > > >>>> > > [2] https://github.com/apache/ignite/pull/2742
> > > > >>>> > > [3] https://github.com/apache/ignite/pull/2742/files#diff-
> > > > >>>> > > f4ff509cef3018e221394474775e0905
> > > > >>>> > > [4] https://github.com/apache/ignite/pull/2742/files#diff-
> > > > >>>> > > f2b670497d81e780dfd5098c5dd8a89c
> > > > >>>> > > [5] http://apache-spark-developers-list.1001551.n3.
> > > > >>>> > > nabble.com/Spark-Core-Custom-Catalog-Integration-between-
> > > > >>>> > > Apache-Ignite-and-Apache-Spark-td22452.html
> > > > >>>> > > [6] https://issues.apache.org/jira/browse/SPARK-17767
> > > > >>>> > >
> > > > >>>> > > --
> > > > >>>> > > Nikolay Izhikov
> > > > >>>> > > NIzhikov.dev@gmail.com
> > > > >>>> >
> > > > >>>> >
> > > > >>>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> --
> > > > >>> Nikolay Izhikov
> > > > >>> NIzhikov.dev@gmail.com
> > > > >>>
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Nikolay Izhikov
> > > > >> NIzhikov.dev@gmail.com
> > > > >>
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Nikolay Izhikov
> > > > NIzhikov.dev@gmail.com
> > > >
> > >
> >
> >
> >
> > --
> > Nikolay Izhikov
> > NIzhikov.dev@gmail.com
> >
>

Re: Integration of Spark and Ignite. Prototype.

Posted by Anton Vinogradov <av...@gridgain.com>.

Nikolay,

So, it will be able to start regular spark and ignite clusters and, using
peer classloading via spark-context, perform any DataFrame request, correct?

On Tue, Oct 17, 2017 at 2:25 PM, Николай Ижиков <ni...@gmail.com>
wrote:

> Hello, Anton.
>
> An example you provide is a path to a master *local* file.
> These libraries are added to the classpath for each remote node running
> submitted job.
>
> Please, see documentation:
>
> http://spark.apache.org/docs/latest/api/java/org/apache/
> spark/SparkContext.html#addJar(java.lang.String)
> http://spark.apache.org/docs/latest/api/java/org/apache/
> spark/SparkContext.html#addFile(java.lang.String)
>
>
> 2017-10-17 13:10 GMT+03:00 Anton Vinogradov <av...@gridgain.com>:
>
> > Nikolay,
> >
> > > With Data Frame API implementation there are no requirements to have
> any
> > > Ignite files on spark worker nodes.
> >
> > What do you mean? I see code like:
> >
> > spark.sparkContext.addJar(MAVEN_HOME +
> > "/org/apache/ignite/ignite-core/2.3.0-SNAPSHOT/ignite-
> > core-2.3.0-SNAPSHOT.jar")
> >
> > On Mon, Oct 16, 2017 at 5:22 PM, Николай Ижиков <ni...@gmail.com>
> > wrote:
> >
> > > Hello, guys.
> > >
> > > I have created example application to run Ignite Data Frame on
> standalone
> > > Spark cluster.
> > > With Data Frame API implementation there are no requirements to have
> any
> > > Ignite files on spark worker nodes.
> > >
> > > I ran this application on the free dataset: ATP tennis match
> statistics.
> > >
> > > data - https://github.com/nizhikov/atp_matches
> > > app - https://github.com/nizhikov/ignite-spark-df-example
> > >
> > > Valentin, do you have a chance to look at my changes?
> > >
> > >
> > > 2017-10-12 6:03 GMT+03:00 Valentin Kulichenko <
> > > valentin.kulichenko@gmail.com
> > > >:
> > >
> > > > Hi Nikolay,
> > > >
> > > > Sorry for delay on this, got a little swamped lately. I will do my
> best
> > > to
> > > > review the code this week.
> > > >
> > > > -Val
> > > >
> > > > On Mon, Oct 9, 2017 at 11:48 AM, Николай Ижиков <
> > nizhikov.dev@gmail.com>
> > > > wrote:
> > > >
> > > >> Hello, Valentin.
> > > >>
> > > >> Did you have a chance to look at my changes?
> > > >>
> > > >> Now I think I have done almost all required features.
> > > >> I want to make some performance test to ensure my implementation
> work
> > > >> properly with a significant amount of data.
> > > >> And I definitely need some feedback for my changes.
> > > >>
> > > >>
> > > >> 2017-10-09 18:45 GMT+03:00 Николай Ижиков <ni...@gmail.com>:
> > > >>
> > > >>> Hello, guys.
> > > >>>
> > > >>> Which version of Spark do we want to use?
> > > >>>
> > > >>> 1. Currently, Ignite depends on Spark 2.1.0.
> > > >>>
> > > >>>     * Can be run on JDK 7.
> > > >>>     * Still supported: 2.1.2 will be released soon.
> > > >>>
> > > >>> 2. Latest Spark version is 2.2.0.
> > > >>>
> > > >>>     * Can be run only on JDK 8+
> > > >>>     * Released Jul 11, 2017.
> > > >>>     * Already supported by huge vendors(Amazon for example).
> > > >>>
> > > >>> Note that in IGNITE-3084 I implement some internal Spark API.
> > > >>> So It will take some effort to switch between Spark 2.1 and 2.2
> > > >>>
> > > >>>
> > > >>> 2017-09-27 2:20 GMT+03:00 Valentin Kulichenko <
> > > >>> valentin.kulichenko@gmail.com>:
> > > >>>
> > > >>>> I will review in the next few days.
> > > >>>>
> > > >>>> -Val
> > > >>>>
> > > >>>> On Tue, Sep 26, 2017 at 2:23 PM, Denis Magda <dm...@apache.org>
> > > wrote:
> > > >>>>
> > > >>>> > Hello Nikolay,
> > > >>>> >
> > > >>>> > This is good news. Finally this capability is coming to Ignite.
> > > >>>> >
> > > >>>> > Val, Vladimir, could you do a preliminary review?
> > > >>>> >
> > > >>>> > Answering on your questions.
> > > >>>> >
> > > >>>> > 1. Yardstick should be enough for performance measurements. As a
> > > Spark
> > > >>>> > user, I will be curious to know what’s the point of this
> > > integration.
> > > >>>> > Probably we need to compare Spark + Ignite and Spark + Hive or
> > > Spark +
> > > >>>> > RDBMS cases.
> > > >>>> >
> > > >>>> > 2. If Spark community is reluctant let’s include the module in
> > > >>>> > ignite-spark integration.
> > > >>>> >
> > > >>>> > —
> > > >>>> > Denis
> > > >>>> >
> > > >>>> > > On Sep 25, 2017, at 11:14 AM, Николай Ижиков <
> > > >>>> nizhikov.dev@gmail.com>
> > > >>>> > wrote:
> > > >>>> > >
> > > >>>> > > Hello, guys.
> > > >>>> > >
> > > >>>> > > Currently, I’m working on integration between Spark and Ignite
> > > [1].
> > > >>>> > >
> > > >>>> > > For now, I implement following:
> > > >>>> > >    * Ignite DataSource implementation(IgniteRelationProvider)
> > > >>>> > >    * DataFrame support for Ignite SQL table.
> > > >>>> > >    * IgniteCatalog implementation for a transparent resolving
> of
> > > >>>> ignites
> > > >>>> > > SQL tables.
> > > >>>> > >
> > > >>>> > > Implementation of it can be found in PR [2]
> > > >>>> > > It would be great if someone provides feedback for a
> prototype.
> > > >>>> > >
> > > >>>> > > I made some examples in PR so you can see how API suppose to
> be
> > > >>>> used [3].
> > > >>>> > > [4].
> > > >>>> > >
> > > >>>> > > I need some advice. Can you help me?
> > > >>>> > >
> > > >>>> > > 1. How should this PR be tested?
> > > >>>> > >
> > > >>>> > > Of course, I need to provide some unit tests. But what about
> > > >>>> scalability
> > > >>>> > > tests, etc.
> > > >>>> > > Maybe we need some Yardstick benchmark or similar?
> > > >>>> > > What are your thoughts?
> > > >>>> > > Which scenarios should I consider in the first place?
> > > >>>> > >
> > > >>>> > > 2. Should we provide Spark Catalog implementation inside
> Ignite
> > > >>>> codebase?
> > > >>>> > >
> > > >>>> > > A current implementation of Spark Catalog based on *internal
> > Spark
> > > >>>> API*.
> > > >>>> > > Spark community seems not interested in making Catalog API
> > public
> > > or
> > > >>>> > > including Ignite Catalog in Spark code base [5], [6].
> > > >>>> > >
> > > >>>> > > *Should we include Spark internal API implementation inside
> > Ignite
> > > >>>> code
> > > >>>> > > base?*
> > > >>>> > >
> > > >>>> > > Or should we consider to include Catalog implementation in
> some
> > > >>>> external
> > > >>>> > > module?
> > > >>>> > > That will be created and released outside Ignite?(we still can
> > > >>>> support
> > > >>>> > and
> > > >>>> > > develop it inside Ignite community).
> > > >>>> > >
> > > >>>> > > [1] https://issues.apache.org/jira/browse/IGNITE-3084
> > > >>>> > > [2] https://github.com/apache/ignite/pull/2742
> > > >>>> > > [3] https://github.com/apache/ignite/pull/2742/files#diff-
> > > >>>> > > f4ff509cef3018e221394474775e0905
> > > >>>> > > [4] https://github.com/apache/ignite/pull/2742/files#diff-
> > > >>>> > > f2b670497d81e780dfd5098c5dd8a89c
> > > >>>> > > [5] http://apache-spark-developers-list.1001551.n3.
> > > >>>> > > nabble.com/Spark-Core-Custom-Catalog-Integration-between-
> > > >>>> > > Apache-Ignite-and-Apache-Spark-td22452.html
> > > >>>> > > [6] https://issues.apache.org/jira/browse/SPARK-17767
> > > >>>> > >
> > > >>>> > > --
> > > >>>> > > Nikolay Izhikov
> > > >>>> > > NIzhikov.dev@gmail.com
> > > >>>> >
> > > >>>> >
> > > >>>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>> Nikolay Izhikov
> > > >>> NIzhikov.dev@gmail.com
> > > >>>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Nikolay Izhikov
> > > >> NIzhikov.dev@gmail.com
> > > >>
> > > >
> > > >
> > >
> > >
> > > --
> > > Nikolay Izhikov
> > > NIzhikov.dev@gmail.com
> > >
> >
>
>
>
> --
> Nikolay Izhikov
> NIzhikov.dev@gmail.com
>

Re: Integration of Spark and Ignite. Prototype.

Posted by Николай Ижиков <ni...@gmail.com>.

Hello, Anton.

An example you provide is a path to a master *local* file.
These libraries are added to the classpath for each remote node running
submitted job.

Please, see documentation:

http://spark.apache.org/docs/latest/api/java/org/apache/
spark/SparkContext.html#addJar(java.lang.String)
http://spark.apache.org/docs/latest/api/java/org/apache/
spark/SparkContext.html#addFile(java.lang.String)


2017-10-17 13:10 GMT+03:00 Anton Vinogradov <av...@gridgain.com>:

> Nikolay,
>
> > With Data Frame API implementation there are no requirements to have any
> > Ignite files on spark worker nodes.
>
> What do you mean? I see code like:
>
> spark.sparkContext.addJar(MAVEN_HOME +
> "/org/apache/ignite/ignite-core/2.3.0-SNAPSHOT/ignite-
> core-2.3.0-SNAPSHOT.jar")
>
> On Mon, Oct 16, 2017 at 5:22 PM, Николай Ижиков <ni...@gmail.com>
> wrote:
>
> > Hello, guys.
> >
> > I have created example application to run Ignite Data Frame on standalone
> > Spark cluster.
> > With Data Frame API implementation there are no requirements to have any
> > Ignite files on spark worker nodes.
> >
> > I ran this application on the free dataset: ATP tennis match statistics.
> >
> > data - https://github.com/nizhikov/atp_matches
> > app - https://github.com/nizhikov/ignite-spark-df-example
> >
> > Valentin, do you have a chance to look at my changes?
> >
> >
> > 2017-10-12 6:03 GMT+03:00 Valentin Kulichenko <
> > valentin.kulichenko@gmail.com
> > >:
> >
> > > Hi Nikolay,
> > >
> > > Sorry for delay on this, got a little swamped lately. I will do my best
> > to
> > > review the code this week.
> > >
> > > -Val
> > >
> > > On Mon, Oct 9, 2017 at 11:48 AM, Николай Ижиков <
> nizhikov.dev@gmail.com>
> > > wrote:
> > >
> > >> Hello, Valentin.
> > >>
> > >> Did you have a chance to look at my changes?
> > >>
> > >> Now I think I have done almost all required features.
> > >> I want to make some performance test to ensure my implementation work
> > >> properly with a significant amount of data.
> > >> And I definitely need some feedback for my changes.
> > >>
> > >>
> > >> 2017-10-09 18:45 GMT+03:00 Николай Ижиков <ni...@gmail.com>:
> > >>
> > >>> Hello, guys.
> > >>>
> > >>> Which version of Spark do we want to use?
> > >>>
> > >>> 1. Currently, Ignite depends on Spark 2.1.0.
> > >>>
> > >>>     * Can be run on JDK 7.
> > >>>     * Still supported: 2.1.2 will be released soon.
> > >>>
> > >>> 2. Latest Spark version is 2.2.0.
> > >>>
> > >>>     * Can be run only on JDK 8+
> > >>>     * Released Jul 11, 2017.
> > >>>     * Already supported by huge vendors(Amazon for example).
> > >>>
> > >>> Note that in IGNITE-3084 I implement some internal Spark API.
> > >>> So It will take some effort to switch between Spark 2.1 and 2.2
> > >>>
> > >>>
> > >>> 2017-09-27 2:20 GMT+03:00 Valentin Kulichenko <
> > >>> valentin.kulichenko@gmail.com>:
> > >>>
> > >>>> I will review in the next few days.
> > >>>>
> > >>>> -Val
> > >>>>
> > >>>> On Tue, Sep 26, 2017 at 2:23 PM, Denis Magda <dm...@apache.org>
> > wrote:
> > >>>>
> > >>>> > Hello Nikolay,
> > >>>> >
> > >>>> > This is good news. Finally this capability is coming to Ignite.
> > >>>> >
> > >>>> > Val, Vladimir, could you do a preliminary review?
> > >>>> >
> > >>>> > Answering on your questions.
> > >>>> >
> > >>>> > 1. Yardstick should be enough for performance measurements. As a
> > Spark
> > >>>> > user, I will be curious to know what’s the point of this
> > integration.
> > >>>> > Probably we need to compare Spark + Ignite and Spark + Hive or
> > Spark +
> > >>>> > RDBMS cases.
> > >>>> >
> > >>>> > 2. If Spark community is reluctant let’s include the module in
> > >>>> > ignite-spark integration.
> > >>>> >
> > >>>> > —
> > >>>> > Denis
> > >>>> >
> > >>>> > > On Sep 25, 2017, at 11:14 AM, Николай Ижиков <
> > >>>> nizhikov.dev@gmail.com>
> > >>>> > wrote:
> > >>>> > >
> > >>>> > > Hello, guys.
> > >>>> > >
> > >>>> > > Currently, I’m working on integration between Spark and Ignite
> > [1].
> > >>>> > >
> > >>>> > > For now, I implement following:
> > >>>> > >    * Ignite DataSource implementation(IgniteRelationProvider)
> > >>>> > >    * DataFrame support for Ignite SQL table.
> > >>>> > >    * IgniteCatalog implementation for a transparent resolving of
> > >>>> ignites
> > >>>> > > SQL tables.
> > >>>> > >
> > >>>> > > Implementation of it can be found in PR [2]
> > >>>> > > It would be great if someone provides feedback for a prototype.
> > >>>> > >
> > >>>> > > I made some examples in PR so you can see how API suppose to be
> > >>>> used [3].
> > >>>> > > [4].
> > >>>> > >
> > >>>> > > I need some advice. Can you help me?
> > >>>> > >
> > >>>> > > 1. How should this PR be tested?
> > >>>> > >
> > >>>> > > Of course, I need to provide some unit tests. But what about
> > >>>> scalability
> > >>>> > > tests, etc.
> > >>>> > > Maybe we need some Yardstick benchmark or similar?
> > >>>> > > What are your thoughts?
> > >>>> > > Which scenarios should I consider in the first place?
> > >>>> > >
> > >>>> > > 2. Should we provide Spark Catalog implementation inside Ignite
> > >>>> codebase?
> > >>>> > >
> > >>>> > > A current implementation of Spark Catalog based on *internal
> Spark
> > >>>> API*.
> > >>>> > > Spark community seems not interested in making Catalog API
> public
> > or
> > >>>> > > including Ignite Catalog in Spark code base [5], [6].
> > >>>> > >
> > >>>> > > *Should we include Spark internal API implementation inside
> Ignite
> > >>>> code
> > >>>> > > base?*
> > >>>> > >
> > >>>> > > Or should we consider to include Catalog implementation in some
> > >>>> external
> > >>>> > > module?
> > >>>> > > That will be created and released outside Ignite?(we still can
> > >>>> support
> > >>>> > and
> > >>>> > > develop it inside Ignite community).
> > >>>> > >
> > >>>> > > [1] https://issues.apache.org/jira/browse/IGNITE-3084
> > >>>> > > [2] https://github.com/apache/ignite/pull/2742
> > >>>> > > [3] https://github.com/apache/ignite/pull/2742/files#diff-
> > >>>> > > f4ff509cef3018e221394474775e0905
> > >>>> > > [4] https://github.com/apache/ignite/pull/2742/files#diff-
> > >>>> > > f2b670497d81e780dfd5098c5dd8a89c
> > >>>> > > [5] http://apache-spark-developers-list.1001551.n3.
> > >>>> > > nabble.com/Spark-Core-Custom-Catalog-Integration-between-
> > >>>> > > Apache-Ignite-and-Apache-Spark-td22452.html
> > >>>> > > [6] https://issues.apache.org/jira/browse/SPARK-17767
> > >>>> > >
> > >>>> > > --
> > >>>> > > Nikolay Izhikov
> > >>>> > > NIzhikov.dev@gmail.com
> > >>>> >
> > >>>> >
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Nikolay Izhikov
> > >>> NIzhikov.dev@gmail.com
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> Nikolay Izhikov
> > >> NIzhikov.dev@gmail.com
> > >>
> > >
> > >
> >
> >
> > --
> > Nikolay Izhikov
> > NIzhikov.dev@gmail.com
> >
>



-- 
Nikolay Izhikov
NIzhikov.dev@gmail.com

Re: Integration of Spark and Ignite. Prototype.

Posted by Anton Vinogradov <av...@gridgain.com>.

Nikolay,

> With Data Frame API implementation there are no requirements to have any
> Ignite files on spark worker nodes.

What do you mean? I see code like:

spark.sparkContext.addJar(MAVEN_HOME +
"/org/apache/ignite/ignite-core/2.3.0-SNAPSHOT/ignite-core-2.3.0-SNAPSHOT.jar")

On Mon, Oct 16, 2017 at 5:22 PM, Николай Ижиков <ni...@gmail.com>
wrote:

> Hello, guys.
>
> I have created example application to run Ignite Data Frame on standalone
> Spark cluster.
> With Data Frame API implementation there are no requirements to have any
> Ignite files on spark worker nodes.
>
> I ran this application on the free dataset: ATP tennis match statistics.
>
> data - https://github.com/nizhikov/atp_matches
> app - https://github.com/nizhikov/ignite-spark-df-example
>
> Valentin, do you have a chance to look at my changes?
>
>
> 2017-10-12 6:03 GMT+03:00 Valentin Kulichenko <
> valentin.kulichenko@gmail.com
> >:
>
> > Hi Nikolay,
> >
> > Sorry for delay on this, got a little swamped lately. I will do my best
> to
> > review the code this week.
> >
> > -Val
> >
> > On Mon, Oct 9, 2017 at 11:48 AM, Николай Ижиков <ni...@gmail.com>
> > wrote:
> >
> >> Hello, Valentin.
> >>
> >> Did you have a chance to look at my changes?
> >>
> >> Now I think I have done almost all required features.
> >> I want to make some performance test to ensure my implementation work
> >> properly with a significant amount of data.
> >> And I definitely need some feedback for my changes.
> >>
> >>
> >> 2017-10-09 18:45 GMT+03:00 Николай Ижиков <ni...@gmail.com>:
> >>
> >>> Hello, guys.
> >>>
> >>> Which version of Spark do we want to use?
> >>>
> >>> 1. Currently, Ignite depends on Spark 2.1.0.
> >>>
> >>>     * Can be run on JDK 7.
> >>>     * Still supported: 2.1.2 will be released soon.
> >>>
> >>> 2. Latest Spark version is 2.2.0.
> >>>
> >>>     * Can be run only on JDK 8+
> >>>     * Released Jul 11, 2017.
> >>>     * Already supported by huge vendors(Amazon for example).
> >>>
> >>> Note that in IGNITE-3084 I implement some internal Spark API.
> >>> So It will take some effort to switch between Spark 2.1 and 2.2
> >>>
> >>>
> >>> 2017-09-27 2:20 GMT+03:00 Valentin Kulichenko <
> >>> valentin.kulichenko@gmail.com>:
> >>>
> >>>> I will review in the next few days.
> >>>>
> >>>> -Val
> >>>>
> >>>> On Tue, Sep 26, 2017 at 2:23 PM, Denis Magda <dm...@apache.org>
> wrote:
> >>>>
> >>>> > Hello Nikolay,
> >>>> >
> >>>> > This is good news. Finally this capability is coming to Ignite.
> >>>> >
> >>>> > Val, Vladimir, could you do a preliminary review?
> >>>> >
> >>>> > Answering on your questions.
> >>>> >
> >>>> > 1. Yardstick should be enough for performance measurements. As a
> Spark
> >>>> > user, I will be curious to know what’s the point of this
> integration.
> >>>> > Probably we need to compare Spark + Ignite and Spark + Hive or
> Spark +
> >>>> > RDBMS cases.
> >>>> >
> >>>> > 2. If Spark community is reluctant let’s include the module in
> >>>> > ignite-spark integration.
> >>>> >
> >>>> > —
> >>>> > Denis
> >>>> >
> >>>> > > On Sep 25, 2017, at 11:14 AM, Николай Ижиков <
> >>>> nizhikov.dev@gmail.com>
> >>>> > wrote:
> >>>> > >
> >>>> > > Hello, guys.
> >>>> > >
> >>>> > > Currently, I’m working on integration between Spark and Ignite
> [1].
> >>>> > >
> >>>> > > For now, I implement following:
> >>>> > >    * Ignite DataSource implementation(IgniteRelationProvider)
> >>>> > >    * DataFrame support for Ignite SQL table.
> >>>> > >    * IgniteCatalog implementation for a transparent resolving of
> >>>> ignites
> >>>> > > SQL tables.
> >>>> > >
> >>>> > > Implementation of it can be found in PR [2]
> >>>> > > It would be great if someone provides feedback for a prototype.
> >>>> > >
> >>>> > > I made some examples in PR so you can see how API suppose to be
> >>>> used [3].
> >>>> > > [4].
> >>>> > >
> >>>> > > I need some advice. Can you help me?
> >>>> > >
> >>>> > > 1. How should this PR be tested?
> >>>> > >
> >>>> > > Of course, I need to provide some unit tests. But what about
> >>>> scalability
> >>>> > > tests, etc.
> >>>> > > Maybe we need some Yardstick benchmark or similar?
> >>>> > > What are your thoughts?
> >>>> > > Which scenarios should I consider in the first place?
> >>>> > >
> >>>> > > 2. Should we provide Spark Catalog implementation inside Ignite
> >>>> codebase?
> >>>> > >
> >>>> > > A current implementation of Spark Catalog based on *internal Spark
> >>>> API*.
> >>>> > > Spark community seems not interested in making Catalog API public
> or
> >>>> > > including Ignite Catalog in Spark code base [5], [6].
> >>>> > >
> >>>> > > *Should we include Spark internal API implementation inside Ignite
> >>>> code
> >>>> > > base?*
> >>>> > >
> >>>> > > Or should we consider to include Catalog implementation in some
> >>>> external
> >>>> > > module?
> >>>> > > That will be created and released outside Ignite?(we still can
> >>>> support
> >>>> > and
> >>>> > > develop it inside Ignite community).
> >>>> > >
> >>>> > > [1] https://issues.apache.org/jira/browse/IGNITE-3084
> >>>> > > [2] https://github.com/apache/ignite/pull/2742
> >>>> > > [3] https://github.com/apache/ignite/pull/2742/files#diff-
> >>>> > > f4ff509cef3018e221394474775e0905
> >>>> > > [4] https://github.com/apache/ignite/pull/2742/files#diff-
> >>>> > > f2b670497d81e780dfd5098c5dd8a89c
> >>>> > > [5] http://apache-spark-developers-list.1001551.n3.
> >>>> > > nabble.com/Spark-Core-Custom-Catalog-Integration-between-
> >>>> > > Apache-Ignite-and-Apache-Spark-td22452.html
> >>>> > > [6] https://issues.apache.org/jira/browse/SPARK-17767
> >>>> > >
> >>>> > > --
> >>>> > > Nikolay Izhikov
> >>>> > > NIzhikov.dev@gmail.com
> >>>> >
> >>>> >
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Nikolay Izhikov
> >>> NIzhikov.dev@gmail.com
> >>>
> >>
> >>
> >>
> >> --
> >> Nikolay Izhikov
> >> NIzhikov.dev@gmail.com
> >>
> >
> >
>
>
> --
> Nikolay Izhikov
> NIzhikov.dev@gmail.com
>

Re: Integration of Spark and Ignite. Prototype.

Posted by Николай Ижиков <ni...@gmail.com>.

Hello, guys.

I have created example application to run Ignite Data Frame on standalone
Spark cluster.
With Data Frame API implementation there are no requirements to have any
Ignite files on spark worker nodes.

I ran this application on the free dataset: ATP tennis match statistics.

data - https://github.com/nizhikov/atp_matches
app - https://github.com/nizhikov/ignite-spark-df-example

Valentin, do you have a chance to look at my changes?


2017-10-12 6:03 GMT+03:00 Valentin Kulichenko <valentin.kulichenko@gmail.com
>:

> Hi Nikolay,
>
> Sorry for delay on this, got a little swamped lately. I will do my best to
> review the code this week.
>
> -Val
>
> On Mon, Oct 9, 2017 at 11:48 AM, Николай Ижиков <ni...@gmail.com>
> wrote:
>
>> Hello, Valentin.
>>
>> Did you have a chance to look at my changes?
>>
>> Now I think I have done almost all required features.
>> I want to make some performance test to ensure my implementation work
>> properly with a significant amount of data.
>> And I definitely need some feedback for my changes.
>>
>>
>> 2017-10-09 18:45 GMT+03:00 Николай Ижиков <ni...@gmail.com>:
>>
>>> Hello, guys.
>>>
>>> Which version of Spark do we want to use?
>>>
>>> 1. Currently, Ignite depends on Spark 2.1.0.
>>>
>>>     * Can be run on JDK 7.
>>>     * Still supported: 2.1.2 will be released soon.
>>>
>>> 2. Latest Spark version is 2.2.0.
>>>
>>>     * Can be run only on JDK 8+
>>>     * Released Jul 11, 2017.
>>>     * Already supported by huge vendors(Amazon for example).
>>>
>>> Note that in IGNITE-3084 I implement some internal Spark API.
>>> So It will take some effort to switch between Spark 2.1 and 2.2
>>>
>>>
>>> 2017-09-27 2:20 GMT+03:00 Valentin Kulichenko <
>>> valentin.kulichenko@gmail.com>:
>>>
>>>> I will review in the next few days.
>>>>
>>>> -Val
>>>>
>>>> On Tue, Sep 26, 2017 at 2:23 PM, Denis Magda <dm...@apache.org> wrote:
>>>>
>>>> > Hello Nikolay,
>>>> >
>>>> > This is good news. Finally this capability is coming to Ignite.
>>>> >
>>>> > Val, Vladimir, could you do a preliminary review?
>>>> >
>>>> > Answering on your questions.
>>>> >
>>>> > 1. Yardstick should be enough for performance measurements. As a Spark
>>>> > user, I will be curious to know what’s the point of this integration.
>>>> > Probably we need to compare Spark + Ignite and Spark + Hive or Spark +
>>>> > RDBMS cases.
>>>> >
>>>> > 2. If Spark community is reluctant let’s include the module in
>>>> > ignite-spark integration.
>>>> >
>>>> > —
>>>> > Denis
>>>> >
>>>> > > On Sep 25, 2017, at 11:14 AM, Николай Ижиков <
>>>> nizhikov.dev@gmail.com>
>>>> > wrote:
>>>> > >
>>>> > > Hello, guys.
>>>> > >
>>>> > > Currently, I’m working on integration between Spark and Ignite [1].
>>>> > >
>>>> > > For now, I implement following:
>>>> > >    * Ignite DataSource implementation(IgniteRelationProvider)
>>>> > >    * DataFrame support for Ignite SQL table.
>>>> > >    * IgniteCatalog implementation for a transparent resolving of
>>>> ignites
>>>> > > SQL tables.
>>>> > >
>>>> > > Implementation of it can be found in PR [2]
>>>> > > It would be great if someone provides feedback for a prototype.
>>>> > >
>>>> > > I made some examples in PR so you can see how API suppose to be
>>>> used [3].
>>>> > > [4].
>>>> > >
>>>> > > I need some advice. Can you help me?
>>>> > >
>>>> > > 1. How should this PR be tested?
>>>> > >
>>>> > > Of course, I need to provide some unit tests. But what about
>>>> scalability
>>>> > > tests, etc.
>>>> > > Maybe we need some Yardstick benchmark or similar?
>>>> > > What are your thoughts?
>>>> > > Which scenarios should I consider in the first place?
>>>> > >
>>>> > > 2. Should we provide Spark Catalog implementation inside Ignite
>>>> codebase?
>>>> > >
>>>> > > A current implementation of Spark Catalog based on *internal Spark
>>>> API*.
>>>> > > Spark community seems not interested in making Catalog API public or
>>>> > > including Ignite Catalog in Spark code base [5], [6].
>>>> > >
>>>> > > *Should we include Spark internal API implementation inside Ignite
>>>> code
>>>> > > base?*
>>>> > >
>>>> > > Or should we consider to include Catalog implementation in some
>>>> external
>>>> > > module?
>>>> > > That will be created and released outside Ignite?(we still can
>>>> support
>>>> > and
>>>> > > develop it inside Ignite community).
>>>> > >
>>>> > > [1] https://issues.apache.org/jira/browse/IGNITE-3084
>>>> > > [2] https://github.com/apache/ignite/pull/2742
>>>> > > [3] https://github.com/apache/ignite/pull/2742/files#diff-
>>>> > > f4ff509cef3018e221394474775e0905
>>>> > > [4] https://github.com/apache/ignite/pull/2742/files#diff-
>>>> > > f2b670497d81e780dfd5098c5dd8a89c
>>>> > > [5] http://apache-spark-developers-list.1001551.n3.
>>>> > > nabble.com/Spark-Core-Custom-Catalog-Integration-between-
>>>> > > Apache-Ignite-and-Apache-Spark-td22452.html
>>>> > > [6] https://issues.apache.org/jira/browse/SPARK-17767
>>>> > >
>>>> > > --
>>>> > > Nikolay Izhikov
>>>> > > NIzhikov.dev@gmail.com
>>>> >
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> Nikolay Izhikov
>>> NIzhikov.dev@gmail.com
>>>
>>
>>
>>
>> --
>> Nikolay Izhikov
>> NIzhikov.dev@gmail.com
>>
>
>


-- 
Nikolay Izhikov
NIzhikov.dev@gmail.com

Re: Integration of Spark and Ignite. Prototype.

Posted by Valentin Kulichenko <va...@gmail.com>.

Hi Nikolay,

Sorry for delay on this, got a little swamped lately. I will do my best to
review the code this week.

-Val

On Mon, Oct 9, 2017 at 11:48 AM, Николай Ижиков <ni...@gmail.com>
wrote:

> Hello, Valentin.
>
> Did you have a chance to look at my changes?
>
> Now I think I have done almost all required features.
> I want to make some performance test to ensure my implementation work
> properly with a significant amount of data.
> And I definitely need some feedback for my changes.
>
>
> 2017-10-09 18:45 GMT+03:00 Николай Ижиков <ni...@gmail.com>:
>
>> Hello, guys.
>>
>> Which version of Spark do we want to use?
>>
>> 1. Currently, Ignite depends on Spark 2.1.0.
>>
>>     * Can be run on JDK 7.
>>     * Still supported: 2.1.2 will be released soon.
>>
>> 2. Latest Spark version is 2.2.0.
>>
>>     * Can be run only on JDK 8+
>>     * Released Jul 11, 2017.
>>     * Already supported by huge vendors(Amazon for example).
>>
>> Note that in IGNITE-3084 I implement some internal Spark API.
>> So It will take some effort to switch between Spark 2.1 and 2.2
>>
>>
>> 2017-09-27 2:20 GMT+03:00 Valentin Kulichenko <
>> valentin.kulichenko@gmail.com>:
>>
>>> I will review in the next few days.
>>>
>>> -Val
>>>
>>> On Tue, Sep 26, 2017 at 2:23 PM, Denis Magda <dm...@apache.org> wrote:
>>>
>>> > Hello Nikolay,
>>> >
>>> > This is good news. Finally this capability is coming to Ignite.
>>> >
>>> > Val, Vladimir, could you do a preliminary review?
>>> >
>>> > Answering on your questions.
>>> >
>>> > 1. Yardstick should be enough for performance measurements. As a Spark
>>> > user, I will be curious to know what’s the point of this integration.
>>> > Probably we need to compare Spark + Ignite and Spark + Hive or Spark +
>>> > RDBMS cases.
>>> >
>>> > 2. If Spark community is reluctant let’s include the module in
>>> > ignite-spark integration.
>>> >
>>> > —
>>> > Denis
>>> >
>>> > > On Sep 25, 2017, at 11:14 AM, Николай Ижиков <nizhikov.dev@gmail.com
>>> >
>>> > wrote:
>>> > >
>>> > > Hello, guys.
>>> > >
>>> > > Currently, I’m working on integration between Spark and Ignite [1].
>>> > >
>>> > > For now, I implement following:
>>> > >    * Ignite DataSource implementation(IgniteRelationProvider)
>>> > >    * DataFrame support for Ignite SQL table.
>>> > >    * IgniteCatalog implementation for a transparent resolving of
>>> ignites
>>> > > SQL tables.
>>> > >
>>> > > Implementation of it can be found in PR [2]
>>> > > It would be great if someone provides feedback for a prototype.
>>> > >
>>> > > I made some examples in PR so you can see how API suppose to be used
>>> [3].
>>> > > [4].
>>> > >
>>> > > I need some advice. Can you help me?
>>> > >
>>> > > 1. How should this PR be tested?
>>> > >
>>> > > Of course, I need to provide some unit tests. But what about
>>> scalability
>>> > > tests, etc.
>>> > > Maybe we need some Yardstick benchmark or similar?
>>> > > What are your thoughts?
>>> > > Which scenarios should I consider in the first place?
>>> > >
>>> > > 2. Should we provide Spark Catalog implementation inside Ignite
>>> codebase?
>>> > >
>>> > > A current implementation of Spark Catalog based on *internal Spark
>>> API*.
>>> > > Spark community seems not interested in making Catalog API public or
>>> > > including Ignite Catalog in Spark code base [5], [6].
>>> > >
>>> > > *Should we include Spark internal API implementation inside Ignite
>>> code
>>> > > base?*
>>> > >
>>> > > Or should we consider to include Catalog implementation in some
>>> external
>>> > > module?
>>> > > That will be created and released outside Ignite?(we still can
>>> support
>>> > and
>>> > > develop it inside Ignite community).
>>> > >
>>> > > [1] https://issues.apache.org/jira/browse/IGNITE-3084
>>> > > [2] https://github.com/apache/ignite/pull/2742
>>> > > [3] https://github.com/apache/ignite/pull/2742/files#diff-
>>> > > f4ff509cef3018e221394474775e0905
>>> > > [4] https://github.com/apache/ignite/pull/2742/files#diff-
>>> > > f2b670497d81e780dfd5098c5dd8a89c
>>> > > [5] http://apache-spark-developers-list.1001551.n3.
>>> > > nabble.com/Spark-Core-Custom-Catalog-Integration-between-
>>> > > Apache-Ignite-and-Apache-Spark-td22452.html
>>> > > [6] https://issues.apache.org/jira/browse/SPARK-17767
>>> > >
>>> > > --
>>> > > Nikolay Izhikov
>>> > > NIzhikov.dev@gmail.com
>>> >
>>> >
>>>
>>
>>
>>
>> --
>> Nikolay Izhikov
>> NIzhikov.dev@gmail.com
>>
>
>
>
> --
> Nikolay Izhikov
> NIzhikov.dev@gmail.com
>

Re: Integration of Spark and Ignite. Prototype.

Posted by Николай Ижиков <ni...@gmail.com>.

Hello, Valentin.

Did you have a chance to look at my changes?

Now I think I have done almost all required features.
I want to make some performance test to ensure my implementation work
properly with a significant amount of data.
And I definitely need some feedback for my changes.


2017-10-09 18:45 GMT+03:00 Николай Ижиков <ni...@gmail.com>:

> Hello, guys.
>
> Which version of Spark do we want to use?
>
> 1. Currently, Ignite depends on Spark 2.1.0.
>
>     * Can be run on JDK 7.
>     * Still supported: 2.1.2 will be released soon.
>
> 2. Latest Spark version is 2.2.0.
>
>     * Can be run only on JDK 8+
>     * Released Jul 11, 2017.
>     * Already supported by huge vendors(Amazon for example).
>
> Note that in IGNITE-3084 I implement some internal Spark API.
> So It will take some effort to switch between Spark 2.1 and 2.2
>
>
> 2017-09-27 2:20 GMT+03:00 Valentin Kulichenko <
> valentin.kulichenko@gmail.com>:
>
>> I will review in the next few days.
>>
>> -Val
>>
>> On Tue, Sep 26, 2017 at 2:23 PM, Denis Magda <dm...@apache.org> wrote:
>>
>> > Hello Nikolay,
>> >
>> > This is good news. Finally this capability is coming to Ignite.
>> >
>> > Val, Vladimir, could you do a preliminary review?
>> >
>> > Answering on your questions.
>> >
>> > 1. Yardstick should be enough for performance measurements. As a Spark
>> > user, I will be curious to know what’s the point of this integration.
>> > Probably we need to compare Spark + Ignite and Spark + Hive or Spark +
>> > RDBMS cases.
>> >
>> > 2. If Spark community is reluctant let’s include the module in
>> > ignite-spark integration.
>> >
>> > —
>> > Denis
>> >
>> > > On Sep 25, 2017, at 11:14 AM, Николай Ижиков <ni...@gmail.com>
>> > wrote:
>> > >
>> > > Hello, guys.
>> > >
>> > > Currently, I’m working on integration between Spark and Ignite [1].
>> > >
>> > > For now, I implement following:
>> > >    * Ignite DataSource implementation(IgniteRelationProvider)
>> > >    * DataFrame support for Ignite SQL table.
>> > >    * IgniteCatalog implementation for a transparent resolving of
>> ignites
>> > > SQL tables.
>> > >
>> > > Implementation of it can be found in PR [2]
>> > > It would be great if someone provides feedback for a prototype.
>> > >
>> > > I made some examples in PR so you can see how API suppose to be used
>> [3].
>> > > [4].
>> > >
>> > > I need some advice. Can you help me?
>> > >
>> > > 1. How should this PR be tested?
>> > >
>> > > Of course, I need to provide some unit tests. But what about
>> scalability
>> > > tests, etc.
>> > > Maybe we need some Yardstick benchmark or similar?
>> > > What are your thoughts?
>> > > Which scenarios should I consider in the first place?
>> > >
>> > > 2. Should we provide Spark Catalog implementation inside Ignite
>> codebase?
>> > >
>> > > A current implementation of Spark Catalog based on *internal Spark
>> API*.
>> > > Spark community seems not interested in making Catalog API public or
>> > > including Ignite Catalog in Spark code base [5], [6].
>> > >
>> > > *Should we include Spark internal API implementation inside Ignite
>> code
>> > > base?*
>> > >
>> > > Or should we consider to include Catalog implementation in some
>> external
>> > > module?
>> > > That will be created and released outside Ignite?(we still can support
>> > and
>> > > develop it inside Ignite community).
>> > >
>> > > [1] https://issues.apache.org/jira/browse/IGNITE-3084
>> > > [2] https://github.com/apache/ignite/pull/2742
>> > > [3] https://github.com/apache/ignite/pull/2742/files#diff-
>> > > f4ff509cef3018e221394474775e0905
>> > > [4] https://github.com/apache/ignite/pull/2742/files#diff-
>> > > f2b670497d81e780dfd5098c5dd8a89c
>> > > [5] http://apache-spark-developers-list.1001551.n3.
>> > > nabble.com/Spark-Core-Custom-Catalog-Integration-between-
>> > > Apache-Ignite-and-Apache-Spark-td22452.html
>> > > [6] https://issues.apache.org/jira/browse/SPARK-17767
>> > >
>> > > --
>> > > Nikolay Izhikov
>> > > NIzhikov.dev@gmail.com
>> >
>> >
>>
>
>
>
> --
> Nikolay Izhikov
> NIzhikov.dev@gmail.com
>



-- 
Nikolay Izhikov
NIzhikov.dev@gmail.com

Re: Integration of Spark and Ignite. Prototype.

Posted by Николай Ижиков <ni...@gmail.com>.

Hello, guys.

Which version of Spark do we want to use?

1. Currently, Ignite depends on Spark 2.1.0.

    * Can be run on JDK 7.
    * Still supported: 2.1.2 will be released soon.

2. Latest Spark version is 2.2.0.

    * Can be run only on JDK 8+
    * Released Jul 11, 2017.
    * Already supported by huge vendors(Amazon for example).

Note that in IGNITE-3084 I implement some internal Spark API.
So It will take some effort to switch between Spark 2.1 and 2.2


2017-09-27 2:20 GMT+03:00 Valentin Kulichenko <valentin.kulichenko@gmail.com
>:

> I will review in the next few days.
>
> -Val
>
> On Tue, Sep 26, 2017 at 2:23 PM, Denis Magda <dm...@apache.org> wrote:
>
> > Hello Nikolay,
> >
> > This is good news. Finally this capability is coming to Ignite.
> >
> > Val, Vladimir, could you do a preliminary review?
> >
> > Answering on your questions.
> >
> > 1. Yardstick should be enough for performance measurements. As a Spark
> > user, I will be curious to know what’s the point of this integration.
> > Probably we need to compare Spark + Ignite and Spark + Hive or Spark +
> > RDBMS cases.
> >
> > 2. If Spark community is reluctant let’s include the module in
> > ignite-spark integration.
> >
> > —
> > Denis
> >
> > > On Sep 25, 2017, at 11:14 AM, Николай Ижиков <ni...@gmail.com>
> > wrote:
> > >
> > > Hello, guys.
> > >
> > > Currently, I’m working on integration between Spark and Ignite [1].
> > >
> > > For now, I implement following:
> > >    * Ignite DataSource implementation(IgniteRelationProvider)
> > >    * DataFrame support for Ignite SQL table.
> > >    * IgniteCatalog implementation for a transparent resolving of
> ignites
> > > SQL tables.
> > >
> > > Implementation of it can be found in PR [2]
> > > It would be great if someone provides feedback for a prototype.
> > >
> > > I made some examples in PR so you can see how API suppose to be used
> [3].
> > > [4].
> > >
> > > I need some advice. Can you help me?
> > >
> > > 1. How should this PR be tested?
> > >
> > > Of course, I need to provide some unit tests. But what about
> scalability
> > > tests, etc.
> > > Maybe we need some Yardstick benchmark or similar?
> > > What are your thoughts?
> > > Which scenarios should I consider in the first place?
> > >
> > > 2. Should we provide Spark Catalog implementation inside Ignite
> codebase?
> > >
> > > A current implementation of Spark Catalog based on *internal Spark
> API*.
> > > Spark community seems not interested in making Catalog API public or
> > > including Ignite Catalog in Spark code base [5], [6].
> > >
> > > *Should we include Spark internal API implementation inside Ignite code
> > > base?*
> > >
> > > Or should we consider to include Catalog implementation in some
> external
> > > module?
> > > That will be created and released outside Ignite?(we still can support
> > and
> > > develop it inside Ignite community).
> > >
> > > [1] https://issues.apache.org/jira/browse/IGNITE-3084
> > > [2] https://github.com/apache/ignite/pull/2742
> > > [3] https://github.com/apache/ignite/pull/2742/files#diff-
> > > f4ff509cef3018e221394474775e0905
> > > [4] https://github.com/apache/ignite/pull/2742/files#diff-
> > > f2b670497d81e780dfd5098c5dd8a89c
> > > [5] http://apache-spark-developers-list.1001551.n3.
> > > nabble.com/Spark-Core-Custom-Catalog-Integration-between-
> > > Apache-Ignite-and-Apache-Spark-td22452.html
> > > [6] https://issues.apache.org/jira/browse/SPARK-17767
> > >
> > > --
> > > Nikolay Izhikov
> > > NIzhikov.dev@gmail.com
> >
> >
>



-- 
Nikolay Izhikov
NIzhikov.dev@gmail.com

Re: Integration of Spark and Ignite. Prototype.

Posted by Valentin Kulichenko <va...@gmail.com>.

I will review in the next few days.

-Val

On Tue, Sep 26, 2017 at 2:23 PM, Denis Magda <dm...@apache.org> wrote:

> Hello Nikolay,
>
> This is good news. Finally this capability is coming to Ignite.
>
> Val, Vladimir, could you do a preliminary review?
>
> Answering on your questions.
>
> 1. Yardstick should be enough for performance measurements. As a Spark
> user, I will be curious to know what’s the point of this integration.
> Probably we need to compare Spark + Ignite and Spark + Hive or Spark +
> RDBMS cases.
>
> 2. If Spark community is reluctant let’s include the module in
> ignite-spark integration.
>
> —
> Denis
>
> > On Sep 25, 2017, at 11:14 AM, Николай Ижиков <ni...@gmail.com>
> wrote:
> >
> > Hello, guys.
> >
> > Currently, I’m working on integration between Spark and Ignite [1].
> >
> > For now, I implement following:
> >    * Ignite DataSource implementation(IgniteRelationProvider)
> >    * DataFrame support for Ignite SQL table.
> >    * IgniteCatalog implementation for a transparent resolving of ignites
> > SQL tables.
> >
> > Implementation of it can be found in PR [2]
> > It would be great if someone provides feedback for a prototype.
> >
> > I made some examples in PR so you can see how API suppose to be used [3].
> > [4].
> >
> > I need some advice. Can you help me?
> >
> > 1. How should this PR be tested?
> >
> > Of course, I need to provide some unit tests. But what about scalability
> > tests, etc.
> > Maybe we need some Yardstick benchmark or similar?
> > What are your thoughts?
> > Which scenarios should I consider in the first place?
> >
> > 2. Should we provide Spark Catalog implementation inside Ignite codebase?
> >
> > A current implementation of Spark Catalog based on *internal Spark API*.
> > Spark community seems not interested in making Catalog API public or
> > including Ignite Catalog in Spark code base [5], [6].
> >
> > *Should we include Spark internal API implementation inside Ignite code
> > base?*
> >
> > Or should we consider to include Catalog implementation in some external
> > module?
> > That will be created and released outside Ignite?(we still can support
> and
> > develop it inside Ignite community).
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-3084
> > [2] https://github.com/apache/ignite/pull/2742
> > [3] https://github.com/apache/ignite/pull/2742/files#diff-
> > f4ff509cef3018e221394474775e0905
> > [4] https://github.com/apache/ignite/pull/2742/files#diff-
> > f2b670497d81e780dfd5098c5dd8a89c
> > [5] http://apache-spark-developers-list.1001551.n3.
> > nabble.com/Spark-Core-Custom-Catalog-Integration-between-
> > Apache-Ignite-and-Apache-Spark-td22452.html
> > [6] https://issues.apache.org/jira/browse/SPARK-17767
> >
> > --
> > Nikolay Izhikov
> > NIzhikov.dev@gmail.com
>
>

Re: Integration of Spark and Ignite. Prototype.

Posted by Denis Magda <dm...@apache.org>.

Hello Nikolay,

This is good news. Finally this capability is coming to Ignite. 

Val, Vladimir, could you do a preliminary review?

Answering on your questions.

1. Yardstick should be enough for performance measurements. As a Spark user, I will be curious to know what’s the point of this integration. Probably we need to compare Spark + Ignite and Spark + Hive or Spark + RDBMS cases.

2. If Spark community is reluctant let’s include the module in ignite-spark integration.

—
Denis
 
> On Sep 25, 2017, at 11:14 AM, Николай Ижиков <ni...@gmail.com> wrote:
> 
> Hello, guys.
> 
> Currently, I’m working on integration between Spark and Ignite [1].
> 
> For now, I implement following:
>    * Ignite DataSource implementation(IgniteRelationProvider)
>    * DataFrame support for Ignite SQL table.
>    * IgniteCatalog implementation for a transparent resolving of ignites
> SQL tables.
> 
> Implementation of it can be found in PR [2]
> It would be great if someone provides feedback for a prototype.
> 
> I made some examples in PR so you can see how API suppose to be used [3].
> [4].
> 
> I need some advice. Can you help me?
> 
> 1. How should this PR be tested?
> 
> Of course, I need to provide some unit tests. But what about scalability
> tests, etc.
> Maybe we need some Yardstick benchmark or similar?
> What are your thoughts?
> Which scenarios should I consider in the first place?
> 
> 2. Should we provide Spark Catalog implementation inside Ignite codebase?
> 
> A current implementation of Spark Catalog based on *internal Spark API*.
> Spark community seems not interested in making Catalog API public or
> including Ignite Catalog in Spark code base [5], [6].
> 
> *Should we include Spark internal API implementation inside Ignite code
> base?*
> 
> Or should we consider to include Catalog implementation in some external
> module?
> That will be created and released outside Ignite?(we still can support and
> develop it inside Ignite community).
> 
> [1] https://issues.apache.org/jira/browse/IGNITE-3084
> [2] https://github.com/apache/ignite/pull/2742
> [3] https://github.com/apache/ignite/pull/2742/files#diff-
> f4ff509cef3018e221394474775e0905
> [4] https://github.com/apache/ignite/pull/2742/files#diff-
> f2b670497d81e780dfd5098c5dd8a89c
> [5] http://apache-spark-developers-list.1001551.n3.
> nabble.com/Spark-Core-Custom-Catalog-Integration-between-
> Apache-Ignite-and-Apache-Spark-td22452.html
> [6] https://issues.apache.org/jira/browse/SPARK-17767
> 
> --
> Nikolay Izhikov
> NIzhikov.dev@gmail.com