You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ignite.apache.org by Nikolay Izhikov <ni...@apache.org> on 2018/10/20 21:09:13 UTC

[DISCUSSION] Spark Data Frame through Thin Client

Hello, Igniters.

Currently, Spark Data Frame integration implemented via client node connection.
Whenever we need to retrieve some data into Spark worker(or master) from Ignite we start a client node.

It has several major disadvantages:

	1. We should copy whole Ignite distribution on to each Spark worker [1]
	2. We should copy whole Ignite distribution on to Spark master to get catalogue works.
	3. We should have the same absolute path to Ignite configuration file on every worker and provide it during data frame construction [2]
	4. We should additionally configure Spark workerks classpath to include Ignite libraries.

For now, almost all operation we need to do in Spark Data Frame integration is supported by Java Thin Client.
	* obtain the list of caches.
	* get cache configuration.
	* execute SQL query.
	* stream data to the table - don't support by the thin client for now, but can be implemented using simple SQL INSERT statements.

Advantages of usage Java Thin Client in Spark integration(they all known from Java Thin Client advantages):
	1. Easy to configure: only IP addresses of server nodes are required.
	2. Easy to deploy: only 1 additional jar required. No server side(Ignite worker) configuration required.

I propose to implement Spark Data Frame integration through Java Thin Client.

Thoughts?

[1] https://apacheignite-fs.readme.io/docs/installation-deployment
[2] https://apacheignite-fs.readme.io/docs/ignite-data-frame#section-ignite-dataframe-options

Re: [DISCUSSION] Spark Data Frame through Thin Client

Posted by Stephen Darlington <st...@gridgain.com>.
Ignite doesn’t currently support Spark Structured Streaming:

https://issues.apache.org/jira/browse/IGNITE-9357 <https://issues.apache.org/jira/browse/IGNITE-9357>

There’s a working patch associated with it.

Regards,
Stephen

> On 22 Oct 2018, at 10:43, Nikolay Izhikov <ni...@apache.org> wrote:
> 
> Hello, Stephen.
> 
> I suggest thin client deployment as a second option together with existing integration that use Client Node.
> 
>> I’m thinking specifically about better support for Spark Streaming, where the lack  of continuous query support in thin clients removes a significant optimisation option. 
> 
> It's very interesting.
> Can you share you thoughts?
> What can be improved in Spark integration?
> 
> В Пн, 22/10/2018 в 10:22 +0100, Stephen Darlington пишет:
>> Are you suggesting making the Thin Client deployment an option or as a replacement for the thick-client? If the latter, do we risk making future desirable changes more difficult (or impossible)? I’m thinking specifically about better support for Spark Streaming, where the lack  of continuous query support in thin clients removes a significant optimisation option. I’m sure there are other use cases.
>> 
>> Regards,
>> Stephen
>> 
>>> On 21 Oct 2018, at 09:08, Nikolay Izhikov <ni...@apache.org> wrote:
>>> 
>>> Valentin.
>>> 
>>> Seems, You made several suggestions, which is not always true, from my point of view:
>>> 
>>> 1. "We have access to Spark cluster installation to perform deployment steps" - this is not true in cloud or enterprise environment.
>>> 
>>> 2. "Spark cluster is used only for Ignite integration".
>>> From what I know computational resources for big Spark cluster is divided by many business divisions.
>>> And it is not convenient to perform some deployment steps on this cluster.
>>> 
>>> 3. "When Ignite + Spark are used in real production it's OK to have reasonable deployment overhead"
>>> What about developer who want to play with this integration?
>>> And want to do it quickly to see how it works in real life examples.
>>> Can we do his life much easier?
>>> 
>>>> First of all, they will exist with thin client either.
>>> 
>>> Spark have an ability to deploy jars on worker and add it to application tasks classpath.
>>> For 2.6 we must deploy 11 additional jars to start using Ignite.
>>> Please, see my example on the bottom of documentation page [1]
>>> 
>>> Does cache-api-1.0.0.jar and h2-1.4.195.jar seems like obvious dependencies for Ignite integration for you?
>>> And for our users? :)
>>> 
>>> Actually, list of dependencies will be changed in 2.7 - new version of jcache, new version of h2
>>> So user should change it in code or perform additional deployment steps.
>>> 
>>> It overkill for me.
>>> 
>>> On the other hand - thin client requires only 1 jar.
>>> Moreover, thin client protocol have the backward compatibility.
>>> So thin client will perform correctly when Ignite cluster will be updated from 2.6 to 2.7.
>>> So, with Spark integration via thin client we will be able to update Ignite cluster and Spark integration separately.
>>> For now, we should do it in one big step.
>>> 
>>> What do you think?
>>> 
>>> [1] https://apacheignite-fs.readme.io/docs/installation-deployment
>>> 
>>> В Сб, 20/10/2018 в 18:33 -0700, Valentin Kulichenko пишет:
>>>> Guys,
>>>> 
>>>> From my experience, Ignite and Spark clusters typically run in the same
>>>> environment, which makes client node a more preferable option. Mainly,
>>>> because of performance. BTW, I doubt partition-awareness on thin client
>>>> will help either, because in dataframes we only run SQL queries and I
>>>> believe thin client will execute them through a proxy anyway. But correct
>>>> me if I’m wrong.
>>>> 
>>>> Either way, it sounds like we just have usability issues with Ignite/Spark
>>>> integration. Why don’t we concentrate on fixing them then? For example, #3
>>>> can be fixed by loading XML content on master and then distributing it to
>>>> workers, instead of loading on every worker independently. Then there are
>>>> certain procedures like deploying JARs, etc. First of all, they will exist
>>>> with thin client either. Second of all, I’m sure there are ways to simplify
>>>> this procedures and make integration easier. My opinion is that working on
>>>> such improvements is going to add more value than another implementation
>>>> based on thin client.
>>>> 
>>>> -Val
>>>> 
>>>> On Sat, Oct 20, 2018 at 4:03 PM Denis Magda <dm...@apache.org> wrote:
>>>> 
>>>>> Hello Nikolay,
>>>>> 
>>>>> Your proposal sounds reasonable. However, I would suggest us to wait while
>>>>> partition-awareness is supported for Java thin client first. With that
>>>>> feature, the client can connect to any node directly while presently all
>>>>> the communication goes through a proxy (a node the client is connected to).
>>>>> All of that is bad for performance.
>>>>> 
>>>>> 
>>>>> Vladimir, how hard would it be to support the partition-awareness for Java
>>>>> client? Probably, Nikolay can take over.
>>>>> 
>>>>> --
>>>>> Denis
>>>>> 
>>>>> 
>>>>> On Sat, Oct 20, 2018 at 2:09 PM Nikolay Izhikov <ni...@apache.org>
>>>>> wrote:
>>>>> 
>>>>>> Hello, Igniters.
>>>>>> 
>>>>>> Currently, Spark Data Frame integration implemented via client node
>>>>>> connection.
>>>>>> Whenever we need to retrieve some data into Spark worker(or master) from
>>>>>> Ignite we start a client node.
>>>>>> 
>>>>>> It has several major disadvantages:
>>>>>> 
>>>>>>       1. We should copy whole Ignite distribution on to each Spark
>>>>>> worker [1]
>>>>>>       2. We should copy whole Ignite distribution on to Spark master to
>>>>>> get catalogue works.
>>>>>>       3. We should have the same absolute path to Ignite configuration
>>>>>> file on every worker and provide it during data frame construction [2]
>>>>>>       4. We should additionally configure Spark workerks classpath to
>>>>>> include Ignite libraries.
>>>>>> 
>>>>>> For now, almost all operation we need to do in Spark Data Frame
>>>>>> integration is supported by Java Thin Client.
>>>>>>       * obtain the list of caches.
>>>>>>       * get cache configuration.
>>>>>>       * execute SQL query.
>>>>>>       * stream data to the table - don't support by the thin client for
>>>>>> now, but can be implemented using simple SQL INSERT statements.
>>>>>> 
>>>>>> Advantages of usage Java Thin Client in Spark integration(they all known
>>>>>> from Java Thin Client advantages):
>>>>>>       1. Easy to configure: only IP addresses of server nodes are
>>>>>> required.
>>>>>>       2. Easy to deploy: only 1 additional jar required. No server
>>>>>> side(Ignite worker) configuration required.
>>>>>> 
>>>>>> I propose to implement Spark Data Frame integration through Java Thin
>>>>>> Client.
>>>>>> 
>>>>>> Thoughts?
>>>>>> 
>>>>>> [1] https://apacheignite-fs.readme.io/docs/installation-deployment
>>>>>> [2]
>>>>>> 
>>>>> 
>>>>> https://apacheignite-fs.readme.io/docs/ignite-data-frame#section-ignite-dataframe-options
>>>>>> 
>> 
>> 



Re: [DISCUSSION] Spark Data Frame through Thin Client

Posted by Nikolay Izhikov <ni...@apache.org>.
Hello, Stephen.

I suggest thin client deployment as a second option together with existing integration that use Client Node.

> I’m thinking specifically about better support for Spark Streaming, where the lack  of continuous query support in thin clients removes a significant optimisation option. 

It's very interesting.
Can you share you thoughts?
What can be improved in Spark integration?

В Пн, 22/10/2018 в 10:22 +0100, Stephen Darlington пишет:
> Are you suggesting making the Thin Client deployment an option or as a replacement for the thick-client? If the latter, do we risk making future desirable changes more difficult (or impossible)? I’m thinking specifically about better support for Spark Streaming, where the lack  of continuous query support in thin clients removes a significant optimisation option. I’m sure there are other use cases.
> 
> Regards,
> Stephen
> 
> > On 21 Oct 2018, at 09:08, Nikolay Izhikov <ni...@apache.org> wrote:
> > 
> > Valentin.
> > 
> > Seems, You made several suggestions, which is not always true, from my point of view:
> > 
> > 1. "We have access to Spark cluster installation to perform deployment steps" - this is not true in cloud or enterprise environment.
> > 
> > 2. "Spark cluster is used only for Ignite integration".
> > From what I know computational resources for big Spark cluster is divided by many business divisions.
> > And it is not convenient to perform some deployment steps on this cluster.
> > 
> > 3. "When Ignite + Spark are used in real production it's OK to have reasonable deployment overhead"
> > What about developer who want to play with this integration?
> > And want to do it quickly to see how it works in real life examples.
> > Can we do his life much easier?
> > 
> > > First of all, they will exist with thin client either.
> > 
> > Spark have an ability to deploy jars on worker and add it to application tasks classpath.
> > For 2.6 we must deploy 11 additional jars to start using Ignite.
> > Please, see my example on the bottom of documentation page [1]
> > 
> > Does cache-api-1.0.0.jar and h2-1.4.195.jar seems like obvious dependencies for Ignite integration for you?
> > And for our users? :)
> > 
> > Actually, list of dependencies will be changed in 2.7 - new version of jcache, new version of h2
> > So user should change it in code or perform additional deployment steps.
> > 
> > It overkill for me.
> > 
> > On the other hand - thin client requires only 1 jar.
> > Moreover, thin client protocol have the backward compatibility.
> > So thin client will perform correctly when Ignite cluster will be updated from 2.6 to 2.7.
> > So, with Spark integration via thin client we will be able to update Ignite cluster and Spark integration separately.
> > For now, we should do it in one big step.
> > 
> > What do you think?
> > 
> > [1] https://apacheignite-fs.readme.io/docs/installation-deployment
> > 
> > В Сб, 20/10/2018 в 18:33 -0700, Valentin Kulichenko пишет:
> > > Guys,
> > > 
> > > From my experience, Ignite and Spark clusters typically run in the same
> > > environment, which makes client node a more preferable option. Mainly,
> > > because of performance. BTW, I doubt partition-awareness on thin client
> > > will help either, because in dataframes we only run SQL queries and I
> > > believe thin client will execute them through a proxy anyway. But correct
> > > me if I’m wrong.
> > > 
> > > Either way, it sounds like we just have usability issues with Ignite/Spark
> > > integration. Why don’t we concentrate on fixing them then? For example, #3
> > > can be fixed by loading XML content on master and then distributing it to
> > > workers, instead of loading on every worker independently. Then there are
> > > certain procedures like deploying JARs, etc. First of all, they will exist
> > > with thin client either. Second of all, I’m sure there are ways to simplify
> > > this procedures and make integration easier. My opinion is that working on
> > > such improvements is going to add more value than another implementation
> > > based on thin client.
> > > 
> > > -Val
> > > 
> > > On Sat, Oct 20, 2018 at 4:03 PM Denis Magda <dm...@apache.org> wrote:
> > > 
> > > > Hello Nikolay,
> > > > 
> > > > Your proposal sounds reasonable. However, I would suggest us to wait while
> > > > partition-awareness is supported for Java thin client first. With that
> > > > feature, the client can connect to any node directly while presently all
> > > > the communication goes through a proxy (a node the client is connected to).
> > > > All of that is bad for performance.
> > > > 
> > > > 
> > > > Vladimir, how hard would it be to support the partition-awareness for Java
> > > > client? Probably, Nikolay can take over.
> > > > 
> > > > --
> > > > Denis
> > > > 
> > > > 
> > > > On Sat, Oct 20, 2018 at 2:09 PM Nikolay Izhikov <ni...@apache.org>
> > > > wrote:
> > > > 
> > > > > Hello, Igniters.
> > > > > 
> > > > > Currently, Spark Data Frame integration implemented via client node
> > > > > connection.
> > > > > Whenever we need to retrieve some data into Spark worker(or master) from
> > > > > Ignite we start a client node.
> > > > > 
> > > > > It has several major disadvantages:
> > > > > 
> > > > >        1. We should copy whole Ignite distribution on to each Spark
> > > > > worker [1]
> > > > >        2. We should copy whole Ignite distribution on to Spark master to
> > > > > get catalogue works.
> > > > >        3. We should have the same absolute path to Ignite configuration
> > > > > file on every worker and provide it during data frame construction [2]
> > > > >        4. We should additionally configure Spark workerks classpath to
> > > > > include Ignite libraries.
> > > > > 
> > > > > For now, almost all operation we need to do in Spark Data Frame
> > > > > integration is supported by Java Thin Client.
> > > > >        * obtain the list of caches.
> > > > >        * get cache configuration.
> > > > >        * execute SQL query.
> > > > >        * stream data to the table - don't support by the thin client for
> > > > > now, but can be implemented using simple SQL INSERT statements.
> > > > > 
> > > > > Advantages of usage Java Thin Client in Spark integration(they all known
> > > > > from Java Thin Client advantages):
> > > > >        1. Easy to configure: only IP addresses of server nodes are
> > > > > required.
> > > > >        2. Easy to deploy: only 1 additional jar required. No server
> > > > > side(Ignite worker) configuration required.
> > > > > 
> > > > > I propose to implement Spark Data Frame integration through Java Thin
> > > > > Client.
> > > > > 
> > > > > Thoughts?
> > > > > 
> > > > > [1] https://apacheignite-fs.readme.io/docs/installation-deployment
> > > > > [2]
> > > > > 
> > > > 
> > > > https://apacheignite-fs.readme.io/docs/ignite-data-frame#section-ignite-dataframe-options
> > > > > 
> 
> 

Re: [DISCUSSION] Spark Data Frame through Thin Client

Posted by Stephen Darlington <st...@gridgain.com>.
Are you suggesting making the Thin Client deployment an option or as a replacement for the thick-client? If the latter, do we risk making future desirable changes more difficult (or impossible)? I’m thinking specifically about better support for Spark Streaming, where the lack  of continuous query support in thin clients removes a significant optimisation option. I’m sure there are other use cases.

Regards,
Stephen

> On 21 Oct 2018, at 09:08, Nikolay Izhikov <ni...@apache.org> wrote:
> 
> Valentin.
> 
> Seems, You made several suggestions, which is not always true, from my point of view:
> 
> 1. "We have access to Spark cluster installation to perform deployment steps" - this is not true in cloud or enterprise environment.
> 
> 2. "Spark cluster is used only for Ignite integration".
> From what I know computational resources for big Spark cluster is divided by many business divisions.
> And it is not convenient to perform some deployment steps on this cluster.
> 
> 3. "When Ignite + Spark are used in real production it's OK to have reasonable deployment overhead"
> What about developer who want to play with this integration?
> And want to do it quickly to see how it works in real life examples.
> Can we do his life much easier?
> 
>> First of all, they will exist with thin client either.
> 
> Spark have an ability to deploy jars on worker and add it to application tasks classpath.
> For 2.6 we must deploy 11 additional jars to start using Ignite.
> Please, see my example on the bottom of documentation page [1]
> 
> Does cache-api-1.0.0.jar and h2-1.4.195.jar seems like obvious dependencies for Ignite integration for you?
> And for our users? :)
> 
> Actually, list of dependencies will be changed in 2.7 - new version of jcache, new version of h2
> So user should change it in code or perform additional deployment steps.
> 
> It overkill for me.
> 
> On the other hand - thin client requires only 1 jar.
> Moreover, thin client protocol have the backward compatibility.
> So thin client will perform correctly when Ignite cluster will be updated from 2.6 to 2.7.
> So, with Spark integration via thin client we will be able to update Ignite cluster and Spark integration separately.
> For now, we should do it in one big step.
> 
> What do you think?
> 
> [1] https://apacheignite-fs.readme.io/docs/installation-deployment
> 
> В Сб, 20/10/2018 в 18:33 -0700, Valentin Kulichenko пишет:
>> Guys,
>> 
>> From my experience, Ignite and Spark clusters typically run in the same
>> environment, which makes client node a more preferable option. Mainly,
>> because of performance. BTW, I doubt partition-awareness on thin client
>> will help either, because in dataframes we only run SQL queries and I
>> believe thin client will execute them through a proxy anyway. But correct
>> me if I’m wrong.
>> 
>> Either way, it sounds like we just have usability issues with Ignite/Spark
>> integration. Why don’t we concentrate on fixing them then? For example, #3
>> can be fixed by loading XML content on master and then distributing it to
>> workers, instead of loading on every worker independently. Then there are
>> certain procedures like deploying JARs, etc. First of all, they will exist
>> with thin client either. Second of all, I’m sure there are ways to simplify
>> this procedures and make integration easier. My opinion is that working on
>> such improvements is going to add more value than another implementation
>> based on thin client.
>> 
>> -Val
>> 
>> On Sat, Oct 20, 2018 at 4:03 PM Denis Magda <dm...@apache.org> wrote:
>> 
>>> Hello Nikolay,
>>> 
>>> Your proposal sounds reasonable. However, I would suggest us to wait while
>>> partition-awareness is supported for Java thin client first. With that
>>> feature, the client can connect to any node directly while presently all
>>> the communication goes through a proxy (a node the client is connected to).
>>> All of that is bad for performance.
>>> 
>>> 
>>> Vladimir, how hard would it be to support the partition-awareness for Java
>>> client? Probably, Nikolay can take over.
>>> 
>>> --
>>> Denis
>>> 
>>> 
>>> On Sat, Oct 20, 2018 at 2:09 PM Nikolay Izhikov <ni...@apache.org>
>>> wrote:
>>> 
>>>> Hello, Igniters.
>>>> 
>>>> Currently, Spark Data Frame integration implemented via client node
>>>> connection.
>>>> Whenever we need to retrieve some data into Spark worker(or master) from
>>>> Ignite we start a client node.
>>>> 
>>>> It has several major disadvantages:
>>>> 
>>>>        1. We should copy whole Ignite distribution on to each Spark
>>>> worker [1]
>>>>        2. We should copy whole Ignite distribution on to Spark master to
>>>> get catalogue works.
>>>>        3. We should have the same absolute path to Ignite configuration
>>>> file on every worker and provide it during data frame construction [2]
>>>>        4. We should additionally configure Spark workerks classpath to
>>>> include Ignite libraries.
>>>> 
>>>> For now, almost all operation we need to do in Spark Data Frame
>>>> integration is supported by Java Thin Client.
>>>>        * obtain the list of caches.
>>>>        * get cache configuration.
>>>>        * execute SQL query.
>>>>        * stream data to the table - don't support by the thin client for
>>>> now, but can be implemented using simple SQL INSERT statements.
>>>> 
>>>> Advantages of usage Java Thin Client in Spark integration(they all known
>>>> from Java Thin Client advantages):
>>>>        1. Easy to configure: only IP addresses of server nodes are
>>>> required.
>>>>        2. Easy to deploy: only 1 additional jar required. No server
>>>> side(Ignite worker) configuration required.
>>>> 
>>>> I propose to implement Spark Data Frame integration through Java Thin
>>>> Client.
>>>> 
>>>> Thoughts?
>>>> 
>>>> [1] https://apacheignite-fs.readme.io/docs/installation-deployment
>>>> [2]
>>>> 
>>> 
>>> https://apacheignite-fs.readme.io/docs/ignite-data-frame#section-ignite-dataframe-options
>>>> 



Re: [DISCUSSION] Spark Data Frame through Thin Client

Posted by Nikolay Izhikov <ni...@apache.org>.
Hello, Valentin.

> What I don't agree with is that replacing thick client with thin client is a way to fix usability issues. 

I think it will fix some of them.

> will potentially compromise the performance

As I mentioned earlier, I want to provide easy way to play with integration.
For maximum performance one should use client nodes.

> What is the difference between thin and thick client from this point of view?

We need only 1 jar file.
All config options we need is list of ip addressed.

> I'm not arguing there are usability issues with thick client. 
> I'm just suggesting to fix those issues first, before we jump reworking the implementation.

> My suggestion is to look at usability issues and try to fix them without getting rid of thick client.

I agree, let's do it!
Can you create some tickets?
I'm ready to look at it and contribute a fix.

В Вт, 23/10/2018 в 19:31 -0700, Valentin Kulichenko пишет:
> Nikolay,
> 
> Please see my comments below. Actually, I haven't made most of the
> assumptions that you mentioned, and I generally agree with you. What I
> don't agree with is that replacing thick client with thin client is a way
> to fix usability issues. Thin client is not going to be issue-free either,
> but will potentially compromise the performance, as well as functionality
> (like streaming, as Stephen mentioned). My suggestion is to look at
> usability issues and try to fix them without getting rid of thick client.
> 
> -Val
> 
> On Sun, Oct 21, 2018 at 1:08 AM Nikolay Izhikov <ni...@apache.org> wrote:
> 
> > Valentin.
> > 
> > Seems, You made several suggestions, which is not always true, from my
> > point of view:
> > 
> > 1. "We have access to Spark cluster installation to perform deployment
> > steps" - this is not true in cloud or enterprise environment.
> > 
> 
> Can you please elaborate on this? What is the difference between thin and
> thick client from this point of view? I understand that the latter would
> generally be more complicated, but how would one use thin client without
> deploying a JAR?
> 
> 
> > 
> > 2. "Spark cluster is used only for Ignite integration".
> > From what I know computational resources for big Spark cluster is divided
> > by many business divisions.
> > And it is not convenient to perform some deployment steps on this cluster.
> > 
> 
> Same as #1. Regardless how we use the Spark cluster, we need to deploy a
> JAR in case of thin client, no?
> 
> 
> > 
> > 3. "When Ignite + Spark are used in real production it's OK to have
> > reasonable deployment overhead"
> > What about developer who want to play with this integration?
> > And want to do it quickly to see how it works in real life examples.
> > Can we do his life much easier?
> > 
> 
> We can and we should :) I'm not arguing there are usability issues with
> thick client. I'm just suggesting to fix those issues first, before we jump
> reworking the implementation.
> 
> 
> > 
> > > First of all, they will exist with thin client either.
> > 
> > Spark have an ability to deploy jars on worker and add it to application
> > tasks classpath.
> > For 2.6 we must deploy 11 additional jars to start using Ignite.
> > Please, see my example on the bottom of documentation page [1]
> > 
> > Does cache-api-1.0.0.jar and h2-1.4.195.jar seems like obvious
> > dependencies for Ignite integration for you?
> > And for our users? :)
> > 
> 
> No, this is not obvious. Absolutely, this is a usability issue and we
> should think how to make user's life easier.
> 
> 
> > 
> > Actually, list of dependencies will be changed in 2.7 - new version of
> > jcache, new version of h2
> > So user should change it in code or perform additional deployment steps.
> > 
> > It overkill for me.
> > 
> > On the other hand - thin client requires only 1 jar.
> > Moreover, thin client protocol have the backward compatibility.
> > So thin client will perform correctly when Ignite cluster will be updated
> > from 2.6 to 2.7.
> > So, with Spark integration via thin client we will be able to update
> > Ignite cluster and Spark integration separately.
> > For now, we should do it in one big step.
> > 
> > What do you think?
> > 
> > [1] https://apacheignite-fs.readme.io/docs/installation-deployment
> > 
> > В Сб, 20/10/2018 в 18:33 -0700, Valentin Kulichenko пишет:
> > > Guys,
> > > 
> > > From my experience, Ignite and Spark clusters typically run in the same
> > > environment, which makes client node a more preferable option. Mainly,
> > > because of performance. BTW, I doubt partition-awareness on thin client
> > > will help either, because in dataframes we only run SQL queries and I
> > > believe thin client will execute them through a proxy anyway. But correct
> > > me if I’m wrong.
> > > 
> > > Either way, it sounds like we just have usability issues with
> > 
> > Ignite/Spark
> > > integration. Why don’t we concentrate on fixing them then? For example,
> > 
> > #3
> > > can be fixed by loading XML content on master and then distributing it to
> > > workers, instead of loading on every worker independently. Then there are
> > > certain procedures like deploying JARs, etc. First of all, they will
> > 
> > exist
> > > with thin client either. Second of all, I’m sure there are ways to
> > 
> > simplify
> > > this procedures and make integration easier. My opinion is that working
> > 
> > on
> > > such improvements is going to add more value than another implementation
> > > based on thin client.
> > > 
> > > -Val
> > > 
> > > On Sat, Oct 20, 2018 at 4:03 PM Denis Magda <dm...@apache.org> wrote:
> > > 
> > > > Hello Nikolay,
> > > > 
> > > > Your proposal sounds reasonable. However, I would suggest us to wait
> > 
> > while
> > > > partition-awareness is supported for Java thin client first. With that
> > > > feature, the client can connect to any node directly while presently
> > 
> > all
> > > > the communication goes through a proxy (a node the client is connected
> > 
> > to).
> > > > All of that is bad for performance.
> > > > 
> > > > 
> > > > Vladimir, how hard would it be to support the partition-awareness for
> > 
> > Java
> > > > client? Probably, Nikolay can take over.
> > > > 
> > > > --
> > > > Denis
> > > > 
> > > > 
> > > > On Sat, Oct 20, 2018 at 2:09 PM Nikolay Izhikov <ni...@apache.org>
> > > > wrote:
> > > > 
> > > > > Hello, Igniters.
> > > > > 
> > > > > Currently, Spark Data Frame integration implemented via client node
> > > > > connection.
> > > > > Whenever we need to retrieve some data into Spark worker(or master)
> > 
> > from
> > > > > Ignite we start a client node.
> > > > > 
> > > > > It has several major disadvantages:
> > > > > 
> > > > >         1. We should copy whole Ignite distribution on to each Spark
> > > > > worker [1]
> > > > >         2. We should copy whole Ignite distribution on to Spark
> > 
> > master to
> > > > > get catalogue works.
> > > > >         3. We should have the same absolute path to Ignite
> > 
> > configuration
> > > > > file on every worker and provide it during data frame construction
> > 
> > [2]
> > > > >         4. We should additionally configure Spark workerks classpath
> > 
> > to
> > > > > include Ignite libraries.
> > > > > 
> > > > > For now, almost all operation we need to do in Spark Data Frame
> > > > > integration is supported by Java Thin Client.
> > > > >         * obtain the list of caches.
> > > > >         * get cache configuration.
> > > > >         * execute SQL query.
> > > > >         * stream data to the table - don't support by the thin
> > 
> > client for
> > > > > now, but can be implemented using simple SQL INSERT statements.
> > > > > 
> > > > > Advantages of usage Java Thin Client in Spark integration(they all
> > 
> > known
> > > > > from Java Thin Client advantages):
> > > > >         1. Easy to configure: only IP addresses of server nodes are
> > > > > required.
> > > > >         2. Easy to deploy: only 1 additional jar required. No server
> > > > > side(Ignite worker) configuration required.
> > > > > 
> > > > > I propose to implement Spark Data Frame integration through Java Thin
> > > > > Client.
> > > > > 
> > > > > Thoughts?
> > > > > 
> > > > > [1] https://apacheignite-fs.readme.io/docs/installation-deployment
> > > > > [2]
> > > > > 
> > > > 
> > > > 
> > 
> > https://apacheignite-fs.readme.io/docs/ignite-data-frame#section-ignite-dataframe-options
> > > > > 

Re: [DISCUSSION] Spark Data Frame through Thin Client

Posted by Valentin Kulichenko <va...@gmail.com>.
Nikolay,

Please see my comments below. Actually, I haven't made most of the
assumptions that you mentioned, and I generally agree with you. What I
don't agree with is that replacing thick client with thin client is a way
to fix usability issues. Thin client is not going to be issue-free either,
but will potentially compromise the performance, as well as functionality
(like streaming, as Stephen mentioned). My suggestion is to look at
usability issues and try to fix them without getting rid of thick client.

-Val

On Sun, Oct 21, 2018 at 1:08 AM Nikolay Izhikov <ni...@apache.org> wrote:

> Valentin.
>
> Seems, You made several suggestions, which is not always true, from my
> point of view:
>
> 1. "We have access to Spark cluster installation to perform deployment
> steps" - this is not true in cloud or enterprise environment.
>

Can you please elaborate on this? What is the difference between thin and
thick client from this point of view? I understand that the latter would
generally be more complicated, but how would one use thin client without
deploying a JAR?


>
> 2. "Spark cluster is used only for Ignite integration".
> From what I know computational resources for big Spark cluster is divided
> by many business divisions.
> And it is not convenient to perform some deployment steps on this cluster.
>

Same as #1. Regardless how we use the Spark cluster, we need to deploy a
JAR in case of thin client, no?


>
> 3. "When Ignite + Spark are used in real production it's OK to have
> reasonable deployment overhead"
> What about developer who want to play with this integration?
> And want to do it quickly to see how it works in real life examples.
> Can we do his life much easier?
>

We can and we should :) I'm not arguing there are usability issues with
thick client. I'm just suggesting to fix those issues first, before we jump
reworking the implementation.


>
> > First of all, they will exist with thin client either.
>
> Spark have an ability to deploy jars on worker and add it to application
> tasks classpath.
> For 2.6 we must deploy 11 additional jars to start using Ignite.
> Please, see my example on the bottom of documentation page [1]
>
> Does cache-api-1.0.0.jar and h2-1.4.195.jar seems like obvious
> dependencies for Ignite integration for you?
> And for our users? :)
>

No, this is not obvious. Absolutely, this is a usability issue and we
should think how to make user's life easier.


>
> Actually, list of dependencies will be changed in 2.7 - new version of
> jcache, new version of h2
> So user should change it in code or perform additional deployment steps.
>
> It overkill for me.
>
> On the other hand - thin client requires only 1 jar.
> Moreover, thin client protocol have the backward compatibility.
> So thin client will perform correctly when Ignite cluster will be updated
> from 2.6 to 2.7.
> So, with Spark integration via thin client we will be able to update
> Ignite cluster and Spark integration separately.
> For now, we should do it in one big step.
>
> What do you think?
>
> [1] https://apacheignite-fs.readme.io/docs/installation-deployment
>
> В Сб, 20/10/2018 в 18:33 -0700, Valentin Kulichenko пишет:
> > Guys,
> >
> > From my experience, Ignite and Spark clusters typically run in the same
> > environment, which makes client node a more preferable option. Mainly,
> > because of performance. BTW, I doubt partition-awareness on thin client
> > will help either, because in dataframes we only run SQL queries and I
> > believe thin client will execute them through a proxy anyway. But correct
> > me if I’m wrong.
> >
> > Either way, it sounds like we just have usability issues with
> Ignite/Spark
> > integration. Why don’t we concentrate on fixing them then? For example,
> #3
> > can be fixed by loading XML content on master and then distributing it to
> > workers, instead of loading on every worker independently. Then there are
> > certain procedures like deploying JARs, etc. First of all, they will
> exist
> > with thin client either. Second of all, I’m sure there are ways to
> simplify
> > this procedures and make integration easier. My opinion is that working
> on
> > such improvements is going to add more value than another implementation
> > based on thin client.
> >
> > -Val
> >
> > On Sat, Oct 20, 2018 at 4:03 PM Denis Magda <dm...@apache.org> wrote:
> >
> > > Hello Nikolay,
> > >
> > > Your proposal sounds reasonable. However, I would suggest us to wait
> while
> > > partition-awareness is supported for Java thin client first. With that
> > > feature, the client can connect to any node directly while presently
> all
> > > the communication goes through a proxy (a node the client is connected
> to).
> > > All of that is bad for performance.
> > >
> > >
> > > Vladimir, how hard would it be to support the partition-awareness for
> Java
> > > client? Probably, Nikolay can take over.
> > >
> > > --
> > > Denis
> > >
> > >
> > > On Sat, Oct 20, 2018 at 2:09 PM Nikolay Izhikov <ni...@apache.org>
> > > wrote:
> > >
> > > > Hello, Igniters.
> > > >
> > > > Currently, Spark Data Frame integration implemented via client node
> > > > connection.
> > > > Whenever we need to retrieve some data into Spark worker(or master)
> from
> > > > Ignite we start a client node.
> > > >
> > > > It has several major disadvantages:
> > > >
> > > >         1. We should copy whole Ignite distribution on to each Spark
> > > > worker [1]
> > > >         2. We should copy whole Ignite distribution on to Spark
> master to
> > > > get catalogue works.
> > > >         3. We should have the same absolute path to Ignite
> configuration
> > > > file on every worker and provide it during data frame construction
> [2]
> > > >         4. We should additionally configure Spark workerks classpath
> to
> > > > include Ignite libraries.
> > > >
> > > > For now, almost all operation we need to do in Spark Data Frame
> > > > integration is supported by Java Thin Client.
> > > >         * obtain the list of caches.
> > > >         * get cache configuration.
> > > >         * execute SQL query.
> > > >         * stream data to the table - don't support by the thin
> client for
> > > > now, but can be implemented using simple SQL INSERT statements.
> > > >
> > > > Advantages of usage Java Thin Client in Spark integration(they all
> known
> > > > from Java Thin Client advantages):
> > > >         1. Easy to configure: only IP addresses of server nodes are
> > > > required.
> > > >         2. Easy to deploy: only 1 additional jar required. No server
> > > > side(Ignite worker) configuration required.
> > > >
> > > > I propose to implement Spark Data Frame integration through Java Thin
> > > > Client.
> > > >
> > > > Thoughts?
> > > >
> > > > [1] https://apacheignite-fs.readme.io/docs/installation-deployment
> > > > [2]
> > > >
> > >
> > >
> https://apacheignite-fs.readme.io/docs/ignite-data-frame#section-ignite-dataframe-options
> > > >
>

Re: [DISCUSSION] Spark Data Frame through Thin Client

Posted by Nikolay Izhikov <ni...@apache.org>.
Valentin.

Seems, You made several suggestions, which is not always true, from my point of view:

1. "We have access to Spark cluster installation to perform deployment steps" - this is not true in cloud or enterprise environment.

2. "Spark cluster is used only for Ignite integration".
From what I know computational resources for big Spark cluster is divided by many business divisions.
And it is not convenient to perform some deployment steps on this cluster.

3. "When Ignite + Spark are used in real production it's OK to have reasonable deployment overhead"
What about developer who want to play with this integration?
And want to do it quickly to see how it works in real life examples.
Can we do his life much easier?

> First of all, they will exist with thin client either.

Spark have an ability to deploy jars on worker and add it to application tasks classpath.
For 2.6 we must deploy 11 additional jars to start using Ignite.
Please, see my example on the bottom of documentation page [1]

Does cache-api-1.0.0.jar and h2-1.4.195.jar seems like obvious dependencies for Ignite integration for you?
And for our users? :)

Actually, list of dependencies will be changed in 2.7 - new version of jcache, new version of h2
So user should change it in code or perform additional deployment steps.

It overkill for me.

On the other hand - thin client requires only 1 jar.
Moreover, thin client protocol have the backward compatibility.
So thin client will perform correctly when Ignite cluster will be updated from 2.6 to 2.7.
So, with Spark integration via thin client we will be able to update Ignite cluster and Spark integration separately.
For now, we should do it in one big step.

What do you think?

[1] https://apacheignite-fs.readme.io/docs/installation-deployment

В Сб, 20/10/2018 в 18:33 -0700, Valentin Kulichenko пишет:
> Guys,
> 
> From my experience, Ignite and Spark clusters typically run in the same
> environment, which makes client node a more preferable option. Mainly,
> because of performance. BTW, I doubt partition-awareness on thin client
> will help either, because in dataframes we only run SQL queries and I
> believe thin client will execute them through a proxy anyway. But correct
> me if I’m wrong.
> 
> Either way, it sounds like we just have usability issues with Ignite/Spark
> integration. Why don’t we concentrate on fixing them then? For example, #3
> can be fixed by loading XML content on master and then distributing it to
> workers, instead of loading on every worker independently. Then there are
> certain procedures like deploying JARs, etc. First of all, they will exist
> with thin client either. Second of all, I’m sure there are ways to simplify
> this procedures and make integration easier. My opinion is that working on
> such improvements is going to add more value than another implementation
> based on thin client.
> 
> -Val
> 
> On Sat, Oct 20, 2018 at 4:03 PM Denis Magda <dm...@apache.org> wrote:
> 
> > Hello Nikolay,
> > 
> > Your proposal sounds reasonable. However, I would suggest us to wait while
> > partition-awareness is supported for Java thin client first. With that
> > feature, the client can connect to any node directly while presently all
> > the communication goes through a proxy (a node the client is connected to).
> > All of that is bad for performance.
> > 
> > 
> > Vladimir, how hard would it be to support the partition-awareness for Java
> > client? Probably, Nikolay can take over.
> > 
> > --
> > Denis
> > 
> > 
> > On Sat, Oct 20, 2018 at 2:09 PM Nikolay Izhikov <ni...@apache.org>
> > wrote:
> > 
> > > Hello, Igniters.
> > > 
> > > Currently, Spark Data Frame integration implemented via client node
> > > connection.
> > > Whenever we need to retrieve some data into Spark worker(or master) from
> > > Ignite we start a client node.
> > > 
> > > It has several major disadvantages:
> > > 
> > >         1. We should copy whole Ignite distribution on to each Spark
> > > worker [1]
> > >         2. We should copy whole Ignite distribution on to Spark master to
> > > get catalogue works.
> > >         3. We should have the same absolute path to Ignite configuration
> > > file on every worker and provide it during data frame construction [2]
> > >         4. We should additionally configure Spark workerks classpath to
> > > include Ignite libraries.
> > > 
> > > For now, almost all operation we need to do in Spark Data Frame
> > > integration is supported by Java Thin Client.
> > >         * obtain the list of caches.
> > >         * get cache configuration.
> > >         * execute SQL query.
> > >         * stream data to the table - don't support by the thin client for
> > > now, but can be implemented using simple SQL INSERT statements.
> > > 
> > > Advantages of usage Java Thin Client in Spark integration(they all known
> > > from Java Thin Client advantages):
> > >         1. Easy to configure: only IP addresses of server nodes are
> > > required.
> > >         2. Easy to deploy: only 1 additional jar required. No server
> > > side(Ignite worker) configuration required.
> > > 
> > > I propose to implement Spark Data Frame integration through Java Thin
> > > Client.
> > > 
> > > Thoughts?
> > > 
> > > [1] https://apacheignite-fs.readme.io/docs/installation-deployment
> > > [2]
> > > 
> > 
> > https://apacheignite-fs.readme.io/docs/ignite-data-frame#section-ignite-dataframe-options
> > > 

Re: [DISCUSSION] Spark Data Frame through Thin Client

Posted by Valentin Kulichenko <va...@gmail.com>.
Guys,

From my experience, Ignite and Spark clusters typically run in the same
environment, which makes client node a more preferable option. Mainly,
because of performance. BTW, I doubt partition-awareness on thin client
will help either, because in dataframes we only run SQL queries and I
believe thin client will execute them through a proxy anyway. But correct
me if I’m wrong.

Either way, it sounds like we just have usability issues with Ignite/Spark
integration. Why don’t we concentrate on fixing them then? For example, #3
can be fixed by loading XML content on master and then distributing it to
workers, instead of loading on every worker independently. Then there are
certain procedures like deploying JARs, etc. First of all, they will exist
with thin client either. Second of all, I’m sure there are ways to simplify
this procedures and make integration easier. My opinion is that working on
such improvements is going to add more value than another implementation
based on thin client.

-Val

On Sat, Oct 20, 2018 at 4:03 PM Denis Magda <dm...@apache.org> wrote:

> Hello Nikolay,
>
> Your proposal sounds reasonable. However, I would suggest us to wait while
> partition-awareness is supported for Java thin client first. With that
> feature, the client can connect to any node directly while presently all
> the communication goes through a proxy (a node the client is connected to).
> All of that is bad for performance.
>
>
> Vladimir, how hard would it be to support the partition-awareness for Java
> client? Probably, Nikolay can take over.
>
> --
> Denis
>
>
> On Sat, Oct 20, 2018 at 2:09 PM Nikolay Izhikov <ni...@apache.org>
> wrote:
>
> > Hello, Igniters.
> >
> > Currently, Spark Data Frame integration implemented via client node
> > connection.
> > Whenever we need to retrieve some data into Spark worker(or master) from
> > Ignite we start a client node.
> >
> > It has several major disadvantages:
> >
> >         1. We should copy whole Ignite distribution on to each Spark
> > worker [1]
> >         2. We should copy whole Ignite distribution on to Spark master to
> > get catalogue works.
> >         3. We should have the same absolute path to Ignite configuration
> > file on every worker and provide it during data frame construction [2]
> >         4. We should additionally configure Spark workerks classpath to
> > include Ignite libraries.
> >
> > For now, almost all operation we need to do in Spark Data Frame
> > integration is supported by Java Thin Client.
> >         * obtain the list of caches.
> >         * get cache configuration.
> >         * execute SQL query.
> >         * stream data to the table - don't support by the thin client for
> > now, but can be implemented using simple SQL INSERT statements.
> >
> > Advantages of usage Java Thin Client in Spark integration(they all known
> > from Java Thin Client advantages):
> >         1. Easy to configure: only IP addresses of server nodes are
> > required.
> >         2. Easy to deploy: only 1 additional jar required. No server
> > side(Ignite worker) configuration required.
> >
> > I propose to implement Spark Data Frame integration through Java Thin
> > Client.
> >
> > Thoughts?
> >
> > [1] https://apacheignite-fs.readme.io/docs/installation-deployment
> > [2]
> >
> https://apacheignite-fs.readme.io/docs/ignite-data-frame#section-ignite-dataframe-options
> >
>

Re: [DISCUSSION] Spark Data Frame through Thin Client

Posted by Denis Magda <dm...@apache.org>.
Hello Nikolay,

Your proposal sounds reasonable. However, I would suggest us to wait while
partition-awareness is supported for Java thin client first. With that
feature, the client can connect to any node directly while presently all
the communication goes through a proxy (a node the client is connected to).
All of that is bad for performance.


Vladimir, how hard would it be to support the partition-awareness for Java
client? Probably, Nikolay can take over.

--
Denis


On Sat, Oct 20, 2018 at 2:09 PM Nikolay Izhikov <ni...@apache.org> wrote:

> Hello, Igniters.
>
> Currently, Spark Data Frame integration implemented via client node
> connection.
> Whenever we need to retrieve some data into Spark worker(or master) from
> Ignite we start a client node.
>
> It has several major disadvantages:
>
>         1. We should copy whole Ignite distribution on to each Spark
> worker [1]
>         2. We should copy whole Ignite distribution on to Spark master to
> get catalogue works.
>         3. We should have the same absolute path to Ignite configuration
> file on every worker and provide it during data frame construction [2]
>         4. We should additionally configure Spark workerks classpath to
> include Ignite libraries.
>
> For now, almost all operation we need to do in Spark Data Frame
> integration is supported by Java Thin Client.
>         * obtain the list of caches.
>         * get cache configuration.
>         * execute SQL query.
>         * stream data to the table - don't support by the thin client for
> now, but can be implemented using simple SQL INSERT statements.
>
> Advantages of usage Java Thin Client in Spark integration(they all known
> from Java Thin Client advantages):
>         1. Easy to configure: only IP addresses of server nodes are
> required.
>         2. Easy to deploy: only 1 additional jar required. No server
> side(Ignite worker) configuration required.
>
> I propose to implement Spark Data Frame integration through Java Thin
> Client.
>
> Thoughts?
>
> [1] https://apacheignite-fs.readme.io/docs/installation-deployment
> [2]
> https://apacheignite-fs.readme.io/docs/ignite-data-frame#section-ignite-dataframe-options
>

Re: [DISCUSSION] Spark Data Frame through Thin Client

Posted by Nikolay Izhikov <ni...@apache.org>.
IGNITE-10325 created.

ср, 7 нояб. 2018 г., 11:42 Ray rayliu@cisco.com:

> From my past experience with Spark Data Frame API, the thick client
> approach
> leads to many usability problems.
>
> Ex.
>
> http://apache-ignite-users.70518.x6.nabble.com/Local-node-SEGMENTED-error-causing-node-goes-down-for-no-obvious-reason-td25061.html
>
> I think it makes a lot of sense to change to thin client.
>
>
>
> --
> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
>

Re: [DISCUSSION] Spark Data Frame through Thin Client

Posted by Ray <ra...@cisco.com>.
From my past experience with Spark Data Frame API, the thick client approach
leads to many usability problems.

Ex.
http://apache-ignite-users.70518.x6.nabble.com/Local-node-SEGMENTED-error-causing-node-goes-down-for-no-obvious-reason-td25061.html

I think it makes a lot of sense to change to thin client.



--
Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/