You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Amin Borjian <bo...@outlook.com> on 2021/11/24 07:14:43 UTC

[Spark] Does Spark support backward and forward compatibility?

I have a simple question about using Spark, which although most tools usually explain this question explicitly (in important text, such as a specific format or a separate page), I did not find it anywhere. Maybe my search was not enough, but I thought it was good that I ask this question in the hope that maybe the answer will benefit other people as well.

Spark binary is usually downloaded from the following link and installed and configured on the cluster: Download Apache Spark<https://spark.apache.org/downloads.html>

If, for example, we use the Java language for programming (although it can be other supported languages), we need the following dependencies to communicate with Spark:

<dependency>

    <groupId>org.apache.spark</groupId>

    <artifactId>spark-core_2.12</artifactId>

    <version>3.2.0</version>

</dependency>

<dependency>

    <groupId>org.apache.spark</groupId>

    <artifactId>spark-sql_2.12</artifactId>

    <version>3.2.0</version>

</dependency>

As is clear, both the Spark cluster (binary of Spark) and the dependencies used on the application side have a specific version. In my opinion, it is obvious that if the version used is the same on both the application side and the server side, everything will most likely work in its ideal state without any problems.

But the question is, what if the two versions are not the same? Is it possible to have compatibility between the server and the application in specific number of conditions (such as not changing major version)? Or, for example, if the client is always ahead, is it not a problem? Or if the server is always ahead, is it not a problem?

The argument is that there may be a library that I did not write and it is an old version, but I want to update my cluster (server version). Or it may not be possible for me to update the server version and all the applications version at the same time, so I want to update each one separately. As a result, the application-server version differs in a period of time. (maybe short or long period) I want to know exactly how Spark works in this situation.

Re: [Spark] Does Spark support backward and forward compatibility?

Posted by "Lalwani, Jayesh" <jl...@amazon.com.INVALID>.

One thing to be pointed out is that you never bundle the Spark Client with your code. You compile against a Spark version. You bundle your code (without Spark jars) in an uber jar and deploy the Uber jar into Spark. Spark is already bundled with the jars that are required to send jobs to scheduler. At runtime, your code will be using the jars bundled in the instance of Spark that your application is running in

Spark is backward compatible; ie; a jar, compiled against 3.1.x , will run in a Spark 3.2.0 cluster
Like Sean mentioned, Spark is not guaranteed to be forward compatible; ie; a jar, compiled against 3.2.1, may not run in a Spark 2.4.0 cluster. It might work if the functions called from your code are available in 2.4.0. But, it will fail if you are calling API that was introduced after 2.4.0.

So, the question of “Can I use an older version of the client to submit jobs to a newer version of Spark” is moot. You never do that.

From: Amin Borjian <bo...@outlook.com>
Date: Wednesday, November 24, 2021 at 2:44 PM
To: Sean Owen <sr...@gmail.com>
Cc: "user@spark.apache.org" <us...@spark.apache.org>
Subject: RE: [EXTERNAL] [Spark] Does Spark support backward and forward compatibility?

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

Thanks again for reply.

Personally, I think the whole cluster should have a single version. What mattered most to me was how important the client version that sends the jobs to scheduler, that we should hope everything work well in small version changes! (In version changes less than major)

From: Sean Owen<ma...@gmail.com>
Sent: Wednesday, November 24, 2021 10:48 PM
To: Amin Borjian<ma...@outlook.com>
Cc: user@spark.apache.org<ma...@spark.apache.org>
Subject: Re: [Spark] Does Spark support backward and forward compatibility?

I think/hope that it goes without saying you can't mix Spark versions within a cluster.
Forwards compatibility is something you don't generally expect as a default from any piece of software, so not sure there is something to document explicitly.
Backwards compatibility is important, and this is documented extensively where it doesn't hold in the Spark docs and release notes.

On Wed, Nov 24, 2021 at 1:16 PM Amin Borjian <bo...@outlook.com>> wrote:
Thank you very much for the reply you sent. It would be great if these items were mentioned in the Spark document (for example, the download page or something else)

If I understand correctly, it means that we can compile the client (for example Java, etc.) with a newer version (for example 3.2.0) within the range of a major version against older server (for example 3.1.x) and do not see any problem in most cases. Am I right? (Because the issue of backward-compatibility can be expressed from both the server and the client view, I repeated the sentence to make sure I got it right.)

But what happened if we update server to 3.2.x and our client was in version 3.1.x? Does it client can work with newer cluster version because it uses just old feature of severs? (Maybe you mean this and in fact my previous sentence was wrong and I misunderstood)

From: Sean Owen<ma...@gmail.com>
Sent: Wednesday, November 24, 2021 5:38 PM
To: Amin Borjian<ma...@outlook.com>
Cc: user@spark.apache.org<ma...@spark.apache.org>
Subject: Re: [Spark] Does Spark support backward and forward compatibility?

Can you mix different Spark versions on driver and executor? no.
Can you compile against a different version of Spark than you run on? That typically works within a major release, though forwards compatibility may not work (you can't use a feature that doesn't exist in the version on the cluster). Compiling vs 3.2.0 and running on 3.1.x for example should work fine in 99% of cases.

On Wed, Nov 24, 2021 at 8:04 AM Amin Borjian <bo...@outlook.com>> wrote:

I have a simple question about using Spark, which although most tools usually explain this question explicitly (in important text, such as a specific format or a separate page), I did not find it anywhere. Maybe my search was not enough, but I thought it was good that I ask this question in the hope that maybe the answer will benefit other people as well.

Spark binary is usually downloaded from the following link and installed and configured on the cluster: Download Apache Spark<https://spark.apache.org/downloads.html>

If, for example, we use the Java language for programming (although it can be other supported languages), we need the following dependencies to communicate with Spark:

<dependency>

    <groupId>org.apache.spark</groupId>

    <artifactId>spark-core_2.12</artifactId>

    <version>3.2.0</version>

</dependency>

<dependency>

    <groupId>org.apache.spark</groupId>

    <artifactId>spark-sql_2.12</artifactId>

    <version>3.2.0</version>

</dependency>

As is clear, both the Spark cluster (binary of Spark) and the dependencies used on the application side have a specific version. In my opinion, it is obvious that if the version used is the same on both the application side and the server side, everything will most likely work in its ideal state without any problems.

But the question is, what if the two versions are not the same? Is it possible to have compatibility between the server and the application in specific number of conditions (such as not changing major version)? Or, for example, if the client is always ahead, is it not a problem? Or if the server is always ahead, is it not a problem?

The argument is that there may be a library that I did not write and it is an old version, but I want to update my cluster (server version). Or it may not be possible for me to update the server version and all the applications version at the same time, so I want to update each one separately. As a result, the application-server version differs in a period of time. (maybe short or long period) I want to know exactly how Spark works in this situation.

RE: [Spark] Does Spark support backward and forward compatibility?

Posted by Amin Borjian <bo...@outlook.com>.

Thanks again for reply.

Personally, I think the whole cluster should have a single version. What mattered most to me was how important the client version that sends the jobs to scheduler, that we should hope everything work well in small version changes! (In version changes less than major)

From: Sean Owen<ma...@gmail.com>
Sent: Wednesday, November 24, 2021 10:48 PM
To: Amin Borjian<ma...@outlook.com>
Cc: user@spark.apache.org<ma...@spark.apache.org>
Subject: Re: [Spark] Does Spark support backward and forward compatibility?

I think/hope that it goes without saying you can't mix Spark versions within a cluster.
Forwards compatibility is something you don't generally expect as a default from any piece of software, so not sure there is something to document explicitly.
Backwards compatibility is important, and this is documented extensively where it doesn't hold in the Spark docs and release notes.

On Wed, Nov 24, 2021 at 1:16 PM Amin Borjian <bo...@outlook.com>> wrote:
Thank you very much for the reply you sent. It would be great if these items were mentioned in the Spark document (for example, the download page or something else)

If I understand correctly, it means that we can compile the client (for example Java, etc.) with a newer version (for example 3.2.0) within the range of a major version against older server (for example 3.1.x) and do not see any problem in most cases. Am I right? (Because the issue of backward-compatibility can be expressed from both the server and the client view, I repeated the sentence to make sure I got it right.)

But what happened if we update server to 3.2.x and our client was in version 3.1.x? Does it client can work with newer cluster version because it uses just old feature of severs? (Maybe you mean this and in fact my previous sentence was wrong and I misunderstood)

From: Sean Owen<ma...@gmail.com>
Sent: Wednesday, November 24, 2021 5:38 PM
To: Amin Borjian<ma...@outlook.com>
Cc: user@spark.apache.org<ma...@spark.apache.org>
Subject: Re: [Spark] Does Spark support backward and forward compatibility?

Can you mix different Spark versions on driver and executor? no.
Can you compile against a different version of Spark than you run on? That typically works within a major release, though forwards compatibility may not work (you can't use a feature that doesn't exist in the version on the cluster). Compiling vs 3.2.0 and running on 3.1.x for example should work fine in 99% of cases.

On Wed, Nov 24, 2021 at 8:04 AM Amin Borjian <bo...@outlook.com>> wrote:

I have a simple question about using Spark, which although most tools usually explain this question explicitly (in important text, such as a specific format or a separate page), I did not find it anywhere. Maybe my search was not enough, but I thought it was good that I ask this question in the hope that maybe the answer will benefit other people as well.

Spark binary is usually downloaded from the following link and installed and configured on the cluster: Download Apache Spark<https://spark.apache.org/downloads.html>

If, for example, we use the Java language for programming (although it can be other supported languages), we need the following dependencies to communicate with Spark:

<dependency>

    <groupId>org.apache.spark</groupId>

    <artifactId>spark-core_2.12</artifactId>

    <version>3.2.0</version>

</dependency>

<dependency>

    <groupId>org.apache.spark</groupId>

    <artifactId>spark-sql_2.12</artifactId>

    <version>3.2.0</version>

</dependency>

As is clear, both the Spark cluster (binary of Spark) and the dependencies used on the application side have a specific version. In my opinion, it is obvious that if the version used is the same on both the application side and the server side, everything will most likely work in its ideal state without any problems.

But the question is, what if the two versions are not the same? Is it possible to have compatibility between the server and the application in specific number of conditions (such as not changing major version)? Or, for example, if the client is always ahead, is it not a problem? Or if the server is always ahead, is it not a problem?

The argument is that there may be a library that I did not write and it is an old version, but I want to update my cluster (server version). Or it may not be possible for me to update the server version and all the applications version at the same time, so I want to update each one separately. As a result, the application-server version differs in a period of time. (maybe short or long period) I want to know exactly how Spark works in this situation.

Re: [Spark] Does Spark support backward and forward compatibility?

Posted by Martin Wunderlich <ma...@wunderlich.com>.

Hi Amin,

This might be only marginally relevant to your question, but in my 
project I also noticed the following: The trained and exported Spark 
models (i.e. pipelines saved to binary files) are also not compatible 
between versions, at least between major versions. I noticed this when 
trying to load a model built with Spark 2.4.4 after updating to 3.2.0. 
This didn't work.

Cheers,

Martin

Am 24.11.21 um 20:18 schrieb Sean Owen:
> I think/hope that it goes without saying you can't mix Spark versions 
> within a cluster.
> Forwards compatibility is something you don't generally expect as a 
> default from any piece of software, so not sure there is something to 
> document explicitly.
> Backwards compatibility is important, and this is documented 
> extensively where it doesn't hold in the Spark docs and release notes.
>
>
> On Wed, Nov 24, 2021 at 1:16 PM Amin Borjian 
> <bo...@outlook.com> wrote:
>
>     Thank you very much for the reply you sent. It would be great if
>     these items were mentioned in the Spark document (for example, the
>     download page or something else)
>
>     If I understand correctly, it means that we can compile the client
>     (for example Java, etc.) with a newer version (for example 3.2.0)
>     within the range of a major version against older server (for
>     example 3.1.x) and do not see any problem in most cases. Am I
>     right?(Because the issue of backward-compatibility can be
>     expressed from both the server and the client view, I repeated the
>     sentence to make sure I got it right.)
>
>     But what happened if we update server to 3.2.x and our client was
>     in version 3.1.x? Does it client can work with newer cluster
>     version because it uses just old feature of severs? (Maybe you
>     mean this and in fact my previous sentence was wrong and I
>     misunderstood)
>
>     *From: *Sean Owen <ma...@gmail.com>
>     *Sent: *Wednesday, November 24, 2021 5:38 PM
>     *To: *Amin Borjian <ma...@outlook.com>
>     *Cc: *user@spark.apache.org
>     *Subject: *Re: [Spark] Does Spark support backward and forward
>     compatibility?
>
>     Can you mix different Spark versions on driver and executor? no.
>
>     Can you compile against a different version of Spark than you run
>     on? That typically works within a major release, though forwards
>     compatibility may not work (you can't use a feature that doesn't
>     exist in the version on the cluster). Compiling vs 3.2.0 and
>     running on 3.1.x for example should work fine in 99% of cases.
>
>     On Wed, Nov 24, 2021 at 8:04 AM Amin Borjian
>     <bo...@outlook.com> wrote:
>
>         I have a simple question about using Spark, which although
>         most tools usually explain this question explicitly (in
>         important text, such as a specific format or a separate page),
>         I did not find it anywhere. Maybe my search was not enough,
>         but I thought it was good that I ask this question in the hope
>         that maybe the answer will benefit other people as well.
>
>         Spark binary is usually downloaded from the following link and
>         installed and configured on the cluster: Download Apache Spark
>         <https://spark.apache.org/downloads.html>
>
>         If, for example, we use the Java language for programming
>         (although it can be other supported languages), we need the
>         following dependencies to communicate with Spark:
>
>         |<dependency>|
>
>         |    <groupId>org.apache.spark</groupId>|
>
>         |    <artifactId>spark-core_2|.12|</artifactId>|
>
>         |    <version>|3.2.0|</version>|
>
>         |</dependency>|
>
>         |<dependency>|
>
>         |    <groupId>org.apache.spark</groupId>|
>
>         |    <artifactId>spark-sql_2|.12|</artifactId>|
>
>         |    <version>|3.2.0|</version>|
>
>         |</dependency>|
>
>         As is clear, both the Spark cluster (binary of Spark) and the
>         dependencies used on the application side have a specific
>         version. In my opinion, it is obvious that if the version used
>         is the same on both the application side and the server side,
>         everything will most likely work in its ideal state without
>         any problems.
>
>         But the question is, what if the two versions are not the
>         same? Is it possible to have compatibility between the server
>         and the application in specific number of conditions (such as
>         not changing major version)? Or, for example, if the client is
>         always ahead, is it not a problem? Or if the server is always
>         ahead, is it not a problem?
>
>     The argument is that there may be a library that I did not write
>     and it is an old version, but I want to update my cluster (server
>     version). Or it may not be possible for me to update the server
>     version and all the applications version at the same time, so I
>     want to update each one separately. As a result, the
>     application-server version differs in a period of time. (maybe
>     short or long period) I want to know exactly how Spark works in
>     this situation.
>

Re: [Spark] Does Spark support backward and forward compatibility?

Posted by Sean Owen <sr...@gmail.com>.

I think/hope that it goes without saying you can't mix Spark versions
within a cluster.
Forwards compatibility is something you don't generally expect as a default
from any piece of software, so not sure there is something to document
explicitly.
Backwards compatibility is important, and this is documented extensively
where it doesn't hold in the Spark docs and release notes.


On Wed, Nov 24, 2021 at 1:16 PM Amin Borjian <bo...@outlook.com>
wrote:

> Thank you very much for the reply you sent. It would be great if these
> items were mentioned in the Spark document (for example, the download page
> or something else)
>
>
>
> If I understand correctly, it means that we can compile the client (for
> example Java, etc.) with a newer version (for example 3.2.0) within the
> range of a major version against older server (for example 3.1.x) and do
> not see any problem in most cases. Am I right? (Because the issue of
> backward-compatibility can be expressed from both the server and the client
> view, I repeated the sentence to make sure I got it right.)
>
>
>
> But what happened if we update server to 3.2.x and our client was in
> version 3.1.x? Does it client can work with newer cluster version because
> it uses just old feature of severs? (Maybe you mean this and in fact my
> previous sentence was wrong and I misunderstood)
>
>
>
> *From: *Sean Owen <sr...@gmail.com>
> *Sent: *Wednesday, November 24, 2021 5:38 PM
> *To: *Amin Borjian <bo...@outlook.com>
> *Cc: *user@spark.apache.org
> *Subject: *Re: [Spark] Does Spark support backward and forward
> compatibility?
>
>
>
> Can you mix different Spark versions on driver and executor? no.
>
> Can you compile against a different version of Spark than you run on? That
> typically works within a major release, though forwards compatibility may
> not work (you can't use a feature that doesn't exist in the version on the
> cluster). Compiling vs 3.2.0 and running on 3.1.x for example should work
> fine in 99% of cases.
>
>
>
> On Wed, Nov 24, 2021 at 8:04 AM Amin Borjian <bo...@outlook.com>
> wrote:
>
> I have a simple question about using Spark, which although most tools
> usually explain this question explicitly (in important text, such as a
> specific format or a separate page), I did not find it anywhere. Maybe my
> search was not enough, but I thought it was good that I ask this question
> in the hope that maybe the answer will benefit other people as well.
>
> Spark binary is usually downloaded from the following link and installed
> and configured on the cluster: Download Apache Spark
> <https://spark.apache.org/downloads.html>
>
> If, for example, we use the Java language for programming (although it can
> be other supported languages), we need the following dependencies to
> communicate with Spark:
>
> <dependency>
>
>     <groupId>org.apache.spark</groupId>
>
>     <artifactId>spark-core_2.12</artifactId>
>
>     <version>3.2.0</version>
>
> </dependency>
>
> <dependency>
>
>     <groupId>org.apache.spark</groupId>
>
>     <artifactId>spark-sql_2.12</artifactId>
>
>     <version>3.2.0</version>
>
> </dependency>
>
> As is clear, both the Spark cluster (binary of Spark) and the dependencies
> used on the application side have a specific version. In my opinion, it is
> obvious that if the version used is the same on both the application side
> and the server side, everything will most likely work in its ideal state
> without any problems.
>
> But the question is, what if the two versions are not the same? Is it
> possible to have compatibility between the server and the application in
> specific number of conditions (such as not changing major version)? Or, for
> example, if the client is always ahead, is it not a problem? Or if the
> server is always ahead, is it not a problem?
>
> The argument is that there may be a library that I did not write and it is
> an old version, but I want to update my cluster (server version). Or it may
> not be possible for me to update the server version and all the
> applications version at the same time, so I want to update each one
> separately. As a result, the application-server version differs in a period
> of time. (maybe short or long period) I want to know exactly how Spark
> works in this situation.
>
>
>

RE: [Spark] Does Spark support backward and forward compatibility?

Posted by Amin Borjian <bo...@outlook.com>.

Thank you very much for the reply you sent. It would be great if these items were mentioned in the Spark document (for example, the download page or something else)

If I understand correctly, it means that we can compile the client (for example Java, etc.) with a newer version (for example 3.2.0) within the range of a major version against older server (for example 3.1.x) and do not see any problem in most cases. Am I right? (Because the issue of backward-compatibility can be expressed from both the server and the client view, I repeated the sentence to make sure I got it right.)

But what happened if we update server to 3.2.x and our client was in version 3.1.x? Does it client can work with newer cluster version because it uses just old feature of severs? (Maybe you mean this and in fact my previous sentence was wrong and I misunderstood)

From: Sean Owen<ma...@gmail.com>
Sent: Wednesday, November 24, 2021 5:38 PM
To: Amin Borjian<ma...@outlook.com>
Cc: user@spark.apache.org<ma...@spark.apache.org>
Subject: Re: [Spark] Does Spark support backward and forward compatibility?

Can you mix different Spark versions on driver and executor? no.
Can you compile against a different version of Spark than you run on? That typically works within a major release, though forwards compatibility may not work (you can't use a feature that doesn't exist in the version on the cluster). Compiling vs 3.2.0 and running on 3.1.x for example should work fine in 99% of cases.

On Wed, Nov 24, 2021 at 8:04 AM Amin Borjian <bo...@outlook.com>> wrote:

I have a simple question about using Spark, which although most tools usually explain this question explicitly (in important text, such as a specific format or a separate page), I did not find it anywhere. Maybe my search was not enough, but I thought it was good that I ask this question in the hope that maybe the answer will benefit other people as well.

Spark binary is usually downloaded from the following link and installed and configured on the cluster: Download Apache Spark<https://spark.apache.org/downloads.html>

If, for example, we use the Java language for programming (although it can be other supported languages), we need the following dependencies to communicate with Spark:

<dependency>

    <groupId>org.apache.spark</groupId>

    <artifactId>spark-core_2.12</artifactId>

    <version>3.2.0</version>

</dependency>

<dependency>

    <groupId>org.apache.spark</groupId>

    <artifactId>spark-sql_2.12</artifactId>

    <version>3.2.0</version>

</dependency>

As is clear, both the Spark cluster (binary of Spark) and the dependencies used on the application side have a specific version. In my opinion, it is obvious that if the version used is the same on both the application side and the server side, everything will most likely work in its ideal state without any problems.

But the question is, what if the two versions are not the same? Is it possible to have compatibility between the server and the application in specific number of conditions (such as not changing major version)? Or, for example, if the client is always ahead, is it not a problem? Or if the server is always ahead, is it not a problem?

The argument is that there may be a library that I did not write and it is an old version, but I want to update my cluster (server version). Or it may not be possible for me to update the server version and all the applications version at the same time, so I want to update each one separately. As a result, the application-server version differs in a period of time. (maybe short or long period) I want to know exactly how Spark works in this situation.

Re: [Spark] Does Spark support backward and forward compatibility?

Posted by Sean Owen <sr...@gmail.com>.

Can you mix different Spark versions on driver and executor? no.
Can you compile against a different version of Spark than you run on? That
typically works within a major release, though forwards compatibility may
not work (you can't use a feature that doesn't exist in the version on the
cluster). Compiling vs 3.2.0 and running on 3.1.x for example should work
fine in 99% of cases.

On Wed, Nov 24, 2021 at 8:04 AM Amin Borjian <bo...@outlook.com>
wrote:

> I have a simple question about using Spark, which although most tools
> usually explain this question explicitly (in important text, such as a
> specific format or a separate page), I did not find it anywhere. Maybe my
> search was not enough, but I thought it was good that I ask this question
> in the hope that maybe the answer will benefit other people as well.
>
> Spark binary is usually downloaded from the following link and installed
> and configured on the cluster: Download Apache Spark
> <https://spark.apache.org/downloads.html>
>
> If, for example, we use the Java language for programming (although it can
> be other supported languages), we need the following dependencies to
> communicate with Spark:
>
> <dependency>
>
>     <groupId>org.apache.spark</groupId>
>
>     <artifactId>spark-core_2.12</artifactId>
>
>     <version>3.2.0</version>
>
> </dependency>
>
> <dependency>
>
>     <groupId>org.apache.spark</groupId>
>
>     <artifactId>spark-sql_2.12</artifactId>
>
>     <version>3.2.0</version>
>
> </dependency>
>
> As is clear, both the Spark cluster (binary of Spark) and the dependencies
> used on the application side have a specific version. In my opinion, it is
> obvious that if the version used is the same on both the application side
> and the server side, everything will most likely work in its ideal state
> without any problems.
>
> But the question is, what if the two versions are not the same? Is it
> possible to have compatibility between the server and the application in
> specific number of conditions (such as not changing major version)? Or, for
> example, if the client is always ahead, is it not a problem? Or if the
> server is always ahead, is it not a problem?
>
> The argument is that there may be a library that I did not write and it is
> an old version, but I want to update my cluster (server version). Or it may
> not be possible for me to update the server version and all the
> applications version at the same time, so I want to update each one
> separately. As a result, the application-server version differs in a period
> of time. (maybe short or long period) I want to know exactly how Spark
> works in this situation.
>