You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Marc Le Bihan <ml...@gmail.com> on 2023/12/03 17:29:46 UTC

Should Spark 4.x use Java modules (those you define with module-info.java sources)?

Hello,

     Last month, I've attempted the experience of upgrading my 
Spring-Boot 2 Java project, that relies heavily on Spark 3.4.2, to 
Spring-Boot 3. It didn't succeed yet, but was informative.

     Spring-Boot 2 → 3 means especially javax.* becoming jakarka.* : 
javax.activation, javax.ws.rs, javax.persistence, javax.validation, 
javax.servlet... all of these have to change their packages and 
dependencies.
     Apart of that, they were some trouble with ANTLR 4 against ANTLR 3, 
and few things with SFL4 and Log4J.

     It was not easy, and I guessed that going into modules could be a 
key. But when I'm near the Spark submodules of my project, it fail with 
messages such as:
         package org.apache.spark.sql.types is declared in the unnamed 
module, but module fr.ecoemploi.outbound.spark.core does not read it

     But I can't handle the spark dependencies easily, because they have 
an "invalid name" for Java. It's a matter that it doesn't want the "_" 
that is in the "_2.13" suffix of the jars.
         [WARNING] Can't extract module name from 
breeze-macros_2.13-2.1.0.jar: breeze.macros.2.13: Invalid module name: 
'2' is not a Java identifier
         [WARNING] Can't extract module name from 
spark-tags_2.13-3.4.2.jar: spark.tags.2.13: Invalid module name: '2' is 
not a Java identifier
         [WARNING] Can't extract module name from 
spark-unsafe_2.13-3.4.2.jar: spark.unsafe.2.13: Invalid module name: '2' 
is not a Java identifier
         [WARNING] Can't extract module name from 
spark-mllib_2.13-3.4.2.jar: spark.mllib.2.13: Invalid module name: '2' 
is not a Java identifier
         [... around 30 ...]

     I think that changing the naming pattern of the Spark jars for the 
4.x could be a good idea,
     but beyond that, what about attempting to integrate Spark into 
modules, it's submodules defining module-info.java?

     Is it something that you think that [must | should | might | should 
not | must not] be done?

Regards,

Marc Le Bihan

Re: Should Spark 4.x use Java modules (those you define with module-info.java sources)?

Posted by Sean Owen <sr...@gmail.com>.
It already does. I think that's not the same idea?

On Mon, Dec 4, 2023, 8:12 PM Almog Tavor <al...@gmail.com> wrote:

> I think Spark should start shading it’s problematic deps similar to how
> it’s done in Flink
>
> On Mon, 4 Dec 2023 at 2:57 Sean Owen <sr...@gmail.com> wrote:
>
>> I am not sure we can control that - the Scala _x.y suffix has particular
>> meaning in the Scala ecosystem for artifacts and thus the naming of .jar
>> files. And we need to work with the Scala ecosystem.
>>
>> What can't handle these files, Spring Boot? does it somehow assume the
>> .jar file name relates to Java modules?
>>
>> By the by, Spark 4 is already moving to the jakarta.* packages for
>> similar reasons.
>>
>> I don't think Spark does or can really leverage Java modules. It started
>> waaay before that and expect that it has some structural issues that are
>> incompatible with Java modules, like multiple places declaring code in the
>> same Java package.
>>
>> As in all things, if there's a change that doesn't harm anything else and
>> helps support for Java modules, sure, suggest it. If it has the conflicts I
>> think it will, probably not possible and not really a goal I think.
>>
>>
>> On Sun, Dec 3, 2023 at 11:30 AM Marc Le Bihan <ml...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>>     Last month, I've attempted the experience of upgrading my
>>> Spring-Boot 2 Java project, that relies heavily on Spark 3.4.2, to
>>> Spring-Boot 3. It didn't succeed yet, but was informative.
>>>
>>>     Spring-Boot 2 → 3 means especially javax.* becoming jakarka.* :
>>> javax.activation, javax.ws.rs, javax.persistence, javax.validation,
>>> javax.servlet... all of these have to change their packages and
>>> dependencies.
>>>     Apart of that, they were some trouble with ANTLR 4 against ANTLR 3,
>>> and few things with SFL4 and Log4J.
>>>
>>>     It was not easy, and I guessed that going into modules could be a
>>> key. But when I'm near the Spark submodules of my project, it fail with
>>> messages such as:
>>>         package org.apache.spark.sql.types is declared in the unnamed
>>> module, but module fr.ecoemploi.outbound.spark.core does not read it
>>>
>>>     But I can't handle the spark dependencies easily, because they have
>>> an "invalid name" for Java. It's a matter that it doesn't want the "_" that
>>> is in the "_2.13" suffix of the jars.
>>>         [WARNING] Can't extract module name from
>>> breeze-macros_2.13-2.1.0.jar: breeze.macros.2.13: Invalid module name: '2'
>>> is not a Java identifier
>>>         [WARNING] Can't extract module name from
>>> spark-tags_2.13-3.4.2.jar: spark.tags.2.13: Invalid module name: '2' is not
>>> a Java identifier
>>>         [WARNING] Can't extract module name from
>>> spark-unsafe_2.13-3.4.2.jar: spark.unsafe.2.13: Invalid module name: '2' is
>>> not a Java identifier
>>>         [WARNING] Can't extract module name from
>>> spark-mllib_2.13-3.4.2.jar: spark.mllib.2.13: Invalid module name: '2' is
>>> not a Java identifier
>>>         [... around 30 ...]
>>>
>>>     I think that changing the naming pattern of the Spark jars for the
>>> 4.x could be a good idea,
>>>     but beyond that, what about attempting to integrate Spark into
>>> modules, it's submodules defining module-info.java?
>>>
>>>     Is it something that you think that [must | should | might | should
>>> not | must not] be done?
>>>
>>> Regards,
>>>
>>> Marc Le Bihan
>>>
>>

Re: Should Spark 4.x use Java modules (those you define with module-info.java sources)?

Posted by Almog Tavor <al...@gmail.com>.
I think Spark should start shading it’s problematic deps similar to how
it’s done in Flink

On Mon, 4 Dec 2023 at 2:57 Sean Owen <sr...@gmail.com> wrote:

> I am not sure we can control that - the Scala _x.y suffix has particular
> meaning in the Scala ecosystem for artifacts and thus the naming of .jar
> files. And we need to work with the Scala ecosystem.
>
> What can't handle these files, Spring Boot? does it somehow assume the
> .jar file name relates to Java modules?
>
> By the by, Spark 4 is already moving to the jakarta.* packages for similar
> reasons.
>
> I don't think Spark does or can really leverage Java modules. It started
> waaay before that and expect that it has some structural issues that are
> incompatible with Java modules, like multiple places declaring code in the
> same Java package.
>
> As in all things, if there's a change that doesn't harm anything else and
> helps support for Java modules, sure, suggest it. If it has the conflicts I
> think it will, probably not possible and not really a goal I think.
>
>
> On Sun, Dec 3, 2023 at 11:30 AM Marc Le Bihan <ml...@gmail.com>
> wrote:
>
>> Hello,
>>
>>     Last month, I've attempted the experience of upgrading my Spring-Boot
>> 2 Java project, that relies heavily on Spark 3.4.2, to Spring-Boot 3. It
>> didn't succeed yet, but was informative.
>>
>>     Spring-Boot 2 → 3 means especially javax.* becoming jakarka.* :
>> javax.activation, javax.ws.rs, javax.persistence, javax.validation,
>> javax.servlet... all of these have to change their packages and
>> dependencies.
>>     Apart of that, they were some trouble with ANTLR 4 against ANTLR 3,
>> and few things with SFL4 and Log4J.
>>
>>     It was not easy, and I guessed that going into modules could be a
>> key. But when I'm near the Spark submodules of my project, it fail with
>> messages such as:
>>         package org.apache.spark.sql.types is declared in the unnamed
>> module, but module fr.ecoemploi.outbound.spark.core does not read it
>>
>>     But I can't handle the spark dependencies easily, because they have
>> an "invalid name" for Java. It's a matter that it doesn't want the "_" that
>> is in the "_2.13" suffix of the jars.
>>         [WARNING] Can't extract module name from
>> breeze-macros_2.13-2.1.0.jar: breeze.macros.2.13: Invalid module name: '2'
>> is not a Java identifier
>>         [WARNING] Can't extract module name from
>> spark-tags_2.13-3.4.2.jar: spark.tags.2.13: Invalid module name: '2' is not
>> a Java identifier
>>         [WARNING] Can't extract module name from
>> spark-unsafe_2.13-3.4.2.jar: spark.unsafe.2.13: Invalid module name: '2' is
>> not a Java identifier
>>         [WARNING] Can't extract module name from
>> spark-mllib_2.13-3.4.2.jar: spark.mllib.2.13: Invalid module name: '2' is
>> not a Java identifier
>>         [... around 30 ...]
>>
>>     I think that changing the naming pattern of the Spark jars for the
>> 4.x could be a good idea,
>>     but beyond that, what about attempting to integrate Spark into
>> modules, it's submodules defining module-info.java?
>>
>>     Is it something that you think that [must | should | might | should
>> not | must not] be done?
>>
>> Regards,
>>
>> Marc Le Bihan
>>
>

Re: Should Spark 4.x use Java modules (those you define with module-info.java sources)?

Posted by Sean Owen <sr...@gmail.com>.
I am not sure we can control that - the Scala _x.y suffix has particular
meaning in the Scala ecosystem for artifacts and thus the naming of .jar
files. And we need to work with the Scala ecosystem.

What can't handle these files, Spring Boot? does it somehow assume the .jar
file name relates to Java modules?

By the by, Spark 4 is already moving to the jakarta.* packages for similar
reasons.

I don't think Spark does or can really leverage Java modules. It started
waaay before that and expect that it has some structural issues that are
incompatible with Java modules, like multiple places declaring code in the
same Java package.

As in all things, if there's a change that doesn't harm anything else and
helps support for Java modules, sure, suggest it. If it has the conflicts I
think it will, probably not possible and not really a goal I think.


On Sun, Dec 3, 2023 at 11:30 AM Marc Le Bihan <ml...@gmail.com> wrote:

> Hello,
>
>     Last month, I've attempted the experience of upgrading my Spring-Boot
> 2 Java project, that relies heavily on Spark 3.4.2, to Spring-Boot 3. It
> didn't succeed yet, but was informative.
>
>     Spring-Boot 2 → 3 means especially javax.* becoming jakarka.* :
> javax.activation, javax.ws.rs, javax.persistence, javax.validation,
> javax.servlet... all of these have to change their packages and
> dependencies.
>     Apart of that, they were some trouble with ANTLR 4 against ANTLR 3,
> and few things with SFL4 and Log4J.
>
>     It was not easy, and I guessed that going into modules could be a key.
> But when I'm near the Spark submodules of my project, it fail with messages
> such as:
>         package org.apache.spark.sql.types is declared in the unnamed
> module, but module fr.ecoemploi.outbound.spark.core does not read it
>
>     But I can't handle the spark dependencies easily, because they have an
> "invalid name" for Java. It's a matter that it doesn't want the "_" that is
> in the "_2.13" suffix of the jars.
>         [WARNING] Can't extract module name from
> breeze-macros_2.13-2.1.0.jar: breeze.macros.2.13: Invalid module name: '2'
> is not a Java identifier
>         [WARNING] Can't extract module name from
> spark-tags_2.13-3.4.2.jar: spark.tags.2.13: Invalid module name: '2' is not
> a Java identifier
>         [WARNING] Can't extract module name from
> spark-unsafe_2.13-3.4.2.jar: spark.unsafe.2.13: Invalid module name: '2' is
> not a Java identifier
>         [WARNING] Can't extract module name from
> spark-mllib_2.13-3.4.2.jar: spark.mllib.2.13: Invalid module name: '2' is
> not a Java identifier
>         [... around 30 ...]
>
>     I think that changing the naming pattern of the Spark jars for the 4.x
> could be a good idea,
>     but beyond that, what about attempting to integrate Spark into
> modules, it's submodules defining module-info.java?
>
>     Is it something that you think that [must | should | might | should
> not | must not] be done?
>
> Regards,
>
> Marc Le Bihan
>