You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Bartosz Konieczny <ba...@gmail.com> on 2019/11/09 17:46:56 UTC

Why Spark generates Java code and not Scala?

Hi there,

Few days ago I got an intriguing but hard to answer question:
"Why Spark generates Java code and not Scala code?"
(https://github.com/bartosz25/spark-scala-playground/issues/18)

Since I'm not sure about the exact answer, I'd like to ask you to confirm
or not my thinking. I was looking for the reasons in the JIRA and the
research paper "Spark SQL: Relational Data Processing in Spark" (
http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf) but
found nothing explaining why Java over Scala. The single task I found was
about why Scala and not Java but concerning data types (
https://issues.apache.org/jira/browse/SPARK-5193) That's why I'm writing
here.

My guesses about choosing Java code are:
- Java runtime compiler libs are more mature and prod-ready than the
Scala's - or at least, they were at the implementation time
- Scala compiler tends to be slower than the Java's
https://stackoverflow.com/questions/3490383/java-compile-speed-vs-scala-compile-speed
- Scala compiler seems to be more complex, so debugging & maintaining it
would be harder
- it was easier to represent a pure Java OO design than mixed FP/OO in Scala
?

Thank you for your help.

-- 
Bartosz Konieczny
data engineer
https://www.waitingforcode.com
https://github.com/bartosz25/
https://twitter.com/waitingforcode

Re: Why Spark generates Java code and not Scala?

Posted by Marcin Tustin <ma...@bluevoyant.com.INVALID>.
Well TIL.

For those also newly informed:
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-whole-stage-codegen.html
https://mail-archives.apache.org/mod_mbox/spark-dev/201911.mbox/browser


On Sun, Nov 10, 2019 at 7:57 AM Holden Karau <ho...@pigscanfly.ca> wrote:

> *This Message originated outside your organization.*
> ------------------------------
> If you look inside of the generation we generate java code and compile it
> with Janino. For interested folks the conversation moved over to the dev@
> list
>
> On Sat, Nov 9, 2019 at 10:37 AM Marcin Tustin
> <ma...@bluevoyant.com.invalid> wrote:
>
>> What do you mean by this? Spark is written in a combination of Scala and
>> Java, and then compiled to Java Byte Code, as is typical for both Scala and
>> Java. If there's additional byte code generation happening, it's java byte
>> code, because the platform runs on the JVM.
>>
>> On Sat, Nov 9, 2019 at 12:47 PM Bartosz Konieczny <
>> bartkonieczny@gmail.com> wrote:
>>
>>> *This Message originated outside your organization.*
>>> ------------------------------
>>> Hi there,
>>>
>>
>>> Few days ago I got an intriguing but hard to answer question:
>>> "Why Spark generates Java code and not Scala code?"
>>> (https://github.com/bartosz25/spark-scala-playground/issues/18
>>> <https://github.com/bartosz25/spark-scala-playground/issues/18>
>>> )
>>>
>>> Since I'm not sure about the exact answer, I'd like to ask you to
>>> confirm or not my thinking. I was looking for the reasons in the JIRA and
>>> the research paper "Spark SQL: Relational Data Processing in Spark" (
>>> http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
>>> <http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf>)
>>> but found nothing explaining why Java over Scala. The single task I found
>>> was about why Scala and not Java but concerning data types (
>>> https://issues.apache.org/jira/browse/SPARK-5193
>>> <https://issues.apache.org/jira/browse/SPARK-5193>)
>>> That's why I'm writing here.
>>>
>>> My guesses about choosing Java code are:
>>> - Java runtime compiler libs are more mature and prod-ready than the
>>> Scala's - or at least, they were at the implementation time
>>> - Scala compiler tends to be slower than the Java's
>>> https://stackoverflow.com/questions/3490383/java-compile-speed-vs-scala-compile-speed
>>> <https://stackoverflow.com/questions/3490383/java-compile-speed-vs-scala-compile-speed>
>>> - Scala compiler seems to be more complex, so debugging & maintaining it
>>> would be harder
>>> - it was easier to represent a pure Java OO design than mixed FP/OO in
>>> Scala
>>> ?
>>>
>>> Thank you for your help.
>>>
>>> --
>>> Bartosz Konieczny
>>> data engineer
>>> https://www.waitingforcode.com
>>> <https://www.waitingforcode.com>
>>> https://github.com/bartosz25/
>>> <https://github.com/bartosz25/>
>>> https://twitter.com/waitingforcode
>>> <https://twitter.com/waitingforcode>
>>>
>>> --
> Twitter: https://twitter.com/holdenkarau
> <https://twitter.com/holdenkarau>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> <https://www.youtube.com/user/holdenkarau>
>

Re: Why Spark generates Java code and not Scala?

Posted by Holden Karau <ho...@pigscanfly.ca>.
If you look inside of the generation we generate java code and compile it
with Janino. For interested folks the conversation moved over to the dev@
list

On Sat, Nov 9, 2019 at 10:37 AM Marcin Tustin
<ma...@bluevoyant.com.invalid> wrote:

> What do you mean by this? Spark is written in a combination of Scala and
> Java, and then compiled to Java Byte Code, as is typical for both Scala and
> Java. If there's additional byte code generation happening, it's java byte
> code, because the platform runs on the JVM.
>
> On Sat, Nov 9, 2019 at 12:47 PM Bartosz Konieczny <ba...@gmail.com>
> wrote:
>
>> *This Message originated outside your organization.*
>> ------------------------------
>> Hi there,
>>
>
>> Few days ago I got an intriguing but hard to answer question:
>> "Why Spark generates Java code and not Scala code?"
>> (https://github.com/bartosz25/spark-scala-playground/issues/18)
>>
>> Since I'm not sure about the exact answer, I'd like to ask you to confirm
>> or not my thinking. I was looking for the reasons in the JIRA and the
>> research paper "Spark SQL: Relational Data Processing in Spark" (
>> http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf) but
>> found nothing explaining why Java over Scala. The single task I found was
>> about why Scala and not Java but concerning data types (
>> https://issues.apache.org/jira/browse/SPARK-5193) That's why I'm writing
>> here.
>>
>> My guesses about choosing Java code are:
>> - Java runtime compiler libs are more mature and prod-ready than the
>> Scala's - or at least, they were at the implementation time
>> - Scala compiler tends to be slower than the Java's
>> https://stackoverflow.com/questions/3490383/java-compile-speed-vs-scala-compile-speed
>> - Scala compiler seems to be more complex, so debugging & maintaining it
>> would be harder
>> - it was easier to represent a pure Java OO design than mixed FP/OO in
>> Scala
>> ?
>>
>> Thank you for your help.
>>
>> --
>> Bartosz Konieczny
>> data engineer
>> https://www.waitingforcode.com
>> https://github.com/bartosz25/
>> https://twitter.com/waitingforcode
>>
>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Why Spark generates Java code and not Scala?

Posted by Marcin Tustin <ma...@bluevoyant.com.INVALID>.
What do you mean by this? Spark is written in a combination of Scala and
Java, and then compiled to Java Byte Code, as is typical for both Scala and
Java. If there's additional byte code generation happening, it's java byte
code, because the platform runs on the JVM.

On Sat, Nov 9, 2019 at 12:47 PM Bartosz Konieczny <ba...@gmail.com>
wrote:

> *This Message originated outside your organization.*
> ------------------------------
> Hi there,
>
> Few days ago I got an intriguing but hard to answer question:
> "Why Spark generates Java code and not Scala code?"
> (https://github.com/bartosz25/spark-scala-playground/issues/18
> <https://github.com/bartosz25/spark-scala-playground/issues/18>
> )
>
> Since I'm not sure about the exact answer, I'd like to ask you to confirm
> or not my thinking. I was looking for the reasons in the JIRA and the
> research paper "Spark SQL: Relational Data Processing in Spark" (
> http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
> <http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf>)
> but found nothing explaining why Java over Scala. The single task I found
> was about why Scala and not Java but concerning data types (
> https://issues.apache.org/jira/browse/SPARK-5193
> <https://issues.apache.org/jira/browse/SPARK-5193>)
> That's why I'm writing here.
>
> My guesses about choosing Java code are:
> - Java runtime compiler libs are more mature and prod-ready than the
> Scala's - or at least, they were at the implementation time
> - Scala compiler tends to be slower than the Java's
> https://stackoverflow.com/questions/3490383/java-compile-speed-vs-scala-compile-speed
> <https://stackoverflow.com/questions/3490383/java-compile-speed-vs-scala-compile-speed>
> - Scala compiler seems to be more complex, so debugging & maintaining it
> would be harder
> - it was easier to represent a pure Java OO design than mixed FP/OO in
> Scala
> ?
>
> Thank you for your help.
>
> --
> Bartosz Konieczny
> data engineer
> https://www.waitingforcode.com
> <https://www.waitingforcode.com>
> https://github.com/bartosz25/
> <https://github.com/bartosz25/>
> https://twitter.com/waitingforcode
> <https://twitter.com/waitingforcode>
>
>

Re: Why Spark generates Java code and not Scala?

Posted by Reynold Xin <rx...@databricks.com>.
It’s mainly due to compilation speed. Scala compiler is known to be slow.
Even javac is quite slow. We use Janino which is a simpler compiler to get
faster compilation speed at runtime.

Also for low level code we can’t use (due to perf concerns) any of the
edges scala has over java, eg we can’t use the scala collection library,
functional programming, map/flatMap. So using scala doesn’t really buy
anything even if there is no compilation speed concerns.

On Sat, Nov 9, 2019 at 9:52 AM Holden Karau <ho...@pigscanfly.ca> wrote:

>
> Switching this from user to dev
>
> On Sat, Nov 9, 2019 at 9:47 AM Bartosz Konieczny <ba...@gmail.com>
> wrote:
>
>> Hi there,
>>
>> Few days ago I got an intriguing but hard to answer question:
>> "Why Spark generates Java code and not Scala code?"
>> (https://github.com/bartosz25/spark-scala-playground/issues/18)
>>
>> Since I'm not sure about the exact answer, I'd like to ask you to confirm
>> or not my thinking. I was looking for the reasons in the JIRA and the
>> research paper "Spark SQL: Relational Data Processing in Spark" (
>> http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf) but
>> found nothing explaining why Java over Scala. The single task I found was
>> about why Scala and not Java but concerning data types (
>> https://issues.apache.org/jira/browse/SPARK-5193) That's why I'm writing
>> here.
>>
>> My guesses about choosing Java code are:
>> - Java runtime compiler libs are more mature and prod-ready than the
>> Scala's - or at least, they were at the implementation time
>> - Scala compiler tends to be slower than the Java's
>> https://stackoverflow.com/questions/3490383/java-compile-speed-vs-scala-compile-speed
>>
> From the discussions when I was doing some code gen (in MLlib not SQL) I
> think this is the primary reason why.
>
>>
>> <https://stackoverflow.com/questions/3490383/java-compile-speed-vs-scala-compile-speed>
>> - Scala compiler seems to be more complex, so debugging & maintaining it
>> would be harder
>>
> this was also given as a secondary reason
>
>> - it was easier to represent a pure Java OO design than mixed FP/OO in
>> Scala
>>
> no one brought up this point. Maybe it was a consideration and it just
> wasn’t raised.
>
>> ?
>>
>> Thank you for your help.
>>
>>
>> --
>> Bartosz Konieczny
>> data engineer
>> https://www.waitingforcode.com
>> https://github.com/bartosz25/
>> https://twitter.com/waitingforcode
>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: Why Spark generates Java code and not Scala?

Posted by Holden Karau <ho...@pigscanfly.ca>.
Switching this from user to dev

On Sat, Nov 9, 2019 at 9:47 AM Bartosz Konieczny <ba...@gmail.com>
wrote:

> Hi there,
>
> Few days ago I got an intriguing but hard to answer question:
> "Why Spark generates Java code and not Scala code?"
> (https://github.com/bartosz25/spark-scala-playground/issues/18)
>
> Since I'm not sure about the exact answer, I'd like to ask you to confirm
> or not my thinking. I was looking for the reasons in the JIRA and the
> research paper "Spark SQL: Relational Data Processing in Spark" (
> http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf) but
> found nothing explaining why Java over Scala. The single task I found was
> about why Scala and not Java but concerning data types (
> https://issues.apache.org/jira/browse/SPARK-5193) That's why I'm writing
> here.
>
> My guesses about choosing Java code are:
> - Java runtime compiler libs are more mature and prod-ready than the
> Scala's - or at least, they were at the implementation time
> - Scala compiler tends to be slower than the Java's
> https://stackoverflow.com/questions/3490383/java-compile-speed-vs-scala-compile-speed
>
From the discussions when I was doing some code gen (in MLlib not SQL) I
think this is the primary reason why.

>
> <https://stackoverflow.com/questions/3490383/java-compile-speed-vs-scala-compile-speed>
> - Scala compiler seems to be more complex, so debugging & maintaining it
> would be harder
>
this was also given as a secondary reason

> - it was easier to represent a pure Java OO design than mixed FP/OO in
> Scala
>
no one brought up this point. Maybe it was a consideration and it just
wasn’t raised.

> ?
>
> Thank you for your help.
>
>
> --
> Bartosz Konieczny
> data engineer
> https://www.waitingforcode.com
> https://github.com/bartosz25/
> https://twitter.com/waitingforcode
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau