You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Steven Van Ingelgem <st...@kbc.be.INVALID> on 2020/05/13 11:29:02 UTC

Huge difference in speed between pyspark and scalaspark

Public

Hello all,


We noticed a HUGE difference between using pyspark and spark in scala.
Pyspark runs:

  *   on my work computer in +-350 seconds
  *   on my home computer in +- 130 seconds (Windows defender enabled)
  *   on my home computer in +- 105 seconds (Windows defender disabled)
  *   on my home computer as Scala code in +- 7 seconds
  *

The script:
def setUp(self):
    self.left = self.parallelize([
        ('Wim', 46),
        ('Klaas', 18)
    ]).toDF('name: string, age: int')

    self.right = self.parallelize([
        ('Jiri', 25),
        ('Tomasz', 23)
    ]).toDF('name: string, age: int')

def test_simple_union(self):
    sut = self.left.union(self.right)

    self.assertDatasetEquals(sut, self.parallelize([
            ('Wim', 46),
            ('Klaas', 18),
            ('Jiri', 25),
            ('Tomasz', 23)
        ]).toDF('name: string, age: int')
    )

Disclaimer <http://www.kbc.com/KBCmailDisclaimer>

Re: Huge difference in speed between pyspark and scalaspark

Posted by Russell Spitzer <ru...@gmail.com>.
Well, right off the bat, the Python version is going to have to be slower
because the dataset is being parallelized from python types.

While dataframes are usually identical in performance between different
languages, the beginning of your script starts off the intial data
outside of Dataframes. This move from native language objects to Java
objects is going to be basically instanteous for Java, but will carry
a performance penalty in Python.

I haven't looked at this code recently, but worse case scenario you would
end up serializing the python objects to the executors and
each executor would then start a local python process and the serialization
would have to happen there. Best case scenario is that
it would do this conversion on the driver prior to parallelizing so at
least you wouldn't have to spin up Python interpreters on the executors. I
can't remember which it actually is in the code. Either way this is a big
hit compared to java which can just encode the Java objects
to Spark Internal Rows immediately.

Although that is expensive I wouldn't imagine it would be as expensive as
the results you are seeing (350 seconds is pretty ridiculous), but one way
to eliminate this
difference between the code samples would be to have your script read from
a CSV file for data (or other Datasource.) If you avoid
calling parallelize you will at least be able to remove the serialization
costs for Python. Then the two code samples would be much
closer to identical. I would try doing that before looking into anything
else.


On Wed, May 13, 2020 at 8:06 AM Steven Van Ingelgem
<st...@kbc.be.invalid> wrote:

> Public
>
>
>
> Hello all,
>
>
>
>
>
> We noticed a HUGE difference between using pyspark and spark in scala.
>
> Pyspark runs:
>
>    - on my work computer in +- 350 seconds
>    - on my home computer in +- 130 seconds (Windows defender enabled)
>    - on my home computer in +- 105 seconds (Windows defender *dis*abled)
>    - on my home computer as Scala code in +- 7 seconds
>
>
>
> What we already investigated:
>
>    - memory is correct (and enough) in both scala & pyspark. 1G was
>    defined, and it was never reaching the garbage collecting limit (which was
>    also 1G).
>    - There are a lot of threads in the spark process, is this normal?
>    Unknown… There are 300 more in the Spark session under pyspark than under
>    scala-spark.
>    - Scala code timings are consistent on any platform used (windows,
>    linux, mac, WSL)
>    - Disabling the antivirus makes a bit of a difference, but not enough
>    to justify the huge differences in timings.
>    - Under Debian WSL on the same windows hosts, the pyspark code runs in
>    the same time as the scala code.
>    - Timings for pyspark on Linux/Mac/WSL are similar to the timings for
>    scala. (so virtualized it’s also around 7s, but not virtualized it’s 130s!)
>
>
>
>
>
> What could we continue to investigate to figure out what the difference in
> time could be?
>
> And/or did someone encounter this same behavior and could point us to a
> possible solution?
>
>
>
>
>
> Thanks,
>
> Steven
>
>
>
>
>
>
>
>
>
>
>
> The pyspark script:
>
> The spark session is created via:
>
> SparkSession
>                       .builder
>                       .appName(*'Testing'*)
>                       .config(*'spark.driver.extraJavaOptions'*,
> *'-Xms1g'*)
>                       .getOrCreate()
>
>
>
> This is the part of the unittest:
>
> *def *setUp(self):
>     self.left = self.parallelize([
>         (*'Wim'*, 46),
>         (*'Klaas'*, 18)
>     ]).toDF(*'name: string, age: int'*)
>
>     self.right = self.parallelize([
>         (*'Jiri'*, 25),
>         (*'Tomasz'*, 23)
>     ]).toDF(*'name: string, age: int'*)
>
> *def *test_simple_union(self):
>     sut = self.left.union(self.right)
>
>     self.assertDatasetEquals(sut, self.parallelize([
>             (*'Wim'*, 46),
>             (*'Klaas'*, 18),
>             (*'Jiri'*, 25),
>             (*'Tomasz'*, 23)
>         ]).toDF(*'name: string, age: int'*)
>     )
>
>
>
> VisualVM from pyspark script:
>
>
>
> The same Scala script:
>
> *class *GlowPerformanceSpec *extends *SparkFunSuite
>                                   *with *Matchers
>                                   *with *DatasetComparer {
>
>   test(*"test simple union"*) {
>     *val *data = testData()
>     *val *sut = data._1.union(data._2)
>     assertDatasetEquality(sut,
>                           parallelize(Seq((*"Wim"*, 46),
>                                           (*"Klaas"*, 18),
>                                           (*"Jiri"*, 25),
>                                           (*"Tomasz"*, 23)),
>                                       *"name"*,
>                                       *"age"*))
>   }
>
>
>   *private def *testData(): (DataFrame, DataFrame) = {
>     *val *left = parallelize(Seq((*"Wim"*, 46),
>                                (*"Klaas"*, 18)),
>                            *"name"*,
>                            *"age"*)
>     *val *right = parallelize(Seq((*"Jiri"*, 25),
>                                 (*"Tomasz"*, 23)),
>                             *"name"*,
>                             *"age"*)
>     (left, right)
>   }
>
>   *import *scala.reflect.runtime.universe._
>   *private def *parallelize[T <: Product : ClassTag : TypeTag](data:
> Seq[T],
>                                                              colNames:
> String*): DataFrame = {
>     *import *spark.implicits._
>     spark.sparkContext.parallelize(data).toDF(colNames:_*)
>   }
> }
>
>
>
> VisualVM from scala:
>
>
> Disclaimer <http://www.kbc.com/KBCmailDisclaimer>
>

Re: Huge difference in speed between pyspark and scalaspark

Posted by Russell Spitzer <ru...@gmail.com>.
Like I wrote before. When you mix in python spark needs a Python process to
deal with python objects, that means you end up booting up python processes
for every executor. My guess would be on your system python takes a while
to start up, so before any work can be done you are waiting on this slow
startup. My guess is with a much larger dataset the difference would not be
as great since it would amortize the python startup costs. The
serialization costs are also non trivial but if you don't use any python
lambdas I don't think that cost would scale that dramatically with your
data size. Check out some of the original dataframe articles for more perf
comparisons and explanations

On Thu, May 14, 2020, 12:50 AM Steven Van Ingelgem <
steven.vaningelgem@kbc.be> wrote:

> Internal
>
>
>
> Wow!
>
> I changed the simple test I ran yesterday with a CSV version:
>
>
>
>
> *def *setUp(self):
>     self.left = self.spark.read.format(*"csv"*).options(header=*'true'*,
> inferSchema=*'true'*).load(*"test_left.csv"*)
>     self.right = self.spark.read.format(*"csv"*).options(header=*'true'*,
> inferSchema=*'true'*).load(*"test_right.csv"*)
>
> *def *test_simple_union(self):
>     sut = self.left.union(self.right)
>
>     self.assertDatasetEquals(sut, self.spark.read.format(*"csv"*).options(
> header=*'true'*, inferSchema=*'true'*).load(*"expected.csv"*))
>
>
>
>
>
> Scala parallelize: 19.956
>
> Python parallelize: 61.184
>
> Scala CSV: 18.394
>
> Python CSV: reported as 3.747 by unittest (but the total system time was
> *16.291*. So I assume 12.544s spinning up spark)
>
>
>
> What I don’t understand though is why the parallelize version in Scala is
> that much faster, whereas the csv version is similar?
>
> What exactly is it that makes this huge difference in pyspark between
> parallelize & csv and not in scala parallelize & csv?
>
>
>
>
>
> Thanks!
>
>
>
>
>
> *From:* Russell Spitzer <ru...@gmail.com>
> *Sent:* woensdag 13 mei 2020 20:43
> *To:* Wim Van Leuven <wi...@highestpoint.biz>
> *Cc:* Sean Owen <sr...@gmail.com>; Gerard Maas <ge...@gmail.com>;
> Sriram Bhamidipati <sr...@gmail.com>; Steven Van Ingelgem
> <st...@kbc.be.invalid>; user@spark.apache.org
> *Subject:* Re: Huge difference in speed between pyspark and scalaspark
>
>
>
> It could also be the cost of spinning up interpreters, but there is an
> easy way to find out. Remove the serialization from the equation.
>
>
>
> On Wed, May 13, 2020 at 1:40 PM Wim Van Leuven <
> wim.vanleuven@highestpoint.biz> wrote:
>
> Yes it is ... That why it is strange. But never mind the 350 secs. That is
> not relevant. That's probably the security layer...
>
>
>
> What is relevant is that the same pyspark code take 135s on a plain win10
> PC vs only 8 secs on WSL Debian on the same Win10 PC, also 8 on a Linux
> server and also 8 on a Mac. Thát difference is the question.
>
>
>
> That's why we added the same implementation on Spark, which also just
> takes 8 secs on Win10. The difference is just insane... That's not a simple
> ser/deser of these few strings, right?
>
>
>
> Thoughts?
>
> -wim
>
>
>
>
>
>
>
> On Wed, 13 May 2020 at 18:46, Sean Owen <sr...@gmail.com> wrote:
>
> That code is barely doing anything; there is no action. I can't imagine it
> taking 350s. Is this really all that is run?
>
>
>
> On Wed, May 13, 2020, 9:25 AM Sriram Bhamidipati <sr...@gmail.com>
> wrote:
>
> Pyspark and scala versions along with spark version data also are needed
> to conclude anything
>
> pyspark had a performance lag in the early days in comparision to scala ,
> which is native for spark
>
>
>
> Also: The video referenced is 5 yrs old
>
>
>
>
>
> On Wed, May 13, 2020 at 6:59 PM Gerard Maas <ge...@gmail.com> wrote:
>
> Steven,
>
>
>
> I'm not sure what the goals of the comparison are.
>
>
>
> The reason behind the difference is that `parallelize` is an RDD-based
> operation. When called from pySpark, the Python-JVM call incurs in
> additional overhead transferring that array.
>
> These early-days differences[1] between the Scala and the Python APIs are
> no longer an issue when you use the SparkSQL-based Dataframe/Dataset API.
>
>
>
> Given that the source of your data is unlikely to be
> synthetically-generated in-memory records, a more interesting test would be
> to load the same dataset (text, CSV, Parquet, ...) from both Spark/Scala
> and PySpark.
>
>
>
> I hope this helps.
>
>
>
> Met vriendelijke groeten,
>
> - Gerard.
>
>
>
>
>
>
>
> [1]
> https://databricks.com/session/getting-the-best-performance-with-pyspark
>
>
>
> On Wed, May 13, 2020 at 3:05 PM Steven Van Ingelgem <
> steven.vaningelgem@kbc.be.invalid> wrote:
>
> Public
>
>
>
> Hello all,
>
>
>
>
>
> We noticed a HUGE difference between using pyspark and spark in scala.
>
> Pyspark runs:
>
>    - on my work computer in +- 350 seconds
>    - on my home computer in +- 130 seconds (Windows defender enabled)
>    - on my home computer in +- 105 seconds (Windows defender *dis*abled)
>    - on my home computer as Scala code in +- 7 seconds
>
>
>
> What we already investigated:
>
>    - memory is correct (and enough) in both scala & pyspark. 1G was
>    defined, and it was never reaching the garbage collecting limit (which was
>    also 1G).
>    - There are a lot of threads in the spark process, is this normal?
>    Unknown… There are 300 more in the Spark session under pyspark than under
>    scala-spark.
>    - Scala code timings are consistent on any platform used (windows,
>    linux, mac, WSL)
>    - Disabling the antivirus makes a bit of a difference, but not enough
>    to justify the huge differences in timings.
>    - Under Debian WSL on the same windows hosts, the pyspark code runs in
>    the same time as the scala code.
>    - Timings for pyspark on Linux/Mac/WSL are similar to the timings for
>    scala. (so virtualized it’s also around 7s, but not virtualized it’s 130s!)
>
>
>
>
>
> What could we continue to investigate to figure out what the difference in
> time could be?
>
> And/or did someone encounter this same behavior and could point us to a
> possible solution?
>
>
>
>
>
> Thanks,
>
> Steven
>
>
>
>
>
>
>
>
>
>
>
> The pyspark script:
>
> The spark session is created via:
>
> SparkSession
>                       .builder
>                       .appName(*'Testing'*)
>                       .config(*'spark.driver.extraJavaOptions'*,
> *'-Xms1g'*)
>                       .getOrCreate()
>
>
>
> This is the part of the unittest:
>
> *def *setUp(self):
>     self.left = self.parallelize([
>         (*'Wim'*, 46),
>         (*'Klaas'*, 18)
>     ]).toDF(*'name: string, age: int'*)
>
>     self.right = self.parallelize([
>         (*'Jiri'*, 25),
>         (*'Tomasz'*, 23)
>     ]).toDF(*'name: string, age: int'*)
>
> *def *test_simple_union(self):
>     sut = self.left.union(self.right)
>
>     self.assertDatasetEquals(sut, self.parallelize([
>             (*'Wim'*, 46),
>             (*'Klaas'*, 18),
>             (*'Jiri'*, 25),
>             (*'Tomasz'*, 23)
>         ]).toDF(*'name: string, age: int'*)
>     )
>
>
>
> VisualVM from pyspark script:
>
>
>
> The same Scala script:
>
> *class *GlowPerformanceSpec *extends *SparkFunSuite
>                                   *with *Matchers
>                                   *with *DatasetComparer {
>
>   test(*"test simple union"*) {
>     *val *data = testData()
>     *val *sut = data._1.union(data._2)
>     assertDatasetEquality(sut,
>                           parallelize(Seq((*"Wim"*, 46),
>                                           (*"Klaas"*, 18),
>                                           (*"Jiri"*, 25),
>                                           (*"Tomasz"*, 23)),
>                                       *"name"*,
>                                       *"age"*))
>   }
>
>
>   *private def *testData(): (DataFrame, DataFrame) = {
>     *val *left = parallelize(Seq((*"Wim"*, 46),
>                                (*"Klaas"*, 18)),
>                            *"name"*,
>                            *"age"*)
>     *val *right = parallelize(Seq((*"Jiri"*, 25),
>                                 (*"Tomasz"*, 23)),
>                             *"name"*,
>                             *"age"*)
>     (left, right)
>   }
>
>   *import *scala.reflect.runtime.universe._
>   *private def *parallelize[T <: Product : ClassTag : TypeTag](data:
> Seq[T],
>                                                              colNames:
> String*): DataFrame = {
>     *import *spark.implicits._
>     spark.sparkContext.parallelize(data).toDF(colNames:_*)
>   }
> }
>
>
>
> VisualVM from scala:
>
>
> Disclaimer <http://www.kbc.com/KBCmailDisclaimer>
>
>
>
>
> --
>
> -Sriram
>
>
> Disclaimer <http://www.kbc.com/KBCmailDisclaimer>
>

RE: Huge difference in speed between pyspark and scalaspark

Posted by Steven Van Ingelgem <st...@kbc.be.INVALID>.
Internal

Wow!
I changed the simple test I ran yesterday with a CSV version:


def setUp(self):
    self.left = self.spark.read.format("csv").options(header='true', inferSchema='true').load("test_left.csv")
    self.right = self.spark.read.format("csv").options(header='true', inferSchema='true').load("test_right.csv")

def test_simple_union(self):
    sut = self.left.union(self.right)

    self.assertDatasetEquals(sut, self.spark.read.format("csv").options(header='true', inferSchema='true').load("expected.csv"))


Scala parallelize: 19.956
Python parallelize: 61.184
Scala CSV: 18.394
Python CSV: reported as 3.747 by unittest (but the total system time was 16.291. So I assume 12.544s spinning up spark)

What I don't understand though is why the parallelize version in Scala is that much faster, whereas the csv version is similar?
What exactly is it that makes this huge difference in pyspark between parallelize & csv and not in scala parallelize & csv?


Thanks!


From: Russell Spitzer <ru...@gmail.com>
Sent: woensdag 13 mei 2020 20:43
To: Wim Van Leuven <wi...@highestpoint.biz>
Cc: Sean Owen <sr...@gmail.com>; Gerard Maas <ge...@gmail.com>; Sriram Bhamidipati <sr...@gmail.com>; Steven Van Ingelgem <st...@kbc.be.invalid>; user@spark.apache.org
Subject: Re: Huge difference in speed between pyspark and scalaspark

It could also be the cost of spinning up interpreters, but there is an easy way to find out. Remove the serialization from the equation.

On Wed, May 13, 2020 at 1:40 PM Wim Van Leuven <wi...@highestpoint.biz>> wrote:
Yes it is ... That why it is strange. But never mind the 350 secs. That is not relevant. That's probably the security layer...

What is relevant is that the same pyspark code take 135s on a plain win10 PC vs only 8 secs on WSL Debian on the same Win10 PC, also 8 on a Linux server and also 8 on a Mac. Thát difference is the question.

That's why we added the same implementation on Spark, which also just takes 8 secs on Win10. The difference is just insane... That's not a simple ser/deser of these few strings, right?

Thoughts?
-wim



On Wed, 13 May 2020 at 18:46, Sean Owen <sr...@gmail.com>> wrote:
That code is barely doing anything; there is no action. I can't imagine it taking 350s. Is this really all that is run?

On Wed, May 13, 2020, 9:25 AM Sriram Bhamidipati <sr...@gmail.com>> wrote:
Pyspark and scala versions along with spark version data also are needed to conclude anything
pyspark had a performance lag in the early days in comparision to scala , which is native for spark

Also: The video referenced is 5 yrs old


On Wed, May 13, 2020 at 6:59 PM Gerard Maas <ge...@gmail.com>> wrote:
Steven,

I'm not sure what the goals of the comparison are.

The reason behind the difference is that `parallelize` is an RDD-based operation. When called from pySpark, the Python-JVM call incurs in additional overhead transferring that array.
These early-days differences[1] between the Scala and the Python APIs are no longer an issue when you use the SparkSQL-based Dataframe/Dataset API.

Given that the source of your data is unlikely to be synthetically-generated in-memory records, a more interesting test would be to load the same dataset (text, CSV, Parquet, ...) from both Spark/Scala and PySpark.

I hope this helps.

Met vriendelijke groeten,
- Gerard.



[1] https://databricks.com/session/getting-the-best-performance-with-pyspark

On Wed, May 13, 2020 at 3:05 PM Steven Van Ingelgem <st...@kbc.be.invalid>> wrote:

Public

Hello all,


We noticed a HUGE difference between using pyspark and spark in scala.
Pyspark runs:

  *   on my work computer in +- 350 seconds
  *   on my home computer in +- 130 seconds (Windows defender enabled)
  *   on my home computer in +- 105 seconds (Windows defender disabled)
  *   on my home computer as Scala code in +- 7 seconds

What we already investigated:

  *   memory is correct (and enough) in both scala & pyspark. 1G was defined, and it was never reaching the garbage collecting limit (which was also 1G).
  *   There are a lot of threads in the spark process, is this normal? Unknown... There are 300 more in the Spark session under pyspark than under scala-spark.
  *   Scala code timings are consistent on any platform used (windows, linux, mac, WSL)
  *   Disabling the antivirus makes a bit of a difference, but not enough to justify the huge differences in timings.
  *   Under Debian WSL on the same windows hosts, the pyspark code runs in the same time as the scala code.
  *   Timings for pyspark on Linux/Mac/WSL are similar to the timings for scala. (so virtualized it's also around 7s, but not virtualized it's 130s!)


What could we continue to investigate to figure out what the difference in time could be?
And/or did someone encounter this same behavior and could point us to a possible solution?


Thanks,
Steven





The pyspark script:
The spark session is created via:
SparkSession
                      .builder
                      .appName('Testing')
                      .config('spark.driver.extraJavaOptions', '-Xms1g')
                      .getOrCreate()

This is the part of the unittest:
def setUp(self):
    self.left = self.parallelize([
        ('Wim', 46),
        ('Klaas', 18)
    ]).toDF('name: string, age: int')

    self.right = self.parallelize([
        ('Jiri', 25),
        ('Tomasz', 23)
    ]).toDF('name: string, age: int')

def test_simple_union(self):
    sut = self.left.union(self.right)

    self.assertDatasetEquals(sut, self.parallelize([
            ('Wim', 46),
            ('Klaas', 18),
            ('Jiri', 25),
            ('Tomasz', 23)
        ]).toDF('name: string, age: int')
    )

VisualVM from pyspark script:
[cid:image001.jpg@01D629BE.0081CD00]

The same Scala script:
class GlowPerformanceSpec extends SparkFunSuite
                                  with Matchers
                                  with DatasetComparer {

  test("test simple union") {
    val data = testData()
    val sut = data._1.union(data._2)
    assertDatasetEquality(sut,
                          parallelize(Seq(("Wim", 46),
                                          ("Klaas", 18),
                                          ("Jiri", 25),
                                          ("Tomasz", 23)),
                                      "name",
                                      "age"))
  }

  private def testData(): (DataFrame, DataFrame) = {
    val left = parallelize(Seq(("Wim", 46),
                               ("Klaas", 18)),
                           "name",
                           "age")
    val right = parallelize(Seq(("Jiri", 25),
                                ("Tomasz", 23)),
                            "name",
                            "age")
    (left, right)
  }

  import scala.reflect.runtime.universe._
  private def parallelize[T <: Product : ClassTag : TypeTag](data: Seq[T],
                                                             colNames: String*): DataFrame = {
    import spark.implicits._
    spark.sparkContext.parallelize(data).toDF(colNames:_*)
  }
}

VisualVM from scala:
[cid:image002.jpg@01D629BE.0081CD00]

Disclaimer<http://www.kbc.com/KBCmailDisclaimer>


--
-Sriram

Disclaimer <http://www.kbc.com/KBCmailDisclaimer>

Re: Huge difference in speed between pyspark and scalaspark

Posted by Russell Spitzer <ru...@gmail.com>.
It could also be the cost of spinning up interpreters, but there is an easy
way to find out. Remove the serialization from the equation.

On Wed, May 13, 2020 at 1:40 PM Wim Van Leuven <
wim.vanleuven@highestpoint.biz> wrote:

> Yes it is ... That why it is strange. But never mind the 350 secs. That is
> not relevant. That's probably the security layer...
>
> What is relevant is that the same pyspark code take 135s on a plain win10
> PC vs only 8 secs on WSL Debian on the same Win10 PC, also 8 on a Linux
> server and also 8 on a Mac. Thát difference is the question.
>
> That's why we added the same implementation on Spark, which also just
> takes 8 secs on Win10. The difference is just insane... That's not a simple
> ser/deser of these few strings, right?
>
> Thoughts?
> -wim
>
>
>
> On Wed, 13 May 2020 at 18:46, Sean Owen <sr...@gmail.com> wrote:
>
>> That code is barely doing anything; there is no action. I can't imagine
>> it taking 350s. Is this really all that is run?
>>
>> On Wed, May 13, 2020, 9:25 AM Sriram Bhamidipati <sr...@gmail.com>
>> wrote:
>>
>>> Pyspark and scala versions along with spark version data also are needed
>>> to conclude anything
>>> pyspark had a performance lag in the early days in comparision to scala
>>> , which is native for spark
>>>
>>> Also: The video referenced is 5 yrs old
>>>
>>>
>>> On Wed, May 13, 2020 at 6:59 PM Gerard Maas <ge...@gmail.com>
>>> wrote:
>>>
>>>> Steven,
>>>>
>>>> I'm not sure what the goals of the comparison are.
>>>>
>>>> The reason behind the difference is that `parallelize` is an RDD-based
>>>> operation. When called from pySpark, the Python-JVM call incurs in
>>>> additional overhead transferring that array.
>>>> These early-days differences[1] between the Scala and the Python APIs
>>>> are no longer an issue when you use the SparkSQL-based Dataframe/Dataset
>>>> API.
>>>>
>>>> Given that the source of your data is unlikely to be
>>>> synthetically-generated in-memory records, a more interesting test would be
>>>> to load the same dataset (text, CSV, Parquet, ...) from both Spark/Scala
>>>> and PySpark.
>>>>
>>>> I hope this helps.
>>>>
>>>> Met vriendelijke groeten,
>>>> - Gerard.
>>>>
>>>>
>>>>
>>>> [1]
>>>> https://databricks.com/session/getting-the-best-performance-with-pyspark
>>>>
>>>> On Wed, May 13, 2020 at 3:05 PM Steven Van Ingelgem
>>>> <st...@kbc.be.invalid> wrote:
>>>>
>>>>> Public
>>>>>
>>>>>
>>>>>
>>>>> Hello all,
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> We noticed a HUGE difference between using pyspark and spark in scala.
>>>>>
>>>>> Pyspark runs:
>>>>>
>>>>>    - on my work computer in +- 350 seconds
>>>>>    - on my home computer in +- 130 seconds (Windows defender enabled)
>>>>>    - on my home computer in +- 105 seconds (Windows defender *dis*
>>>>>    abled)
>>>>>    - on my home computer as Scala code in +- 7 seconds
>>>>>
>>>>>
>>>>>
>>>>> What we already investigated:
>>>>>
>>>>>    - memory is correct (and enough) in both scala & pyspark. 1G was
>>>>>    defined, and it was never reaching the garbage collecting limit (which was
>>>>>    also 1G).
>>>>>    - There are a lot of threads in the spark process, is this normal?
>>>>>    Unknown… There are 300 more in the Spark session under pyspark than under
>>>>>    scala-spark.
>>>>>    - Scala code timings are consistent on any platform used (windows,
>>>>>    linux, mac, WSL)
>>>>>    - Disabling the antivirus makes a bit of a difference, but not
>>>>>    enough to justify the huge differences in timings.
>>>>>    - Under Debian WSL on the same windows hosts, the pyspark code
>>>>>    runs in the same time as the scala code.
>>>>>    - Timings for pyspark on Linux/Mac/WSL are similar to the timings
>>>>>    for scala. (so virtualized it’s also around 7s, but not virtualized it’s
>>>>>    130s!)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> What could we continue to investigate to figure out what the
>>>>> difference in time could be?
>>>>>
>>>>> And/or did someone encounter this same behavior and could point us to
>>>>> a possible solution?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Steven
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> The pyspark script:
>>>>>
>>>>> The spark session is created via:
>>>>>
>>>>> SparkSession
>>>>>                       .builder
>>>>>                       .appName(*'Testing'*)
>>>>>                       .config(*'spark.driver.extraJavaOptions'*,
>>>>> *'-Xms1g'*)
>>>>>                       .getOrCreate()
>>>>>
>>>>>
>>>>>
>>>>> This is the part of the unittest:
>>>>>
>>>>> *def *setUp(self):
>>>>>     self.left = self.parallelize([
>>>>>         (*'Wim'*, 46),
>>>>>         (*'Klaas'*, 18)
>>>>>     ]).toDF(*'name: string, age: int'*)
>>>>>
>>>>>     self.right = self.parallelize([
>>>>>         (*'Jiri'*, 25),
>>>>>         (*'Tomasz'*, 23)
>>>>>     ]).toDF(*'name: string, age: int'*)
>>>>>
>>>>> *def *test_simple_union(self):
>>>>>     sut = self.left.union(self.right)
>>>>>
>>>>>     self.assertDatasetEquals(sut, self.parallelize([
>>>>>             (*'Wim'*, 46),
>>>>>             (*'Klaas'*, 18),
>>>>>             (*'Jiri'*, 25),
>>>>>             (*'Tomasz'*, 23)
>>>>>         ]).toDF(*'name: string, age: int'*)
>>>>>     )
>>>>>
>>>>>
>>>>>
>>>>> VisualVM from pyspark script:
>>>>>
>>>>>
>>>>>
>>>>> The same Scala script:
>>>>>
>>>>> *class *GlowPerformanceSpec *extends *SparkFunSuite
>>>>>                                   *with *Matchers
>>>>>                                   *with *DatasetComparer {
>>>>>
>>>>>   test(*"test simple union"*) {
>>>>>     *val *data = testData()
>>>>>     *val *sut = data._1.union(data._2)
>>>>>     assertDatasetEquality(sut,
>>>>>                           parallelize(Seq((*"Wim"*, 46),
>>>>>                                           (*"Klaas"*, 18),
>>>>>                                           (*"Jiri"*, 25),
>>>>>                                           (*"Tomasz"*, 23)),
>>>>>                                       *"name"*,
>>>>>                                       *"age"*))
>>>>>   }
>>>>>
>>>>>
>>>>>   *private def *testData(): (DataFrame, DataFrame) = {
>>>>>     *val *left = parallelize(Seq((*"Wim"*, 46),
>>>>>                                (*"Klaas"*, 18)),
>>>>>                            *"name"*,
>>>>>                            *"age"*)
>>>>>     *val *right = parallelize(Seq((*"Jiri"*, 25),
>>>>>                                 (*"Tomasz"*, 23)),
>>>>>                             *"name"*,
>>>>>                             *"age"*)
>>>>>     (left, right)
>>>>>   }
>>>>>
>>>>>   *import *scala.reflect.runtime.universe._
>>>>>   *private def *parallelize[T <: Product : ClassTag : TypeTag](data:
>>>>> Seq[T],
>>>>>                                                              colNames:
>>>>> String*): DataFrame = {
>>>>>     *import *spark.implicits._
>>>>>     spark.sparkContext.parallelize(data).toDF(colNames:_*)
>>>>>   }
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>> VisualVM from scala:
>>>>>
>>>>>
>>>>> Disclaimer <http://www.kbc.com/KBCmailDisclaimer>
>>>>>
>>>>
>>>
>>> --
>>> -Sriram
>>>
>>

Re: Huge difference in speed between pyspark and scalaspark

Posted by Wim Van Leuven <wi...@highestpoint.biz>.
Yes it is ... That why it is strange. But never mind the 350 secs. That is
not relevant. That's probably the security layer...

What is relevant is that the same pyspark code take 135s on a plain win10
PC vs only 8 secs on WSL Debian on the same Win10 PC, also 8 on a Linux
server and also 8 on a Mac. Thát difference is the question.

That's why we added the same implementation on Spark, which also just takes
8 secs on Win10. The difference is just insane... That's not a simple
ser/deser of these few strings, right?

Thoughts?
-wim



On Wed, 13 May 2020 at 18:46, Sean Owen <sr...@gmail.com> wrote:

> That code is barely doing anything; there is no action. I can't imagine it
> taking 350s. Is this really all that is run?
>
> On Wed, May 13, 2020, 9:25 AM Sriram Bhamidipati <sr...@gmail.com>
> wrote:
>
>> Pyspark and scala versions along with spark version data also are needed
>> to conclude anything
>> pyspark had a performance lag in the early days in comparision to scala ,
>> which is native for spark
>>
>> Also: The video referenced is 5 yrs old
>>
>>
>> On Wed, May 13, 2020 at 6:59 PM Gerard Maas <ge...@gmail.com>
>> wrote:
>>
>>> Steven,
>>>
>>> I'm not sure what the goals of the comparison are.
>>>
>>> The reason behind the difference is that `parallelize` is an RDD-based
>>> operation. When called from pySpark, the Python-JVM call incurs in
>>> additional overhead transferring that array.
>>> These early-days differences[1] between the Scala and the Python APIs
>>> are no longer an issue when you use the SparkSQL-based Dataframe/Dataset
>>> API.
>>>
>>> Given that the source of your data is unlikely to be
>>> synthetically-generated in-memory records, a more interesting test would be
>>> to load the same dataset (text, CSV, Parquet, ...) from both Spark/Scala
>>> and PySpark.
>>>
>>> I hope this helps.
>>>
>>> Met vriendelijke groeten,
>>> - Gerard.
>>>
>>>
>>>
>>> [1]
>>> https://databricks.com/session/getting-the-best-performance-with-pyspark
>>>
>>> On Wed, May 13, 2020 at 3:05 PM Steven Van Ingelgem
>>> <st...@kbc.be.invalid> wrote:
>>>
>>>> Public
>>>>
>>>>
>>>>
>>>> Hello all,
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> We noticed a HUGE difference between using pyspark and spark in scala.
>>>>
>>>> Pyspark runs:
>>>>
>>>>    - on my work computer in +- 350 seconds
>>>>    - on my home computer in +- 130 seconds (Windows defender enabled)
>>>>    - on my home computer in +- 105 seconds (Windows defender *dis*
>>>>    abled)
>>>>    - on my home computer as Scala code in +- 7 seconds
>>>>
>>>>
>>>>
>>>> What we already investigated:
>>>>
>>>>    - memory is correct (and enough) in both scala & pyspark. 1G was
>>>>    defined, and it was never reaching the garbage collecting limit (which was
>>>>    also 1G).
>>>>    - There are a lot of threads in the spark process, is this normal?
>>>>    Unknown… There are 300 more in the Spark session under pyspark than under
>>>>    scala-spark.
>>>>    - Scala code timings are consistent on any platform used (windows,
>>>>    linux, mac, WSL)
>>>>    - Disabling the antivirus makes a bit of a difference, but not
>>>>    enough to justify the huge differences in timings.
>>>>    - Under Debian WSL on the same windows hosts, the pyspark code runs
>>>>    in the same time as the scala code.
>>>>    - Timings for pyspark on Linux/Mac/WSL are similar to the timings
>>>>    for scala. (so virtualized it’s also around 7s, but not virtualized it’s
>>>>    130s!)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> What could we continue to investigate to figure out what the difference
>>>> in time could be?
>>>>
>>>> And/or did someone encounter this same behavior and could point us to a
>>>> possible solution?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Steven
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> The pyspark script:
>>>>
>>>> The spark session is created via:
>>>>
>>>> SparkSession
>>>>                       .builder
>>>>                       .appName(*'Testing'*)
>>>>                       .config(*'spark.driver.extraJavaOptions'*,
>>>> *'-Xms1g'*)
>>>>                       .getOrCreate()
>>>>
>>>>
>>>>
>>>> This is the part of the unittest:
>>>>
>>>> *def *setUp(self):
>>>>     self.left = self.parallelize([
>>>>         (*'Wim'*, 46),
>>>>         (*'Klaas'*, 18)
>>>>     ]).toDF(*'name: string, age: int'*)
>>>>
>>>>     self.right = self.parallelize([
>>>>         (*'Jiri'*, 25),
>>>>         (*'Tomasz'*, 23)
>>>>     ]).toDF(*'name: string, age: int'*)
>>>>
>>>> *def *test_simple_union(self):
>>>>     sut = self.left.union(self.right)
>>>>
>>>>     self.assertDatasetEquals(sut, self.parallelize([
>>>>             (*'Wim'*, 46),
>>>>             (*'Klaas'*, 18),
>>>>             (*'Jiri'*, 25),
>>>>             (*'Tomasz'*, 23)
>>>>         ]).toDF(*'name: string, age: int'*)
>>>>     )
>>>>
>>>>
>>>>
>>>> VisualVM from pyspark script:
>>>>
>>>>
>>>>
>>>> The same Scala script:
>>>>
>>>> *class *GlowPerformanceSpec *extends *SparkFunSuite
>>>>                                   *with *Matchers
>>>>                                   *with *DatasetComparer {
>>>>
>>>>   test(*"test simple union"*) {
>>>>     *val *data = testData()
>>>>     *val *sut = data._1.union(data._2)
>>>>     assertDatasetEquality(sut,
>>>>                           parallelize(Seq((*"Wim"*, 46),
>>>>                                           (*"Klaas"*, 18),
>>>>                                           (*"Jiri"*, 25),
>>>>                                           (*"Tomasz"*, 23)),
>>>>                                       *"name"*,
>>>>                                       *"age"*))
>>>>   }
>>>>
>>>>
>>>>   *private def *testData(): (DataFrame, DataFrame) = {
>>>>     *val *left = parallelize(Seq((*"Wim"*, 46),
>>>>                                (*"Klaas"*, 18)),
>>>>                            *"name"*,
>>>>                            *"age"*)
>>>>     *val *right = parallelize(Seq((*"Jiri"*, 25),
>>>>                                 (*"Tomasz"*, 23)),
>>>>                             *"name"*,
>>>>                             *"age"*)
>>>>     (left, right)
>>>>   }
>>>>
>>>>   *import *scala.reflect.runtime.universe._
>>>>   *private def *parallelize[T <: Product : ClassTag : TypeTag](data:
>>>> Seq[T],
>>>>                                                              colNames:
>>>> String*): DataFrame = {
>>>>     *import *spark.implicits._
>>>>     spark.sparkContext.parallelize(data).toDF(colNames:_*)
>>>>   }
>>>> }
>>>>
>>>>
>>>>
>>>> VisualVM from scala:
>>>>
>>>>
>>>> Disclaimer <http://www.kbc.com/KBCmailDisclaimer>
>>>>
>>>
>>
>> --
>> -Sriram
>>
>

Re: Huge difference in speed between pyspark and scalaspark

Posted by Sean Owen <sr...@gmail.com>.
That code is barely doing anything; there is no action. I can't imagine it
taking 350s. Is this really all that is run?

On Wed, May 13, 2020, 9:25 AM Sriram Bhamidipati <sr...@gmail.com>
wrote:

> Pyspark and scala versions along with spark version data also are needed
> to conclude anything
> pyspark had a performance lag in the early days in comparision to scala ,
> which is native for spark
>
> Also: The video referenced is 5 yrs old
>
>
> On Wed, May 13, 2020 at 6:59 PM Gerard Maas <ge...@gmail.com> wrote:
>
>> Steven,
>>
>> I'm not sure what the goals of the comparison are.
>>
>> The reason behind the difference is that `parallelize` is an RDD-based
>> operation. When called from pySpark, the Python-JVM call incurs in
>> additional overhead transferring that array.
>> These early-days differences[1] between the Scala and the Python APIs are
>> no longer an issue when you use the SparkSQL-based Dataframe/Dataset API.
>>
>> Given that the source of your data is unlikely to be
>> synthetically-generated in-memory records, a more interesting test would be
>> to load the same dataset (text, CSV, Parquet, ...) from both Spark/Scala
>> and PySpark.
>>
>> I hope this helps.
>>
>> Met vriendelijke groeten,
>> - Gerard.
>>
>>
>>
>> [1]
>> https://databricks.com/session/getting-the-best-performance-with-pyspark
>>
>> On Wed, May 13, 2020 at 3:05 PM Steven Van Ingelgem
>> <st...@kbc.be.invalid> wrote:
>>
>>> Public
>>>
>>>
>>>
>>> Hello all,
>>>
>>>
>>>
>>>
>>>
>>> We noticed a HUGE difference between using pyspark and spark in scala.
>>>
>>> Pyspark runs:
>>>
>>>    - on my work computer in +- 350 seconds
>>>    - on my home computer in +- 130 seconds (Windows defender enabled)
>>>    - on my home computer in +- 105 seconds (Windows defender *dis*abled)
>>>    - on my home computer as Scala code in +- 7 seconds
>>>
>>>
>>>
>>> What we already investigated:
>>>
>>>    - memory is correct (and enough) in both scala & pyspark. 1G was
>>>    defined, and it was never reaching the garbage collecting limit (which was
>>>    also 1G).
>>>    - There are a lot of threads in the spark process, is this normal?
>>>    Unknown… There are 300 more in the Spark session under pyspark than under
>>>    scala-spark.
>>>    - Scala code timings are consistent on any platform used (windows,
>>>    linux, mac, WSL)
>>>    - Disabling the antivirus makes a bit of a difference, but not
>>>    enough to justify the huge differences in timings.
>>>    - Under Debian WSL on the same windows hosts, the pyspark code runs
>>>    in the same time as the scala code.
>>>    - Timings for pyspark on Linux/Mac/WSL are similar to the timings
>>>    for scala. (so virtualized it’s also around 7s, but not virtualized it’s
>>>    130s!)
>>>
>>>
>>>
>>>
>>>
>>> What could we continue to investigate to figure out what the difference
>>> in time could be?
>>>
>>> And/or did someone encounter this same behavior and could point us to a
>>> possible solution?
>>>
>>>
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Steven
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> The pyspark script:
>>>
>>> The spark session is created via:
>>>
>>> SparkSession
>>>                       .builder
>>>                       .appName(*'Testing'*)
>>>                       .config(*'spark.driver.extraJavaOptions'*,
>>> *'-Xms1g'*)
>>>                       .getOrCreate()
>>>
>>>
>>>
>>> This is the part of the unittest:
>>>
>>> *def *setUp(self):
>>>     self.left = self.parallelize([
>>>         (*'Wim'*, 46),
>>>         (*'Klaas'*, 18)
>>>     ]).toDF(*'name: string, age: int'*)
>>>
>>>     self.right = self.parallelize([
>>>         (*'Jiri'*, 25),
>>>         (*'Tomasz'*, 23)
>>>     ]).toDF(*'name: string, age: int'*)
>>>
>>> *def *test_simple_union(self):
>>>     sut = self.left.union(self.right)
>>>
>>>     self.assertDatasetEquals(sut, self.parallelize([
>>>             (*'Wim'*, 46),
>>>             (*'Klaas'*, 18),
>>>             (*'Jiri'*, 25),
>>>             (*'Tomasz'*, 23)
>>>         ]).toDF(*'name: string, age: int'*)
>>>     )
>>>
>>>
>>>
>>> VisualVM from pyspark script:
>>>
>>>
>>>
>>> The same Scala script:
>>>
>>> *class *GlowPerformanceSpec *extends *SparkFunSuite
>>>                                   *with *Matchers
>>>                                   *with *DatasetComparer {
>>>
>>>   test(*"test simple union"*) {
>>>     *val *data = testData()
>>>     *val *sut = data._1.union(data._2)
>>>     assertDatasetEquality(sut,
>>>                           parallelize(Seq((*"Wim"*, 46),
>>>                                           (*"Klaas"*, 18),
>>>                                           (*"Jiri"*, 25),
>>>                                           (*"Tomasz"*, 23)),
>>>                                       *"name"*,
>>>                                       *"age"*))
>>>   }
>>>
>>>
>>>   *private def *testData(): (DataFrame, DataFrame) = {
>>>     *val *left = parallelize(Seq((*"Wim"*, 46),
>>>                                (*"Klaas"*, 18)),
>>>                            *"name"*,
>>>                            *"age"*)
>>>     *val *right = parallelize(Seq((*"Jiri"*, 25),
>>>                                 (*"Tomasz"*, 23)),
>>>                             *"name"*,
>>>                             *"age"*)
>>>     (left, right)
>>>   }
>>>
>>>   *import *scala.reflect.runtime.universe._
>>>   *private def *parallelize[T <: Product : ClassTag : TypeTag](data:
>>> Seq[T],
>>>                                                              colNames:
>>> String*): DataFrame = {
>>>     *import *spark.implicits._
>>>     spark.sparkContext.parallelize(data).toDF(colNames:_*)
>>>   }
>>> }
>>>
>>>
>>>
>>> VisualVM from scala:
>>>
>>>
>>> Disclaimer <http://www.kbc.com/KBCmailDisclaimer>
>>>
>>
>
> --
> -Sriram
>

Re: Huge difference in speed between pyspark and scalaspark

Posted by Sriram Bhamidipati <sr...@gmail.com>.
Pyspark and scala versions along with spark version data also are needed to
conclude anything
pyspark had a performance lag in the early days in comparision to scala ,
which is native for spark

Also: The video referenced is 5 yrs old


On Wed, May 13, 2020 at 6:59 PM Gerard Maas <ge...@gmail.com> wrote:

> Steven,
>
> I'm not sure what the goals of the comparison are.
>
> The reason behind the difference is that `parallelize` is an RDD-based
> operation. When called from pySpark, the Python-JVM call incurs in
> additional overhead transferring that array.
> These early-days differences[1] between the Scala and the Python APIs are
> no longer an issue when you use the SparkSQL-based Dataframe/Dataset API.
>
> Given that the source of your data is unlikely to be
> synthetically-generated in-memory records, a more interesting test would be
> to load the same dataset (text, CSV, Parquet, ...) from both Spark/Scala
> and PySpark.
>
> I hope this helps.
>
> Met vriendelijke groeten,
> - Gerard.
>
>
>
> [1]
> https://databricks.com/session/getting-the-best-performance-with-pyspark
>
> On Wed, May 13, 2020 at 3:05 PM Steven Van Ingelgem
> <st...@kbc.be.invalid> wrote:
>
>> Public
>>
>>
>>
>> Hello all,
>>
>>
>>
>>
>>
>> We noticed a HUGE difference between using pyspark and spark in scala.
>>
>> Pyspark runs:
>>
>>    - on my work computer in +- 350 seconds
>>    - on my home computer in +- 130 seconds (Windows defender enabled)
>>    - on my home computer in +- 105 seconds (Windows defender *dis*abled)
>>    - on my home computer as Scala code in +- 7 seconds
>>
>>
>>
>> What we already investigated:
>>
>>    - memory is correct (and enough) in both scala & pyspark. 1G was
>>    defined, and it was never reaching the garbage collecting limit (which was
>>    also 1G).
>>    - There are a lot of threads in the spark process, is this normal?
>>    Unknown… There are 300 more in the Spark session under pyspark than under
>>    scala-spark.
>>    - Scala code timings are consistent on any platform used (windows,
>>    linux, mac, WSL)
>>    - Disabling the antivirus makes a bit of a difference, but not enough
>>    to justify the huge differences in timings.
>>    - Under Debian WSL on the same windows hosts, the pyspark code runs
>>    in the same time as the scala code.
>>    - Timings for pyspark on Linux/Mac/WSL are similar to the timings for
>>    scala. (so virtualized it’s also around 7s, but not virtualized it’s 130s!)
>>
>>
>>
>>
>>
>> What could we continue to investigate to figure out what the difference
>> in time could be?
>>
>> And/or did someone encounter this same behavior and could point us to a
>> possible solution?
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Steven
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> The pyspark script:
>>
>> The spark session is created via:
>>
>> SparkSession
>>                       .builder
>>                       .appName(*'Testing'*)
>>                       .config(*'spark.driver.extraJavaOptions'*,
>> *'-Xms1g'*)
>>                       .getOrCreate()
>>
>>
>>
>> This is the part of the unittest:
>>
>> *def *setUp(self):
>>     self.left = self.parallelize([
>>         (*'Wim'*, 46),
>>         (*'Klaas'*, 18)
>>     ]).toDF(*'name: string, age: int'*)
>>
>>     self.right = self.parallelize([
>>         (*'Jiri'*, 25),
>>         (*'Tomasz'*, 23)
>>     ]).toDF(*'name: string, age: int'*)
>>
>> *def *test_simple_union(self):
>>     sut = self.left.union(self.right)
>>
>>     self.assertDatasetEquals(sut, self.parallelize([
>>             (*'Wim'*, 46),
>>             (*'Klaas'*, 18),
>>             (*'Jiri'*, 25),
>>             (*'Tomasz'*, 23)
>>         ]).toDF(*'name: string, age: int'*)
>>     )
>>
>>
>>
>> VisualVM from pyspark script:
>>
>>
>>
>> The same Scala script:
>>
>> *class *GlowPerformanceSpec *extends *SparkFunSuite
>>                                   *with *Matchers
>>                                   *with *DatasetComparer {
>>
>>   test(*"test simple union"*) {
>>     *val *data = testData()
>>     *val *sut = data._1.union(data._2)
>>     assertDatasetEquality(sut,
>>                           parallelize(Seq((*"Wim"*, 46),
>>                                           (*"Klaas"*, 18),
>>                                           (*"Jiri"*, 25),
>>                                           (*"Tomasz"*, 23)),
>>                                       *"name"*,
>>                                       *"age"*))
>>   }
>>
>>
>>   *private def *testData(): (DataFrame, DataFrame) = {
>>     *val *left = parallelize(Seq((*"Wim"*, 46),
>>                                (*"Klaas"*, 18)),
>>                            *"name"*,
>>                            *"age"*)
>>     *val *right = parallelize(Seq((*"Jiri"*, 25),
>>                                 (*"Tomasz"*, 23)),
>>                             *"name"*,
>>                             *"age"*)
>>     (left, right)
>>   }
>>
>>   *import *scala.reflect.runtime.universe._
>>   *private def *parallelize[T <: Product : ClassTag : TypeTag](data:
>> Seq[T],
>>                                                              colNames:
>> String*): DataFrame = {
>>     *import *spark.implicits._
>>     spark.sparkContext.parallelize(data).toDF(colNames:_*)
>>   }
>> }
>>
>>
>>
>> VisualVM from scala:
>>
>>
>> Disclaimer <http://www.kbc.com/KBCmailDisclaimer>
>>
>

-- 
-Sriram

Re: Huge difference in speed between pyspark and scalaspark

Posted by Mich Talebzadeh <mi...@gmail.com>.
This comparison comes up time and time again. Spark is written in
Scala and provides
APIs in Scala, Java, Python, and R.

However, its primary focus has been on Scala. In generic terms this means
that Python, Java etc are add-ons and I suspect if you look under the
bonnet they  interface with Scala.

Hence that would be a driver for Spark on Scala being fastest. The question
is it is what it is. So if you are going to use Python then expect that
behaviour to materialise.

HTH,


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 13 May 2020 at 14:30, Gerard Maas <ge...@gmail.com> wrote:

> Steven,
>
> I'm not sure what the goals of the comparison are.
>
> The reason behind the difference is that `parallelize` is an RDD-based
> operation. When called from pySpark, the Python-JVM call incurs in
> additional overhead transferring that array.
> These early-days differences[1] between the Scala and the Python APIs are
> no longer an issue when you use the SparkSQL-based Dataframe/Dataset API.
>
> Given that the source of your data is unlikely to be
> synthetically-generated in-memory records, a more interesting test would be
> to load the same dataset (text, CSV, Parquet, ...) from both Spark/Scala
> and PySpark.
>
> I hope this helps.
>
> Met vriendelijke groeten,
> - Gerard.
>
>
>
> [1]
> https://databricks.com/session/getting-the-best-performance-with-pyspark
>
> On Wed, May 13, 2020 at 3:05 PM Steven Van Ingelgem
> <st...@kbc.be.invalid> wrote:
>
>> Public
>>
>>
>>
>> Hello all,
>>
>>
>>
>>
>>
>> We noticed a HUGE difference between using pyspark and spark in scala.
>>
>> Pyspark runs:
>>
>>    - on my work computer in +- 350 seconds
>>    - on my home computer in +- 130 seconds (Windows defender enabled)
>>    - on my home computer in +- 105 seconds (Windows defender *dis*abled)
>>    - on my home computer as Scala code in +- 7 seconds
>>
>>
>>
>> What we already investigated:
>>
>>    - memory is correct (and enough) in both scala & pyspark. 1G was
>>    defined, and it was never reaching the garbage collecting limit (which was
>>    also 1G).
>>    - There are a lot of threads in the spark process, is this normal?
>>    Unknown… There are 300 more in the Spark session under pyspark than under
>>    scala-spark.
>>    - Scala code timings are consistent on any platform used (windows,
>>    linux, mac, WSL)
>>    - Disabling the antivirus makes a bit of a difference, but not enough
>>    to justify the huge differences in timings.
>>    - Under Debian WSL on the same windows hosts, the pyspark code runs
>>    in the same time as the scala code.
>>    - Timings for pyspark on Linux/Mac/WSL are similar to the timings for
>>    scala. (so virtualized it’s also around 7s, but not virtualized it’s 130s!)
>>
>>
>>
>>
>>
>> What could we continue to investigate to figure out what the difference
>> in time could be?
>>
>> And/or did someone encounter this same behavior and could point us to a
>> possible solution?
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Steven
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> The pyspark script:
>>
>> The spark session is created via:
>>
>> SparkSession
>>                       .builder
>>                       .appName(*'Testing'*)
>>                       .config(*'spark.driver.extraJavaOptions'*,
>> *'-Xms1g'*)
>>                       .getOrCreate()
>>
>>
>>
>> This is the part of the unittest:
>>
>> *def *setUp(self):
>>     self.left = self.parallelize([
>>         (*'Wim'*, 46),
>>         (*'Klaas'*, 18)
>>     ]).toDF(*'name: string, age: int'*)
>>
>>     self.right = self.parallelize([
>>         (*'Jiri'*, 25),
>>         (*'Tomasz'*, 23)
>>     ]).toDF(*'name: string, age: int'*)
>>
>> *def *test_simple_union(self):
>>     sut = self.left.union(self.right)
>>
>>     self.assertDatasetEquals(sut, self.parallelize([
>>             (*'Wim'*, 46),
>>             (*'Klaas'*, 18),
>>             (*'Jiri'*, 25),
>>             (*'Tomasz'*, 23)
>>         ]).toDF(*'name: string, age: int'*)
>>     )
>>
>>
>>
>> VisualVM from pyspark script:
>>
>>
>>
>> The same Scala script:
>>
>> *class *GlowPerformanceSpec *extends *SparkFunSuite
>>                                   *with *Matchers
>>                                   *with *DatasetComparer {
>>
>>   test(*"test simple union"*) {
>>     *val *data = testData()
>>     *val *sut = data._1.union(data._2)
>>     assertDatasetEquality(sut,
>>                           parallelize(Seq((*"Wim"*, 46),
>>                                           (*"Klaas"*, 18),
>>                                           (*"Jiri"*, 25),
>>                                           (*"Tomasz"*, 23)),
>>                                       *"name"*,
>>                                       *"age"*))
>>   }
>>
>>
>>   *private def *testData(): (DataFrame, DataFrame) = {
>>     *val *left = parallelize(Seq((*"Wim"*, 46),
>>                                (*"Klaas"*, 18)),
>>                            *"name"*,
>>                            *"age"*)
>>     *val *right = parallelize(Seq((*"Jiri"*, 25),
>>                                 (*"Tomasz"*, 23)),
>>                             *"name"*,
>>                             *"age"*)
>>     (left, right)
>>   }
>>
>>   *import *scala.reflect.runtime.universe._
>>   *private def *parallelize[T <: Product : ClassTag : TypeTag](data:
>> Seq[T],
>>                                                              colNames:
>> String*): DataFrame = {
>>     *import *spark.implicits._
>>     spark.sparkContext.parallelize(data).toDF(colNames:_*)
>>   }
>> }
>>
>>
>>
>> VisualVM from scala:
>>
>>
>> Disclaimer <http://www.kbc.com/KBCmailDisclaimer>
>>
>

Re: Huge difference in speed between pyspark and scalaspark

Posted by Gerard Maas <ge...@gmail.com>.
Steven,

I'm not sure what the goals of the comparison are.

The reason behind the difference is that `parallelize` is an RDD-based
operation. When called from pySpark, the Python-JVM call incurs in
additional overhead transferring that array.
These early-days differences[1] between the Scala and the Python APIs are
no longer an issue when you use the SparkSQL-based Dataframe/Dataset API.

Given that the source of your data is unlikely to be
synthetically-generated in-memory records, a more interesting test would be
to load the same dataset (text, CSV, Parquet, ...) from both Spark/Scala
and PySpark.

I hope this helps.

Met vriendelijke groeten,
- Gerard.



[1] https://databricks.com/session/getting-the-best-performance-with-pyspark

On Wed, May 13, 2020 at 3:05 PM Steven Van Ingelgem
<st...@kbc.be.invalid> wrote:

> Public
>
>
>
> Hello all,
>
>
>
>
>
> We noticed a HUGE difference between using pyspark and spark in scala.
>
> Pyspark runs:
>
>    - on my work computer in +- 350 seconds
>    - on my home computer in +- 130 seconds (Windows defender enabled)
>    - on my home computer in +- 105 seconds (Windows defender *dis*abled)
>    - on my home computer as Scala code in +- 7 seconds
>
>
>
> What we already investigated:
>
>    - memory is correct (and enough) in both scala & pyspark. 1G was
>    defined, and it was never reaching the garbage collecting limit (which was
>    also 1G).
>    - There are a lot of threads in the spark process, is this normal?
>    Unknown… There are 300 more in the Spark session under pyspark than under
>    scala-spark.
>    - Scala code timings are consistent on any platform used (windows,
>    linux, mac, WSL)
>    - Disabling the antivirus makes a bit of a difference, but not enough
>    to justify the huge differences in timings.
>    - Under Debian WSL on the same windows hosts, the pyspark code runs in
>    the same time as the scala code.
>    - Timings for pyspark on Linux/Mac/WSL are similar to the timings for
>    scala. (so virtualized it’s also around 7s, but not virtualized it’s 130s!)
>
>
>
>
>
> What could we continue to investigate to figure out what the difference in
> time could be?
>
> And/or did someone encounter this same behavior and could point us to a
> possible solution?
>
>
>
>
>
> Thanks,
>
> Steven
>
>
>
>
>
>
>
>
>
>
>
> The pyspark script:
>
> The spark session is created via:
>
> SparkSession
>                       .builder
>                       .appName(*'Testing'*)
>                       .config(*'spark.driver.extraJavaOptions'*,
> *'-Xms1g'*)
>                       .getOrCreate()
>
>
>
> This is the part of the unittest:
>
> *def *setUp(self):
>     self.left = self.parallelize([
>         (*'Wim'*, 46),
>         (*'Klaas'*, 18)
>     ]).toDF(*'name: string, age: int'*)
>
>     self.right = self.parallelize([
>         (*'Jiri'*, 25),
>         (*'Tomasz'*, 23)
>     ]).toDF(*'name: string, age: int'*)
>
> *def *test_simple_union(self):
>     sut = self.left.union(self.right)
>
>     self.assertDatasetEquals(sut, self.parallelize([
>             (*'Wim'*, 46),
>             (*'Klaas'*, 18),
>             (*'Jiri'*, 25),
>             (*'Tomasz'*, 23)
>         ]).toDF(*'name: string, age: int'*)
>     )
>
>
>
> VisualVM from pyspark script:
>
>
>
> The same Scala script:
>
> *class *GlowPerformanceSpec *extends *SparkFunSuite
>                                   *with *Matchers
>                                   *with *DatasetComparer {
>
>   test(*"test simple union"*) {
>     *val *data = testData()
>     *val *sut = data._1.union(data._2)
>     assertDatasetEquality(sut,
>                           parallelize(Seq((*"Wim"*, 46),
>                                           (*"Klaas"*, 18),
>                                           (*"Jiri"*, 25),
>                                           (*"Tomasz"*, 23)),
>                                       *"name"*,
>                                       *"age"*))
>   }
>
>
>   *private def *testData(): (DataFrame, DataFrame) = {
>     *val *left = parallelize(Seq((*"Wim"*, 46),
>                                (*"Klaas"*, 18)),
>                            *"name"*,
>                            *"age"*)
>     *val *right = parallelize(Seq((*"Jiri"*, 25),
>                                 (*"Tomasz"*, 23)),
>                             *"name"*,
>                             *"age"*)
>     (left, right)
>   }
>
>   *import *scala.reflect.runtime.universe._
>   *private def *parallelize[T <: Product : ClassTag : TypeTag](data:
> Seq[T],
>                                                              colNames:
> String*): DataFrame = {
>     *import *spark.implicits._
>     spark.sparkContext.parallelize(data).toDF(colNames:_*)
>   }
> }
>
>
>
> VisualVM from scala:
>
>
> Disclaimer <http://www.kbc.com/KBCmailDisclaimer>
>

RE: Huge difference in speed between pyspark and scalaspark

Posted by Steven Van Ingelgem <st...@kbc.be.INVALID>.
Public

Hello all,


We noticed a HUGE difference between using pyspark and spark in scala.
Pyspark runs:

  *   on my work computer in +- 350 seconds
  *   on my home computer in +- 130 seconds (Windows defender enabled)
  *   on my home computer in +- 105 seconds (Windows defender disabled)
  *   on my home computer as Scala code in +- 7 seconds

What we already investigated:

  *   memory is correct (and enough) in both scala & pyspark. 1G was defined, and it was never reaching the garbage collecting limit (which was also 1G).
  *   There are a lot of threads in the spark process, is this normal? Unknown... There are 300 more in the Spark session under pyspark than under scala-spark.
  *   Scala code timings are consistent on any platform used (windows, linux, mac, WSL)
  *   Disabling the antivirus makes a bit of a difference, but not enough to justify the huge differences in timings.
  *   Under Debian WSL on the same windows hosts, the pyspark code runs in the same time as the scala code.
  *   Timings for pyspark on Linux/Mac/WSL are similar to the timings for scala. (so virtualized it's also around 7s, but not virtualized it's 130s!)


What could we continue to investigate to figure out what the difference in time could be?
And/or did someone encounter this same behavior and could point us to a possible solution?


Thanks,
Steven





The pyspark script:
The spark session is created via:
SparkSession
                      .builder
                      .appName('Testing')
                      .config('spark.driver.extraJavaOptions', '-Xms1g')
                      .getOrCreate()

This is the part of the unittest:
def setUp(self):
    self.left = self.parallelize([
        ('Wim', 46),
        ('Klaas', 18)
    ]).toDF('name: string, age: int')

    self.right = self.parallelize([
        ('Jiri', 25),
        ('Tomasz', 23)
    ]).toDF('name: string, age: int')

def test_simple_union(self):
    sut = self.left.union(self.right)

    self.assertDatasetEquals(sut, self.parallelize([
            ('Wim', 46),
            ('Klaas', 18),
            ('Jiri', 25),
            ('Tomasz', 23)
        ]).toDF('name: string, age: int')
    )

VisualVM from pyspark script:
[cid:image001.jpg@01D62937.EB082D10]

The same Scala script:
class GlowPerformanceSpec extends SparkFunSuite
                                  with Matchers
                                  with DatasetComparer {

  test("test simple union") {
    val data = testData()
    val sut = data._1.union(data._2)
    assertDatasetEquality(sut,
                          parallelize(Seq(("Wim", 46),
                                          ("Klaas", 18),
                                          ("Jiri", 25),
                                          ("Tomasz", 23)),
                                      "name",
                                      "age"))
  }

  private def testData(): (DataFrame, DataFrame) = {
    val left = parallelize(Seq(("Wim", 46),
                               ("Klaas", 18)),
                           "name",
                           "age")
    val right = parallelize(Seq(("Jiri", 25),
                                ("Tomasz", 23)),
                            "name",
                            "age")
    (left, right)
  }

  import scala.reflect.runtime.universe._
  private def parallelize[T <: Product : ClassTag : TypeTag](data: Seq[T],
                                                             colNames: String*): DataFrame = {
    import spark.implicits._
    spark.sparkContext.parallelize(data).toDF(colNames:_*)
  }
}

VisualVM from scala:
[cid:image002.jpg@01D62937.EB082D10]

Disclaimer <http://www.kbc.com/KBCmailDisclaimer>