You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Yeoul Na <ye...@uci.edu> on 2017/03/08 02:18:12 UTC

PySpark Serialization/Deserialization (Pickling) Overhead

Hi all,

I am trying to analyze PySpark performance overhead. People just say PySpark
is slower than Scala due to the Serialization/Deserialization overhead. I
tried with the example in this post:
https://0x0fff.com/spark-dataframes-are-faster-arent-they/. This and many
articles say straight-forward Python implementation is the slowest due to
the serialization/deserialization overhead.

However, when I actually looked at the log in the Web UI, serialization and
deserialization time of PySpark do not seem to be any bigger than that of
Scala. The main contributor was "Executor Computing Time". Thus, we cannot
sure whether this is due to serialization or because Python code is
basically slower than Scala code. 

So my question is that does "Task Deserialization Time" in Spark WebUI
actually include serialization/deserialization times in PySpark? If this is
not the case, how can I actually measure the serialization/deserialization
overhead? 

Thanks,
Yeoul



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Serialization-Deserialization-Pickling-Overhead-tp28468.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: PySpark Serialization/Deserialization (Pickling) Overhead

Posted by rok <ro...@gmail.com>.

My guess is that the UI serialization times show the Java side only. To get
a feeling for the python pickling/unpickling, use the show_profiles()
method of the SparkContext instance: http://spark.apache.
org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.show_profiles

That will show you how much of the execution time is used up by
cPickle.load() and .dump() methods.

Hope that helps,

Rok

On Wed, Mar 8, 2017 at 3:18 AM, Yeoul Na [via Apache Spark User List] <
ml-node+s1001560n28468h99@n3.nabble.com> wrote:

>
> Hi all,
>
> I am trying to analyze PySpark performance overhead. People just say
> PySpark
> is slower than Scala due to the Serialization/Deserialization overhead. I
> tried with the example in this post:
> https://0x0fff.com/spark-dataframes-are-faster-arent-they/. This and many
> articles say straight-forward Python implementation is the slowest due to
> the serialization/deserialization overhead.
>
> However, when I actually looked at the log in the Web UI, serialization
> and deserialization time of PySpark do not seem to be any bigger than that
> of Scala. The main contributor was "Executor Computing Time". Thus, we
> cannot sure whether this is due to serialization or because Python code is
> basically slower than Scala code.
>
> So my question is that does "Task Deserialization Time" in Spark WebUI
> actually include serialization/deserialization times in PySpark? If this is
> not the case, how can I actually measure the serialization/deserialization
> overhead?
>
> Thanks,
> Yeoul
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-
> Serialization-Deserialization-Pickling-Overhead-tp28468.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1h22@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=cm9rcm9za2FyQGdtYWlsLmNvbXwxfC0xNDM4OTI3NjU3>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Serialization-Deserialization-Pickling-Overhead-tp28468p28469.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: PySpark Serialization/Deserialization (Pickling) Overhead

Posted by Li Jin <ic...@gmail.com>.

Yeoul,

I think a you can run an microbench for pyspark
serialization/deserialization would be to run a withColumn + a python udf
that returns a constant and compare that with similar code in
Scala.

I am not sure if there is way to measure just the serialization code,
because pyspark API only allows you apply a python function over the data
frame so that always involve running a for loop in python over the data.
You probably need to some hacking to make it just do the serialization.

Maybe other people have more insights?

On Tue, Mar 7, 2017 at 9:18 PM Yeoul Na <ye...@uci.edu> wrote:

Hi all,

I am trying to analyze PySpark performance overhead. People just say PySpark
is slower than Scala due to the Serialization/Deserialization overhead. I
tried with the example in this post:
https://0x0fff.com/spark-dataframes-are-faster-arent-they/. This and many
articles say straight-forward Python implementation is the slowest due to
the serialization/deserialization overhead.

However, when I actually looked at the log in the Web UI, serialization and
deserialization time of PySpark do not seem to be any bigger than that of
Scala. The main contributor was "Executor Computing Time". Thus, we cannot
sure whether this is due to serialization or because Python code is
basically slower than Scala code.

So my question is that does "Task Deserialization Time" in Spark WebUI
actually include serialization/deserialization times in PySpark? If this is
not the case, how can I actually measure the serialization/deserialization
overhead?

Thanks,
Yeoul

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Serialization-Deserialization-Pickling-Overhead-tp28468.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org