You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Albertus Kelvin (JIRA)" <ji...@apache.org> on 2019/07/30 09:40:00 UTC

[jira] [Created] (SPARK-28562) PySpark profiling is not understandable

Albertus Kelvin created SPARK-28562:
---------------------------------------

             Summary: PySpark profiling is not understandable
                 Key: SPARK-28562
                 URL: https://issues.apache.org/jira/browse/SPARK-28562
             Project: Spark
          Issue Type: Question
          Components: Optimizer
    Affects Versions: 2.4.0
            Reporter: Albertus Kelvin


I was profiling code in PySpark. What I did was set the "spark.python.profile" in the config to "true". I also made a simple method consisting of several dataframe operations, such as "withColumn" and "join". Here's the code sample:


{code:python}
def join_df(df, df1):
	df = df.withColumn('rowa', F.lit(100))
	df = df.withColumn('rowb', df['rowa'] * F.lit(100))
	
	joined_df = df.join(df1,'rowid',how='left')
	return joined_df
{code}

However, after the driver exits, the output of the profiler was not understandable because there were no my filename and the corresponding methods. All exists was Spark's built-in files and methods, such as "rdd.py", "worker.py", and "serializers.py".

The question is, how to show all of my methods that become the bottlenecks? For example, using the above code sample, I'd like to know the time needed for "withColumn" and "join" operation.

Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org