You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2021/11/02 07:58:00 UTC

[jira] [Comment Edited] (SPARK-37174) WARN WindowExec: No Partition Defined is being printed 4 times.

    [ https://issues.apache.org/jira/browse/SPARK-37174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17437184#comment-17437184 ] 

Hyukjin Kwon edited comment on SPARK-37174 at 11/2/21, 7:57 AM:
----------------------------------------------------------------

This is related to default index, see also https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/options.html#default-index-type.

Spark 3.3 targets to remove such warnings for natively supporting global windows. That's slightly orthogonal from this issue though.


was (Author: hyukjin.kwon):
This is related to default index, see also https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/options.html#default-index-type.
Spark 3.3 targets to remove such warnings.

> WARN WindowExec: No Partition Defined is being printed 4 times. 
> ----------------------------------------------------------------
>
>                 Key: SPARK-37174
>                 URL: https://issues.apache.org/jira/browse/SPARK-37174
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 3.3.0
>            Reporter: Bjørn Jørgensen
>            Priority: Major
>
> Hi I use this code  
> {code:java}
> f01 = spark.read.json("/home/test_files/falk/flatted110721/F01.json/*.json")
> pf01 = f01.to_pandas_on_spark()
> pf01 = pf01.rename(columns=lambda x: re.sub(':P$', '', x))
> pf01["OBJECT_CONTRACT:DATE_PUBLICATION_NOTICE"] = ps.to_datetime(pf01["OBJECT_CONTRACT:DATE_PUBLICATION_NOTICE"])
> pf01.info(){code}
>  
>  sometimes it prints 
>   
> {code:java}
>  21/10/31 20:38:04 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
>  21/10/31 20:38:04 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
>  21/10/31 20:38:08 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
>  /opt/spark/python/pyspark/sql/pandas/conversion.py:214: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
>    df[column_name] = series
>  /opt/spark/python/pyspark/pandas/utils.py:967: UserWarning: `to_pandas` loads all data into the driver's memory. It should only be used if the resulting pandas Series is expected to be small.
>    warnings.warn(message, UserWarning)
>  21/10/31 20:38:16 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
>  21/10/31 20:38:18 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.{code}
>  
>  and some other times it "just" prints 
>   
> {code:java}
>  21/10/31 21:24:13 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
>  21/10/31 21:24:16 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
>  21/10/31 21:24:22 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
>  21/10/31 21:24:24 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.{code}
> Why does it print df[column_name] = series ?
>   
>  can we remove /opt/spark/python/pyspark/pandas/utils.py:967: ?
>  and warnings.warn(message, UserWarning) ?
>  and 3 of WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org