You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Bjørn Jørgensen (Jira)" <ji...@apache.org> on 2021/10/31 21:56:00 UTC
[jira] [Updated] (SPARK-37174) WARN WindowExec: No Partition Defined is being printed 4 times.

     [ https://issues.apache.org/jira/browse/SPARK-37174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bjørn Jørgensen updated SPARK-37174:
------------------------------------
    Description: 
Hi I use this code 


{code:java}
// f01 = spark.read.json("/home/test_files/falk/flatted110721/F01.json/*.json")
pf01 = f01.to_pandas_on_spark()
pf01 = pf01.rename(columns=lambda x: re.sub(':P$', '', x))
pf01["OBJECT_CONTRACT:DATE_PUBLICATION_NOTICE"] = ps.to_datetime(pf01["OBJECT_CONTRACT:DATE_PUBLICATION_NOTICE"])
pf01.info(){code}
 

 sometimes it prints 
  
{code:java}
// 21/10/31 20:38:04 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
 21/10/31 20:38:04 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
 21/10/31 20:38:08 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
 /opt/spark/python/pyspark/sql/pandas/conversion.py:214: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
   df[column_name] = series
 /opt/spark/python/pyspark/pandas/utils.py:967: UserWarning: `to_pandas` loads all data into the driver's memory. It should only be used if the resulting pandas Series is expected to be small.
   warnings.warn(message, UserWarning)
 21/10/31 20:38:16 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
 21/10/31 20:38:18 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.{code}

  
 and some other times it "just" prints 
  
{code:java}
// 21/10/31 21:24:13 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
 21/10/31 21:24:16 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
 21/10/31 21:24:22 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
 21/10/31 21:24:24 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.{code}

 Why does it print df[column_name] = series ?
  
 can we remove /opt/spark/python/pyspark/pandas/utils.py:967: ?
 and warnings.warn(message, UserWarning) ?
 and 3 of WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.?

  was:
Hi I use this code 
```
 
f01 = spark.read.json("/home/test_files/falk/flatted110721/F01.json/*.json")

pf01 = f01.to_pandas_on_spark()


pf01 = pf01.rename(columns=lambda x: re.sub('\:P$', '', x))


pf01["OBJECT_CONTRACT:DATE_PUBLICATION_NOTICE"] = ps.to_datetime(pf01["OBJECT_CONTRACT:DATE_PUBLICATION_NOTICE"])

pf01.info()
 
```
 sometimes it prints 
 
21/10/31 20:38:04 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
21/10/31 20:38:04 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
21/10/31 20:38:08 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
/opt/spark/python/pyspark/sql/pandas/conversion.py:214: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
  df[column_name] = series
/opt/spark/python/pyspark/pandas/utils.py:967: UserWarning: `to_pandas` loads all data into the driver's memory. It should only be used if the resulting pandas Series is expected to be small.
  warnings.warn(message, UserWarning)
21/10/31 20:38:16 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
21/10/31 20:38:18 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
 
 
and some other times it "just" prints 
 
21/10/31 21:24:13 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
21/10/31 21:24:16 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
21/10/31 21:24:22 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
21/10/31 21:24:24 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
 
Why does it print df[column_name] = series ?
 
can we remove /opt/spark/python/pyspark/pandas/utils.py:967: ?
and warnings.warn(message, UserWarning) ?
and 3 of WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.?


> WARN WindowExec: No Partition Defined is being printed 4 times. 
> ----------------------------------------------------------------
>
>                 Key: SPARK-37174
>                 URL: https://issues.apache.org/jira/browse/SPARK-37174
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 3.3.0
>            Reporter: Bjørn Jørgensen
>            Priority: Major
>
> Hi I use this code 
> {code:java}
> // f01 = spark.read.json("/home/test_files/falk/flatted110721/F01.json/*.json")
> pf01 = f01.to_pandas_on_spark()
> pf01 = pf01.rename(columns=lambda x: re.sub(':P$', '', x))
> pf01["OBJECT_CONTRACT:DATE_PUBLICATION_NOTICE"] = ps.to_datetime(pf01["OBJECT_CONTRACT:DATE_PUBLICATION_NOTICE"])
> pf01.info(){code}
>  
>  sometimes it prints 
>   
> {code:java}
> // 21/10/31 20:38:04 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
>  21/10/31 20:38:04 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
>  21/10/31 20:38:08 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
>  /opt/spark/python/pyspark/sql/pandas/conversion.py:214: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
>    df[column_name] = series
>  /opt/spark/python/pyspark/pandas/utils.py:967: UserWarning: `to_pandas` loads all data into the driver's memory. It should only be used if the resulting pandas Series is expected to be small.
>    warnings.warn(message, UserWarning)
>  21/10/31 20:38:16 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
>  21/10/31 20:38:18 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.{code}
>   
>  and some other times it "just" prints 
>   
> {code:java}
> // 21/10/31 21:24:13 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
>  21/10/31 21:24:16 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
>  21/10/31 21:24:22 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
>  21/10/31 21:24:24 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.{code}
>  Why does it print df[column_name] = series ?
>   
>  can we remove /opt/spark/python/pyspark/pandas/utils.py:967: ?
>  and warnings.warn(message, UserWarning) ?
>  and 3 of WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org