You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "WweiL (via GitHub)" <gi...@apache.org> on 2024/03/06 00:30:33 UTC

[PR] [SPARK-47277][3.5] PySpark util function assertDataFrameEqual should not support streaming DF [spark]

WweiL opened a new pull request, #45395:
URL: https://github.com/apache/spark/pull/45395

### What changes were proposed in this pull request?

Backport https://github.com/apache/spark/pull/45380 to branch-3.5

The handy util function should not support streaming dataframes, currently if you call it upon streaming queries, it throws a relatively hard-to-understand error:
```
>>> df1 = spark.readStream.format("rate").load()
>>> df2 = spark.readStream.format("rate").load()
>>> from pyspark.testing.utils import QuietTest, assertDataFrameEqual
>>> assertDataFrameEqual(df1, df2)
/Users/wei.liu/oss-spark/python/pyspark/pandas/__init__.py:43: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
warnings.warn(
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/wei.liu/oss-spark/python/pyspark/testing/utils.py", line 936, in assertDataFrameEqual
actual_list = actual.collect()
File "/Users/wei.liu/oss-spark/python/pyspark/sql/dataframe.py", line 1453, in collect
sock_info = self._jdf.collectToPython()
File "/Users/wei.liu/oss-spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
File "/Users/wei.liu/oss-spark/python/pyspark/errors/exceptions/captured.py", line 221, in deco
raise converted from None
pyspark.errors.exceptions.captured.AnalysisException: Queries with streaming sources must be executed with writeStream.start();
rate
```
Because the function calls `collect` which is not supported on streaming dataframes. It'd be good if we can catch this earlier.

### Why are the changes needed?

Improve usability

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Github Copilot
It helped me to pick the error class UNSUPPORTED_OPERATION

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47277][3.5] PySpark util function assertDataFrameEqual should not support streaming DF [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon closed pull request #45395: [SPARK-47277][3.5] PySpark util function assertDataFrameEqual should not support streaming DF
URL: https://github.com/apache/spark/pull/45395


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47277][3.5] PySpark util function assertDataFrameEqual should not support streaming DF [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #45395:
URL: https://github.com/apache/spark/pull/45395#issuecomment-1980009033

   Merged to branch-3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org