You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "HyukjinKwon (via GitHub)" <gi...@apache.org> on 2024/02/16 01:24:49 UTC

[PR] Recover -1 and 0 case for spark.sql.execution.arrow.maxRecordsPerBatch [spark]

HyukjinKwon opened a new pull request, #45132:
URL: https://github.com/apache/spark/pull/45132

   ### What changes were proposed in this pull request?
   
   This PR fixes the regression introduced by https://github.com/apache/spark/pull/36683.
   
   ```python
   import pandas as pd
   spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
   spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", 0)
   spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", False)
   spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas()
   
   spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", -1)
   spark.createDataFrame(pd.DataFrame({'a': [123]})).toPandas()
   ```
   
   **Before**
   
   ```
   /.../spark/python/pyspark/sql/pandas/conversion.py:371: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and will not continue because automatic fallback with 'spark.sql.execution.arrow.pyspark.fallback.enabled' has been set to false.
     range() arg 3 must not be zero
     warn(msg)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/.../spark/python/pyspark/sql/session.py", line 1483, in createDataFrame
       return super(SparkSession, self).createDataFrame(  # type: ignore[call-overload]
     File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 351, in createDataFrame
       return self._create_from_pandas_with_arrow(data, schema, timezone)
     File "/.../spark/python/pyspark/sql/pandas/conversion.py", line 633, in _create_from_pandas_with_arrow
       pdf_slices = (pdf.iloc[start : start + step] for start in range(0, len(pdf), step))
   ValueError: range() arg 3 must not be zero
   ```
   ```
   Empty DataFrame
   Columns: [a]
   Index: []
   ```
   
   **After**
   
   ```
        a
   0  123
   ```
   
   ```
        a
   0  123
   ```
   
   ### Why are the changes needed?
   
   It fixes a regerssion. This is a documented behaviour. It should be backported to branch-3.4 and branch-3.5.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, it fixes a regression as described above.
   
   ### How was this patch tested?
   
   Unittest was added.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47068][PYTHON][TESTS] Recover -1 and 0 case for spark.sql.execution.arrow.maxRecordsPerBatch [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon closed pull request #45132: [SPARK-47068][PYTHON][TESTS] Recover -1 and 0 case for spark.sql.execution.arrow.maxRecordsPerBatch
URL: https://github.com/apache/spark/pull/45132


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47068][PYTHON][TESTS] Recover -1 and 0 case for spark.sql.execution.arrow.maxRecordsPerBatch [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #45132:
URL: https://github.com/apache/spark/pull/45132#issuecomment-1947709204

   Merged to master, brnach-3.5 and branch-3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org