You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/06/25 11:15:58 UTC

[GitHub] [spark] HyukjinKwon opened a new pull request #28928: [SPARK-32098][PYTHON] Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow

HyukjinKwon opened a new pull request #28928:
URL: https://github.com/apache/spark/pull/28928


   ### What changes were proposed in this pull request?
   
   When you use floats are index of pandas, it creates a Spark DataFrame with a wrong results as below when Arrow is enabled:
   
   ```bash
   ./bin/pyspark --conf spark.sql.execution.arrow.pyspark.enabled=true
   ```
   
   ```python
   >>> import pandas as pd
   >>> spark.createDataFrame(pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])).show()
   +---+
   |  a|
   +---+
   |  1|
   |  1|
   |  2|
   +---+
   ```
   
   This is because direct slicing uses the value as index when the index contains floats:
   
   ```python
   >>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])[2:]
        a
   2.0  1
   3.0  2
   4.0  3
   >>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.]).iloc[2:]
        a
   4.0  3
   >>> pd.DataFrame({'a': [1,2,3]}, index=[2, 3, 4])[2:]
      a
   4  3
   ```
   
   This PR proposes to explicitly use `iloc` to positionally slide when we create a DataFrame from a pandas DataFrame with Arrow enabled.
   
   FWIW, I was trying to investigate why direct slicing refers the index value or the positional index sometimes but I stopped investigating further after reading this https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#selection
   
   > While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, `.at`, `.iat`, `.loc` and `.iloc`.
   
   ### Why are the changes needed?
   
   To create the correct Spark DataFrame from a pandas DataFrame without a data loss.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, it is a bug fix. 
   
   ```bash
   ./bin/pyspark --conf spark.sql.execution.arrow.pyspark.enabled=true
   ```
   ```python
   import pandas as pd
   spark.createDataFrame(pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])).show()
   ```
   
   Before:
   
   ```python
   +---+
   |  a|
   +---+
   |  1|
   |  1|
   |  2|
   +---+```
   
   After:
   
   ```python
   +---+
   |  a|
   +---+
   |  1|
   |  2|
   |  3|
   +---+
   ```
   
   ### How was this patch tested?
   
   Manually tested and unittest were added.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #28928: [SPARK-32098][PYTHON] Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #28928:
URL: https://github.com/apache/spark/pull/28928#issuecomment-649883425


   Thank you @BryanCutler and @ueshin!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28928: [SPARK-32098][PYTHON] Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28928:
URL: https://github.com/apache/spark/pull/28928#issuecomment-649477489


   **[Test build #124513 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124513/testReport)** for PR 28928 at commit [`612426e`](https://github.com/apache/spark/commit/612426e34e29229aa2187e8775ee0e453288c75d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28928: [SPARK-32098][PYTHON] Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28928:
URL: https://github.com/apache/spark/pull/28928#issuecomment-649491191


   **[Test build #124513 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124513/testReport)** for PR 28928 at commit [`612426e`](https://github.com/apache/spark/commit/612426e34e29229aa2187e8775ee0e453288c75d).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28928: [SPARK-32098][PYTHON] Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28928:
URL: https://github.com/apache/spark/pull/28928#issuecomment-649478115






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28928: [SPARK-32098][PYTHON] Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28928:
URL: https://github.com/apache/spark/pull/28928#issuecomment-649491706






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gatorsmile commented on a change in pull request #28928: [SPARK-32098][PYTHON] Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow

Posted by GitBox <gi...@apache.org>.
gatorsmile commented on a change in pull request #28928:
URL: https://github.com/apache/spark/pull/28928#discussion_r445976618



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -413,7 +413,7 @@ def _create_from_pandas_with_arrow(self, pdf, schema, timezone):
 
         # Slice the DataFrame to be batched
         step = -(-len(pdf) // self.sparkContext.defaultParallelism)  # round int up
-        pdf_slices = (pdf[start:start + step] for start in xrange(0, len(pdf), step))
+        pdf_slices = (pdf.iloc[start:start + step] for start in xrange(0, len(pdf), step))

Review comment:
       Thank you for fixing this! 
   
   > While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, .at, .iat, .loc and .iloc.
   
   Is it the only place? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] BryanCutler commented on pull request #28928: [SPARK-32098][PYTHON] Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow

Posted by GitBox <gi...@apache.org>.
BryanCutler commented on pull request #28928:
URL: https://github.com/apache/spark/pull/28928#issuecomment-649741883


   merged to master, branch-3.0 and branch-2.4


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28928: [SPARK-32098][PYTHON] Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28928:
URL: https://github.com/apache/spark/pull/28928#issuecomment-649478115






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] BryanCutler closed pull request #28928: [SPARK-32098][PYTHON] Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow

Posted by GitBox <gi...@apache.org>.
BryanCutler closed pull request #28928:
URL: https://github.com/apache/spark/pull/28928


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #28928: [SPARK-32098][PYTHON] Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #28928:
URL: https://github.com/apache/spark/pull/28928#issuecomment-649476145


   I think this should be ported back through branch-2.4 ...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28928: [SPARK-32098][PYTHON] Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28928:
URL: https://github.com/apache/spark/pull/28928#issuecomment-649477489


   **[Test build #124513 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124513/testReport)** for PR 28928 at commit [`612426e`](https://github.com/apache/spark/commit/612426e34e29229aa2187e8775ee0e453288c75d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #28928: [SPARK-32098][PYTHON] Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #28928:
URL: https://github.com/apache/spark/pull/28928#discussion_r445977049



##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -413,7 +413,7 @@ def _create_from_pandas_with_arrow(self, pdf, schema, timezone):
 
         # Slice the DataFrame to be batched
         step = -(-len(pdf) // self.sparkContext.defaultParallelism)  # round int up
-        pdf_slices = (pdf[start:start + step] for start in xrange(0, len(pdf), step))
+        pdf_slices = (pdf.iloc[start:start + step] for start in xrange(0, len(pdf), step))

Review comment:
       As far as I can tell, yes.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28928: [SPARK-32098][PYTHON] Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28928:
URL: https://github.com/apache/spark/pull/28928#issuecomment-649491706






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org