You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "itholic (via GitHub)" <gi...@apache.org> on 2023/09/05 02:16:12 UTC

[GitHub] [spark] itholic commented on pull request #42798: [SPARK-43295][PS] Support string type columns for `DataFrameGroupBy.sum`

itholic commented on PR #42798:
URL: https://github.com/apache/spark/pull/42798#issuecomment-1705851281

   @zhengruifeng I think the problem is that the Pandas compute the concat without sorting, so the result can be difficult when the index is not sorted as below:
   ## Problem
   
   **Pandas**
   ```python
   >>> pdf
      A  B
   4  a  1
   3  b  2
   2  c  3
   >>> pdf.sum()
   A    abc
   B      6
   dtype: object
   ```
   
   **Pandas API on Spark**
   ```python
   >>> psdf
      A  B
   4  a  1
   3  b  2
   2  c  3
   >>> psdf.sum()
   A    cba  # we internally sorted the index, so the result is different from Pandas
   B      6
   dtype: object
   ```
   
   ## Solution
   I think for now we can pick the one of three ways below:
   1. We can document the warning note as below:
       ```
       The result for string type column is non-deterministic since the implementation depends on `collect_list` API from PySpark which is non-deterministic as well.
       ```
   2. We can `collect_list` both value and index, and sort by the indices before `concat_ws` as you suggested, and document the warning note as below:
       ```
       The result for string type column can be different from Pandas when the index is not sorted, since we always sort the indexes before computing since the implementation depends on `collect_list` API from PySpark which is non-deterministic.
       ```
   3. We don't support the string type column like so far, and add a note that why we don't support the string type column as below:
       ```
       String type column is not support for now, because it might yield non-deterministic results unlike in Pandas.
       ```
   
   WDYT? Also cc @HyukjinKwon, @ueshin @xinrong-meng , What strategy do we take for this situation? I believe that the same rules should apply to similar cases that already exist or may arise in the future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org