You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/05/26 02:28:32 UTC

[GitHub] [spark] zhengruifeng commented on a diff in pull request #36648: [SPARK-39268][SQL][WIP] AttachDistributedSequenceExec do not checkpoint childRDD with single partition

zhengruifeng commented on code in PR #36648:
URL: https://github.com/apache/spark/pull/36648#discussion_r882261957


##########
python/pyspark/pandas/tests/test_groupby.py:
##########
@@ -2256,9 +2256,12 @@ def sum_with_acc_frame(x) -> ps.DataFrame[np.float64, np.float64]:
             acc += 1
             return np.sum(x)
 
-        actual = psdf.groupby("d").apply(sum_with_acc_frame).sort_index()

Review Comment:
   this reason is:
   
   1, after this PR, dataframe will not be cached since it only contain 1 partition;
   2, there is a global sort in `sort_index`, which contains a sampling that will trigger an action. This sampling will cause accumulator be computed twice, this is a already-know issue (see https://issues.apache.org/jira/browse/SPARK-37487)
   
   There maybe a optimization space that convert global sort on single partition to local sort on sigle partition, but I am not sure whether it is worthwhile.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org