You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/10/05 04:03:39 UTC

[GitHub] [spark] JoshRosen edited a comment on pull request #34158: [SPARK-36705][FOLLOW-UP] Support the case when user's classes need to register for Kryo serialization

JoshRosen edited a comment on pull request #34158:
URL: https://github.com/apache/spark/pull/34158#issuecomment-934039150


   Stepping back a bit, I have a more fundamental conceptual question about the push-based shuffle configurations: could the decision of whether to use push-based shuffle be performed on a per-ShuffleDependency basis rather than a per-app basis?
   
   The `spark.serializer` configuration controls the default serializer, but under the hood `ShuffleDependency` instances are sometimes automatically configured with non-default serializers:
   
   - Pure SQL / DataFrame / Dataset code [uses `UnsafeRowSerializer`](https://github.com/apache/spark/blob/41a16ebf1196bec86aec104e72fd7fb1597c0073/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala#L132-L133).
   - The RDD shuffle path [will automatically use Kryo](https://github.com/apache/spark/blame/41a16ebf1196bec86aec104e72fd7fb1597c0073/core/src/main/scala/org/apache/spark/serializer/SerializerManager.scala#L99-L102) when shuffling RDDs containing only primitive types.
   
   Spark's shuffle code contains a [serialized sorting mode](https://github.com/apache/spark/blame/fa1805db48ca53ece4cbbe42ebb2a9811a142ed2/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala#L37-L71) which can only be used when  `serializer. supportsRelocationOfSerializedObjects == true`. The decision of whether to use serialized sorting mode is performed on a per-ShuffleDependency (not per-app) basis (by [using a different ShuffleHandle](https://github.com/apache/spark/blame/fa1805db48ca53ece4cbbe42ebb2a9811a142ed2/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala#L106-L109)).
   
   If we did something similar for push-based shuffle then users could enable PBS and still get some benefit even if they couldn't / didn't want to use Kryo as the default serializer. With that approach we wouldn't need to construct a default serializer instance: during `SparkEnv` creation time or executor startup time we'd only need to know whether the other prerequisites of PBS were met (configuration enabled, shuffle service enabled, IO encryption disabled, etc). We'd still need to check the serializer properties, but that check would happen on the driver and its result would somehow be encoded in the `ShuffleHandle`.
   
   Is that possible given the architecture of push-based shuffle?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org