You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@uniffle.apache.org by "zjf2012 (via GitHub)" <gi...@apache.org> on 2023/03/06 02:06:23 UTC

[GitHub] [incubator-uniffle] zjf2012 commented on pull request #637: [#615] improvement: Reduce task binary by removing 'partitionToServers' from RssShuffleHandle

zjf2012 commented on PR #637:
URL: https://github.com/apache/incubator-uniffle/pull/637#issuecomment-1455318109

   > > @advancedxy @xianjingfeng Do you have another suggestion?
   > 
   > I'm not sure, this pr introduced some quite complex logic to broadcast shuffle handle info.
   > 
   > If I was implementing this feature, I would just use Kryo by default, and in RSSShuffleManager indicating users to either turn off `spark.kryo.registerRequired` (which is explicitly set by user) or manually register RssShuffleHandle.
   
   Just make sure we are on the same page. Only registering RssShuffleHandle to kryo serializer doesn't help resolve this issue. Each task will still have more than 670KB size in binary. for 10000 partitions job. And each task will have quite noticeable deserialization time to repeatedly deserialize partition -> shuffle server mappings as shown in https://docs.google.com/document/d/1TZ-3Mgwj9j7n1mMyCrS3sskFv_uOtUtXl1oF9Y_oMOw/edit?usp=sharing. 
   
   RssShuffleHandle is a field of ShuffleDependency which is serialized to task binary in below code in DAGScheduler. Without broadcast of ShuffleHandleInfo, it's hard to pull it out and avoid repeat of ShuffleHandleInfo.
   
   `      RDDCheckpointData.synchronized {
           taskBinaryBytes = stage match {
             case stage: ShuffleMapStage =>
               JavaUtils.bufferToArray(
                 closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))
             case stage: ResultStage =>
               JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))
           }
   
           partitions = stage.rdd.partitions
         }`
   
   I think broadcast itself is quite simple and efficient. I don't see other better alternatives.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@uniffle.apache.org
For additional commands, e-mail: issues-help@uniffle.apache.org