You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by djvulee <gi...@git.apache.org> on 2017/05/16 16:06:22 UTC

[GitHub] spark issue #15505: [SPARK-18890][CORE] Move task serialization from the Tas...

Github user djvulee commented on the issue:

    https://github.com/apache/spark/pull/15505
  
    >I agree with Kay that putting in a smaller change first is better, assuming it still has the performance gains. That doesn't preclude any further optimizations that are bigger changes.
    
    >I'm a little surprised that the serializing tasks has much of an impact, given how little data is getting serialized. But if it really is, I feel like there is a much bigger optimization we're completely missing. Why are we repeating the work of serialization for each task in a taskset? The serialized data is almost exactly the same for every task. they only differ in the partition id (an int) and the preferred locations (which aren't even used by the executor at all).
    
    >Task serialization already leverages the idea of having info across all the tasks in the Broadcast for the task binary. We just need to use that same idea for all the rest of the task data that is sent to the executor. Then the only difference between the serialized task data sent to executors is the int for the partitionId. You'd serialize into a bytebuffer once, and then your per-task "serialization" becomes copying the buffer and modifying that int directly.
    
    
    
    
    @squito  I like this idea very much. I just encounte the de-serialization time is too long (about more than 10s for some tasks). Is there any PR try to solve this? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org