You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Sandy Ryza <sa...@cloudera.com> on 2014/11/07 10:05:34 UTC

proposal / discuss: multiple Serializers within a SparkContext?

Hey all,

Was messing around with Spark and Google FlatBuffers for fun, and it got me
thinking about Spark and serialization.  I know there's been work / talk
about in-memory columnar formats Spark SQL, so maybe there are ways to
provide this flexibility already that I've missed?  Either way, my thoughts:

Java and Kryo serialization are really nice in that they require almost no
extra work on the part of the user.  They can also represent complex object
graphs with cycles etc.

There are situations where other serialization frameworks are more
efficient:
* A Hadoop Writable style format that delineates key-value boundaries and
allows for raw comparisons can greatly speed up some shuffle operations by
entirely avoiding deserialization until the object hits user code.
Writables also probably ser / deser faster than Kryo.
* "No-deserialization" formats like FlatBuffers and Cap'n Proto address the
tradeoff between (1) Java objects that offer fast access but take lots of
space and stress GC and (2) Kryo-serialized buffers that are more compact
but take time to deserialize.

The drawbacks of these frameworks are that they require more work from the
user to define types.  And that they're more restrictive in the reference
graphs they can represent.

In large applications, there are probably a few points where a
"specialized" serialization format is useful. But requiring Writables
everywhere because they're needed in a particularly intense shuffle is
cumbersome.

In light of that, would it make sense to enable varying Serializers within
an app? It could make sense to choose a serialization framework both based
on the objects being serialized and what they're being serialized for
(caching vs. shuffle).  It might be possible to implement this underneath
the Serializer interface with some sort of multiplexing serializer that
chooses between subserializers.

Nothing urgent here, but curious to hear other's opinions.

-Sandy

Re: proposal / discuss: multiple Serializers within a SparkContext?

Posted by Sandy Ryza <sa...@cloudera.com>.
Ah awesome.  Passing customer serializers when persisting an RDD is exactly
one of the things I was thinking of.

-Sandy

On Fri, Nov 7, 2014 at 1:19 AM, Matei Zaharia <ma...@gmail.com>
wrote:

> Yup, the JIRA for this was https://issues.apache.org/jira/browse/SPARK-540
> (one of our older JIRAs). I think it would be interesting to explore this
> further. Basically the way to add it into the API would be to add a version
> of persist() that takes another class than StorageLevel, say
> StorageStrategy, which allows specifying a custom serializer or perhaps
> even a transformation to turn each partition into another representation
> before saving it. It would also be interesting if this could work directly
> on an InputStream or ByteBuffer to deal with off-heap data.
>
> One issue we've found with our current Serializer interface by the way is
> that a lot of type information is lost when you pass data to it, so the
> serializers spend a fair bit of time figuring out what class each object
> written is. With this model, it would be possible for a serializer to know
> that all its data is of one type, which is pretty cool, but we might also
> consider ways of expanding the current Serializer interface to take more
> info.
>
> Matei
>
> > On Nov 7, 2014, at 1:09 AM, Reynold Xin <rx...@databricks.com> wrote:
> >
> > Technically you can already do custom serializer for each shuffle
> operation
> > (it is part of the ShuffledRDD). I've seen Matei suggesting on jira
> issues
> > (or github) in the past a "storage policy" in which you can specify how
> > data should be stored. I think that would be a great API to have in the
> > long run. Designing it won't be trivial though.
> >
> >
> > On Fri, Nov 7, 2014 at 1:05 AM, Sandy Ryza <sa...@cloudera.com>
> wrote:
> >
> >> Hey all,
> >>
> >> Was messing around with Spark and Google FlatBuffers for fun, and it
> got me
> >> thinking about Spark and serialization.  I know there's been work / talk
> >> about in-memory columnar formats Spark SQL, so maybe there are ways to
> >> provide this flexibility already that I've missed?  Either way, my
> >> thoughts:
> >>
> >> Java and Kryo serialization are really nice in that they require almost
> no
> >> extra work on the part of the user.  They can also represent complex
> object
> >> graphs with cycles etc.
> >>
> >> There are situations where other serialization frameworks are more
> >> efficient:
> >> * A Hadoop Writable style format that delineates key-value boundaries
> and
> >> allows for raw comparisons can greatly speed up some shuffle operations
> by
> >> entirely avoiding deserialization until the object hits user code.
> >> Writables also probably ser / deser faster than Kryo.
> >> * "No-deserialization" formats like FlatBuffers and Cap'n Proto address
> the
> >> tradeoff between (1) Java objects that offer fast access but take lots
> of
> >> space and stress GC and (2) Kryo-serialized buffers that are more
> compact
> >> but take time to deserialize.
> >>
> >> The drawbacks of these frameworks are that they require more work from
> the
> >> user to define types.  And that they're more restrictive in the
> reference
> >> graphs they can represent.
> >>
> >> In large applications, there are probably a few points where a
> >> "specialized" serialization format is useful. But requiring Writables
> >> everywhere because they're needed in a particularly intense shuffle is
> >> cumbersome.
> >>
> >> In light of that, would it make sense to enable varying Serializers
> within
> >> an app? It could make sense to choose a serialization framework both
> based
> >> on the objects being serialized and what they're being serialized for
> >> (caching vs. shuffle).  It might be possible to implement this
> underneath
> >> the Serializer interface with some sort of multiplexing serializer that
> >> chooses between subserializers.
> >>
> >> Nothing urgent here, but curious to hear other's opinions.
> >>
> >> -Sandy
> >>
>
>

Re: proposal / discuss: multiple Serializers within a SparkContext?

Posted by Matei Zaharia <ma...@gmail.com>.
Yup, the JIRA for this was https://issues.apache.org/jira/browse/SPARK-540 (one of our older JIRAs). I think it would be interesting to explore this further. Basically the way to add it into the API would be to add a version of persist() that takes another class than StorageLevel, say StorageStrategy, which allows specifying a custom serializer or perhaps even a transformation to turn each partition into another representation before saving it. It would also be interesting if this could work directly on an InputStream or ByteBuffer to deal with off-heap data.

One issue we've found with our current Serializer interface by the way is that a lot of type information is lost when you pass data to it, so the serializers spend a fair bit of time figuring out what class each object written is. With this model, it would be possible for a serializer to know that all its data is of one type, which is pretty cool, but we might also consider ways of expanding the current Serializer interface to take more info.

Matei

> On Nov 7, 2014, at 1:09 AM, Reynold Xin <rx...@databricks.com> wrote:
> 
> Technically you can already do custom serializer for each shuffle operation
> (it is part of the ShuffledRDD). I've seen Matei suggesting on jira issues
> (or github) in the past a "storage policy" in which you can specify how
> data should be stored. I think that would be a great API to have in the
> long run. Designing it won't be trivial though.
> 
> 
> On Fri, Nov 7, 2014 at 1:05 AM, Sandy Ryza <sa...@cloudera.com> wrote:
> 
>> Hey all,
>> 
>> Was messing around with Spark and Google FlatBuffers for fun, and it got me
>> thinking about Spark and serialization.  I know there's been work / talk
>> about in-memory columnar formats Spark SQL, so maybe there are ways to
>> provide this flexibility already that I've missed?  Either way, my
>> thoughts:
>> 
>> Java and Kryo serialization are really nice in that they require almost no
>> extra work on the part of the user.  They can also represent complex object
>> graphs with cycles etc.
>> 
>> There are situations where other serialization frameworks are more
>> efficient:
>> * A Hadoop Writable style format that delineates key-value boundaries and
>> allows for raw comparisons can greatly speed up some shuffle operations by
>> entirely avoiding deserialization until the object hits user code.
>> Writables also probably ser / deser faster than Kryo.
>> * "No-deserialization" formats like FlatBuffers and Cap'n Proto address the
>> tradeoff between (1) Java objects that offer fast access but take lots of
>> space and stress GC and (2) Kryo-serialized buffers that are more compact
>> but take time to deserialize.
>> 
>> The drawbacks of these frameworks are that they require more work from the
>> user to define types.  And that they're more restrictive in the reference
>> graphs they can represent.
>> 
>> In large applications, there are probably a few points where a
>> "specialized" serialization format is useful. But requiring Writables
>> everywhere because they're needed in a particularly intense shuffle is
>> cumbersome.
>> 
>> In light of that, would it make sense to enable varying Serializers within
>> an app? It could make sense to choose a serialization framework both based
>> on the objects being serialized and what they're being serialized for
>> (caching vs. shuffle).  It might be possible to implement this underneath
>> the Serializer interface with some sort of multiplexing serializer that
>> chooses between subserializers.
>> 
>> Nothing urgent here, but curious to hear other's opinions.
>> 
>> -Sandy
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: proposal / discuss: multiple Serializers within a SparkContext?

Posted by Reynold Xin <rx...@databricks.com>.
Technically you can already do custom serializer for each shuffle operation
(it is part of the ShuffledRDD). I've seen Matei suggesting on jira issues
(or github) in the past a "storage policy" in which you can specify how
data should be stored. I think that would be a great API to have in the
long run. Designing it won't be trivial though.


On Fri, Nov 7, 2014 at 1:05 AM, Sandy Ryza <sa...@cloudera.com> wrote:

> Hey all,
>
> Was messing around with Spark and Google FlatBuffers for fun, and it got me
> thinking about Spark and serialization.  I know there's been work / talk
> about in-memory columnar formats Spark SQL, so maybe there are ways to
> provide this flexibility already that I've missed?  Either way, my
> thoughts:
>
> Java and Kryo serialization are really nice in that they require almost no
> extra work on the part of the user.  They can also represent complex object
> graphs with cycles etc.
>
> There are situations where other serialization frameworks are more
> efficient:
> * A Hadoop Writable style format that delineates key-value boundaries and
> allows for raw comparisons can greatly speed up some shuffle operations by
> entirely avoiding deserialization until the object hits user code.
> Writables also probably ser / deser faster than Kryo.
> * "No-deserialization" formats like FlatBuffers and Cap'n Proto address the
> tradeoff between (1) Java objects that offer fast access but take lots of
> space and stress GC and (2) Kryo-serialized buffers that are more compact
> but take time to deserialize.
>
> The drawbacks of these frameworks are that they require more work from the
> user to define types.  And that they're more restrictive in the reference
> graphs they can represent.
>
> In large applications, there are probably a few points where a
> "specialized" serialization format is useful. But requiring Writables
> everywhere because they're needed in a particularly intense shuffle is
> cumbersome.
>
> In light of that, would it make sense to enable varying Serializers within
> an app? It could make sense to choose a serialization framework both based
> on the objects being serialized and what they're being serialized for
> (caching vs. shuffle).  It might be possible to implement this underneath
> the Serializer interface with some sort of multiplexing serializer that
> chooses between subserializers.
>
> Nothing urgent here, but curious to hear other's opinions.
>
> -Sandy
>