You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ConeyLiu <gi...@git.apache.org> on 2017/10/27 08:02:10 UTC

[GitHub] spark pull request #19586: [SPARK-22367][CORE] Separate the serialization of...

GitHub user ConeyLiu opened a pull request:

    https://github.com/apache/spark/pull/19586

    [SPARK-22367][CORE] Separate the serialization of class and object for iteraor

    ## What changes were proposed in this pull request?
    
    Becuase they are all the same class for an iterator.  So there is no need write class information for every record in the iterator. We only need write the class information once at the serialization beginning, also only need read the class information once for deserialization.
    
    In this patch, we separate the serialization of class and object for an iterator serialized by Kryo. This can improve the performance of the serialization and deserialization, and save the space.
    
    Test case:
    ```scala
        val conf = new SparkConf().setAppName("Test for serialization")
        val sc = new SparkContext(conf)
    
        val random = new Random(1)
        val data = sc.parallelize(1 to 1000000000).map { i =>
          Person("id-" + i, random.nextInt(Integer.MAX_VALUE))
        }.persist(StorageLevel.OFF_HEAP)
    
        var start = System.currentTimeMillis()
        data.count()
        println("First time: " + (System.currentTimeMillis() - start))
    
        start = System.currentTimeMillis()
        data.count()
        println("Second time: " + (System.currentTimeMillis() - start))
    
    ```
    
    Test result:
    
    The size of serialized:
    before: 34.3GB
    after: 17.5GB
    
    | before(cal+serialization)| before(deserialization)| after(cal+serialization)| after(deserialization) |
    | ------| ------ | ------ | ------ | 
    | 63869| 21882|  45513| 15158|
    | 59368| 21507|  51683| 15524|
    | 66230| 21481|  62163| 14903|
    | 62399| 22529|  52400| 16255|
    
    | 137564.2 | 136990.8 | 1.004186 | 
    
    ## How was this patch tested?
    
    Existing UT.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ConeyLiu/spark kryo

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19586.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19586
    
----
commit c681e81f9d49b3558c91a3b981504159bbeff910
Author: Xianyang Liu <xi...@intel.com>
Date:   2017-10-26T06:37:04Z

    serialize object and class seperately for iterator

commit 640ad5e1d12d1137f4c979a1e75dbdbd713e14de
Author: Xianyang Liu <xi...@intel.com>
Date:   2017-10-26T06:42:58Z

    Merge remote-tracking branch 'spark/master' into kryo

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19586
  
    and in `ml`, if we want to register class before running algos, Some other classes like `LabeledPoint`, `Instance` also need registered. 
    and there're some class temporary defined in some ml algos (when using RDD).
    Maybe better way is to register classes automatically if possible. E.g. when using kyro, auto detecting which classes are used frequently, and register them?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

Posted by ConeyLiu <gi...@git.apache.org>.
Github user ConeyLiu commented on the issue:

    https://github.com/apache/spark/pull/19586
  
    Hi @cloud-fan, thanks for reviewing. There are some errors about `UnsafeShuffleWrite` need further fixed. I am not familiar with this code, so I need some time.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19586: [SPARK-22367][CORE] Separate the serialization of...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19586#discussion_r147371549
  
    --- Diff: core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala ---
    @@ -205,11 +205,45 @@ class KryoSerializationStream(
     
       private[this] var kryo: Kryo = serInstance.borrowKryo()
     
    +  // This is only used when we write object and class separately.
    +  var classWrote = false
    --- End diff --
    
    Why not write that state as an iterator of stuff, if that's how it behaves? rather than duplicate code. 'values' is already an iterator there.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19586: [SPARK-22367][CORE] Separate the serialization of class ...

Posted by ConeyLiu <gi...@git.apache.org>.
Github user ConeyLiu commented on the issue:

    https://github.com/apache/spark/pull/19586
  
    @srowen Thanks for the reviewing. 
    What do you meaning here?
    > I'm trying to think if there's any case where we intend to support kryo/java serialized objects from 2.x in 2.y.
    
    After you registered. It still writes the class (not class full name but just a class ID) if you call `writeObjectAndClass`. In order to get the class id, there is also need some calculation. And then write the class ID and object.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19586: [SPARK-22367][CORE] Separate the serialization of...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19586#discussion_r147565429
  
    --- Diff: core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala ---
    @@ -205,11 +205,45 @@ class KryoSerializationStream(
     
       private[this] var kryo: Kryo = serInstance.borrowKryo()
     
    +  // This is only used when we write object and class separately.
    +  var classWrote = false
    +
       override def writeObject[T: ClassTag](t: T): SerializationStream = {
         kryo.writeClassAndObject(output, t)
    --- End diff --
    
    I was expecting kryo to buffer the distinct classes and only store an identifier/pointer for duplicated classes. Even if we write object and class every time, the overhead should be small. This is not true? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/19586
  
    also cc @WeichenXu123 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19586
  
    We can config the class to register by config `spark.kryo.classesToRegister`, does it need to add into spark code ?
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

Posted by ConeyLiu <gi...@git.apache.org>.
Github user ConeyLiu commented on the issue:

    https://github.com/apache/spark/pull/19586
  
    Currently, I use it directly. Maybe this is suitable for some special case which has same type data, such as ml or else. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19586: [SPARK-22367][CORE] Separate the serialization of...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19586#discussion_r147361808
  
    --- Diff: pom.xml ---
    @@ -133,7 +133,7 @@
         <hive.parquet.version>1.6.0</hive.parquet.version>
         <jetty.version>9.3.20.v20170531</jetty.version>
         <javaxservlet.version>3.1.0</javaxservlet.version>
    -    <chill.version>0.8.4</chill.version>
    +    <chill.version>0.9.2</chill.version>
    --- End diff --
    
    Why do you need to update it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19586: [SPARK-22367][CORE] Separate the serialization of...

Posted by ConeyLiu <gi...@git.apache.org>.
Github user ConeyLiu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19586#discussion_r147367241
  
    --- Diff: pom.xml ---
    @@ -133,7 +133,7 @@
         <hive.parquet.version>1.6.0</hive.parquet.version>
         <jetty.version>9.3.20.v20170531</jetty.version>
         <javaxservlet.version>3.1.0</javaxservlet.version>
    -    <chill.version>0.8.4</chill.version>
    +    <chill.version>0.9.2</chill.version>
    --- End diff --
    
    Not necessary. Chill 0.9.2 uses kryo 4.0. I can change it back.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

Posted by ConeyLiu <gi...@git.apache.org>.
Github user ConeyLiu commented on the issue:

    https://github.com/apache/spark/pull/19586
  
    Hi @cloud-fan, for most case the data type should be same. So I think this optimization is valuable, because it can save the space and cpu resource considerable. What about setting a flag for the RDD, which indicates whether the RDD only has the same types. If it'st not valid, could we putting it to the ml package for special serializer, then user could configure it. But for this case, there must be provided the exactly classtag of the RDD for serialization due to the relocation of unsafeshufflewrite.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

Posted by ConeyLiu <gi...@git.apache.org>.
Github user ConeyLiu commented on the issue:

    https://github.com/apache/spark/pull/19586
  
    OK, I can understand your concern. There is huge gc problem for K-means workload, it occupied about 10-20% percent. The source data is cached in memory, there is even worse performance when the source data can't be cached in memory. So we try the source data to off-heap. However, the training time even worse after using the off-heap memory.  Because the gc only occupied about 10-20% with on-heap memory, while deserialization occupied about 30-40% with off-heap memory even if the gc problem solved.  
    <img width="960" alt="capture" src="https://user-images.githubusercontent.com/12733256/32313752-5dbec220-bfdf-11e7-8b49-d5daa47cd50f.PNG">
    
    <img width="960" alt="capture" src="https://user-images.githubusercontent.com/12733256/32313788-824b8470-bfdf-11e7-9b59-aea26e9c6c0a.PNG">
    
    You can see the pic, the `readClass` occupied about 13% .  So I opened this pr.  With this path test result, the total time (loading data + training kmeans model) saved about 10% time.  The above picture is only about training phase, not include the loading source data phase,  so the improvement should be larger as we expected. And I plan to optimize the `readObjectOrNull` after this.
    
    Also, I found the `Vector` is not registered, so I will test the performance with the registered vector. This maybe can reduce the cpu occupied, but can't save the serialized memory.
     


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/19586
  
    @ConeyLiu what about the below example, does your implementation support this?
    
    ```scala
    
    trait Base { val name: String }
    case class A(name: String) extends Base
    case class B(name: String) extends Base
    
    sc.parallelize(Seq(A("a"), B("b"))).map { i => (i, 1) }.reduceByKey(_ + _).collect()
    ```
    
    Here not all the elements have same class type, does your PR here support such scenario?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19586: [SPARK-22367][WIP][CORE] Separate the serializati...

Posted by ConeyLiu <gi...@git.apache.org>.
Github user ConeyLiu closed the pull request at:

    https://github.com/apache/spark/pull/19586


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19586
  
    OK to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19586: [SPARK-22367][CORE] Separate the serialization of class ...

Posted by ConeyLiu <gi...@git.apache.org>.
Github user ConeyLiu commented on the issue:

    https://github.com/apache/spark/pull/19586
  
    One executor, the configuration as follows:
    the script:
    ```shell
    ${SPARK_HOME}/bin/spark-submit \
            --class com.intel.KryoTest  \
            --master yarn                   \
            --deploy-mode  cluster           \
            --conf spark.memory.offHeap.enabled=true   \
            --conf spark.memory.offHeap.size=50g       \
            --conf spark.serializer=org.apache.spark.serializer.KryoSerializer  \
            --driver-memory 5G         \
            --driver-cores  10        \
            --executor-memory  40G          \
            --executor-cores  20            \
            --num-executors 1               \
    
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19586: [SPARK-22367][CORE] Separate the serialization of...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19586#discussion_r147361751
  
    --- Diff: core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala ---
    @@ -205,11 +205,45 @@ class KryoSerializationStream(
     
       private[this] var kryo: Kryo = serInstance.borrowKryo()
     
    +  // This is only used when we write object and class separately.
    +  var classWrote = false
    --- End diff --
    
    I don't see why you need this state and need to repeat the logic about writing / not writing classes everywhere. Surely this just goes in one writeAll / asIterator pair?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19586
  
    For these cases, they can write their own serializer and set it via `spark.serializer`. I don't think Spark should have built-in support for them because it's not general.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19586: [SPARK-22367][CORE] Separate the serialization of...

Posted by ConeyLiu <gi...@git.apache.org>.
Github user ConeyLiu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19586#discussion_r147371400
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala ---
    @@ -376,7 +382,17 @@ private[spark] class MemoryStore(
     
         // Unroll this block safely, checking whether we have exceeded our threshold
         while (values.hasNext && keepUnrolling) {
    -      serializationStream.writeObject(values.next())(classTag)
    +      val value = values.next()
    +      if (kryoSerializationStream != null) {
    +        if (!kryoSerializationStream.classWrote) {
    +          kryoSerializationStream.writeClass(value.getClass)
    --- End diff --
    
    @srowen you can see here. Here we don't use the writeAll, because we need acquire memory according to the written size.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19586: [SPARK-22367][CORE] Separate the serialization of...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19586#discussion_r147565323
  
    --- Diff: pom.xml ---
    @@ -133,7 +133,7 @@
         <hive.parquet.version>1.6.0</hive.parquet.version>
         <jetty.version>9.3.20.v20170531</jetty.version>
         <javaxservlet.version>3.1.0</javaxservlet.version>
    -    <chill.version>0.8.4</chill.version>
    +    <chill.version>0.9.2</chill.version>
    --- End diff --
    
    please change it back.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19586: [SPARK-22367][WIP][CORE] Separate the serializati...

Posted by ConeyLiu <gi...@git.apache.org>.
Github user ConeyLiu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19586#discussion_r147709649
  
    --- Diff: core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala ---
    @@ -205,11 +205,45 @@ class KryoSerializationStream(
     
       private[this] var kryo: Kryo = serInstance.borrowKryo()
     
    +  // This is only used when we write object and class separately.
    +  var classWrote = false
    +
       override def writeObject[T: ClassTag](t: T): SerializationStream = {
         kryo.writeClassAndObject(output, t)
    --- End diff --
    
    From the code, it just write a `varInt` if the class have been registered. And also there need some calculation for getting the `varInt`. But from the test, the overhead looks more serious than I expected.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

Posted by ConeyLiu <gi...@git.apache.org>.
Github user ConeyLiu commented on the issue:

    https://github.com/apache/spark/pull/19586
  
    Hi @jerryshao, Thanks for the reminder, it doesn't support it. I'm sorry I did not take that into account.  How about using configuration to determine whether we should use `SerializerInstance#serializeStreamForClass[T]`. For most case the data type should be same.
    
    Can you give some advice? Also cc @cloud-fan @srowen 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/19586
  
    Using configurations seems not so elegant, also configuration is application based, how would you turn off/on this feature in the runtime? Sorry I cannot give you a good advice, maybe kryo's solution is the best option for general case.  


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19586
  
    You can call `SparkConf#registerKryoClasses` manually, maybe we can also register these ml classes automatically in `KryoSerializer.newKryo` via reflection.
    
    cc @yanboliang @srowen 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19586: [SPARK-22367][CORE] Separate the serialization of class ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19586
  
    looking at the `SerializationStream` interface, I think it's designed for read/write objects of different classes, so your optimization should not be applied there.
    
    Instead, I think we should introduce `SerializerInstance#serializeStreamForClass[T]`, which returns `ClassSpecificSerializationStream[T]` that is designed for writing objects of same class.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19586: [SPARK-22367][CORE] Separate the serialization of class ...

Posted by ConeyLiu <gi...@git.apache.org>.
Github user ConeyLiu commented on the issue:

    https://github.com/apache/spark/pull/19586
  
    Hi, @cloud-fan @jiangxb1987 @chenghao-intel. Would you mind take a look? Thanks a lot. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19586: [SPARK-22367][CORE] Separate the serialization of...

Posted by ConeyLiu <gi...@git.apache.org>.
Github user ConeyLiu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19586#discussion_r147368368
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala ---
    @@ -860,9 +876,26 @@ private[storage] class PartiallySerializedBlock[T](
         ByteStreams.copy(unrolledBuffer.toInputStream(dispose = true), os)
         memoryStore.releaseUnrollMemoryForThisTask(memoryMode, unrollMemory)
         redirectableOutputStream.setOutputStream(os)
    +
    +    // Whether we use Kryo serialization
    +    var kryoSerializationStream: KryoSerializationStream = null
    +    if (serializationStream.isInstanceOf[KryoSerializationStream]) {
    +      kryoSerializationStream = serializationStream.asInstanceOf[KryoSerializationStream]
    +    }
    +
         while (rest.hasNext) {
    -      serializationStream.writeObject(rest.next())(classTag)
    +      val value = rest.next()
    +      if (kryoSerializationStream != null) {
    +        if (!kryoSerializationStream.classWrote) {
    +          kryoSerializationStream.writeClass(value.getClass)
    --- End diff --
    
    @srowen you can see here. Here we don't use the writeAll, because we need acquire memory according to the written size.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/19586
  
    I tend to agree with @cloud-fan , I think you can implement your own serializer out of Spark to be more specialized for your application, that will definitely be more efficient than the built-in one. But for the Spark's default solution, it should be general enough to cover all cases. Setting a flag or a configuration is not intuitive enough from my understanding.
    
    And for ML, can you please provide an example about how this could be improved with your approach. From my understanding you approach is more useful when leverage custom class definition, like `Person` in your example. But for ML/SQL cases, all the types should be predefined or primitives, will that improved a lot?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19586: [SPARK-22367][CORE] Separate the serialization of...

Posted by ConeyLiu <gi...@git.apache.org>.
Github user ConeyLiu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19586#discussion_r147346131
  
    --- Diff: pom.xml ---
    @@ -133,7 +133,7 @@
         <hive.parquet.version>1.6.0</hive.parquet.version>
         <jetty.version>9.3.20.v20170531</jetty.version>
         <javaxservlet.version>3.1.0</javaxservlet.version>
    -    <chill.version>0.8.4</chill.version>
    +    <chill.version>0.9.2</chill.version>
    --- End diff --
    
    I am not sure whether this is should be changed. If it is unreasonable, I can change it back.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

Posted by ConeyLiu <gi...@git.apache.org>.
Github user ConeyLiu commented on the issue:

    https://github.com/apache/spark/pull/19586
  
    Hi @cloud-fan, @jerryshao. The problem of `writeClass` and `readClass` can be solved by register the class: Vector, DenseVector, SparseVector.  The follow is the test results:
    ```scala
    val conf = new SparkConf().setAppName("Vector Register Test")
        conf.registerKryoClasses(Array(classOf[Vector], classOf[DenseVector], classOf[SparseVector]))
        val sc = new SparkContext(conf)
    
        val sourceData = sc.sequenceFile[LongWritable, VectorWritable](args(0))
          .map { case (k, v) =>
            val vector = v.get()
            val tmpVector = new Array[Double](v.get().size())
            for (i <- 0 until vector.size()) {
              tmpVector(i) = vector.get(i)
            }
            Vectors.dense(tmpVector)
          }
    
        sourceData.persist(StorageLevel.OFF_HEAP)
        var start = System.currentTimeMillis()
        sourceData.count()
        println("First: " + (System.currentTimeMillis() - start))
        start = System.currentTimeMillis()
        sourceData.count()
        println("Second: " + (System.currentTimeMillis() - start))
    
        sc.stop()
    ```
    
    
    Results:
    serialized size:  before 38.4GB after: 30.5GB
    First time: before 93318ms    after:  80708ms
    Second time:  before: 5870ms    after: 3382ms
    
    Those classes are very common for ML,  and also `Matrix`, `DenseMatrix` and `SparseMatrix` too. I'm not sure whether we should register those classes in core directly,  because this could introduce extra jar dependency.  So could you give some advice? Or else we just remind in the ml doc?
    
    The reason shoule be the problem of kryo, it  will write the full class name instead of the classID if the class is not registered.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19586: [SPARK-22367][CORE] Separate the serialization of...

Posted by ConeyLiu <gi...@git.apache.org>.
Github user ConeyLiu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19586#discussion_r147368002
  
    --- Diff: core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala ---
    @@ -205,11 +205,45 @@ class KryoSerializationStream(
     
       private[this] var kryo: Kryo = serInstance.borrowKryo()
     
    +  // This is only used when we write object and class separately.
    +  var classWrote = false
    --- End diff --
    
    Yeah, it used for `writeAll / asIterator`.  But for `MemoryStorea.putIteratorAsBytes`, we don't use the writeAll, we use this state to indicate whether we have written the class first.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

Posted by ConeyLiu <gi...@git.apache.org>.
Github user ConeyLiu commented on the issue:

    https://github.com/apache/spark/pull/19586
  
    Thanks for the suggestion, I re-raised a pr to solve this problem. Close it now.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19586: [SPARK-22367][CORE] Separate the serialization of class ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19586
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19586: [SPARK-22367][WIP][CORE] Separate the serialization of c...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19586
  
    I think this problem will go away after mllib migrate to Spark SQL completely. For now I think we can make the serializer config job-wise and set this special serializer for ml jobs.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19586: [SPARK-22367][CORE] Separate the serialization of...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19586#discussion_r147565331
  
    --- Diff: pom.xml ---
    @@ -133,7 +133,7 @@
         <hive.parquet.version>1.6.0</hive.parquet.version>
         <jetty.version>9.3.20.v20170531</jetty.version>
         <javaxservlet.version>3.1.0</javaxservlet.version>
    -    <chill.version>0.8.4</chill.version>
    +    <chill.version>0.9.2</chill.version>
    --- End diff --
    
    library upgrading deserves a separated PR.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org