You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/10/18 14:29:22 UTC

[GitHub] [hudi] wzx140 commented on a diff in pull request #6977: [MINOR] Make sure all `HoodieRecord`s are appropriately serializable by Kryo

wzx140 commented on code in PR #6977:
URL: https://github.com/apache/hudi/pull/6977#discussion_r998304783


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/commmon/model/HoodieSparkRecord.java:
##########
@@ -299,6 +319,42 @@ public Comparable<?> getOrderingValue(Schema recordSchema, Properties props) {
     }
   }
 
+  /**
+   * NOTE: This method is declared final to make sure there's no polymorphism and therefore
+   *       JIT compiler could perform more aggressive optimizations
+   */
+  @Override
+  protected final void writeRecordPayload(InternalRow payload, Kryo kryo, Output output) {
+    // NOTE: [[payload]] could be null if record has already been deflated
+    UnsafeRow unsafeRow = convertToUnsafeRow(payload, schema);

Review Comment:
   HoodieInternalRow must hold UnsafeRow. Does it need to unsafe projection? We use HoodieInternalRow because we want to modify metaField.
   I don't really understand why we should make sure that SparkRecord only hold UnsafeRow. HoodieInternalRow hold UnsafeRow too. They basically make no difference.



##########
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieKey.java:
##########
@@ -27,13 +32,13 @@
  * - recordKey : a recordKey that acts as primary key for a record.
  * - partitionPath : the partition path of a record.
  */
-public class HoodieKey implements Serializable {
+public class HoodieKey implements Serializable, KryoSerializable {

Review Comment:
   We have used kryo as the serializer of shuffle. I wonder that Why we explicitly specify KryoSerializable? Same to HoodieRecord.



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/commmon/model/HoodieSparkRecord.java:
##########
@@ -70,51 +74,67 @@
  *       need to be updated (ie serving as an overlay layer on top of [[UnsafeRow]])</li>
  * </ul>
  *
-
  */
-public class HoodieSparkRecord extends HoodieRecord<InternalRow> {
+public class HoodieSparkRecord extends HoodieRecord<InternalRow> implements KryoSerializable {
 
   /**
    * Record copy operation to avoid double copying. InternalRow do not need to copy twice.
    */
   private boolean copy;
 
   /**
-   * We should use this construction method when we read internalRow from file.
-   * The record constructed by this method must be used in iter.
+   * NOTE: {@code HoodieSparkRecord} is holding the schema only in cases when it would have
+   *       to execute {@link UnsafeProjection} so that the {@link InternalRow} it's holding to
+   *       could be projected into {@link UnsafeRow} and be efficiently serialized subsequently
+   *       (by Kryo)
    */
-  public HoodieSparkRecord(InternalRow data) {
+  private final transient StructType schema;

Review Comment:
   I think this is unnecessary. The schema is only used in construction.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org