You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@thrift.apache.org by GitBox <gi...@apache.org> on 2021/08/16 19:32:48 UTC

[GitHub] [thrift] bhalchandrap opened a new pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

bhalchandrap opened a new pull request #2439:
URL: https://github.com/apache/thrift/pull/2439


   Client: java
   
   <!-- Explain the changes in the pull request below: -->
   Adds support for partial Thrift deserialization in Java.
   More information is available at these locations:
   blog: https://medium.com/pinterest-engineering/improving-data-processing-efficiency-using-partial-deserialization-of-thrift-16bc3a4a38b4
   in source: lib/java/src/org/apache/thrift/partial/README.md  
   
   <!-- We recommend you review the checklist/tips before submitting a pull request. -->
   
   - [ ] Did you create an [Apache Jira](https://issues.apache.org/jira/projects/THRIFT/issues/) ticket?  (not required for trivial changes)
   - [ ] If a ticket exists: Does your pull request title follow the pattern "THRIFT-NNNN: describe my issue"?
   - [ ] Did you squash your changes to a single commit?  (not required, but preferred)
   - [ ] Did you do your best to avoid breaking changes?  If one was needed, did you label the Jira ticket with "Breaking-Change"?
   - [ ] If your change does not involve any code, include `[skip ci]` anywhere in the commit message to free up build resources.
   
   <!--
     The Contributing Guide at:
     https://github.com/apache/thrift/blob/master/CONTRIBUTING.md
     has more details and tips for committing properly.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] bhalchandrap commented on a change in pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
bhalchandrap commented on a change in pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#discussion_r703055792



##########
File path: lib/java/src/org/apache/thrift/partial/PartialThriftCompactProtocol.java
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.thrift.partial;
+
+import org.apache.thrift.TException;
+import org.apache.thrift.protocol.TCompactProtocol;
+import org.apache.thrift.protocol.TField;
+import org.apache.thrift.protocol.TProtocol;
+
+import java.io.Serializable;
+
+/**
+ * Enables partial deserialization of compact-encoded thrift objects.
+ *
+ * This class is meant to be a helper class for {@link PartialThriftDeserializer}.
+ * It cannot be used separately on its own.
+ */
+public class PartialThriftCompactProtocol extends PartialThriftProtocol implements Serializable {
+
+  public PartialThriftCompactProtocol() {
+  }
+
+  @Override
+  protected TProtocol createProtocol() {
+    return new TCompactProtocol(transport);
+  }
+
+  // -----------------------------------------------------------------
+  // Additional methods to improve performance.
+
+  @Override
+  public int readFieldBeginData() throws TException {
+    // Having to call readFieldBegin() to compute TFieldData really results in lower
+    // performance. However, readFieldBegin() accesses some private vars that this method
+    // does not have access to. We could make it more performant when contributing to
+    // origianl source code.

Review comment:
       added a TODO comment for now. That way we can better separate original contribution from subsequent performance improvement changes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] bhalchandrap commented on pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
bhalchandrap commented on pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#issuecomment-917089186


   @fishy @Jens-G do you have additional comments? do you know a Java dev who can review this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] bhalchandrap commented on a change in pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
bhalchandrap commented on a change in pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#discussion_r726694222



##########
File path: lib/java/src/org/apache/thrift/protocol/TProtocol.java
##########
@@ -180,4 +181,145 @@ public void reset() {}
   public Class<? extends IScheme> getScheme() {
     return StandardScheme.class;
   }
+
+  // -----------------------------------------------------------------
+  // Additional methods to improve performance.
+
+  public int readFieldBeginData() throws TException {
+    // Derived classes should provide a more efficient version of this
+    // method if allowed by the encoding used by that protocol.
+    TField tfield = this.readFieldBegin();
+    return TFieldData.encode(tfield.type, tfield.id);
+  }
+
+  public void skip(byte fieldType) throws TException {
+    this.skip(fieldType, Integer.MAX_VALUE);
+  }
+
+  public void skip(byte fieldType, int maxDepth) throws TException {
+    if (maxDepth <= 0) {
+      throw new TException("Maximum skip depth exceeded");
+    }
+
+    switch (fieldType) {
+      case TType.BOOL:
+        this.skipBool();
+        break;
+
+      case TType.BYTE:
+        this.skipByte();
+        break;
+
+      case TType.I16:
+        this.skipI16();
+        break;
+
+      case TType.I32:
+        this.skipI32();
+        break;
+
+      case TType.I64:
+        this.skipI64();
+        break;
+
+      case TType.DOUBLE:
+        this.skipDouble();
+        break;
+
+      case TType.STRING:
+        this.skipBinary();
+        break;
+
+      case TType.STRUCT:
+        this.readStructBegin();
+        while (true) {
+          int tfieldData = this.readFieldBeginData();
+          byte tfieldType = TFieldData.getType(tfieldData);
+          if (tfieldType == TType.STOP) {
+            break;
+          }
+          this.skip(tfieldType, maxDepth - 1);
+          this.readFieldEnd();
+        }
+        this.readStructEnd();
+        break;
+
+      case TType.MAP:
+        TMap map = this.readMapBegin();
+        for (int i = 0; i < map.size; i++) {
+          this.skip(map.keyType, maxDepth - 1);
+          this.skip(map.valueType, maxDepth - 1);
+        }
+        this.readMapEnd();
+        break;
+
+      case TType.SET:
+        TSet set = this.readSetBegin();
+        for (int i = 0; i < set.size; i++) {
+          this.skip(set.elemType, maxDepth - 1);
+        }
+        this.readSetEnd();
+        break;
+
+      case TType.LIST:
+        TList list = this.readListBegin();
+        for (int i = 0; i < list.size; i++) {
+          this.skip(list.elemType, maxDepth - 1);
+        }
+        this.readListEnd();
+        break;
+
+      default:
+        throw new TProtocolException(
+            TProtocolException.INVALID_DATA, "Unrecognized type " + fieldType);
+    }
+  }
+
+  protected void skipBool() throws TException {
+    this.skipBytes(1);
+  }
+
+  protected void skipByte() throws TException {
+    this.skipBytes(1);
+  }
+
+  protected void skipI16() throws TException {
+    this.skipBytes(2);
+  }
+
+  protected void skipI32() throws TException {
+    this.skipBytes(4);
+  }
+
+  protected void skipI64() throws TException {
+    this.skipBytes(8);
+  }
+
+  protected void skipDouble() throws TException {
+    this.skipBytes(8);
+  }

Review comment:
       Good catch! I did not know about additional protocols. I updated default implementation of all skip methods to call read and moved specific byte skipping to binary protocol. It'd be great if someone who understands the other protocols better provide more efficient skip implementation for them.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] bhalchandrap commented on a change in pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
bhalchandrap commented on a change in pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#discussion_r703058445



##########
File path: lib/java/src/org/apache/thrift/partial/README.md
##########
@@ -0,0 +1,112 @@
+# Partial Thrift Deserialization
+
+## Overview
+This document describes how partial deserialization of Thrift works. There are two main goals of this documentation:
+1. Make it easier to understand the current Java implementation in this folder.
+1. Be useful in implementing partial deserialization support in additional languages.
+
+This document is divided into two high level areas. The first part explains important concepts relevant to partial deserialization. The second part describes components involved in the Java implementation in this folder.
+
+Moreover, this blog provides some performance numbers and addtional information: https://medium.com/pinterest-engineering/improving-data-processing-efficiency-using-partial-deserialization-of-thrift-16bc3a4a38b4
+
+## Basic Concepts
+
+### Motivation
+
+The main motivation behind implementing this feature is to improve performance when we need to access only a subset of fields in any Thrift object. This situation arises often when big data is stored in Thrift encoded format (for example, SequenceFile with serialized Thrift values). Many data processing jobs may access this data. However, not every job needs to access every field of each object. In such cases, if we have prior knowledge of the fields needed for a given job, we can deserialize only that subset of fields and avoid the cost deserializing the rest of the fields. There are two benefits of this approach: we save cpu cycles by not deserializing unnecessary field and we end up reducing gc pressure. Both of the savings quickly add up when processing billions of instances in a data processing job.
+
+### Partial deserialization
+
+Partial deserialization involves deserializing only a subset of the fields of a serialized Thrift object while efficiently skipping over the rest. One very important benefit of partial deserialization is that the output of the deserialization process is not limited to a `TBase` derived object. It can deserialize a serialized blob into any type by using an appropriate `ThriftFieldValueProcessor`.

Review comment:
       Yes, it is very useful in certain cases (see earlier comment). ThriftStructProcessor enables direct deserialization into TBase.
   
   What are your thoughts on providing a simpler wrapper over ThriftStructProcessor and PartialThriftDeserializer that directly supports deserializing into TBase? That way, not every language would need to expose the ability to deserialize into non-TBase.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] fishy commented on a change in pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
fishy commented on a change in pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#discussion_r706631147



##########
File path: lib/java/src/org/apache/thrift/partial/PartialThriftDeserializer.java
##########
@@ -0,0 +1,337 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.thrift.partial;
+
+import org.apache.thrift.partial.Validate;
+
+import org.apache.thrift.TBase;
+import org.apache.thrift.TEnum;
+import org.apache.thrift.TException;
+import org.apache.thrift.TFieldIdEnum;
+import org.apache.thrift.meta_data.EnumMetaData;
+import org.apache.thrift.meta_data.StructMetaData;
+import org.apache.thrift.protocol.TList;
+import org.apache.thrift.protocol.TMap;
+import org.apache.thrift.protocol.TSet;
+import org.apache.thrift.protocol.TType;
+
+/**
+ * Supports partial deserialization of serialized Thrift data.
+ *
+ * The end result of deserialization can be a Thrift object (? extends TBase)
+ * or it could be a different construct. It is achieved by using a
+ * ThriftFieldValueProcessor instance which processes each field value
+ * as it gets deserialized.
+ */
+public class PartialThriftDeserializer<T extends TBase> {
+
+  // Metadata that describes fields to deserialize.
+  private final ThriftMetadata.ThriftStruct metadata;
+
+  // Processor that handles deserialized field values.
+  private final ThriftFieldValueProcessor processor;
+
+  // Partial thrift protocol to use for deserialization.
+  private final PartialThriftProtocol protocol;
+
+  /**
+   * Constructs an instance of PartialThriftDeserializer.
+   *
+   * @param metadata the Metadata that describes fields to deserialize.
+   * @param processor the Processor that handles deserialized field values.
+   * @param protocol the Partial thrift protocol to use for deserialization.
+   */
+  public PartialThriftDeserializer(
+      ThriftMetadata.ThriftStruct metadata,
+      ThriftFieldValueProcessor processor,
+      PartialThriftProtocol protocol) {
+
+    Validate.checkNotNull(metadata, "metadata");
+    Validate.checkNotNull(processor, "processor");
+    Validate.checkNotNull(protocol, "protocol");
+
+    this.metadata = metadata;
+    this.processor = processor;
+    this.protocol = protocol;
+  }
+
+  /**
+   * Gets the Thrift metadata used by this instance.
+   */
+  public ThriftMetadata.ThriftStruct getMetadata() {
+    return this.metadata;
+  }
+
+  /**
+   * Deserializes the given serialized blob.
+   *
+   * @param bytes the serialized blob.
+   * @return deserialized instance.
+   * @throws TException if an error is encountered during deserialization.
+   */
+  public Object deserialize(byte[] bytes) throws TException {

Review comment:
       I'm still not convinced that this API need to be so dramatically different from the existing TDeserializer API (https://www.javadoc.io/doc/org.apache.thrift/libthrift/latest/org/apache/thrift/TDeserializer.html), can you try to match that as much as possible? The current TDeserializer API in Java already provided `partiallyDeserialize*` functions.
   
   I'm really worried about how this API will map to other languages, even if it makes sense for the Java case.

##########
File path: lib/java/src/org/apache/thrift/partial/README.md
##########
@@ -0,0 +1,112 @@
+# Partial Thrift Deserialization
+
+## Overview
+This document describes how partial deserialization of Thrift works. There are two main goals of this documentation:
+1. Make it easier to understand the current Java implementation in this folder.
+1. Be useful in implementing partial deserialization support in additional languages.
+
+This document is divided into two high level areas. The first part explains important concepts relevant to partial deserialization. The second part describes components involved in the Java implementation in this folder.
+
+Moreover, this blog provides some performance numbers and addtional information: https://medium.com/pinterest-engineering/improving-data-processing-efficiency-using-partial-deserialization-of-thrift-16bc3a4a38b4
+
+## Basic Concepts
+
+### Motivation
+
+The main motivation behind implementing this feature is to improve performance when we need to access only a subset of fields in any Thrift object. This situation arises often when big data is stored in Thrift encoded format (for example, SequenceFile with serialized Thrift values). Many data processing jobs may access this data. However, not every job needs to access every field of each object. In such cases, if we have prior knowledge of the fields needed for a given job, we can deserialize only that subset of fields and avoid the cost deserializing the rest of the fields. There are two benefits of this approach: we save cpu cycles by not deserializing unnecessary field and we end up reducing gc pressure. Both of the savings quickly add up when processing billions of instances in a data processing job.
+
+### Partial deserialization
+
+Partial deserialization involves deserializing only a subset of the fields of a serialized Thrift object while efficiently skipping over the rest. One very important benefit of partial deserialization is that the output of the deserialization process is not limited to a `TBase` derived object. It can deserialize a serialized blob into any type by using an appropriate `ThriftFieldValueProcessor`.
+
+### Defining the subset of fields to deserialize
+
+The subset of fields to deserialize is defined using a list of fully qualified field names. For example, consider the Thrift `struct` definition below:
+
+```Thrift
+struct SmallStruct {
+   1: optional string stringValue;
+   2: optional i16 i16Value;
+}
+
+struct TestStruct {
+   1: optional i16 i16Field;
+   2: optional list<SmallStruct> structList;
+   3: optional set<SmallStruct> structSet;
+   4: optional map<string, SmallStruct> structMap;
+   5: optional SmallStruct structField;
+}
+```
+
+For the Thrift `struct`, each of the following line shows a fully qualified field definition. Partial deserialization uses a non-empty set of such field definitions to identify the subset of fields to deserialize.
+
+```
+- i16Field
+- structList.stringValue
+- structSet.i16Value
+- structMap.stringValue
+- structField.i16Value
+```
+
+Note that the syntax of denoting paths involving map fields do not support a way to define sub-fields of the key type.
+
+For example, the field path `structMap.stringValue` shown above has leaf segment `stringValue` which is a field in map values.
+
+## Components
+
+The process of partial deserialization involves the following major components. We have listed names of the Java file(s) implementing each component for easier mapping to the source code.
+
+### Thrift Metadata
+
+Source files:
+- ThriftField.java
+- ThriftMetadata.java
+
+We saw in the previous section how we can identify the subset of fields to deserialize. As the first step, we need to compile the collection of field definitions into an efficient data structure that we can traverse at runtime. This step is achieved using `ThriftField` and `ThriftMetadata` classes. For example,
+
+```Java
+// First, create a collection of fully qualified field names.
+List<String> fieldNames = Arrays.asList("i16Field", "structField.i16Value");
+
+// Convert the flat collection into an n-ary tree of fields.
+List<ThriftField> fields = ThriftField.fromNames(fieldNames);
+
+// Compile the tree of fields into internally used metadata.
+ThriftMetadata.ThriftStruct metadata =
+    ThriftMetadata.ThriftStruct.fromFields(TestStruct.class, fields);
+```
+
+At this point, we have an efficient internal representation of the fields that need to get deserialized.
+
+### Partial Thrift Protocol
+
+Source files:
+- PartialThriftProtocol.java
+- PartialThriftBinaryProtocol.java
+- PartialThriftCompactProtocol.java
+
+This component implements efficient skipping over fields that need not be deserialized. Note that this skipping is more efficient compared to that achieved by using `TProtocolUtil.skip()`. The latter calls the corresponding `read()`, allocates and initializes certain values (for example, strings) and then discards the returned value. In comparison, `PartialThriftProtocol` skips a field by incrementing internal offset into the transport buffer.

Review comment:
       The current `TProtocolUtil.skip` requires a `TProtocol` arg (https://www.javadoc.io/static/org.apache.thrift/libthrift/0.11.0/org/apache/thrift/protocol/TProtocolUtil.html), so you can totally then delegate to the concrete TProtocol to decide how to skip that field in the most efficient way. (for example, in Go we already have `Skip` function defined in the `TProtocol` interface, so it can be different from `read*`. we can probably do something similar in java.)

##########
File path: lib/java/src/org/apache/thrift/partial/PartialThriftDeserializer.java
##########
@@ -0,0 +1,337 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.thrift.partial;
+
+import org.apache.thrift.partial.Validate;
+
+import org.apache.thrift.TBase;
+import org.apache.thrift.TEnum;
+import org.apache.thrift.TException;
+import org.apache.thrift.TFieldIdEnum;
+import org.apache.thrift.meta_data.EnumMetaData;
+import org.apache.thrift.meta_data.StructMetaData;
+import org.apache.thrift.protocol.TList;
+import org.apache.thrift.protocol.TMap;
+import org.apache.thrift.protocol.TSet;
+import org.apache.thrift.protocol.TType;
+
+/**
+ * Supports partial deserialization of serialized Thrift data.
+ *
+ * The end result of deserialization can be a Thrift object (? extends TBase)
+ * or it could be a different construct. It is achieved by using a
+ * ThriftFieldValueProcessor instance which processes each field value
+ * as it gets deserialized.
+ */
+public class PartialThriftDeserializer<T extends TBase> {
+
+  // Metadata that describes fields to deserialize.
+  private final ThriftMetadata.ThriftStruct metadata;
+
+  // Processor that handles deserialized field values.
+  private final ThriftFieldValueProcessor processor;
+
+  // Partial thrift protocol to use for deserialization.
+  private final PartialThriftProtocol protocol;
+
+  /**
+   * Constructs an instance of PartialThriftDeserializer.
+   *
+   * @param metadata the Metadata that describes fields to deserialize.
+   * @param processor the Processor that handles deserialized field values.
+   * @param protocol the Partial thrift protocol to use for deserialization.
+   */
+  public PartialThriftDeserializer(
+      ThriftMetadata.ThriftStruct metadata,
+      ThriftFieldValueProcessor processor,
+      PartialThriftProtocol protocol) {
+
+    Validate.checkNotNull(metadata, "metadata");
+    Validate.checkNotNull(processor, "processor");
+    Validate.checkNotNull(protocol, "protocol");
+
+    this.metadata = metadata;
+    this.processor = processor;
+    this.protocol = protocol;
+  }
+
+  /**
+   * Gets the Thrift metadata used by this instance.
+   */
+  public ThriftMetadata.ThriftStruct getMetadata() {
+    return this.metadata;
+  }
+
+  /**
+   * Deserializes the given serialized blob.
+   *
+   * @param bytes the serialized blob.
+   * @return deserialized instance.
+   * @throws TException if an error is encountered during deserialization.
+   */
+  public Object deserialize(byte[] bytes) throws TException {
+    return this.deserialize(bytes, 0, bytes.length);
+  }
+
+  /**
+   * Deserializes the given serialized blob.
+   *
+   * @param bytes the serialized blob.
+   * @param offset the blob is read starting at this offset.
+   * @param length the size of blob read (in number of bytes).
+   * @return deserialized instance.
+   * @throws TException if an error is encountered during deserialization.
+   */
+  public Object deserialize(byte[] bytes, int offset, int length) throws TException {
+    this.protocol.reset(bytes, offset, length);
+    return this.deserializeStruct(this.protocol, this.metadata);
+  }
+
+  private Object deserialize(
+      PartialThriftProtocol tprot,
+      ThriftMetadata.ThriftObject data) throws TException {
+
+    Object value;
+    byte fieldType = data.data.valueMetaData.type;
+    switch (fieldType) {
+      case TType.STRUCT:
+        return this.deserializeStruct(tprot, (ThriftMetadata.ThriftStruct) data);
+
+      case TType.LIST:
+        return this.deserializeList(tprot, (ThriftMetadata.ThriftList) data);
+
+      case TType.MAP:
+        return this.deserializeMap(tprot, (ThriftMetadata.ThriftMap) data);
+
+      case TType.SET:
+        return this.deserializeSet(tprot, (ThriftMetadata.ThriftSet) data);
+
+      case TType.ENUM:
+        return this.deserializeEnum(tprot, (ThriftMetadata.ThriftEnum) data);
+
+      case TType.BOOL:
+        return tprot.readBool();
+
+      case TType.BYTE:
+        return tprot.readByte();
+
+      case TType.I16:
+        return tprot.readI16();
+
+      case TType.I32:
+        return tprot.readI32();
+
+      case TType.I64:
+        return tprot.readI64();
+
+      case TType.DOUBLE:
+        return tprot.readDouble();
+
+      case TType.STRING:
+        if (((ThriftMetadata.ThriftPrimitive) data).isBinary()) {
+          return this.processor.prepareBinary(tprot.readBinary());
+        } else {
+          return this.processor.prepareString(tprot.readBinary());
+        }
+
+      default:
+        throw unsupportedFieldTypeException(fieldType);
+    }
+  }
+
+  private Object deserializeStruct(PartialThriftProtocol tprot, ThriftMetadata.ThriftStruct data)
+      throws TException {
+
+    if (data.fields.size() == 0) {
+      return this.fullDeserialize(tprot, data);
+    }
+
+    Object instance = this.processor.createNewStruct(data);
+    tprot.readStructBegin();
+    while (true) {
+      int tfieldData = tprot.readFieldBeginData();
+      byte tfieldType = TFieldData.getType(tfieldData);
+      if (tfieldType == TType.STOP) {
+        break;
+      }
+
+      Integer id = (int) TFieldData.getId(tfieldData);
+      ThriftMetadata.ThriftObject field = (ThriftMetadata.ThriftObject) data.fields.get(id);
+
+      if (field != null) {
+        this.deserializeStructField(tprot, instance, field.fieldId, field);
+      } else {
+        tprot.skip(tfieldType);
+      }
+      tprot.readFieldEnd();
+    }
+    tprot.readStructEnd();
+
+    return this.processor.prepareStruct(instance);
+  }
+
+  private void deserializeStructField(
+      PartialThriftProtocol tprot,
+      Object instance,
+      TFieldIdEnum fieldId,
+      ThriftMetadata.ThriftObject data) throws TException {
+
+    byte fieldType = data.data.valueMetaData.type;
+    Object value;
+
+    switch (fieldType) {
+      case TType.BOOL:
+        this.processor.setBool(instance, fieldId, tprot.readBool());
+        break;
+
+      case TType.BYTE:
+        this.processor.setByte(instance, fieldId, tprot.readByte());
+        break;
+
+      case TType.I16:
+        this.processor.setInt16(instance, fieldId, tprot.readI16());
+        break;
+
+      case TType.I32:
+        this.processor.setInt32(instance, fieldId, tprot.readI32());
+        break;
+
+      case TType.I64:
+        this.processor.setInt64(instance, fieldId, tprot.readI64());
+        break;
+
+      case TType.DOUBLE:
+        this.processor.setDouble(instance, fieldId, tprot.readDouble());
+        break;
+
+      case TType.STRING:
+        if (((ThriftMetadata.ThriftPrimitive) data).isBinary()) {
+          this.processor.setBinary(instance, fieldId, tprot.readBinary());
+        } else {
+          this.processor.setString(instance, fieldId, tprot.readBinary());
+        }
+        break;
+
+      case TType.STRUCT:
+        value = this.deserializeStruct(tprot, (ThriftMetadata.ThriftStruct) data);
+        this.processor.setStructField(instance, fieldId, value);
+        break;
+
+      case TType.LIST:
+        value = this.deserializeList(tprot, (ThriftMetadata.ThriftList) data);
+        this.processor.setListField(instance, fieldId, value);
+        break;
+
+      case TType.MAP:
+        value = this.deserializeMap(tprot, (ThriftMetadata.ThriftMap) data);
+        this.processor.setMapField(instance, fieldId, value);
+        break;
+
+      case TType.SET:
+        value = this.deserializeSet(tprot, (ThriftMetadata.ThriftSet) data);
+        this.processor.setSetField(instance, fieldId, value);
+        break;
+
+      case TType.ENUM:
+        value = this.deserializeEnum(tprot, (ThriftMetadata.ThriftEnum) data);
+        this.processor.setEnumField(instance, fieldId, value);
+        break;
+
+      default:
+        throw new RuntimeException("Unsupported field type: " + fieldId.toString());
+    }
+  }
+
+  private Object deserializeList(PartialThriftProtocol tprot, ThriftMetadata.ThriftList data)
+        throws TException {
+
+    TList tlist = tprot.readListBegin();
+    Object instance = this.processor.createNewList(tlist.size);
+    for (int i = 0; i < tlist.size; i++) {
+      Object value = this.deserialize(tprot, data.elementData);
+      this.processor.setListElement(instance, i, value);
+    }
+    tprot.readListEnd();
+    return this.processor.prepareList(instance);
+  }
+
+  private Object deserializeMap(PartialThriftProtocol tprot, ThriftMetadata.ThriftMap data)
+      throws TException {
+    TMap tmap = tprot.readMapBegin();
+    Object instance = this.processor.createNewMap(tmap.size);
+    for (int i = 0; i < tmap.size; i++) {
+      Object key = this.deserialize(tprot, data.keyData);
+      Object val = this.deserialize(tprot, data.valueData);
+      this.processor.setMapElement(instance, i, key, val);
+    }
+    tprot.readMapEnd();
+    return this.processor.prepareMap(instance);
+  }
+
+  private Object deserializeSet(PartialThriftProtocol tprot, ThriftMetadata.ThriftSet data)
+      throws TException {
+    TSet tset = tprot.readSetBegin();
+    Object instance = this.processor.createNewSet(tset.size);
+    for (int i = 0; i < tset.size; i++) {
+      Object eltValue = this.deserialize(tprot, data.elementData);
+      this.processor.setSetElement(instance, i, eltValue);
+    }
+    tprot.readSetEnd();
+    return this.processor.prepareSet(instance);
+  }
+
+  private Object deserializeEnum(PartialThriftProtocol tprot, ThriftMetadata.ThriftEnum data)
+      throws TException {
+    int ordinal = tprot.readI32();
+    Class<? extends TEnum> enumClass = ((EnumMetaData) data.data.valueMetaData).enumClass;
+    return this.processor.prepareEnum(enumClass, ordinal);
+  }
+
+  private TBase fullDeserialize(PartialThriftProtocol tprot, ThriftMetadata.ThriftStruct data)
+      throws TException {
+    Validate.checkState(
+        data.fields.size() == 0, "Cannot fully deserialize when some fields specified");

Review comment:
       did you? the error message looks still the same as before?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] fishy commented on pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
fishy commented on pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#issuecomment-918752362


   @bhalchandrap for skip: yes please make `TProtocolUtil.skip` choose the most efficient way to skip a field in a TProtocol, instead of adding a new API for "efficient skip" and keep `TProtocolUtil.skip` as the "slow skip".
   
   For partial deserialization, I don't know which makes more sense in Java API. If you can totally use the existing API to implement it that would be awesome (I imagine you at least need to add new constructors to `TDeserializer` to take the field maps you want to partially deserialize). If you can't, I think it's fine to add new APIs for it as long as the new API design makes sense, but the new API should be more similar to the existing `TDeserializer` API, unless we have a good reason to make it quite different (e.g. the current TDeserializer API is bad/inefficient/etc. because of X Y Z).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] bhalchandrap commented on a change in pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
bhalchandrap commented on a change in pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#discussion_r723768281



##########
File path: lib/java/src/org/apache/thrift/TDeserializer.java
##########
@@ -353,4 +418,265 @@ private TField locateField(byte[] bytes, TFieldIdEnum fieldIdPathFirst, TFieldId
   public void fromString(TBase base, String data) throws TException {
     deserialize(base, data.getBytes());
   }
+
+  // ----------------------------------------------------------------------
+  // Methods related to partial deserialization.
+
+  /**
+   * Deserializes the given serialized blob.
+   *
+   * @param bytes the serialized blob.
+   * @return deserialized instance.
+   * @throws TException if an error is encountered during deserialization.
+   */
+  public Object partialDeserializeObject(byte[] bytes) throws TException {
+    return this.deserialize(bytes, 0, bytes.length);
+  }
+
+  /**
+   * Deserializes the given serialized blob.
+   *
+   * @param bytes the serialized blob.
+   * @param offset the blob is read starting at this offset.
+   * @param length the size of blob read (in number of bytes).
+   * @return deserialized instance.
+   * @throws TException if an error is encountered during deserialization.
+   */
+  public Object deserialize(byte[] bytes, int offset, int length) throws TException {
+    ensurePartialDeserializationMode();

Review comment:
       sorry that was an oversight. The method should have been called partialDeserializeObject() like its other overload. fixed. That should address both #1 and #2 points above. The name clarifies two intents:
   1. it performs partial deserialization
   2. it returns an Object

##########
File path: lib/java/src/org/apache/thrift/TDeserializer.java
##########
@@ -61,6 +78,54 @@ public TDeserializer(TProtocolFactory protocolFactory) throws TTransportExceptio
     protocol_ = protocolFactory.getProtocol(trans_);
   }
 
+  /**
+   * Construct a new TDeserializer that supports partial deserialization
+   * that outputs instances of type controlled by the given {@code processor}.
+   *
+   * @param thriftClass a TBase derived class.
+   * @param fieldNames list of fields to deserialize.
+   * @param processor the Processor that handles deserialized field values.
+   * @param protocolFactory the Factory to create a protocol.
+   */
+  public TDeserializer(

Review comment:
       Your observation is correct and is intentional. I did not want to change the behavior of existing methods as it can cause backward compatibility issues. With that in mind,  if one calls any of the existing deserialize() methods, they will continue to get the existing behavior. One will need to call partialDeserilizeObject() to perform partial deserialization. 
   
   If you strongly believe that we need to change the behavior of existing methods, I can make it happen. Let me know what you think.

##########
File path: lib/java/src/org/apache/thrift/protocol/TProtocol.java
##########
@@ -180,4 +181,145 @@ public void reset() {}
   public Class<? extends IScheme> getScheme() {
     return StandardScheme.class;
   }
+
+  // -----------------------------------------------------------------
+  // Additional methods to improve performance.
+
+  public int readFieldBeginData() throws TException {
+    // Derived classes should provide a more efficient version of this
+    // method if allowed by the encoding used by that protocol.
+    TField tfield = this.readFieldBegin();
+    return TFieldData.encode(tfield.type, tfield.id);
+  }
+
+  public void skip(byte fieldType) throws TException {
+    this.skip(fieldType, Integer.MAX_VALUE);
+  }
+
+  public void skip(byte fieldType, int maxDepth) throws TException {
+    if (maxDepth <= 0) {
+      throw new TException("Maximum skip depth exceeded");
+    }
+
+    switch (fieldType) {
+      case TType.BOOL:
+        this.skipBool();
+        break;
+
+      case TType.BYTE:
+        this.skipByte();
+        break;
+
+      case TType.I16:
+        this.skipI16();
+        break;
+
+      case TType.I32:
+        this.skipI32();
+        break;
+
+      case TType.I64:
+        this.skipI64();
+        break;
+
+      case TType.DOUBLE:
+        this.skipDouble();
+        break;
+
+      case TType.STRING:
+        this.skipBinary();
+        break;
+
+      case TType.STRUCT:
+        this.readStructBegin();
+        while (true) {
+          int tfieldData = this.readFieldBeginData();
+          byte tfieldType = TFieldData.getType(tfieldData);
+          if (tfieldType == TType.STOP) {
+            break;
+          }
+          this.skip(tfieldType, maxDepth - 1);
+          this.readFieldEnd();
+        }
+        this.readStructEnd();
+        break;
+
+      case TType.MAP:
+        TMap map = this.readMapBegin();
+        for (int i = 0; i < map.size; i++) {
+          this.skip(map.keyType, maxDepth - 1);
+          this.skip(map.valueType, maxDepth - 1);
+        }
+        this.readMapEnd();
+        break;
+
+      case TType.SET:
+        TSet set = this.readSetBegin();
+        for (int i = 0; i < set.size; i++) {
+          this.skip(set.elemType, maxDepth - 1);
+        }
+        this.readSetEnd();
+        break;
+
+      case TType.LIST:
+        TList list = this.readListBegin();
+        for (int i = 0; i < list.size; i++) {
+          this.skip(list.elemType, maxDepth - 1);
+        }
+        this.readListEnd();
+        break;
+
+      default:
+        throw new TProtocolException(
+            TProtocolException.INVALID_DATA, "Unrecognized type " + fieldType);
+    }
+  }
+
+  protected void skipBool() throws TException {
+    this.skipBytes(1);
+  }
+
+  protected void skipByte() throws TException {
+    this.skipBytes(1);
+  }
+
+  protected void skipI16() throws TException {
+    this.skipBytes(2);
+  }
+
+  protected void skipI32() throws TException {
+    this.skipBytes(4);
+  }
+
+  protected void skipI64() throws TException {
+    this.skipBytes(8);
+  }
+
+  protected void skipDouble() throws TException {
+    this.skipBytes(8);
+  }

Review comment:
       I had other aspect in mind. That is, have the derived protocols override when they do not do what the base does. If I call read() for each skip() then each of the derived protocols would want to override all of the methods.
   
   Happy to do what you suggest. Can you please confirm?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] bhalchandrap commented on a change in pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
bhalchandrap commented on a change in pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#discussion_r706958817



##########
File path: lib/java/src/org/apache/thrift/partial/PartialThriftDeserializer.java
##########
@@ -0,0 +1,337 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.thrift.partial;
+
+import org.apache.thrift.partial.Validate;
+
+import org.apache.thrift.TBase;
+import org.apache.thrift.TEnum;
+import org.apache.thrift.TException;
+import org.apache.thrift.TFieldIdEnum;
+import org.apache.thrift.meta_data.EnumMetaData;
+import org.apache.thrift.meta_data.StructMetaData;
+import org.apache.thrift.protocol.TList;
+import org.apache.thrift.protocol.TMap;
+import org.apache.thrift.protocol.TSet;
+import org.apache.thrift.protocol.TType;
+
+/**
+ * Supports partial deserialization of serialized Thrift data.
+ *
+ * The end result of deserialization can be a Thrift object (? extends TBase)
+ * or it could be a different construct. It is achieved by using a
+ * ThriftFieldValueProcessor instance which processes each field value
+ * as it gets deserialized.
+ */
+public class PartialThriftDeserializer<T extends TBase> {
+
+  // Metadata that describes fields to deserialize.
+  private final ThriftMetadata.ThriftStruct metadata;
+
+  // Processor that handles deserialized field values.
+  private final ThriftFieldValueProcessor processor;
+
+  // Partial thrift protocol to use for deserialization.
+  private final PartialThriftProtocol protocol;
+
+  /**
+   * Constructs an instance of PartialThriftDeserializer.
+   *
+   * @param metadata the Metadata that describes fields to deserialize.
+   * @param processor the Processor that handles deserialized field values.
+   * @param protocol the Partial thrift protocol to use for deserialization.
+   */
+  public PartialThriftDeserializer(
+      ThriftMetadata.ThriftStruct metadata,
+      ThriftFieldValueProcessor processor,
+      PartialThriftProtocol protocol) {
+
+    Validate.checkNotNull(metadata, "metadata");
+    Validate.checkNotNull(processor, "processor");
+    Validate.checkNotNull(protocol, "protocol");
+
+    this.metadata = metadata;
+    this.processor = processor;
+    this.protocol = protocol;
+  }
+
+  /**
+   * Gets the Thrift metadata used by this instance.
+   */
+  public ThriftMetadata.ThriftStruct getMetadata() {
+    return this.metadata;
+  }
+
+  /**
+   * Deserializes the given serialized blob.
+   *
+   * @param bytes the serialized blob.
+   * @return deserialized instance.
+   * @throws TException if an error is encountered during deserialization.
+   */
+  public Object deserialize(byte[] bytes) throws TException {

Review comment:
       Hi @fishy, 
   Happy to make changes if I get clarity on the concerns you have and on the type of changes you are looking for. 
   
   Based on your comments I believe you prefer adding functionality to existing interfaces instead of adding new ones (please confirm since this is my guess). If so, how does the following work for you?
   
   1. move common skip*() methods from PartialThriftProtocol.java to TProtocol.java.
   2. move encoding specific skip*() methods from PartialThrift*Protocol.java to T*Protocol (for example, from PartialThriftBinaryProtocol.java to TBinaryProtocol)
   3. get rid of PartialThriftDeserializer.java and add its methods to TDeserializer
   
   Would something like the above address your concerns?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] fishy commented on a change in pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
fishy commented on a change in pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#discussion_r730206049



##########
File path: lib/java/src/org/apache/thrift/TDeserializer.java
##########
@@ -82,12 +147,16 @@ public void deserialize(TBase base, byte[] bytes) throws TException {
    * @throws TException if an error is encountered during deserialization.
    */
   public void deserialize(TBase base, byte[] bytes, int offset, int length) throws TException {

Review comment:
       we probably need to update the javadoc of this function accordingly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] bhalchandrap commented on a change in pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
bhalchandrap commented on a change in pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#discussion_r723769384



##########
File path: lib/java/src/org/apache/thrift/TDeserializer.java
##########
@@ -61,6 +78,54 @@ public TDeserializer(TProtocolFactory protocolFactory) throws TTransportExceptio
     protocol_ = protocolFactory.getProtocol(trans_);
   }
 
+  /**
+   * Construct a new TDeserializer that supports partial deserialization
+   * that outputs instances of type controlled by the given {@code processor}.
+   *
+   * @param thriftClass a TBase derived class.
+   * @param fieldNames list of fields to deserialize.
+   * @param processor the Processor that handles deserialized field values.
+   * @param protocolFactory the Factory to create a protocol.
+   */
+  public TDeserializer(

Review comment:
       Your observation is correct and is intentional. I did not want to change the behavior of existing methods as it can cause backward compatibility issues. With that in mind,  if one calls any of the existing deserialize() methods, they will continue to get the existing behavior. One will need to call partialDeserilizeObject() to perform partial deserialization. 
   
   If you strongly believe that we need to change the behavior of existing methods, I can make it happen. Let me know what you think.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] bhalchandrap commented on a change in pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
bhalchandrap commented on a change in pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#discussion_r703058445



##########
File path: lib/java/src/org/apache/thrift/partial/README.md
##########
@@ -0,0 +1,112 @@
+# Partial Thrift Deserialization
+
+## Overview
+This document describes how partial deserialization of Thrift works. There are two main goals of this documentation:
+1. Make it easier to understand the current Java implementation in this folder.
+1. Be useful in implementing partial deserialization support in additional languages.
+
+This document is divided into two high level areas. The first part explains important concepts relevant to partial deserialization. The second part describes components involved in the Java implementation in this folder.
+
+Moreover, this blog provides some performance numbers and addtional information: https://medium.com/pinterest-engineering/improving-data-processing-efficiency-using-partial-deserialization-of-thrift-16bc3a4a38b4
+
+## Basic Concepts
+
+### Motivation
+
+The main motivation behind implementing this feature is to improve performance when we need to access only a subset of fields in any Thrift object. This situation arises often when big data is stored in Thrift encoded format (for example, SequenceFile with serialized Thrift values). Many data processing jobs may access this data. However, not every job needs to access every field of each object. In such cases, if we have prior knowledge of the fields needed for a given job, we can deserialize only that subset of fields and avoid the cost deserializing the rest of the fields. There are two benefits of this approach: we save cpu cycles by not deserializing unnecessary field and we end up reducing gc pressure. Both of the savings quickly add up when processing billions of instances in a data processing job.
+
+### Partial deserialization
+
+Partial deserialization involves deserializing only a subset of the fields of a serialized Thrift object while efficiently skipping over the rest. One very important benefit of partial deserialization is that the output of the deserialization process is not limited to a `TBase` derived object. It can deserialize a serialized blob into any type by using an appropriate `ThriftFieldValueProcessor`.

Review comment:
       Yes, it is very useful in certain cases (see earlier comment). ThriftStructProcessor enables direct deserialization into TBase.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] bhalchandrap commented on a change in pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
bhalchandrap commented on a change in pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#discussion_r706959223



##########
File path: lib/java/src/org/apache/thrift/partial/README.md
##########
@@ -0,0 +1,112 @@
+# Partial Thrift Deserialization
+
+## Overview
+This document describes how partial deserialization of Thrift works. There are two main goals of this documentation:
+1. Make it easier to understand the current Java implementation in this folder.
+1. Be useful in implementing partial deserialization support in additional languages.
+
+This document is divided into two high level areas. The first part explains important concepts relevant to partial deserialization. The second part describes components involved in the Java implementation in this folder.
+
+Moreover, this blog provides some performance numbers and addtional information: https://medium.com/pinterest-engineering/improving-data-processing-efficiency-using-partial-deserialization-of-thrift-16bc3a4a38b4
+
+## Basic Concepts
+
+### Motivation
+
+The main motivation behind implementing this feature is to improve performance when we need to access only a subset of fields in any Thrift object. This situation arises often when big data is stored in Thrift encoded format (for example, SequenceFile with serialized Thrift values). Many data processing jobs may access this data. However, not every job needs to access every field of each object. In such cases, if we have prior knowledge of the fields needed for a given job, we can deserialize only that subset of fields and avoid the cost deserializing the rest of the fields. There are two benefits of this approach: we save cpu cycles by not deserializing unnecessary field and we end up reducing gc pressure. Both of the savings quickly add up when processing billions of instances in a data processing job.
+
+### Partial deserialization
+
+Partial deserialization involves deserializing only a subset of the fields of a serialized Thrift object while efficiently skipping over the rest. One very important benefit of partial deserialization is that the output of the deserialization process is not limited to a `TBase` derived object. It can deserialize a serialized blob into any type by using an appropriate `ThriftFieldValueProcessor`.
+
+### Defining the subset of fields to deserialize
+
+The subset of fields to deserialize is defined using a list of fully qualified field names. For example, consider the Thrift `struct` definition below:
+
+```Thrift
+struct SmallStruct {
+   1: optional string stringValue;
+   2: optional i16 i16Value;
+}
+
+struct TestStruct {
+   1: optional i16 i16Field;
+   2: optional list<SmallStruct> structList;
+   3: optional set<SmallStruct> structSet;
+   4: optional map<string, SmallStruct> structMap;
+   5: optional SmallStruct structField;
+}
+```
+
+For the Thrift `struct`, each of the following line shows a fully qualified field definition. Partial deserialization uses a non-empty set of such field definitions to identify the subset of fields to deserialize.
+
+```
+- i16Field
+- structList.stringValue
+- structSet.i16Value
+- structMap.stringValue
+- structField.i16Value
+```
+
+Note that the syntax of denoting paths involving map fields do not support a way to define sub-fields of the key type.
+
+For example, the field path `structMap.stringValue` shown above has leaf segment `stringValue` which is a field in map values.
+
+## Components
+
+The process of partial deserialization involves the following major components. We have listed names of the Java file(s) implementing each component for easier mapping to the source code.
+
+### Thrift Metadata
+
+Source files:
+- ThriftField.java
+- ThriftMetadata.java
+
+We saw in the previous section how we can identify the subset of fields to deserialize. As the first step, we need to compile the collection of field definitions into an efficient data structure that we can traverse at runtime. This step is achieved using `ThriftField` and `ThriftMetadata` classes. For example,
+
+```Java
+// First, create a collection of fully qualified field names.
+List<String> fieldNames = Arrays.asList("i16Field", "structField.i16Value");
+
+// Convert the flat collection into an n-ary tree of fields.
+List<ThriftField> fields = ThriftField.fromNames(fieldNames);
+
+// Compile the tree of fields into internally used metadata.
+ThriftMetadata.ThriftStruct metadata =
+    ThriftMetadata.ThriftStruct.fromFields(TestStruct.class, fields);
+```
+
+At this point, we have an efficient internal representation of the fields that need to get deserialized.
+
+### Partial Thrift Protocol
+
+Source files:
+- PartialThriftProtocol.java
+- PartialThriftBinaryProtocol.java
+- PartialThriftCompactProtocol.java
+
+This component implements efficient skipping over fields that need not be deserialized. Note that this skipping is more efficient compared to that achieved by using `TProtocolUtil.skip()`. The latter calls the corresponding `read()`, allocates and initializes certain values (for example, strings) and then discards the returned value. In comparison, `PartialThriftProtocol` skips a field by incrementing internal offset into the transport buffer.

Review comment:
       got it. basically you are recommending moving skip*() methods directly into T*Protocol. can you please confirm?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] bhalchandrap commented on pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
bhalchandrap commented on pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#issuecomment-913838677


   addressed review comments


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] bhalchandrap commented on pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
bhalchandrap commented on pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#issuecomment-965920962


   @Jens-G can you please let me know if I can do anything from my side to make progress on this pull request?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] fishy commented on a change in pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
fishy commented on a change in pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#discussion_r725557980



##########
File path: lib/java/src/org/apache/thrift/protocol/TProtocol.java
##########
@@ -180,4 +181,145 @@ public void reset() {}
   public Class<? extends IScheme> getScheme() {
     return StandardScheme.class;
   }
+
+  // -----------------------------------------------------------------
+  // Additional methods to improve performance.
+
+  public int readFieldBeginData() throws TException {
+    // Derived classes should provide a more efficient version of this
+    // method if allowed by the encoding used by that protocol.
+    TField tfield = this.readFieldBegin();
+    return TFieldData.encode(tfield.type, tfield.id);
+  }
+
+  public void skip(byte fieldType) throws TException {
+    this.skip(fieldType, Integer.MAX_VALUE);
+  }
+
+  public void skip(byte fieldType, int maxDepth) throws TException {
+    if (maxDepth <= 0) {
+      throw new TException("Maximum skip depth exceeded");
+    }
+
+    switch (fieldType) {
+      case TType.BOOL:
+        this.skipBool();
+        break;
+
+      case TType.BYTE:
+        this.skipByte();
+        break;
+
+      case TType.I16:
+        this.skipI16();
+        break;
+
+      case TType.I32:
+        this.skipI32();
+        break;
+
+      case TType.I64:
+        this.skipI64();
+        break;
+
+      case TType.DOUBLE:
+        this.skipDouble();
+        break;
+
+      case TType.STRING:
+        this.skipBinary();
+        break;
+
+      case TType.STRUCT:
+        this.readStructBegin();
+        while (true) {
+          int tfieldData = this.readFieldBeginData();
+          byte tfieldType = TFieldData.getType(tfieldData);
+          if (tfieldType == TType.STOP) {
+            break;
+          }
+          this.skip(tfieldType, maxDepth - 1);
+          this.readFieldEnd();
+        }
+        this.readStructEnd();
+        break;
+
+      case TType.MAP:
+        TMap map = this.readMapBegin();
+        for (int i = 0; i < map.size; i++) {
+          this.skip(map.keyType, maxDepth - 1);
+          this.skip(map.valueType, maxDepth - 1);
+        }
+        this.readMapEnd();
+        break;
+
+      case TType.SET:
+        TSet set = this.readSetBegin();
+        for (int i = 0; i < set.size; i++) {
+          this.skip(set.elemType, maxDepth - 1);
+        }
+        this.readSetEnd();
+        break;
+
+      case TType.LIST:
+        TList list = this.readListBegin();
+        for (int i = 0; i < list.size; i++) {
+          this.skip(list.elemType, maxDepth - 1);
+        }
+        this.readListEnd();
+        break;
+
+      default:
+        throw new TProtocolException(
+            TProtocolException.INVALID_DATA, "Unrecognized type " + fieldType);
+    }
+  }
+
+  protected void skipBool() throws TException {
+    this.skipBytes(1);
+  }
+
+  protected void skipByte() throws TException {
+    this.skipBytes(1);
+  }
+
+  protected void skipI16() throws TException {
+    this.skipBytes(2);
+  }
+
+  protected void skipI32() throws TException {
+    this.skipBytes(4);
+  }
+
+  protected void skipI64() throws TException {
+    this.skipBytes(8);
+  }
+
+  protected void skipDouble() throws TException {
+    this.skipBytes(8);
+  }

Review comment:
       That is fine, but then you need to fix all the `TProtocol` implementations that use different bytes here, which includes `TJSONProtocol` and `TSimpleJSONProtocol`, and you did not "fix" them.
   
   `TProtocol` is not some thing "either binary or compact". there are a lot more of them, and the fixed numbers are only correct for `TBInaryProtocol`. So it really should be the other way around.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] dulong commented on a change in pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
dulong commented on a change in pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#discussion_r754779898



##########
File path: lib/java/src/org/apache/thrift/partial/README.md
##########
@@ -0,0 +1,112 @@
+# Partial Thrift Deserialization
+
+## Overview
+This document describes how partial deserialization of Thrift works. There are two main goals of this documentation:
+1. Make it easier to understand the current Java implementation in this folder.
+1. Be useful in implementing partial deserialization support in additional languages.
+
+This document is divided into two high level areas. The first part explains important concepts relevant to partial deserialization. The second part describes components involved in the Java implementation in this folder.
+
+Moreover, this blog provides some performance numbers and addtional information: https://medium.com/pinterest-engineering/improving-data-processing-efficiency-using-partial-deserialization-of-thrift-16bc3a4a38b4
+
+## Basic Concepts
+
+### Motivation
+
+The main motivation behind implementing this feature is to improve performance when we need to access only a subset of fields in any Thrift object. This situation arises often when big data is stored in Thrift encoded format (for example, SequenceFile with serialized Thrift values). Many data processing jobs may access this data. However, not every job needs to access every field of each object. In such cases, if we have prior knowledge of the fields needed for a given job, we can deserialize only that subset of fields and avoid the cost deserializing the rest of the fields. There are two benefits of this approach: we save cpu cycles by not deserializing unnecessary field and we end up reducing gc pressure. Both of the savings quickly add up when processing billions of instances in a data processing job.
+
+### Partial deserialization
+
+Partial deserialization involves deserializing only a subset of the fields of a serialized Thrift object while efficiently skipping over the rest. One very important benefit of partial deserialization is that the output of the deserialization process is not limited to a `TBase` derived object. It can deserialize a serialized blob into any type by using an appropriate `ThriftFieldValueProcessor`.
+
+### Defining the subset of fields to deserialize
+
+The subset of fields to deserialize is defined using a list of fully qualified field names. For example, consider the Thrift `struct` definition below:
+
+```Thrift
+struct SmallStruct {
+   1: optional string stringValue;
+   2: optional i16 i16Value;
+}
+
+struct TestStruct {
+   1: optional i16 i16Field;
+   2: optional list<SmallStruct> structList;
+   3: optional set<SmallStruct> structSet;
+   4: optional map<string, SmallStruct> structMap;
+   5: optional SmallStruct structField;
+}
+```
+
+For the Thrift `struct`, each of the following line shows a fully qualified field definition. Partial deserialization uses a non-empty set of such field definitions to identify the subset of fields to deserialize.
+
+```
+- i16Field
+- structList.stringValue
+- structSet.i16Value
+- structMap.stringValue
+- structField.i16Value
+```
+
+Note that the syntax of denoting paths involving map fields do not support a way to define sub-fields of the key type.
+
+For example, the field path `structMap.stringValue` shown above has leaf segment `stringValue` which is a field in map values.
+
+## Components
+
+The process of partial deserialization involves the following major components. We have listed names of the Java file(s) implementing each component for easier mapping to the source code.
+
+### Thrift Metadata
+
+Source files:
+- ThriftField.java
+- ThriftMetadata.java
+
+We saw in the previous section how we can identify the subset of fields to deserialize. As the first step, we need to compile the collection of field definitions into an efficient data structure that we can traverse at runtime. This step is achieved using `ThriftField` and `ThriftMetadata` classes. For example,
+
+```Java
+// First, create a collection of fully qualified field names.
+List<String> fieldNames = Arrays.asList("i16Field", "structField.i16Value");
+
+// Convert the flat collection into an n-ary tree of fields.
+List<ThriftField> fields = ThriftField.fromNames(fieldNames);
+
+// Compile the tree of fields into internally used metadata.
+ThriftMetadata.ThriftStruct metadata =
+    ThriftMetadata.ThriftStruct.fromFields(TestStruct.class, fields);
+```
+
+At this point, we have an efficient internal representation of the fields that need to get deserialized.
+
+### Partial Thrift Protocol
+
+Source files:
+- PartialThriftProtocol.java
+- PartialThriftBinaryProtocol.java
+- PartialThriftCompactProtocol.java

Review comment:
       If these source files have been removed,the document should be updated。




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] bhalchandrap commented on pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
bhalchandrap commented on pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#issuecomment-908826428


   Updated README.md file to add clarification on map field path syntax.
   
   I have not done fuzzy testing. However, the code includes in depth testing. In addition, the code is in active use in production jobs at Pinterest. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] bhalchandrap commented on a change in pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
bhalchandrap commented on a change in pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#discussion_r703057882



##########
File path: lib/java/src/org/apache/thrift/partial/PartialThriftDeserializer.java
##########
@@ -0,0 +1,337 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.thrift.partial;
+
+import org.apache.thrift.partial.Validate;
+
+import org.apache.thrift.TBase;
+import org.apache.thrift.TEnum;
+import org.apache.thrift.TException;
+import org.apache.thrift.TFieldIdEnum;
+import org.apache.thrift.meta_data.EnumMetaData;
+import org.apache.thrift.meta_data.StructMetaData;
+import org.apache.thrift.protocol.TList;
+import org.apache.thrift.protocol.TMap;
+import org.apache.thrift.protocol.TSet;
+import org.apache.thrift.protocol.TType;
+
+/**
+ * Supports partial deserialization of serialized Thrift data.
+ *
+ * The end result of deserialization can be a Thrift object (? extends TBase)
+ * or it could be a different construct. It is achieved by using a
+ * ThriftFieldValueProcessor instance which processes each field value
+ * as it gets deserialized.
+ */
+public class PartialThriftDeserializer<T extends TBase> {
+
+  // Metadata that describes fields to deserialize.
+  private final ThriftMetadata.ThriftStruct metadata;
+
+  // Processor that handles deserialized field values.
+  private final ThriftFieldValueProcessor processor;
+
+  // Partial thrift protocol to use for deserialization.
+  private final PartialThriftProtocol protocol;
+
+  /**
+   * Constructs an instance of PartialThriftDeserializer.
+   *
+   * @param metadata the Metadata that describes fields to deserialize.
+   * @param processor the Processor that handles deserialized field values.
+   * @param protocol the Partial thrift protocol to use for deserialization.
+   */
+  public PartialThriftDeserializer(
+      ThriftMetadata.ThriftStruct metadata,
+      ThriftFieldValueProcessor processor,
+      PartialThriftProtocol protocol) {
+
+    Validate.checkNotNull(metadata, "metadata");
+    Validate.checkNotNull(processor, "processor");
+    Validate.checkNotNull(protocol, "protocol");
+
+    this.metadata = metadata;
+    this.processor = processor;
+    this.protocol = protocol;
+  }
+
+  /**
+   * Gets the Thrift metadata used by this instance.
+   */
+  public ThriftMetadata.ThriftStruct getMetadata() {
+    return this.metadata;
+  }
+
+  /**
+   * Deserializes the given serialized blob.
+   *
+   * @param bytes the serialized blob.
+   * @return deserialized instance.
+   * @throws TException if an error is encountered during deserialization.
+   */
+  public Object deserialize(byte[] bytes) throws TException {
+    return this.deserialize(bytes, 0, bytes.length);
+  }
+
+  /**
+   * Deserializes the given serialized blob.
+   *
+   * @param bytes the serialized blob.
+   * @param offset the blob is read starting at this offset.
+   * @param length the size of blob read (in number of bytes).
+   * @return deserialized instance.
+   * @throws TException if an error is encountered during deserialization.
+   */
+  public Object deserialize(byte[] bytes, int offset, int length) throws TException {
+    this.protocol.reset(bytes, offset, length);
+    return this.deserializeStruct(this.protocol, this.metadata);
+  }
+
+  private Object deserialize(
+      PartialThriftProtocol tprot,
+      ThriftMetadata.ThriftObject data) throws TException {
+
+    Object value;
+    byte fieldType = data.data.valueMetaData.type;
+    switch (fieldType) {
+      case TType.STRUCT:
+        return this.deserializeStruct(tprot, (ThriftMetadata.ThriftStruct) data);
+
+      case TType.LIST:
+        return this.deserializeList(tprot, (ThriftMetadata.ThriftList) data);
+
+      case TType.MAP:
+        return this.deserializeMap(tprot, (ThriftMetadata.ThriftMap) data);
+
+      case TType.SET:
+        return this.deserializeSet(tprot, (ThriftMetadata.ThriftSet) data);
+
+      case TType.ENUM:
+        return this.deserializeEnum(tprot, (ThriftMetadata.ThriftEnum) data);
+
+      case TType.BOOL:
+        return tprot.readBool();
+
+      case TType.BYTE:
+        return tprot.readByte();
+
+      case TType.I16:
+        return tprot.readI16();
+
+      case TType.I32:
+        return tprot.readI32();
+
+      case TType.I64:
+        return tprot.readI64();
+
+      case TType.DOUBLE:
+        return tprot.readDouble();
+
+      case TType.STRING:
+        if (((ThriftMetadata.ThriftPrimitive) data).isBinary()) {
+          return this.processor.prepareBinary(tprot.readBinary());
+        } else {
+          return this.processor.prepareString(tprot.readBinary());
+        }
+
+      default:
+        throw unsupportedFieldTypeException(fieldType);
+    }
+  }
+
+  private Object deserializeStruct(PartialThriftProtocol tprot, ThriftMetadata.ThriftStruct data)
+      throws TException {
+
+    if (data.fields.size() == 0) {
+      return this.fullDeserialize(tprot, data);
+    }
+
+    Object instance = this.processor.createNewStruct(data);
+    tprot.readStructBegin();
+    while (true) {
+      int tfieldData = tprot.readFieldBeginData();
+      byte tfieldType = TFieldData.getType(tfieldData);
+      if (tfieldType == TType.STOP) {
+        break;
+      }
+
+      Integer id = (int) TFieldData.getId(tfieldData);
+      ThriftMetadata.ThriftObject field = (ThriftMetadata.ThriftObject) data.fields.get(id);
+
+      if (field != null) {
+        this.deserializeStructField(tprot, instance, field.fieldId, field);
+      } else {
+        tprot.skip(tfieldType);
+      }
+      tprot.readFieldEnd();
+    }
+    tprot.readStructEnd();
+
+    return this.processor.prepareStruct(instance);
+  }
+
+  private void deserializeStructField(
+      PartialThriftProtocol tprot,
+      Object instance,
+      TFieldIdEnum fieldId,
+      ThriftMetadata.ThriftObject data) throws TException {
+
+    byte fieldType = data.data.valueMetaData.type;
+    Object value;
+
+    switch (fieldType) {
+      case TType.BOOL:
+        this.processor.setBool(instance, fieldId, tprot.readBool());
+        break;
+
+      case TType.BYTE:
+        this.processor.setByte(instance, fieldId, tprot.readByte());
+        break;
+
+      case TType.I16:
+        this.processor.setInt16(instance, fieldId, tprot.readI16());
+        break;
+
+      case TType.I32:
+        this.processor.setInt32(instance, fieldId, tprot.readI32());
+        break;
+
+      case TType.I64:
+        this.processor.setInt64(instance, fieldId, tprot.readI64());
+        break;
+
+      case TType.DOUBLE:
+        this.processor.setDouble(instance, fieldId, tprot.readDouble());
+        break;
+
+      case TType.STRING:
+        if (((ThriftMetadata.ThriftPrimitive) data).isBinary()) {
+          this.processor.setBinary(instance, fieldId, tprot.readBinary());
+        } else {
+          this.processor.setString(instance, fieldId, tprot.readBinary());
+        }
+        break;
+
+      case TType.STRUCT:
+        value = this.deserializeStruct(tprot, (ThriftMetadata.ThriftStruct) data);
+        this.processor.setStructField(instance, fieldId, value);
+        break;
+
+      case TType.LIST:
+        value = this.deserializeList(tprot, (ThriftMetadata.ThriftList) data);
+        this.processor.setListField(instance, fieldId, value);
+        break;
+
+      case TType.MAP:
+        value = this.deserializeMap(tprot, (ThriftMetadata.ThriftMap) data);
+        this.processor.setMapField(instance, fieldId, value);
+        break;
+
+      case TType.SET:
+        value = this.deserializeSet(tprot, (ThriftMetadata.ThriftSet) data);
+        this.processor.setSetField(instance, fieldId, value);
+        break;
+
+      case TType.ENUM:
+        value = this.deserializeEnum(tprot, (ThriftMetadata.ThriftEnum) data);
+        this.processor.setEnumField(instance, fieldId, value);
+        break;
+
+      default:
+        throw new RuntimeException("Unsupported field type: " + fieldId.toString());
+    }
+  }
+
+  private Object deserializeList(PartialThriftProtocol tprot, ThriftMetadata.ThriftList data)
+        throws TException {
+
+    TList tlist = tprot.readListBegin();
+    Object instance = this.processor.createNewList(tlist.size);
+    for (int i = 0; i < tlist.size; i++) {
+      Object value = this.deserialize(tprot, data.elementData);
+      this.processor.setListElement(instance, i, value);
+    }
+    tprot.readListEnd();
+    return this.processor.prepareList(instance);
+  }
+
+  private Object deserializeMap(PartialThriftProtocol tprot, ThriftMetadata.ThriftMap data)
+      throws TException {
+    TMap tmap = tprot.readMapBegin();
+    Object instance = this.processor.createNewMap(tmap.size);
+    for (int i = 0; i < tmap.size; i++) {
+      Object key = this.deserialize(tprot, data.keyData);
+      Object val = this.deserialize(tprot, data.valueData);
+      this.processor.setMapElement(instance, i, key, val);
+    }
+    tprot.readMapEnd();
+    return this.processor.prepareMap(instance);
+  }
+
+  private Object deserializeSet(PartialThriftProtocol tprot, ThriftMetadata.ThriftSet data)
+      throws TException {
+    TSet tset = tprot.readSetBegin();
+    Object instance = this.processor.createNewSet(tset.size);
+    for (int i = 0; i < tset.size; i++) {
+      Object eltValue = this.deserialize(tprot, data.elementData);
+      this.processor.setSetElement(instance, i, eltValue);
+    }
+    tprot.readSetEnd();
+    return this.processor.prepareSet(instance);
+  }
+
+  private Object deserializeEnum(PartialThriftProtocol tprot, ThriftMetadata.ThriftEnum data)
+      throws TException {
+    int ordinal = tprot.readI32();
+    Class<? extends TEnum> enumClass = ((EnumMetaData) data.data.valueMetaData).enumClass;
+    return this.processor.prepareEnum(enumClass, ordinal);
+  }
+
+  private TBase fullDeserialize(PartialThriftProtocol tprot, ThriftMetadata.ThriftStruct data)
+      throws TException {
+    Validate.checkState(
+        data.fields.size() == 0, "Cannot fully deserialize when some fields specified");

Review comment:
       good catch. fixed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] fishy commented on a change in pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
fishy commented on a change in pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#discussion_r720696130



##########
File path: lib/java/src/org/apache/thrift/TDeserializer.java
##########
@@ -353,4 +418,265 @@ private TField locateField(byte[] bytes, TFieldIdEnum fieldIdPathFirst, TFieldId
   public void fromString(TBase base, String data) throws TException {
     deserialize(base, data.getBytes());
   }
+
+  // ----------------------------------------------------------------------
+  // Methods related to partial deserialization.
+
+  /**
+   * Deserializes the given serialized blob.
+   *
+   * @param bytes the serialized blob.
+   * @return deserialized instance.
+   * @throws TException if an error is encountered during deserialization.
+   */
+  public Object partialDeserializeObject(byte[] bytes) throws TException {
+    return this.deserialize(bytes, 0, bytes.length);
+  }
+
+  /**
+   * Deserializes the given serialized blob.
+   *
+   * @param bytes the serialized blob.
+   * @param offset the blob is read starting at this offset.
+   * @param length the size of blob read (in number of bytes).
+   * @return deserialized instance.
+   * @throws TException if an error is encountered during deserialization.
+   */
+  public Object deserialize(byte[] bytes, int offset, int length) throws TException {
+    ensurePartialDeserializationMode();

Review comment:
       We already have a `deserialize` function with very similar signature:
   ```
   public void deserialize​(TBase base, byte[] bytes, int offset, int length) throws TException
   ```
   adding this function will make it really confusing to users on what to call. my suggestions are:
   1. Maybe rename this one to `deserializeObject` to emphasize that it returns `Object` instead of writes to `TBase`.
   2. Why does this only support partial mode (it throws a runtime exception when the `TDeserializer` was not consutrcted with partial mode), while there's nothing in its name or even javadoc mentioning that? that sounds like a huge footgun. Is there even a reason this should call `ensurePartialDeserializationMode`?

##########
File path: lib/java/src/org/apache/thrift/protocol/TProtocol.java
##########
@@ -180,4 +181,145 @@ public void reset() {}
   public Class<? extends IScheme> getScheme() {
     return StandardScheme.class;
   }
+
+  // -----------------------------------------------------------------
+  // Additional methods to improve performance.
+
+  public int readFieldBeginData() throws TException {
+    // Derived classes should provide a more efficient version of this
+    // method if allowed by the encoding used by that protocol.
+    TField tfield = this.readFieldBegin();
+    return TFieldData.encode(tfield.type, tfield.id);
+  }
+
+  public void skip(byte fieldType) throws TException {
+    this.skip(fieldType, Integer.MAX_VALUE);
+  }
+
+  public void skip(byte fieldType, int maxDepth) throws TException {
+    if (maxDepth <= 0) {
+      throw new TException("Maximum skip depth exceeded");
+    }
+
+    switch (fieldType) {
+      case TType.BOOL:
+        this.skipBool();
+        break;
+
+      case TType.BYTE:
+        this.skipByte();
+        break;
+
+      case TType.I16:
+        this.skipI16();
+        break;
+
+      case TType.I32:
+        this.skipI32();
+        break;
+
+      case TType.I64:
+        this.skipI64();
+        break;
+
+      case TType.DOUBLE:
+        this.skipDouble();
+        break;
+
+      case TType.STRING:
+        this.skipBinary();
+        break;
+
+      case TType.STRUCT:
+        this.readStructBegin();
+        while (true) {
+          int tfieldData = this.readFieldBeginData();
+          byte tfieldType = TFieldData.getType(tfieldData);
+          if (tfieldType == TType.STOP) {
+            break;
+          }
+          this.skip(tfieldType, maxDepth - 1);
+          this.readFieldEnd();
+        }
+        this.readStructEnd();
+        break;
+
+      case TType.MAP:
+        TMap map = this.readMapBegin();
+        for (int i = 0; i < map.size; i++) {
+          this.skip(map.keyType, maxDepth - 1);
+          this.skip(map.valueType, maxDepth - 1);
+        }
+        this.readMapEnd();
+        break;
+
+      case TType.SET:
+        TSet set = this.readSetBegin();
+        for (int i = 0; i < set.size; i++) {
+          this.skip(set.elemType, maxDepth - 1);
+        }
+        this.readSetEnd();
+        break;
+
+      case TType.LIST:
+        TList list = this.readListBegin();
+        for (int i = 0; i < list.size; i++) {
+          this.skip(list.elemType, maxDepth - 1);
+        }
+        this.readListEnd();
+        break;
+
+      default:
+        throw new TProtocolException(
+            TProtocolException.INVALID_DATA, "Unrecognized type " + fieldType);
+    }
+  }
+
+  protected void skipBool() throws TException {
+    this.skipBytes(1);
+  }
+
+  protected void skipByte() throws TException {
+    this.skipBytes(1);
+  }
+
+  protected void skipI16() throws TException {
+    this.skipBytes(2);
+  }
+
+  protected void skipI32() throws TException {
+    this.skipBytes(4);
+  }
+
+  protected void skipI64() throws TException {
+    this.skipBytes(8);
+  }
+
+  protected void skipDouble() throws TException {
+    this.skipBytes(8);
+  }

Review comment:
       I don't think the default implementations should use hardcoded number of bytes. the default implementations should just call `read*` and discard the result. this guarantees correctness, and leaves individual protocols the room to override the default implementation to improve performance. Using hardcoded number of bytes cannot guarantee the correctness.

##########
File path: lib/java/src/org/apache/thrift/TDeserializer.java
##########
@@ -61,6 +78,54 @@ public TDeserializer(TProtocolFactory protocolFactory) throws TTransportExceptio
     protocol_ = protocolFactory.getProtocol(trans_);
   }
 
+  /**
+   * Construct a new TDeserializer that supports partial deserialization
+   * that outputs instances of type controlled by the given {@code processor}.
+   *
+   * @param thriftClass a TBase derived class.
+   * @param fieldNames list of fields to deserialize.
+   * @param processor the Processor that handles deserialized field values.
+   * @param protocolFactory the Factory to create a protocol.
+   */
+  public TDeserializer(

Review comment:
       (maybe it just works, I'm not familiar with the java codebase and this is not obvious to me)
   
   I don't see you changing any of the existing public functions of `TDeserializer`, for example:
   ```
   public void deserialize​(TBase base, byte[] bytes) throws TException
   public void deserialize​(TBase base, byte[] bytes, int offset, int length) throws TException
   public void deserialize​(TBase base, java.lang.String data, java.lang.String charset) throws TException
   ```
   Since you are adding new constructors to `TDeserializer`, you need to ensure that those existing public functions still work as intended. When someone uses this constructor to create a `TDeserializer`, then calls `deserialize​(TBase base, byte[] bytes)`, they would expect that `deserialize` function does partial deserialization, not the full one.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] fishy commented on a change in pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
fishy commented on a change in pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#discussion_r725557753



##########
File path: lib/java/src/org/apache/thrift/TDeserializer.java
##########
@@ -61,6 +78,54 @@ public TDeserializer(TProtocolFactory protocolFactory) throws TTransportExceptio
     protocol_ = protocolFactory.getProtocol(trans_);
   }
 
+  /**
+   * Construct a new TDeserializer that supports partial deserialization
+   * that outputs instances of type controlled by the given {@code processor}.
+   *
+   * @param thriftClass a TBase derived class.
+   * @param fieldNames list of fields to deserialize.
+   * @param processor the Processor that handles deserialized field values.
+   * @param protocolFactory the Factory to create a protocol.
+   */
+  public TDeserializer(

Review comment:
       I disagree. Imagine this PR is merged and some developer wrote this code:
   
   ```java
   TDeserializer deserializer = new TDeserializer(thriftClass, fieldNames, processor, protocolFactory);
   deserializer.deserialize(base, buf);
   ```
   
   Do you expect it to do full or partial deserialization? I would say that as `deserializer` here is constructed via a constructor that does partial deserialization, I would expect it to only do partial deserializations.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] Jens-G commented on pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
Jens-G commented on pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#issuecomment-968175273


   > This branch cannot be rebased due to conflicts
   
   If there are no objections and the CI is green again I would merge this in the next days. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] Jens-G commented on pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
Jens-G commented on pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#issuecomment-974548436


   Next time please either squash the PR or at least rebase it onto master every now and then.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] Jens-G closed pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
Jens-G closed pull request #2439:
URL: https://github.com/apache/thrift/pull/2439


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] bhalchandrap commented on pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
bhalchandrap commented on pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#issuecomment-951462534


   Hi @Jens-G, @fishy approved the pull request. I had addressed your earlier comments. Do you have more suggestions? I would like to know what I can do from my side to make progress on this pull request.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] bhalchandrap commented on a change in pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
bhalchandrap commented on a change in pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#discussion_r755400974



##########
File path: lib/java/src/org/apache/thrift/partial/README.md
##########
@@ -0,0 +1,112 @@
+# Partial Thrift Deserialization
+
+## Overview
+This document describes how partial deserialization of Thrift works. There are two main goals of this documentation:
+1. Make it easier to understand the current Java implementation in this folder.
+1. Be useful in implementing partial deserialization support in additional languages.
+
+This document is divided into two high level areas. The first part explains important concepts relevant to partial deserialization. The second part describes components involved in the Java implementation in this folder.
+
+Moreover, this blog provides some performance numbers and addtional information: https://medium.com/pinterest-engineering/improving-data-processing-efficiency-using-partial-deserialization-of-thrift-16bc3a4a38b4
+
+## Basic Concepts
+
+### Motivation
+
+The main motivation behind implementing this feature is to improve performance when we need to access only a subset of fields in any Thrift object. This situation arises often when big data is stored in Thrift encoded format (for example, SequenceFile with serialized Thrift values). Many data processing jobs may access this data. However, not every job needs to access every field of each object. In such cases, if we have prior knowledge of the fields needed for a given job, we can deserialize only that subset of fields and avoid the cost deserializing the rest of the fields. There are two benefits of this approach: we save cpu cycles by not deserializing unnecessary field and we end up reducing gc pressure. Both of the savings quickly add up when processing billions of instances in a data processing job.
+
+### Partial deserialization
+
+Partial deserialization involves deserializing only a subset of the fields of a serialized Thrift object while efficiently skipping over the rest. One very important benefit of partial deserialization is that the output of the deserialization process is not limited to a `TBase` derived object. It can deserialize a serialized blob into any type by using an appropriate `ThriftFieldValueProcessor`.
+
+### Defining the subset of fields to deserialize
+
+The subset of fields to deserialize is defined using a list of fully qualified field names. For example, consider the Thrift `struct` definition below:
+
+```Thrift
+struct SmallStruct {
+   1: optional string stringValue;
+   2: optional i16 i16Value;
+}
+
+struct TestStruct {
+   1: optional i16 i16Field;
+   2: optional list<SmallStruct> structList;
+   3: optional set<SmallStruct> structSet;
+   4: optional map<string, SmallStruct> structMap;
+   5: optional SmallStruct structField;
+}
+```
+
+For the Thrift `struct`, each of the following line shows a fully qualified field definition. Partial deserialization uses a non-empty set of such field definitions to identify the subset of fields to deserialize.
+
+```
+- i16Field
+- structList.stringValue
+- structSet.i16Value
+- structMap.stringValue
+- structField.i16Value
+```
+
+Note that the syntax of denoting paths involving map fields do not support a way to define sub-fields of the key type.
+
+For example, the field path `structMap.stringValue` shown above has leaf segment `stringValue` which is a field in map values.
+
+## Components
+
+The process of partial deserialization involves the following major components. We have listed names of the Java file(s) implementing each component for easier mapping to the source code.
+
+### Thrift Metadata
+
+Source files:
+- ThriftField.java
+- ThriftMetadata.java
+
+We saw in the previous section how we can identify the subset of fields to deserialize. As the first step, we need to compile the collection of field definitions into an efficient data structure that we can traverse at runtime. This step is achieved using `ThriftField` and `ThriftMetadata` classes. For example,
+
+```Java
+// First, create a collection of fully qualified field names.
+List<String> fieldNames = Arrays.asList("i16Field", "structField.i16Value");
+
+// Convert the flat collection into an n-ary tree of fields.
+List<ThriftField> fields = ThriftField.fromNames(fieldNames);
+
+// Compile the tree of fields into internally used metadata.
+ThriftMetadata.ThriftStruct metadata =
+    ThriftMetadata.ThriftStruct.fromFields(TestStruct.class, fields);
+```
+
+At this point, we have an efficient internal representation of the fields that need to get deserialized.
+
+### Partial Thrift Protocol
+
+Source files:
+- PartialThriftProtocol.java
+- PartialThriftBinaryProtocol.java
+- PartialThriftCompactProtocol.java

Review comment:
       Sorry that was an oversight. I will update the document asap and send another PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] bhalchandrap commented on a change in pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
bhalchandrap commented on a change in pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#discussion_r723770470



##########
File path: lib/java/src/org/apache/thrift/protocol/TProtocol.java
##########
@@ -180,4 +181,145 @@ public void reset() {}
   public Class<? extends IScheme> getScheme() {
     return StandardScheme.class;
   }
+
+  // -----------------------------------------------------------------
+  // Additional methods to improve performance.
+
+  public int readFieldBeginData() throws TException {
+    // Derived classes should provide a more efficient version of this
+    // method if allowed by the encoding used by that protocol.
+    TField tfield = this.readFieldBegin();
+    return TFieldData.encode(tfield.type, tfield.id);
+  }
+
+  public void skip(byte fieldType) throws TException {
+    this.skip(fieldType, Integer.MAX_VALUE);
+  }
+
+  public void skip(byte fieldType, int maxDepth) throws TException {
+    if (maxDepth <= 0) {
+      throw new TException("Maximum skip depth exceeded");
+    }
+
+    switch (fieldType) {
+      case TType.BOOL:
+        this.skipBool();
+        break;
+
+      case TType.BYTE:
+        this.skipByte();
+        break;
+
+      case TType.I16:
+        this.skipI16();
+        break;
+
+      case TType.I32:
+        this.skipI32();
+        break;
+
+      case TType.I64:
+        this.skipI64();
+        break;
+
+      case TType.DOUBLE:
+        this.skipDouble();
+        break;
+
+      case TType.STRING:
+        this.skipBinary();
+        break;
+
+      case TType.STRUCT:
+        this.readStructBegin();
+        while (true) {
+          int tfieldData = this.readFieldBeginData();
+          byte tfieldType = TFieldData.getType(tfieldData);
+          if (tfieldType == TType.STOP) {
+            break;
+          }
+          this.skip(tfieldType, maxDepth - 1);
+          this.readFieldEnd();
+        }
+        this.readStructEnd();
+        break;
+
+      case TType.MAP:
+        TMap map = this.readMapBegin();
+        for (int i = 0; i < map.size; i++) {
+          this.skip(map.keyType, maxDepth - 1);
+          this.skip(map.valueType, maxDepth - 1);
+        }
+        this.readMapEnd();
+        break;
+
+      case TType.SET:
+        TSet set = this.readSetBegin();
+        for (int i = 0; i < set.size; i++) {
+          this.skip(set.elemType, maxDepth - 1);
+        }
+        this.readSetEnd();
+        break;
+
+      case TType.LIST:
+        TList list = this.readListBegin();
+        for (int i = 0; i < list.size; i++) {
+          this.skip(list.elemType, maxDepth - 1);
+        }
+        this.readListEnd();
+        break;
+
+      default:
+        throw new TProtocolException(
+            TProtocolException.INVALID_DATA, "Unrecognized type " + fieldType);
+    }
+  }
+
+  protected void skipBool() throws TException {
+    this.skipBytes(1);
+  }
+
+  protected void skipByte() throws TException {
+    this.skipBytes(1);
+  }
+
+  protected void skipI16() throws TException {
+    this.skipBytes(2);
+  }
+
+  protected void skipI32() throws TException {
+    this.skipBytes(4);
+  }
+
+  protected void skipI64() throws TException {
+    this.skipBytes(8);
+  }
+
+  protected void skipDouble() throws TException {
+    this.skipBytes(8);
+  }

Review comment:
       I had other aspect in mind. That is, have the derived protocols override when they do not do what the base does. If I call read() for each skip() then each of the derived protocols would want to override all of the methods.
   
   Happy to do what you suggest. Can you please confirm?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] bhalchandrap commented on a change in pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
bhalchandrap commented on a change in pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#discussion_r703057685



##########
File path: lib/java/src/org/apache/thrift/partial/PartialThriftDeserializer.java
##########
@@ -0,0 +1,337 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.thrift.partial;
+
+import org.apache.thrift.partial.Validate;
+
+import org.apache.thrift.TBase;
+import org.apache.thrift.TEnum;
+import org.apache.thrift.TException;
+import org.apache.thrift.TFieldIdEnum;
+import org.apache.thrift.meta_data.EnumMetaData;
+import org.apache.thrift.meta_data.StructMetaData;
+import org.apache.thrift.protocol.TList;
+import org.apache.thrift.protocol.TMap;
+import org.apache.thrift.protocol.TSet;
+import org.apache.thrift.protocol.TType;
+
+/**
+ * Supports partial deserialization of serialized Thrift data.
+ *
+ * The end result of deserialization can be a Thrift object (? extends TBase)
+ * or it could be a different construct. It is achieved by using a
+ * ThriftFieldValueProcessor instance which processes each field value
+ * as it gets deserialized.
+ */
+public class PartialThriftDeserializer<T extends TBase> {
+
+  // Metadata that describes fields to deserialize.
+  private final ThriftMetadata.ThriftStruct metadata;
+
+  // Processor that handles deserialized field values.
+  private final ThriftFieldValueProcessor processor;
+
+  // Partial thrift protocol to use for deserialization.
+  private final PartialThriftProtocol protocol;
+
+  /**
+   * Constructs an instance of PartialThriftDeserializer.
+   *
+   * @param metadata the Metadata that describes fields to deserialize.
+   * @param processor the Processor that handles deserialized field values.
+   * @param protocol the Partial thrift protocol to use for deserialization.
+   */
+  public PartialThriftDeserializer(
+      ThriftMetadata.ThriftStruct metadata,
+      ThriftFieldValueProcessor processor,
+      PartialThriftProtocol protocol) {
+
+    Validate.checkNotNull(metadata, "metadata");
+    Validate.checkNotNull(processor, "processor");
+    Validate.checkNotNull(protocol, "protocol");
+
+    this.metadata = metadata;
+    this.processor = processor;
+    this.protocol = protocol;
+  }
+
+  /**
+   * Gets the Thrift metadata used by this instance.
+   */
+  public ThriftMetadata.ThriftStruct getMetadata() {
+    return this.metadata;
+  }
+
+  /**
+   * Deserializes the given serialized blob.
+   *
+   * @param bytes the serialized blob.
+   * @return deserialized instance.
+   * @throws TException if an error is encountered during deserialization.
+   */
+  public Object deserialize(byte[] bytes) throws TException {

Review comment:
       It returns Object on purpose. See the class header comment as well as the README.md file for more info.
   This arrangement allows non-TBase values to be directly instantiated from serialized Thrift objects. One example of its usefulness is Spark engine reading SequenceFile of Thrift objects. In that case, this arrangement allows SparkThriftRowProcessor (not a part of this change) to directly deserialize a serialized Thrift blob into Spark's InternalRow representation. That enables significant performance boost.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] bhalchandrap commented on pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
bhalchandrap commented on pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#issuecomment-908826428


   Updated README.md file to add clarification on map field path syntax.
   
   I have not done fuzzy testing. However, the code includes in depth testing. In addition, the code is in active use in production jobs at Pinterest. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] bhalchandrap commented on pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
bhalchandrap commented on pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#issuecomment-917779310


   @fishy Thanks for reviewing! Can you please provide clarification on the questions?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] bhalchandrap commented on pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
bhalchandrap commented on pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#issuecomment-940598125


   @fishy thanks for the careful code review and the suggestions. I applied your suggestions. Can you please re-review? I will add tests corresponding to the new code once the overall structure looks ok to you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] fishy commented on a change in pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
fishy commented on a change in pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#discussion_r702348413



##########
File path: lib/java/src/org/apache/thrift/partial/PartialThriftDeserializer.java
##########
@@ -0,0 +1,337 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.thrift.partial;
+
+import org.apache.thrift.partial.Validate;
+
+import org.apache.thrift.TBase;
+import org.apache.thrift.TEnum;
+import org.apache.thrift.TException;
+import org.apache.thrift.TFieldIdEnum;
+import org.apache.thrift.meta_data.EnumMetaData;
+import org.apache.thrift.meta_data.StructMetaData;
+import org.apache.thrift.protocol.TList;
+import org.apache.thrift.protocol.TMap;
+import org.apache.thrift.protocol.TSet;
+import org.apache.thrift.protocol.TType;
+
+/**
+ * Supports partial deserialization of serialized Thrift data.
+ *
+ * The end result of deserialization can be a Thrift object (? extends TBase)
+ * or it could be a different construct. It is achieved by using a
+ * ThriftFieldValueProcessor instance which processes each field value
+ * as it gets deserialized.
+ */
+public class PartialThriftDeserializer<T extends TBase> {
+
+  // Metadata that describes fields to deserialize.
+  private final ThriftMetadata.ThriftStruct metadata;
+
+  // Processor that handles deserialized field values.
+  private final ThriftFieldValueProcessor processor;
+
+  // Partial thrift protocol to use for deserialization.
+  private final PartialThriftProtocol protocol;
+
+  /**
+   * Constructs an instance of PartialThriftDeserializer.
+   *
+   * @param metadata the Metadata that describes fields to deserialize.
+   * @param processor the Processor that handles deserialized field values.
+   * @param protocol the Partial thrift protocol to use for deserialization.
+   */
+  public PartialThriftDeserializer(
+      ThriftMetadata.ThriftStruct metadata,
+      ThriftFieldValueProcessor processor,
+      PartialThriftProtocol protocol) {
+
+    Validate.checkNotNull(metadata, "metadata");
+    Validate.checkNotNull(processor, "processor");
+    Validate.checkNotNull(protocol, "protocol");
+
+    this.metadata = metadata;
+    this.processor = processor;
+    this.protocol = protocol;
+  }
+
+  /**
+   * Gets the Thrift metadata used by this instance.
+   */
+  public ThriftMetadata.ThriftStruct getMetadata() {
+    return this.metadata;
+  }
+
+  /**
+   * Deserializes the given serialized blob.
+   *
+   * @param bytes the serialized blob.
+   * @return deserialized instance.
+   * @throws TException if an error is encountered during deserialization.
+   */
+  public Object deserialize(byte[] bytes) throws TException {
+    return this.deserialize(bytes, 0, bytes.length);
+  }
+
+  /**
+   * Deserializes the given serialized blob.
+   *
+   * @param bytes the serialized blob.
+   * @param offset the blob is read starting at this offset.
+   * @param length the size of blob read (in number of bytes).
+   * @return deserialized instance.
+   * @throws TException if an error is encountered during deserialization.
+   */
+  public Object deserialize(byte[] bytes, int offset, int length) throws TException {
+    this.protocol.reset(bytes, offset, length);
+    return this.deserializeStruct(this.protocol, this.metadata);
+  }
+
+  private Object deserialize(
+      PartialThriftProtocol tprot,
+      ThriftMetadata.ThriftObject data) throws TException {
+
+    Object value;
+    byte fieldType = data.data.valueMetaData.type;
+    switch (fieldType) {
+      case TType.STRUCT:
+        return this.deserializeStruct(tprot, (ThriftMetadata.ThriftStruct) data);
+
+      case TType.LIST:
+        return this.deserializeList(tprot, (ThriftMetadata.ThriftList) data);
+
+      case TType.MAP:
+        return this.deserializeMap(tprot, (ThriftMetadata.ThriftMap) data);
+
+      case TType.SET:
+        return this.deserializeSet(tprot, (ThriftMetadata.ThriftSet) data);
+
+      case TType.ENUM:
+        return this.deserializeEnum(tprot, (ThriftMetadata.ThriftEnum) data);
+
+      case TType.BOOL:
+        return tprot.readBool();
+
+      case TType.BYTE:
+        return tprot.readByte();
+
+      case TType.I16:
+        return tprot.readI16();
+
+      case TType.I32:
+        return tprot.readI32();
+
+      case TType.I64:
+        return tprot.readI64();
+
+      case TType.DOUBLE:
+        return tprot.readDouble();
+
+      case TType.STRING:
+        if (((ThriftMetadata.ThriftPrimitive) data).isBinary()) {
+          return this.processor.prepareBinary(tprot.readBinary());
+        } else {
+          return this.processor.prepareString(tprot.readBinary());
+        }
+
+      default:
+        throw unsupportedFieldTypeException(fieldType);
+    }
+  }
+
+  private Object deserializeStruct(PartialThriftProtocol tprot, ThriftMetadata.ThriftStruct data)
+      throws TException {
+
+    if (data.fields.size() == 0) {
+      return this.fullDeserialize(tprot, data);
+    }
+
+    Object instance = this.processor.createNewStruct(data);
+    tprot.readStructBegin();
+    while (true) {
+      int tfieldData = tprot.readFieldBeginData();
+      byte tfieldType = TFieldData.getType(tfieldData);
+      if (tfieldType == TType.STOP) {
+        break;
+      }
+
+      Integer id = (int) TFieldData.getId(tfieldData);
+      ThriftMetadata.ThriftObject field = (ThriftMetadata.ThriftObject) data.fields.get(id);
+
+      if (field != null) {
+        this.deserializeStructField(tprot, instance, field.fieldId, field);
+      } else {
+        tprot.skip(tfieldType);
+      }
+      tprot.readFieldEnd();
+    }
+    tprot.readStructEnd();
+
+    return this.processor.prepareStruct(instance);
+  }
+
+  private void deserializeStructField(
+      PartialThriftProtocol tprot,
+      Object instance,
+      TFieldIdEnum fieldId,
+      ThriftMetadata.ThriftObject data) throws TException {
+
+    byte fieldType = data.data.valueMetaData.type;
+    Object value;
+
+    switch (fieldType) {
+      case TType.BOOL:
+        this.processor.setBool(instance, fieldId, tprot.readBool());
+        break;
+
+      case TType.BYTE:
+        this.processor.setByte(instance, fieldId, tprot.readByte());
+        break;
+
+      case TType.I16:
+        this.processor.setInt16(instance, fieldId, tprot.readI16());
+        break;
+
+      case TType.I32:
+        this.processor.setInt32(instance, fieldId, tprot.readI32());
+        break;
+
+      case TType.I64:
+        this.processor.setInt64(instance, fieldId, tprot.readI64());
+        break;
+
+      case TType.DOUBLE:
+        this.processor.setDouble(instance, fieldId, tprot.readDouble());
+        break;
+
+      case TType.STRING:
+        if (((ThriftMetadata.ThriftPrimitive) data).isBinary()) {
+          this.processor.setBinary(instance, fieldId, tprot.readBinary());
+        } else {
+          this.processor.setString(instance, fieldId, tprot.readBinary());
+        }
+        break;
+
+      case TType.STRUCT:
+        value = this.deserializeStruct(tprot, (ThriftMetadata.ThriftStruct) data);
+        this.processor.setStructField(instance, fieldId, value);
+        break;
+
+      case TType.LIST:
+        value = this.deserializeList(tprot, (ThriftMetadata.ThriftList) data);
+        this.processor.setListField(instance, fieldId, value);
+        break;
+
+      case TType.MAP:
+        value = this.deserializeMap(tprot, (ThriftMetadata.ThriftMap) data);
+        this.processor.setMapField(instance, fieldId, value);
+        break;
+
+      case TType.SET:
+        value = this.deserializeSet(tprot, (ThriftMetadata.ThriftSet) data);
+        this.processor.setSetField(instance, fieldId, value);
+        break;
+
+      case TType.ENUM:
+        value = this.deserializeEnum(tprot, (ThriftMetadata.ThriftEnum) data);
+        this.processor.setEnumField(instance, fieldId, value);
+        break;
+
+      default:
+        throw new RuntimeException("Unsupported field type: " + fieldId.toString());
+    }
+  }
+
+  private Object deserializeList(PartialThriftProtocol tprot, ThriftMetadata.ThriftList data)
+        throws TException {
+
+    TList tlist = tprot.readListBegin();
+    Object instance = this.processor.createNewList(tlist.size);
+    for (int i = 0; i < tlist.size; i++) {
+      Object value = this.deserialize(tprot, data.elementData);
+      this.processor.setListElement(instance, i, value);
+    }
+    tprot.readListEnd();
+    return this.processor.prepareList(instance);
+  }
+
+  private Object deserializeMap(PartialThriftProtocol tprot, ThriftMetadata.ThriftMap data)
+      throws TException {
+    TMap tmap = tprot.readMapBegin();
+    Object instance = this.processor.createNewMap(tmap.size);
+    for (int i = 0; i < tmap.size; i++) {
+      Object key = this.deserialize(tprot, data.keyData);
+      Object val = this.deserialize(tprot, data.valueData);
+      this.processor.setMapElement(instance, i, key, val);
+    }
+    tprot.readMapEnd();
+    return this.processor.prepareMap(instance);
+  }
+
+  private Object deserializeSet(PartialThriftProtocol tprot, ThriftMetadata.ThriftSet data)
+      throws TException {
+    TSet tset = tprot.readSetBegin();
+    Object instance = this.processor.createNewSet(tset.size);
+    for (int i = 0; i < tset.size; i++) {
+      Object eltValue = this.deserialize(tprot, data.elementData);
+      this.processor.setSetElement(instance, i, eltValue);
+    }
+    tprot.readSetEnd();
+    return this.processor.prepareSet(instance);
+  }
+
+  private Object deserializeEnum(PartialThriftProtocol tprot, ThriftMetadata.ThriftEnum data)
+      throws TException {
+    int ordinal = tprot.readI32();
+    Class<? extends TEnum> enumClass = ((EnumMetaData) data.data.valueMetaData).enumClass;
+    return this.processor.prepareEnum(enumClass, ordinal);
+  }
+
+  private TBase fullDeserialize(PartialThriftProtocol tprot, ThriftMetadata.ThriftStruct data)
+      throws TException {
+    Validate.checkState(
+        data.fields.size() == 0, "Cannot fully deserialize when some fields specified");

Review comment:
       "...some fields *are* specified"

##########
File path: lib/java/src/org/apache/thrift/partial/PartialThriftDeserializer.java
##########
@@ -0,0 +1,337 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.thrift.partial;
+
+import org.apache.thrift.partial.Validate;
+
+import org.apache.thrift.TBase;
+import org.apache.thrift.TEnum;
+import org.apache.thrift.TException;
+import org.apache.thrift.TFieldIdEnum;
+import org.apache.thrift.meta_data.EnumMetaData;
+import org.apache.thrift.meta_data.StructMetaData;
+import org.apache.thrift.protocol.TList;
+import org.apache.thrift.protocol.TMap;
+import org.apache.thrift.protocol.TSet;
+import org.apache.thrift.protocol.TType;
+
+/**
+ * Supports partial deserialization of serialized Thrift data.
+ *
+ * The end result of deserialization can be a Thrift object (? extends TBase)
+ * or it could be a different construct. It is achieved by using a
+ * ThriftFieldValueProcessor instance which processes each field value
+ * as it gets deserialized.
+ */
+public class PartialThriftDeserializer<T extends TBase> {
+
+  // Metadata that describes fields to deserialize.
+  private final ThriftMetadata.ThriftStruct metadata;
+
+  // Processor that handles deserialized field values.
+  private final ThriftFieldValueProcessor processor;
+
+  // Partial thrift protocol to use for deserialization.
+  private final PartialThriftProtocol protocol;
+
+  /**
+   * Constructs an instance of PartialThriftDeserializer.
+   *
+   * @param metadata the Metadata that describes fields to deserialize.
+   * @param processor the Processor that handles deserialized field values.
+   * @param protocol the Partial thrift protocol to use for deserialization.
+   */
+  public PartialThriftDeserializer(
+      ThriftMetadata.ThriftStruct metadata,
+      ThriftFieldValueProcessor processor,
+      PartialThriftProtocol protocol) {
+
+    Validate.checkNotNull(metadata, "metadata");
+    Validate.checkNotNull(processor, "processor");
+    Validate.checkNotNull(protocol, "protocol");
+
+    this.metadata = metadata;
+    this.processor = processor;
+    this.protocol = protocol;
+  }
+
+  /**
+   * Gets the Thrift metadata used by this instance.
+   */
+  public ThriftMetadata.ThriftStruct getMetadata() {
+    return this.metadata;
+  }
+
+  /**
+   * Deserializes the given serialized blob.
+   *
+   * @param bytes the serialized blob.
+   * @return deserialized instance.
+   * @throws TException if an error is encountered during deserialization.
+   */
+  public Object deserialize(byte[] bytes) throws TException {

Review comment:
       should this return `TBase`, or take `TBase` as an arg and return `void` instead? that's how the current `TDeserializer` interface is defined.

##########
File path: lib/java/src/org/apache/thrift/partial/PartialThriftCompactProtocol.java
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.thrift.partial;
+
+import org.apache.thrift.TException;
+import org.apache.thrift.protocol.TCompactProtocol;
+import org.apache.thrift.protocol.TField;
+import org.apache.thrift.protocol.TProtocol;
+
+import java.io.Serializable;
+
+/**
+ * Enables partial deserialization of compact-encoded thrift objects.
+ *
+ * This class is meant to be a helper class for {@link PartialThriftDeserializer}.
+ * It cannot be used separately on its own.
+ */
+public class PartialThriftCompactProtocol extends PartialThriftProtocol implements Serializable {
+
+  public PartialThriftCompactProtocol() {
+  }
+
+  @Override
+  protected TProtocol createProtocol() {
+    return new TCompactProtocol(transport);
+  }
+
+  // -----------------------------------------------------------------
+  // Additional methods to improve performance.
+
+  @Override
+  public int readFieldBeginData() throws TException {
+    // Having to call readFieldBegin() to compute TFieldData really results in lower
+    // performance. However, readFieldBegin() accesses some private vars that this method
+    // does not have access to. We could make it more performant when contributing to
+    // origianl source code.

Review comment:
       you are contributing to the original source code here, so maybe this comment should be updated accordingly?

##########
File path: lib/java/src/org/apache/thrift/partial/README.md
##########
@@ -0,0 +1,112 @@
+# Partial Thrift Deserialization
+
+## Overview
+This document describes how partial deserialization of Thrift works. There are two main goals of this documentation:
+1. Make it easier to understand the current Java implementation in this folder.
+1. Be useful in implementing partial deserialization support in additional languages.
+
+This document is divided into two high level areas. The first part explains important concepts relevant to partial deserialization. The second part describes components involved in the Java implementation in this folder.
+
+Moreover, this blog provides some performance numbers and addtional information: https://medium.com/pinterest-engineering/improving-data-processing-efficiency-using-partial-deserialization-of-thrift-16bc3a4a38b4
+
+## Basic Concepts
+
+### Motivation
+
+The main motivation behind implementing this feature is to improve performance when we need to access only a subset of fields in any Thrift object. This situation arises often when big data is stored in Thrift encoded format (for example, SequenceFile with serialized Thrift values). Many data processing jobs may access this data. However, not every job needs to access every field of each object. In such cases, if we have prior knowledge of the fields needed for a given job, we can deserialize only that subset of fields and avoid the cost deserializing the rest of the fields. There are two benefits of this approach: we save cpu cycles by not deserializing unnecessary field and we end up reducing gc pressure. Both of the savings quickly add up when processing billions of instances in a data processing job.
+
+### Partial deserialization
+
+Partial deserialization involves deserializing only a subset of the fields of a serialized Thrift object while efficiently skipping over the rest. One very important benefit of partial deserialization is that the output of the deserialization process is not limited to a `TBase` derived object. It can deserialize a serialized blob into any type by using an appropriate `ThriftFieldValueProcessor`.

Review comment:
       I see this is where you explained the API difference from `TDeserialize`.
   
   Is there a real use case of deserializing a non-`TBase`, though? My main concern is that to map this API to other languages, `Object` would be `interface{}` in go, and we really want to avoid `interface{}` in go as much as possible.

##########
File path: lib/java/src/org/apache/thrift/partial/README.md
##########
@@ -0,0 +1,112 @@
+# Partial Thrift Deserialization
+
+## Overview
+This document describes how partial deserialization of Thrift works. There are two main goals of this documentation:
+1. Make it easier to understand the current Java implementation in this folder.
+1. Be useful in implementing partial deserialization support in additional languages.
+
+This document is divided into two high level areas. The first part explains important concepts relevant to partial deserialization. The second part describes components involved in the Java implementation in this folder.
+
+Moreover, this blog provides some performance numbers and addtional information: https://medium.com/pinterest-engineering/improving-data-processing-efficiency-using-partial-deserialization-of-thrift-16bc3a4a38b4
+
+## Basic Concepts
+
+### Motivation
+
+The main motivation behind implementing this feature is to improve performance when we need to access only a subset of fields in any Thrift object. This situation arises often when big data is stored in Thrift encoded format (for example, SequenceFile with serialized Thrift values). Many data processing jobs may access this data. However, not every job needs to access every field of each object. In such cases, if we have prior knowledge of the fields needed for a given job, we can deserialize only that subset of fields and avoid the cost deserializing the rest of the fields. There are two benefits of this approach: we save cpu cycles by not deserializing unnecessary field and we end up reducing gc pressure. Both of the savings quickly add up when processing billions of instances in a data processing job.
+
+### Partial deserialization
+
+Partial deserialization involves deserializing only a subset of the fields of a serialized Thrift object while efficiently skipping over the rest. One very important benefit of partial deserialization is that the output of the deserialization process is not limited to a `TBase` derived object. It can deserialize a serialized blob into any type by using an appropriate `ThriftFieldValueProcessor`.
+
+### Defining the subset of fields to deserialize
+
+The subset of fields to deserialize is defined using a list of fully qualified field names. For example, consider the Thrift `struct` definition below:
+
+```Thrift
+struct SmallStruct {
+   1: optional string stringValue;
+   2: optional i16 i16Value;
+}
+
+struct TestStruct {
+   1: optional i16 i16Field;
+   2: optional list<SmallStruct> structList;
+   3: optional set<SmallStruct> structSet;
+   4: optional map<string, SmallStruct> structMap;
+   5: optional SmallStruct structField;
+}
+```
+
+For the Thrift `struct`, each of the following line shows a fully qualified field definition. Partial deserialization uses a non-empty set of such field definitions to identify the subset of fields to deserialize.
+
+```
+- i16Field
+- structList.stringValue
+- structSet.i16Value
+- structMap.stringValue
+- structField.i16Value
+```
+
+Note that the syntax of denoting paths involving map fields do not support a way to define sub-fields of the key type.
+
+For example, the field path `structMap.stringValue` shown above has leaf segment `stringValue` which is a field in map values.
+
+## Components
+
+The process of partial deserialization involves the following major components. We have listed names of the Java file(s) implementing each component for easier mapping to the source code.
+
+### Thrift Metadata
+
+Source files:
+- ThriftField.java
+- ThriftMetadata.java
+
+We saw in the previous section how we can identify the subset of fields to deserialize. As the first step, we need to compile the collection of field definitions into an efficient data structure that we can traverse at runtime. This step is achieved using `ThriftField` and `ThriftMetadata` classes. For example,
+
+```Java
+// First, create a collection of fully qualified field names.
+List<String> fieldNames = Arrays.asList("i16Field", "structField.i16Value");
+
+// Convert the flat collection into an n-ary tree of fields.
+List<ThriftField> fields = ThriftField.fromNames(fieldNames);
+
+// Compile the tree of fields into internally used metadata.
+ThriftMetadata.ThriftStruct metadata =
+    ThriftMetadata.ThriftStruct.fromFields(TestStruct.class, fields);
+```
+
+At this point, we have an efficient internal representation of the fields that need to get deserialized.
+
+### Partial Thrift Protocol
+
+Source files:
+- PartialThriftProtocol.java
+- PartialThriftBinaryProtocol.java
+- PartialThriftCompactProtocol.java
+
+This component implements efficient skipping over fields that need not be deserialized. Note that this skipping is more efficient compared to that achieved by using `TProtocolUtil.skip()`. The latter calls the corresponding `read()`, allocates and initializes certain values (for example, strings) and then discards the returned value. In comparison, `PartialThriftProtocol` skips a field by incrementing internal offset into the transport buffer.

Review comment:
       While this is true that the existing `skip` implementations all need to read the field wholly then discard it, they don't necessarily have to be this way. I would prefer to replace the existing `TProtocolUtil.skip` with the more efficient implementation, than creating a new, separated API for them.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] bhalchandrap commented on pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
bhalchandrap commented on pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#issuecomment-931797936


   @fishy thanks for the earlier suggestions. I applied all of them. Can you please re-review?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] bhalchandrap commented on a change in pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
bhalchandrap commented on a change in pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#discussion_r723768281



##########
File path: lib/java/src/org/apache/thrift/TDeserializer.java
##########
@@ -353,4 +418,265 @@ private TField locateField(byte[] bytes, TFieldIdEnum fieldIdPathFirst, TFieldId
   public void fromString(TBase base, String data) throws TException {
     deserialize(base, data.getBytes());
   }
+
+  // ----------------------------------------------------------------------
+  // Methods related to partial deserialization.
+
+  /**
+   * Deserializes the given serialized blob.
+   *
+   * @param bytes the serialized blob.
+   * @return deserialized instance.
+   * @throws TException if an error is encountered during deserialization.
+   */
+  public Object partialDeserializeObject(byte[] bytes) throws TException {
+    return this.deserialize(bytes, 0, bytes.length);
+  }
+
+  /**
+   * Deserializes the given serialized blob.
+   *
+   * @param bytes the serialized blob.
+   * @param offset the blob is read starting at this offset.
+   * @param length the size of blob read (in number of bytes).
+   * @return deserialized instance.
+   * @throws TException if an error is encountered during deserialization.
+   */
+  public Object deserialize(byte[] bytes, int offset, int length) throws TException {
+    ensurePartialDeserializationMode();

Review comment:
       sorry that was an oversight. The method should have been called partialDeserializeObject() like its other overload. fixed. That should address both #1 and #2 points above. The name clarifies two intents:
   1. it performs partial deserialization
   2. it returns an Object




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] bhalchandrap commented on a change in pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
bhalchandrap commented on a change in pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#discussion_r703060386



##########
File path: lib/java/src/org/apache/thrift/partial/README.md
##########
@@ -0,0 +1,112 @@
+# Partial Thrift Deserialization
+
+## Overview
+This document describes how partial deserialization of Thrift works. There are two main goals of this documentation:
+1. Make it easier to understand the current Java implementation in this folder.
+1. Be useful in implementing partial deserialization support in additional languages.
+
+This document is divided into two high level areas. The first part explains important concepts relevant to partial deserialization. The second part describes components involved in the Java implementation in this folder.
+
+Moreover, this blog provides some performance numbers and addtional information: https://medium.com/pinterest-engineering/improving-data-processing-efficiency-using-partial-deserialization-of-thrift-16bc3a4a38b4
+
+## Basic Concepts
+
+### Motivation
+
+The main motivation behind implementing this feature is to improve performance when we need to access only a subset of fields in any Thrift object. This situation arises often when big data is stored in Thrift encoded format (for example, SequenceFile with serialized Thrift values). Many data processing jobs may access this data. However, not every job needs to access every field of each object. In such cases, if we have prior knowledge of the fields needed for a given job, we can deserialize only that subset of fields and avoid the cost deserializing the rest of the fields. There are two benefits of this approach: we save cpu cycles by not deserializing unnecessary field and we end up reducing gc pressure. Both of the savings quickly add up when processing billions of instances in a data processing job.
+
+### Partial deserialization
+
+Partial deserialization involves deserializing only a subset of the fields of a serialized Thrift object while efficiently skipping over the rest. One very important benefit of partial deserialization is that the output of the deserialization process is not limited to a `TBase` derived object. It can deserialize a serialized blob into any type by using an appropriate `ThriftFieldValueProcessor`.
+
+### Defining the subset of fields to deserialize
+
+The subset of fields to deserialize is defined using a list of fully qualified field names. For example, consider the Thrift `struct` definition below:
+
+```Thrift
+struct SmallStruct {
+   1: optional string stringValue;
+   2: optional i16 i16Value;
+}
+
+struct TestStruct {
+   1: optional i16 i16Field;
+   2: optional list<SmallStruct> structList;
+   3: optional set<SmallStruct> structSet;
+   4: optional map<string, SmallStruct> structMap;
+   5: optional SmallStruct structField;
+}
+```
+
+For the Thrift `struct`, each of the following line shows a fully qualified field definition. Partial deserialization uses a non-empty set of such field definitions to identify the subset of fields to deserialize.
+
+```
+- i16Field
+- structList.stringValue
+- structSet.i16Value
+- structMap.stringValue
+- structField.i16Value
+```
+
+Note that the syntax of denoting paths involving map fields do not support a way to define sub-fields of the key type.
+
+For example, the field path `structMap.stringValue` shown above has leaf segment `stringValue` which is a field in map values.
+
+## Components
+
+The process of partial deserialization involves the following major components. We have listed names of the Java file(s) implementing each component for easier mapping to the source code.
+
+### Thrift Metadata
+
+Source files:
+- ThriftField.java
+- ThriftMetadata.java
+
+We saw in the previous section how we can identify the subset of fields to deserialize. As the first step, we need to compile the collection of field definitions into an efficient data structure that we can traverse at runtime. This step is achieved using `ThriftField` and `ThriftMetadata` classes. For example,
+
+```Java
+// First, create a collection of fully qualified field names.
+List<String> fieldNames = Arrays.asList("i16Field", "structField.i16Value");
+
+// Convert the flat collection into an n-ary tree of fields.
+List<ThriftField> fields = ThriftField.fromNames(fieldNames);
+
+// Compile the tree of fields into internally used metadata.
+ThriftMetadata.ThriftStruct metadata =
+    ThriftMetadata.ThriftStruct.fromFields(TestStruct.class, fields);
+```
+
+At this point, we have an efficient internal representation of the fields that need to get deserialized.
+
+### Partial Thrift Protocol
+
+Source files:
+- PartialThriftProtocol.java
+- PartialThriftBinaryProtocol.java
+- PartialThriftCompactProtocol.java
+
+This component implements efficient skipping over fields that need not be deserialized. Note that this skipping is more efficient compared to that achieved by using `TProtocolUtil.skip()`. The latter calls the corresponding `read()`, allocates and initializes certain values (for example, strings) and then discards the returned value. In comparison, `PartialThriftProtocol` skips a field by incrementing internal offset into the transport buffer.

Review comment:
       That would have been great. However, TProtocolUtil is protocol agnostic and the specifics of certain types of skip differ between compact and binary protocol. That is why it needs to happen at protocol level.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [thrift] bhalchandrap commented on pull request #2439: THRIFT-5443: add support for partial Thrift deserialization

Posted by GitBox <gi...@apache.org>.
bhalchandrap commented on pull request #2439:
URL: https://github.com/apache/thrift/pull/2439#issuecomment-974562162


   Certainly. This was the first time I contributed to OSS. It was a good
   learning experience for me.
   
   On Fri, Nov 19, 2021 at 4:01 PM Jens Geyer ***@***.***> wrote:
   
   > Next time please either squash the PR or at least rebase it onto master
   > every now and then.
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/thrift/pull/2439#issuecomment-974548436>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/ATA7ZBX7PVQ4PZXUGM4RCEDUM3QNVANCNFSM5CILR3NQ>
   > .
   > Triage notifications on the go with GitHub Mobile for iOS
   > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
   > or Android
   > <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
   >
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@thrift.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org