You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@thrift.apache.org by "Bryan Duxbury (JIRA)" <ji...@apache.org> on 2009/04/09 02:02:12 UTC

[jira] Created: (THRIFT-446) Partial deserialization

Partial deserialization
-----------------------

                 Key: THRIFT-446
                 URL: https://issues.apache.org/jira/browse/THRIFT-446
             Project: Thrift
          Issue Type: New Feature
            Reporter: Bryan Duxbury
            Priority: Minor


There are some use cases where you might have a fair amount of serialized data coming to you, but you're only interested in specific fields from that data. The way it works now, you are stuck paying the price to deserialize that extra data no matter what. 

In the simplest approach, it would be nice if you could specify some sort of mask to use to suppress the deserialization of a given set of fields. A slightly more complex approach would be to not just skip the deserialization, but to actually hang on to the bytes that you didn't deserialize, so that when it came time to re-serialize, you could just rewrite the original data. 

Obviously this would imply some interesting interactions with the validation system and probably nontrivial changes elsewhere (isset, protocol interface, etc). However, it could yield a big performance benefit in specific applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (THRIFT-446) Partial deserialization

Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/THRIFT-446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bryan Duxbury updated THRIFT-446:
---------------------------------

    Attachment: thrift-446-v5.patch

Here's your patch with some modifications. There was a bug in the while loop that was incorrectly reading Struct Begin for each field in a struct, which was causing problems when using the compact protocol. However, now the JSON protocol is encountering an issue.

I've enhanced the test so that all the protocols are tried with every test case.

> Partial deserialization
> -----------------------
>
>                 Key: THRIFT-446
>                 URL: https://issues.apache.org/jira/browse/THRIFT-446
>             Project: Thrift
>          Issue Type: New Feature
>            Reporter: Bryan Duxbury
>            Assignee: Mohammad Shahangian
>            Priority: Minor
>         Attachments: Thrift-446-v4.patch, thrift-446-v5.patch
>
>
> There are some use cases where you might have a fair amount of serialized data coming to you, but you're only interested in specific fields from that data. The way it works now, you are stuck paying the price to deserialize that extra data no matter what. 
> In the simplest approach, it would be nice if you could specify some sort of mask to use to suppress the deserialization of a given set of fields. A slightly more complex approach would be to not just skip the deserialization, but to actually hang on to the bytes that you didn't deserialize, so that when it came time to re-serialize, you could just rewrite the original data. 
> Obviously this would imply some interesting interactions with the validation system and probably nontrivial changes elsewhere (isset, protocol interface, etc). However, it could yield a big performance benefit in specific applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (THRIFT-446) Partial deserialization

Posted by "Mohammad Shahangian (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/THRIFT-446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mohammad Shahangian reassigned THRIFT-446:
------------------------------------------

    Assignee: Mohammad Shahangian

> Partial deserialization
> -----------------------
>
>                 Key: THRIFT-446
>                 URL: https://issues.apache.org/jira/browse/THRIFT-446
>             Project: Thrift
>          Issue Type: New Feature
>            Reporter: Bryan Duxbury
>            Assignee: Mohammad Shahangian
>            Priority: Minor
>
> There are some use cases where you might have a fair amount of serialized data coming to you, but you're only interested in specific fields from that data. The way it works now, you are stuck paying the price to deserialize that extra data no matter what. 
> In the simplest approach, it would be nice if you could specify some sort of mask to use to suppress the deserialization of a given set of fields. A slightly more complex approach would be to not just skip the deserialization, but to actually hang on to the bytes that you didn't deserialize, so that when it came time to re-serialize, you could just rewrite the original data. 
> Obviously this would imply some interesting interactions with the validation system and probably nontrivial changes elsewhere (isset, protocol interface, etc). However, it could yield a big performance benefit in specific applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (THRIFT-446) Partial deserialization

Posted by "Mohammad Shahangian (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/THRIFT-446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mohammad Shahangian updated THRIFT-446:
---------------------------------------

    Attachment: Thrift-446-v4.patch

Worked with Bryan to implement deserialization of specific structure in a Thrift Object. This increases the efficiency of deserialization by skipping expensive deserializaions(e.g. strings) and terminating inspection of fields once the object of interest is found. Follows same pattern as read function by iteratively iterating through fields but skips all fields except those included in the FieldIdPath. Fields that are in the FieldIdPath are "recursively" stepped into until the path is exhausted at which point read is called on that field.

Method was added to TDeserializer. Tests in PartialDeserializeTest.

> Partial deserialization
> -----------------------
>
>                 Key: THRIFT-446
>                 URL: https://issues.apache.org/jira/browse/THRIFT-446
>             Project: Thrift
>          Issue Type: New Feature
>            Reporter: Bryan Duxbury
>            Assignee: Mohammad Shahangian
>            Priority: Minor
>         Attachments: Thrift-446-v4.patch
>
>
> There are some use cases where you might have a fair amount of serialized data coming to you, but you're only interested in specific fields from that data. The way it works now, you are stuck paying the price to deserialize that extra data no matter what. 
> In the simplest approach, it would be nice if you could specify some sort of mask to use to suppress the deserialization of a given set of fields. A slightly more complex approach would be to not just skip the deserialization, but to actually hang on to the bytes that you didn't deserialize, so that when it came time to re-serialize, you could just rewrite the original data. 
> Obviously this would imply some interesting interactions with the validation system and probably nontrivial changes elsewhere (isset, protocol interface, etc). However, it could yield a big performance benefit in specific applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (THRIFT-446) Partial deserialization

Posted by "Mohammad Shahangian (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/THRIFT-446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mohammad Shahangian updated THRIFT-446:
---------------------------------------

    Attachment: Thrift-446-v6.patch

This patch will handle the bugs mentioned by Bryan. What was happening was that readStrucBegin was being called both in the partialDeserialize method and the read method in some edge cases. To deal with this we moved readStructBegin out of the loop for the initial outter struct then called it each time the path index is incremented. Additionally we had to make sure not to call readStructBegin on the last step of the path because it would be called by the read function.

> Partial deserialization
> -----------------------
>
>                 Key: THRIFT-446
>                 URL: https://issues.apache.org/jira/browse/THRIFT-446
>             Project: Thrift
>          Issue Type: New Feature
>            Reporter: Bryan Duxbury
>            Assignee: Mohammad Shahangian
>            Priority: Minor
>         Attachments: Thrift-446-v4.patch, thrift-446-v5.patch, Thrift-446-v6.patch
>
>
> There are some use cases where you might have a fair amount of serialized data coming to you, but you're only interested in specific fields from that data. The way it works now, you are stuck paying the price to deserialize that extra data no matter what. 
> In the simplest approach, it would be nice if you could specify some sort of mask to use to suppress the deserialization of a given set of fields. A slightly more complex approach would be to not just skip the deserialization, but to actually hang on to the bytes that you didn't deserialize, so that when it came time to re-serialize, you could just rewrite the original data. 
> Obviously this would imply some interesting interactions with the validation system and probably nontrivial changes elsewhere (isset, protocol interface, etc). However, it could yield a big performance benefit in specific applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (THRIFT-446) Partial deserialization

Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/THRIFT-446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bryan Duxbury reassigned THRIFT-446:
------------------------------------

    Assignee:     (was: Mohammad Shahangian)

> Partial deserialization
> -----------------------
>
>                 Key: THRIFT-446
>                 URL: https://issues.apache.org/jira/browse/THRIFT-446
>             Project: Thrift
>          Issue Type: New Feature
>            Reporter: Bryan Duxbury
>            Priority: Minor
>         Attachments: Thrift-446-v4.patch, thrift-446-v5.patch, Thrift-446-v6.patch
>
>
> There are some use cases where you might have a fair amount of serialized data coming to you, but you're only interested in specific fields from that data. The way it works now, you are stuck paying the price to deserialize that extra data no matter what. 
> In the simplest approach, it would be nice if you could specify some sort of mask to use to suppress the deserialization of a given set of fields. A slightly more complex approach would be to not just skip the deserialization, but to actually hang on to the bytes that you didn't deserialize, so that when it came time to re-serialize, you could just rewrite the original data. 
> Obviously this would imply some interesting interactions with the validation system and probably nontrivial changes elsewhere (isset, protocol interface, etc). However, it could yield a big performance benefit in specific applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (THRIFT-446) Partial deserialization

Posted by "David Reiss (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/THRIFT-446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760763#action_12760763 ] 

David Reiss commented on THRIFT-446:
------------------------------------

Seems sensible.  I can think of a few complications.  First, you might need the mask to be recursive, since you might want to filter sub-objects.  Second, you would have to be sure that you used the same protocol for writing that you did for reading if you want to just forward the serialized data on.  I don't think that requires a new protocol interface, though.  proto.getTransport().write(saved_data).  isset becomes a tr-state: set, unset, set-but-serialized.  Third, skipping over nontrivial requires a nontrivial amount of CPU.  Less than full deserialization, obviously, but nontrivial.

> Partial deserialization
> -----------------------
>
>                 Key: THRIFT-446
>                 URL: https://issues.apache.org/jira/browse/THRIFT-446
>             Project: Thrift
>          Issue Type: New Feature
>            Reporter: Bryan Duxbury
>            Assignee: Mohammad Shahangian
>            Priority: Minor
>
> There are some use cases where you might have a fair amount of serialized data coming to you, but you're only interested in specific fields from that data. The way it works now, you are stuck paying the price to deserialize that extra data no matter what. 
> In the simplest approach, it would be nice if you could specify some sort of mask to use to suppress the deserialization of a given set of fields. A slightly more complex approach would be to not just skip the deserialization, but to actually hang on to the bytes that you didn't deserialize, so that when it came time to re-serialize, you could just rewrite the original data. 
> Obviously this would imply some interesting interactions with the validation system and probably nontrivial changes elsewhere (isset, protocol interface, etc). However, it could yield a big performance benefit in specific applications.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.