You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@thrift.apache.org by Bhalchandra Pandit <kp...@pinterest.com.INVALID> on 2021/09/03 17:31:58 UTC

Re: partial deserialization of Thrift

I would like to know if there are any changes that anyone would like to
suggest for the pull request <https://github.com/apache/thrift/pull/2439> that
adds support for partial Thrift deserialization. Is there something I can
do from my side to make progress on this request?

On Sun, Aug 29, 2021 at 6:50 PM Bhalchandra Pandit <kp...@pinterest.com>
wrote:

> Hi Duru and Yuxuan,
> I was encouraged by the initial interest both of you showed in this
> feature. Would you be able to help Jens review the pull request?
>
> That will help me open source sooner a number of other useful features
> built on top of partial deserialization.
>
> For example, we can efficiently support columnar access through Hive or
> Spark when the underlying file stores data as serialized Thrift objects.
> Such access can be more efficient compared to columnar formats such as
> Parquet.
>
> Thanks for your help.
>
> Kumar
>
>
>
> On Wed, Aug 25, 2021 at 8:43 AM Bhalchandra Pandit <kp...@pinterest.com>
> wrote:
>
>> Hi Jens,
>> Thanks for reviewing. For the others who may want to review, I have added
>> README.md file that provides technical details on the overall design and
>> includes description of the major implementation components.
>>
>> Kumar
>>
>> On Wed, Aug 25, 2021 at 12:05 AM Jens Geyer <je...@apache.org> wrote:
>>
>>> Hi,
>>>
>>> I added myself already as a reviewer, but it certainly would help if
>>> some
>>> other (especially Java) people could also review.
>>>
>>> Stay tuned, will do my part soon.
>>>
>>> Have fun,
>>> JensG
>>>
>>> -----Ursprüngliche Nachricht-----
>>> From: Bhalchandra Pandit
>>> Sent: Tuesday, August 24, 2021 4:25 AM
>>> To: dev@thrift.apache.org ; Jens Geyer
>>> Subject: Re: partial deserialization of Thrift
>>>
>>> I would like to know if there are any changes that anyone would like to
>>> suggest for the pull request <https://github.com/apache/thrift/pull/2439
>>> >
>>> that adds support for partial Thrift deserialization.
>>>
>>> On Mon, Aug 16, 2021 at 12:37 PM Bhalchandra Pandit <
>>> kpandit@pinterest.com>
>>> wrote:
>>>
>>> > I created a pull request for this change:
>>> > https://github.com/apache/thrift/pull/2439
>>> >
>>> > I have also added a README.md file that provides technical details of
>>> > partial Thrift deserialization so that it is easier to understand the
>>> Java
>>> > implementation. It will be also useful if this support is to be added
>>> for
>>> > other languages.
>>> >
>>> > Please let me know if you have any questions or suggestions for
>>> changes to
>>> > the code.
>>> >
>>> > On Wed, Aug 4, 2021 at 12:10 AM Jens Geyer <je...@apache.org> wrote:
>>> >
>>> >> Hi,
>>> >>
>>> >> thanks!
>>> >>
>>> >> What I can say from a more general/formal viewpoint, the package
>>> >> namespaces
>>> >> etc must be adjusted (e.g. package com.pinterest.commons.thrift) and
>>> all
>>> >> files you plan to contribute need to have the appropriate ASF license
>>> >> header.
>>> >>
>>> >> What would be really nice, especially when we look forward to
>>> integrate
>>> >> it
>>> >> into other languages, adding some technical
>>> documentation/specification
>>> >> to
>>> >> make it easier to implement and have some common, documented view on
>>> the
>>> >> feature.
>>> >>
>>> >> Last not least having appropriate test coverage would be great.
>>> >>
>>> >> Anything else beyond that ... whatever makes sense.
>>> >>
>>> >> Have fun,
>>> >> JensG
>>> >>
>>> >>
>>> >> -----Ursprüngliche Nachricht-----
>>> >> From: Bhalchandra Pandit
>>> >> Sent: Wednesday, August 4, 2021 3:33 AM
>>> >> To: dev@thrift.apache.org
>>> >> Subject: Re: partial deserialization of Thrift
>>> >>
>>> >> I have added basic set of files at the location below. Currently they
>>> are
>>> >> unchanged from their Pinterest version. I would like to get some
>>> guidance
>>> >> on the type of changes I should make. All of them have many tests
>>> though
>>> >> they are not included in this initial skeleton for now.
>>> >>
>>> >>
>>> >>
>>> https://github.com/bhalchandrap/thrift/tree/sample/lib/java/src/org/apache/thrift/partial
>>> >>
>>> >> On Wed, Jul 28, 2021 at 4:59 PM Bhalchandra Pandit <
>>> kpandit@pinterest.com
>>> >> >
>>> >> wrote:
>>> >>
>>> >> > I opened the following jira epic for this contribution:
>>> >> > https://issues.apache.org/jira/browse/THRIFT-5443
>>> >> >
>>> >> > On Mon, Jul 12, 2021 at 11:58 PM Duru Can Celasun <
>>> dcelasun@apache.org>
>>> >> > wrote:
>>> >> >
>>> >> >> On Tue, 13 Jul 2021, at 01:07, Bhalchandra Pandit wrote:
>>> >> >> > Thanks for showing interest and for raising valid questions.
>>> >> >> >
>>> >> >> > I do not currently have a forked branch to show how it is done.
>>> >> >> However, I
>>> >> >> > can certainly make it happen. It will take me a couple of weeks
>>> or
>>> >> >> > so
>>> >> >> to do
>>> >> >> > it. I will need to get myself familiarized with the contribution
>>> >> >> process.
>>> >> >>
>>> >> >> The short version is that you need to open a ticket on Jira [1] and
>>> >> then
>>> >> >> send a PR against the master branch on Github [2] with the title
>>> >> >> "THRIFT-NNNN: Your title". Everything else can be figured out
>>> later.
>>> >> >>
>>> >> >> [1] https://issues.apache.org/jira/projects/THRIFT/issues
>>> >> >>
>>> >> >> [2] https://github.com/apache/thrift/
>>> >> >>
>>> >> >> >
>>> >> >> > To answer other questions:
>>> >> >> >
>>> >> >> >    1. The current implementation is in Java. However, it can
>>> also be
>>> >> >> easily
>>> >> >> >    ported to other languages. Happy to help in that direction as
>>> >> well.
>>> >> >> >    2. The solution does not modify the definition of any
>>> structure
>>> >> (eg,
>>> >> >> >    Payload => SlimPayload). The output instance has the same type
>>> >> >> regardless
>>> >> >> >    of whether we use full deserialization or partial
>>> >> >> > deserialization.
>>> >> >> In that
>>> >> >> >    sense, it is 100% compatible in both directions (partial <=>
>>> >> full).
>>> >> >> The
>>> >> >> >    solution enables selectively deserializing any arbitrary
>>> nested
>>> >> >> subset.
>>> >> >> >
>>> >> >> > Kumar
>>> >> >> >
>>> >> >> > On Mon, Jul 12, 2021 at 4:10 PM Duru Can Celasun <
>>> >> dcelasun@apache.org>
>>> >> >> > wrote:
>>> >> >> >
>>> >> >> > > This is definitely an interesting problem at scale and it'd be
>>> >> great
>>> >> >> to
>>> >> >> > > have a solution upstream.
>>> >> >> > >
>>> >> >> > > I second Yuxuan's questions. From the blog post it seems you
>>> have
>>> >> an
>>> >> >> > > implementation for Java, but it would be great to have at
>>> least
>>> >> >> > > one
>>> >> >> more.
>>> >> >> > >
>>> >> >> > > On Tue, 13 Jul 2021, at 00:00, Yuxuan Wang wrote:
>>> >> >> > > >  Hi Bhalchandra,
>>> >> >> > > >
>>> >> >> > > > Do you have any open source code (e.g. forked thrift) to
>>> show
>>> >> >> > > > how
>>> >> >> you
>>> >> >> > > > implemented it? The blog post stated what's the problem but
>>> >> really
>>> >> >> lack
>>> >> >> > > > information on the solution side:
>>> >> >> > > >
>>> >> >> > > > 1. How did you solve the problem?
>>> >> >> > > > 2. In which language(s) did you implement the solution?
>>> >> >> > > >
>>> >> >> > > > Without that information it's really not much we can do here.
>>> >> >> > > >
>>> >> >> > > > Also just based on the problem you described, a simple
>>> solution
>>> >> >> would be
>>> >> >> > > > just to duplicate the struct definition and remove the
>>> fields
>>> >> >> > > > you
>>> >> >> don't
>>> >> >> > > > need, for example:
>>> >> >> > > >
>>> >> >> > > > // Original struct
>>> >> >> > > > struct Payload {
>>> >> >> > > >   1: optional Type1 field1,
>>> >> >> > > >   2: optional Type2 field2,
>>> >> >> > > >   3: optional Type3 field3,
>>> >> >> > > >   ...
>>> >> >> > > > }
>>> >> >> > > >
>>> >> >> > > > // Slim struct
>>> >> >> > > > struct SlimPayload {
>>> >> >> > > >   1: optional Type1 field1,
>>> >> >> > > >   // field2 removed because we don't care about it in this
>>> use
>>> >> case
>>> >> >> > > >   3: optional SlimType3 field3,
>>> >> >> > > >   ...
>>> >> >> > > > }
>>> >> >> > > >
>>> >> >> > > > But of course it's hard to keep SlimPayload in sync with the
>>> >> >> original
>>> >> >> > > > Payload so I can see there are some values to have some
>>> helpers
>>> >> to
>>> >> >> help
>>> >> >> > > > that, but as long as you don't make breaking changes into
>>> >> >> > > > Payload
>>> >> >> "keep
>>> >> >> > > > them in sync" is a false problem.
>>> >> >> > > >
>>> >> >> > > > On Mon, Jul 12, 2021 at 12:56 PM Bhalchandra Pandit
>>> >> >> > > > <kp...@pinterest.com.invalid> wrote:
>>> >> >> > > >
>>> >> >> > > > > Hi All,
>>> >> >> > > > > I work for Pinterest. I developed a technique for partial
>>> >> >> > > deserialization
>>> >> >> > > > > of Thrift that has been very useful in significantly
>>> improving
>>> >> >> > > efficiency
>>> >> >> > > > > of the data processing at Pinterest. I would like to
>>> >> >> > > > > contribute
>>> >> >> that
>>> >> >> > > > > feature to Apache Thrift. More details on this technique
>>> are
>>> >> >> available
>>> >> >> > > in
>>> >> >> > > > > this blog I recently wrote:
>>> >> >> > > > >
>>> >> >> > > > >
>>> >> >> > >
>>> >> >>
>>> >>
>>> https://medium.com/pinterest-engineering/improving-data-processing-efficiency-using-partial-deserialization-of-thrift-16bc3a4a38b4
>>> >> >> > > > >
>>> >> >> > > > > I would like to know if any of you are interested in
>>> helping
>>> >> with
>>> >> >> > > > > contributing this work to the main branch.
>>> >> >> > > > >
>>> >> >> > > > > Kumar
>>> >> >> > > > >
>>> >> >> > > >
>>> >> >> > >
>>> >> >> >
>>> >> >>
>>> >> >
>>> >>
>>> >>
>>>
>>>

Re: partial deserialization of Thrift

Posted by Jens Geyer <je...@hotmail.com>.
Hi,

I intentionally left the PR open because – as I wrote earlier – I did not run any tests on it, primarily because Java is not my “bread and butter” language so although my expertise is not exactly null, it is certainly limited. That being said, I am the first to give another +1 on it as soon as someone else from the Java professionals is happy with it.

So if there is anyone having any opinion about the patch he/she wants to share – this is the time.

Have fun,
JensG






From: Bhalchandra Pandit 
Sent: Friday, September 3, 2021 7:31 PM
To: dev@thrift.apache.org ; Jens Geyer ; Duru Can Celasun ; yuxuan.wang@reddit.com 
Subject: Re: partial deserialization of Thrift

I would like to know if there are any changes that anyone would like to suggest for the pull request that adds support for partial Thrift deserialization. Is there something I can do from my side to make progress on this request?


On Sun, Aug 29, 2021 at 6:50 PM Bhalchandra Pandit <kp...@pinterest.com> wrote:

  Hi Duru and Yuxuan,
  I was encouraged by the initial interest both of you showed in this feature. Would you be able to help Jens review the pull request?

  That will help me open source sooner a number of other useful features built on top of partial deserialization.

  For example, we can efficiently support columnar access through Hive or Spark when the underlying file stores data as serialized Thrift objects. Such access can be more efficient compared to columnar formats such as Parquet.

  Thanks for your help.

  Kumar



  On Wed, Aug 25, 2021 at 8:43 AM Bhalchandra Pandit <kp...@pinterest.com> wrote:

    Hi Jens,
    Thanks for reviewing. For the others who may want to review, I have added README.md file that provides technical details on the overall design and includes description of the major implementation components.

    Kumar

    On Wed, Aug 25, 2021 at 12:05 AM Jens Geyer <je...@apache.org> wrote:

      Hi,

      I added myself already as a reviewer, but it certainly would help if some 
      other (especially Java) people could also review.

      Stay tuned, will do my part soon.

      Have fun,
      JensG

      -----Ursprüngliche Nachricht----- 
      From: Bhalchandra Pandit
      Sent: Tuesday, August 24, 2021 4:25 AM
      To: dev@thrift.apache.org ; Jens Geyer
      Subject: Re: partial deserialization of Thrift

      I would like to know if there are any changes that anyone would like to
      suggest for the pull request <https://github.com/apache/thrift/pull/2439>
      that adds support for partial Thrift deserialization.

      On Mon, Aug 16, 2021 at 12:37 PM Bhalchandra Pandit <kp...@pinterest.com>
      wrote:

      > I created a pull request for this change:
      > https://github.com/apache/thrift/pull/2439
      >
      > I have also added a README.md file that provides technical details of
      > partial Thrift deserialization so that it is easier to understand the Java
      > implementation. It will be also useful if this support is to be added for
      > other languages.
      >
      > Please let me know if you have any questions or suggestions for changes to
      > the code.
      >
      > On Wed, Aug 4, 2021 at 12:10 AM Jens Geyer <je...@apache.org> wrote:
      >
      >> Hi,
      >>
      >> thanks!
      >>
      >> What I can say from a more general/formal viewpoint, the package
      >> namespaces
      >> etc must be adjusted (e.g. package com.pinterest.commons.thrift) and all
      >> files you plan to contribute need to have the appropriate ASF license
      >> header.
      >>
      >> What would be really nice, especially when we look forward to integrate
      >> it
      >> into other languages, adding some technical documentation/specification
      >> to
      >> make it easier to implement and have some common, documented view on the
      >> feature.
      >>
      >> Last not least having appropriate test coverage would be great.
      >>
      >> Anything else beyond that ... whatever makes sense.
      >>
      >> Have fun,
      >> JensG
      >>
      >>
      >> -----Ursprüngliche Nachricht-----
      >> From: Bhalchandra Pandit
      >> Sent: Wednesday, August 4, 2021 3:33 AM
      >> To: dev@thrift.apache.org
      >> Subject: Re: partial deserialization of Thrift
      >>
      >> I have added basic set of files at the location below. Currently they are
      >> unchanged from their Pinterest version. I would like to get some guidance
      >> on the type of changes I should make. All of them have many tests though
      >> they are not included in this initial skeleton for now.
      >>
      >>
      >> https://github.com/bhalchandrap/thrift/tree/sample/lib/java/src/org/apache/thrift/partial
      >>
      >> On Wed, Jul 28, 2021 at 4:59 PM Bhalchandra Pandit <kpandit@pinterest.com
      >> >
      >> wrote:
      >>
      >> > I opened the following jira epic for this contribution:
      >> > https://issues.apache.org/jira/browse/THRIFT-5443
      >> >
      >> > On Mon, Jul 12, 2021 at 11:58 PM Duru Can Celasun <dc...@apache.org>
      >> > wrote:
      >> >
      >> >> On Tue, 13 Jul 2021, at 01:07, Bhalchandra Pandit wrote:
      >> >> > Thanks for showing interest and for raising valid questions.
      >> >> >
      >> >> > I do not currently have a forked branch to show how it is done.
      >> >> However, I
      >> >> > can certainly make it happen. It will take me a couple of weeks or 
      >> >> > so
      >> >> to do
      >> >> > it. I will need to get myself familiarized with the contribution
      >> >> process.
      >> >>
      >> >> The short version is that you need to open a ticket on Jira [1] and
      >> then
      >> >> send a PR against the master branch on Github [2] with the title
      >> >> "THRIFT-NNNN: Your title". Everything else can be figured out later.
      >> >>
      >> >> [1] https://issues.apache.org/jira/projects/THRIFT/issues
      >> >>
      >> >> [2] https://github.com/apache/thrift/
      >> >>
      >> >> >
      >> >> > To answer other questions:
      >> >> >
      >> >> >    1. The current implementation is in Java. However, it can also be
      >> >> easily
      >> >> >    ported to other languages. Happy to help in that direction as
      >> well.
      >> >> >    2. The solution does not modify the definition of any structure
      >> (eg,
      >> >> >    Payload => SlimPayload). The output instance has the same type
      >> >> regardless
      >> >> >    of whether we use full deserialization or partial 
      >> >> > deserialization.
      >> >> In that
      >> >> >    sense, it is 100% compatible in both directions (partial <=>
      >> full).
      >> >> The
      >> >> >    solution enables selectively deserializing any arbitrary nested
      >> >> subset.
      >> >> >
      >> >> > Kumar
      >> >> >
      >> >> > On Mon, Jul 12, 2021 at 4:10 PM Duru Can Celasun <
      >> dcelasun@apache.org>
      >> >> > wrote:
      >> >> >
      >> >> > > This is definitely an interesting problem at scale and it'd be
      >> great
      >> >> to
      >> >> > > have a solution upstream.
      >> >> > >
      >> >> > > I second Yuxuan's questions. From the blog post it seems you have
      >> an
      >> >> > > implementation for Java, but it would be great to have at least 
      >> >> > > one
      >> >> more.
      >> >> > >
      >> >> > > On Tue, 13 Jul 2021, at 00:00, Yuxuan Wang wrote:
      >> >> > > >  Hi Bhalchandra,
      >> >> > > >
      >> >> > > > Do you have any open source code (e.g. forked thrift) to show 
      >> >> > > > how
      >> >> you
      >> >> > > > implemented it? The blog post stated what's the problem but
      >> really
      >> >> lack
      >> >> > > > information on the solution side:
      >> >> > > >
      >> >> > > > 1. How did you solve the problem?
      >> >> > > > 2. In which language(s) did you implement the solution?
      >> >> > > >
      >> >> > > > Without that information it's really not much we can do here.
      >> >> > > >
      >> >> > > > Also just based on the problem you described, a simple solution
      >> >> would be
      >> >> > > > just to duplicate the struct definition and remove the fields 
      >> >> > > > you
      >> >> don't
      >> >> > > > need, for example:
      >> >> > > >
      >> >> > > > // Original struct
      >> >> > > > struct Payload {
      >> >> > > >   1: optional Type1 field1,
      >> >> > > >   2: optional Type2 field2,
      >> >> > > >   3: optional Type3 field3,
      >> >> > > >   ...
      >> >> > > > }
      >> >> > > >
      >> >> > > > // Slim struct
      >> >> > > > struct SlimPayload {
      >> >> > > >   1: optional Type1 field1,
      >> >> > > >   // field2 removed because we don't care about it in this use
      >> case
      >> >> > > >   3: optional SlimType3 field3,
      >> >> > > >   ...
      >> >> > > > }
      >> >> > > >
      >> >> > > > But of course it's hard to keep SlimPayload in sync with the
      >> >> original
      >> >> > > > Payload so I can see there are some values to have some helpers
      >> to
      >> >> help
      >> >> > > > that, but as long as you don't make breaking changes into 
      >> >> > > > Payload
      >> >> "keep
      >> >> > > > them in sync" is a false problem.
      >> >> > > >
      >> >> > > > On Mon, Jul 12, 2021 at 12:56 PM Bhalchandra Pandit
      >> >> > > > <kp...@pinterest.com.invalid> wrote:
      >> >> > > >
      >> >> > > > > Hi All,
      >> >> > > > > I work for Pinterest. I developed a technique for partial
      >> >> > > deserialization
      >> >> > > > > of Thrift that has been very useful in significantly improving
      >> >> > > efficiency
      >> >> > > > > of the data processing at Pinterest. I would like to 
      >> >> > > > > contribute
      >> >> that
      >> >> > > > > feature to Apache Thrift. More details on this technique are
      >> >> available
      >> >> > > in
      >> >> > > > > this blog I recently wrote:
      >> >> > > > >
      >> >> > > > >
      >> >> > >
      >> >>
      >> https://medium.com/pinterest-engineering/improving-data-processing-efficiency-using-partial-deserialization-of-thrift-16bc3a4a38b4
      >> >> > > > >
      >> >> > > > > I would like to know if any of you are interested in helping
      >> with
      >> >> > > > > contributing this work to the main branch.
      >> >> > > > >
      >> >> > > > > Kumar
      >> >> > > > >
      >> >> > > >
      >> >> > >
      >> >> >
      >> >>
      >> >
      >>
      >>