You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by Enrico Agnoli <en...@workday.com.INVALID> on 2020/07/07 15:49:23 UTC

Re: Proposal: Add afterSerialization and beforeDeserialization hooks to allow tenanted encryption

Hi all,
Did anyone managed to give a first look at the patch to see if it can be interesting to be worked on?
Thanks, Enrico

On 6/22/20, 10:18 AM, "Enrico Agnoli" <en...@workday.com.INVALID> wrote:

    Created https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_AVRO-2D2868&d=DwIGaQ&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=5oal4CtBGP1ioAe2G2rMT-XLCpWwh5R4aEw1TqtlCnc&m=oPkQZ1z4HyMj9YoqgDcyeTl7pvCeZC7xIZxG97GFpCs&s=BIn_pK1tUbHvOErYJaJRq19M4RwFoAVrxuFGKsIR31s&e=  and attached a patch to bring ahead the discussion and possibly get some suggestion for possible improvements.
    
    Yes, the nice of this approach is that the object itself controls this finalization allowing custom logic driven by the data itself. (in our case the encryption with the tenant's key that generated that data)
    The more approaches that would rely on Flink/Beam serialization are higher in the serialization stack. That requires awareness of the requirements from the job writer and consequently be more error prone. Moreover this will solve only the encryption on job side without considering data producer/consumer outside the streaming job.
    
    On 6/19/20, 9:49 AM, "Ryan Skraba" <ry...@skraba.com> wrote:
    
        I would like to take a look at your code, a branch would be welcome.  Thanks!
        
        This is an interesting use case because it isn't using Avro
        serialization for file or messaging persistence ... well, not
        strictly!  It's an interesting approach to have the generated record
        _itself_ responsible for ensuring that it is nicely encrypted as it
        passes over a network link or is spilled to disk temporarily, or saved
        in a streaming checkpoint, etc!
        
        I've only ever used Flink in the context of Beam, where you could
        write a custom coder ("EncryptedAvroCoder") for objects in a
        PCollection and override all serialization for that distributed
        collection.  Does something similar exist in pure Flink?
        
        All my best, Ryan
        
        On Thu, Jun 18, 2020 at 2:18 PM Enrico Agnoli
        <en...@workday.com.invalid> wrote:
        >
        > Hi Ryan,
        > Thanks for getting back to me.
        >
        > Yes, the change is for the JAVA library, as you mention in other languages it doesn't seem easy to make it as a library to be able to delegate like we don in the JVM. It is however feasible to deserialize the data in another language, given access to the same encryption libraries, as the structure of the serialized object is known to the developer.
        >
        > I modified the GenericDatumWriter/Reader as there I found the main entry methods:
        > ```
        >   public D read(D reuse, Decoder in) throws IOException
        > ```
        > And
        > ```
        >   public void write(D datum, Encoder out) throws IOException
        > ```
        >
        > I do have also a generalized template that is used for all our "tenanted" schemas, that extends an abstract class and delegates to it the beforeDeserialization, afterSerialization so to centralize the code.
        >
        > About the customCode, I didn't try to get that route. I didn't find much documentation to tell you the truth.
        > I did however try couple of other extension one of which was the logicTypes. As you can see in the signature
        > ```
        >     public ByteBuffer fromBytes(ByteBuffer value, Schema schema, LogicalType type)
        > ```
        > there we don't have access to the original object where we would have the tenant information needed to retrieve the right token to use to encrypt the data.
        >
        > Would it make sense that I open a branch to show some code?
        >
        > Best,
        > -Enrico
        >
        > On 6/17/20, 4:39 PM, "Ryan Skraba" <ry...@skraba.com> wrote:
        >
        >     Hi!  I was interested enough to watch the entire video from Flink Forward.
        >
        >     I do think this is a good proposal, and adding hooks to "customize"
        >     the serialized bytes is a pretty neat idea.  The developer can benefit
        >     from learning or using Avro-generated classes and the SDK, and still
        >     using standard serialization underneath the customized logic.
        >
        >     At first glance, this would stay in the Java SDK, right?  I mean, once
        >     you've customized your Avro specific record with it's own
        >     serialization layer, there's little hope (without extensive work) for
        >     a different language to expect to be able to read it.  In other words,
        >     you'd never be able to write it to an Avro file and never expect it to
        >     be readable via another programming language or using a generic
        >     model... which is kind of the point!
        >
        >     Is there any use to having these changes in the
        >     GenericDatumWriter/Reader as opposed to the
        >     SpecificDatumWriter/Reader?  Would there ever be an instance where a
        >     generic model of data would delegate serialization?
        >
        >     Do you think that the necessary changes you've made to the specific
        >     data templates could be generalized?  I believe I've already come
        >     across a situation where we've customized the "extends
        >     MySpecificRecordBase" part of the templates -- it could be a
        >     configuration option.  I'm not sure whether passing along the record
        >     context (tenant id) to nested elements is generalizable, but I haven't
        >     thought very hard about it yet.
        >
        >     Have you looked into the `customEncode` parts of generated specific
        >     records?  This or something similar might be a more flexible technique
        >     than the SerializeFinalizationDelegate interface methods.
        >
        >     Thanks for sharing!  Ryan
        >
        >     On Tue, Jun 16, 2020 at 3:02 PM Enrico Agnoli
        >     <en...@workday.com.invalid> wrote:
        >     >
        >     > Hi,
        >     >
        >     > I would like to make a proposal change to AVRO to allow services to integrate some logic after serialization and before deserialization.
        >     > We use AVRO to support the data serialization in our streaming infrastructure and we decided to extend it to provide us the possibility to encrypt the data with info available directly on the data itself: the owner of it.
        >     > The change-set is pretty small and I would like to hear from you if it makes sense to contribute it back to the project.
        >     >
        >     > == The problem is:
        >     > Multi-tenants applications have the need to encrypt data (with the keys of the owner/tenant that generated that piece of data) every time it is serialized to avoid commingling of different tenant data. To do so, transparently to the application, the ideal place to implement the encryption it is in the serialization library (AVRO).
        >     >
        >     > == Proposal:
        >     > We modified the AVRO code to have afterSerialization and beforeDeserialization hooks that can use object defined values (the tenant/owner of that data) to implement encryption.
        >     > In the code we propose to submit we implemented a new interface: `SerializeFinalizationDelegate.java`
        >     > ```
        >     > public interface SerializeFinalizationDelegate {
        >     >   void afterSerialization(ByteArrayOutputStream serializedData, Encoder finalEncoder);
        >     >   Decoder beforeDeserialization(Decoder dataToDecode);
        >     > }
        >     > ```
        >     > That needs to be implemented by any AVRO serializable class that wants to define a post-serialization or pre-deserialization logic.
        >     > `GenericDatumWriter` and `GenericDatumReader` are modified to delegate to the object implementation of the methods above.
        >     >
        >     > More info can be found at https://urldefense.proofpoint.com/v2/url?u=https-3A__www.slideshare.net_FlinkForward_multi-2Dtenanted-2Dstreams-2Dworkday-2Denrico-2Dagnoli-2Dleire-2Dfernandez-2Dde-2Dretana-2Droitegui-2Dworkday-2D185815223&d=DwIFaQ&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=5oal4CtBGP1ioAe2G2rMT-XLCpWwh5R4aEw1TqtlCnc&m=Xu7g3Tz4gpvKrNVQaH8E_gOocZRRxOjiYDGo8Y44Peg&s=dea8kpG8JMBbu6GIqT176VBrvrIrnXdoMByO2cD9SS4&e=  from slide 21
        >     >
        >     >
        >     > What do you think about this proposal? I wanted to first start a discussion, but if it helps I can create a patch or a branch to show the change,
        >     >
        >     > Hope to hear from you,
        >     > -Enrico
        >
        >