You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@avro.apache.org by "Martin Tzvetanov Grigorov (Jira)" <ji...@apache.org> on 2023/05/04 18:30:00 UTC

[jira] [Resolved] (AVRO-3714) Zero-copy (de)serialization - (de)serialization rewrite takeover?

     [ https://issues.apache.org/jira/browse/AVRO-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Martin Tzvetanov Grigorov resolved AVRO-3714.
---------------------------------------------
    Resolution: Won't Fix

Closing as "Won't do".

Good luck!

> Zero-copy (de)serialization - (de)serialization rewrite takeover?
> -----------------------------------------------------------------
>
>                 Key: AVRO-3714
>                 URL: https://issues.apache.org/jira/browse/AVRO-3714
>             Project: Apache Avro
>          Issue Type: Improvement
>          Components: rust
>            Reporter: Ten
>            Priority: Major
>
> Soo... I ended up taking up [this invitation|https://issues.apache.org/jira/browse/AVRO-3631?focusedCommentId=17649163&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17649163]. :)
> Using this library and being severely cpu-capped by the deserialization performance (something I basically never hit with Rust) I've given a go at fixing [https://github.com/flavray/avro-rs/issues/195].
> I initially started writing in the same repository, but as I went through the existing code and noticed so many opportunities for improvement and wanted to try so significantly different design, and as my code was so completely separate and independent from it, I ended up just splitting it out in a separate repository, and somewhat accidentally ended up with a full-featured deserialization library. (For context, I've been a professional Rust developer for years and I've regularly worked with Serde's internals.)
> This deserialization code achieves >10x performance gains, seems simpler to use while still being as flexible as necessary, and passes all the relevant tests I could find (besides [AVRO-3240|https://github.com/apache/avro/pull/1379#issuecomment-1412608332], intentionally, for reason explained there).
> It uses this apache-avro library as a dependency for initial schema parsing.
> It would probably be reasonably easily extendable similarly (using the same pattern) to serializalization, fixing [the currently pending serialization issue|https://issues.apache.org/jira/browse/AVRO-3631?focusedCommentId=17649103&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17649103].
>  
> There are a couple major design differences:
>  * {{Value}} is removed. Deserialization is a one-step process, which is fully serde-integrated, and leverages its zero-copy features. The output struct can now borrow from the source slice.
>  ** Having an intermediate {{Value}} representation appears to be unnecessary in Rust, as the two use-cases for {{Value}} would seem to be:
>  *** Somewhat-known structure of data but still some amount of dynamic processing -> You can deserialize to somewhat-dynamic rust types, e.g. {{HashMap}}, {{Vec}}... 
>  *** Transcoding to a different serialization format (e.g. JSON) with basically zero structural information -> This can still be achieved in a much more performant and idiomatic manner using [serde_transcode|https://crates.io/crates/serde-transcode].
>  ** The {{Value}} representation hurts performance compared to deserializing right away to the correct struct (especially when said representation involves as many allocations as this one does).
>  * Reader schema concept is removed. It appeared to be unnecessary in Rust, as it is a fully statically typed language, and the [deserialization hints|https://serde.rs/impl-deserializer.html] provided by the struct through the Serde framework combined with the writer schema information give all that is necessary to construct the correct types directly, without the need for a separate schema.
>  ** I expect that any code that currently uses a reader schema could work out of the box with this new deserializer without the need to specify a reader schema at all.
>  ** If needing to convert Avro byte streams from one schema to another, this could likely be achieved simply by plugging the deserializer to the serializer through [serde_transcode|https://crates.io/crates/serde-transcode], as such serializer would ([unlike the current one|https://issues.apache.org/jira/browse/AVRO-3631?focusedCommentId=17649103&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17649103]) combine the types provided from the original struct (or in this case, deserializer) with the schema variant to remap the values in a correct way, while preserving zero-alloc.
>  * Schema representation is reworked to be a pre-computed self-referential graph structure.
>  ** This is what allows for maximum performance when traveling it during de(serialization) operations.
> It supports any schema/deserialization target type combination I could think of (besides [some todo|https://github.com/Ten0/serde_avro_fast/blob/10e6ad00fd5b0770a60ca09b4487aa00e8868313/src/de/deserializer/mod.rs#L282] left in the code), including advanced union usage with (or without) enums, as well as proper Option support. I would encourage you to test any exotic use-case you have in mind and see if that works.
> It is extensively documented (hosted on [docs.rs|https://docs.rs/serde_avro_fast/latest/serde_avro_fast/], so you can browse it easily, while full source code is [here|https://github.com/Ten0/serde_avro_fast]).
> So now my questions mainly are:
>  * Does it look like this should/could be taken over by apache-avro, replacing the implementation originally written by flavray in avro-rs?
>  * Or should I release it as a separate crate?
>  * Can you think of common use-cases that would be prevented by the design choice of completely removing the avro {{Value}} and reader schema concepts from a Rust (de)serialization library?
>  * How is [the per-language-releases project|https://lists.apache.org/thread/2rfnszd4dk36jxynpj382b1717gbyv1y] going? ^^ (Wouldn't like it to take months to get a new feature out if I were to add one ;) )
> Thanks,



--
This message was sent by Atlassian Jira
(v8.20.10#820010)