You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Adam Reeve (Jira)" <ji...@apache.org> on 2022/09/07 22:05:00 UTC

[jira] [Commented] (ARROW-16921) [C#] Add decompression support for Record Batches

    [ https://issues.apache.org/jira/browse/ARROW-16921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601527#comment-17601527 ] 

Adam Reeve commented on ARROW-16921:
------------------------------------

Hi, we're interested in this feature at G-Research and believe compression support is important for use at scale. We're keen to help out where we can. I agree that it would be nice if there was a way for compression support to be added more automatically, without users needing to implement decoders themselves. An alternative approach that builds on the Type.GetType idea would be to provide wrapper packages for each compression format used in the IPC format (currently only Zstd and LZ4 I believe), and these could provide implementations of an IDecoder interface defined in the main dotnet Arrow library. So instead of getting the LZ4Stream type with Type.GetType for example, we could do something like this to work with the IDecoder interface without needing to use reflection:
{code:java}
var lz4DecoderType = Type.GetType("Apache.Arrow.Compression.Lz4.Lz4Decoder, Apache.Arrow.Compression.Lz4", false);
if (lz4DecoderType != null)
{
    if (Activator.CreateInstance(lz4DecoderType) is IDecoder decoder)
    {
        // use decoder
    }
    else
    {
        throw new Exception("Failed to cast Lz4Decoder to IDecoder");
    }
}
{code}
Having to maintain these extra wrapper packages would be a bit more work from an operational point of view though. It does seem like just adding new dependencies directly to the Arrow package would be a lot more straightforward, and given there are only two compression formats currently used is this really a problem?

On a more minor point, would Decompressor be a more precise term to use rather than Decoder? At least in the Parquet world, which I'm a bit more familiar with, encodings are a separate concept to compression formats.

> [C#] Add decompression support for Record Batches
> -------------------------------------------------
>
>                 Key: ARROW-16921
>                 URL: https://issues.apache.org/jira/browse/ARROW-16921
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C#
>            Reporter: Rishabh Rana
>            Assignee: Rishabh Rana
>            Priority: Major
>
> C# Implementation does not support reading batches written in other implementations of Arrow when the compression is specified in IPC Write options.
> e.g. Reading this batch from pyarrow in C# will fail:
> pyarrow.ipc.RecordStreamBatchWriter(sink, schema, options=pyarrow,ipcWriteOptions(compression="lz4"))
>  
> This is to support decompression (lz4 & zstd) in the C# implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)