You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "adamreeve (via GitHub)" <gi...@apache.org> on 2024/03/13 23:25:33 UTC

[I] [C#] IPC stream writer should write slices of buffers when writing sliced arrays [arrow]

adamreeve opened a new issue, #40517:
URL: https://github.com/apache/arrow/issues/40517

   ### Describe the enhancement requested
   
   Currently the C# `ArrowStreamWriter` always writes all data in a buffer. This behaviour differs to the Python/C++ implementation, which only writes slices of the buffers when an array has a nonzero offset or size in bytes less than the buffer length. This can be observed by looking at the sizes of IPC files for a whole RecordBatch, compared to slices of the data.
   
   Python:
   ```python
   import pyarrow as pa
   import numpy as np
   
   
   num_rows = 400
   rows_per_batch = 100
   
   ints = pa.array(np.arange(0, num_rows, 1, dtype=np.int32))
   floats = pa.array(np.arange(0, num_rows / 10.0, 0.1, dtype=np.float32))
   
   all_data = pa.RecordBatch.from_arrays([ints, floats], names=["a", "b"])
   
   sink = pa.BufferOutputStream()
   with pa.ipc.new_stream(sink, all_data.schema) as writer:
       writer.write_batch(all_data)
   buf = sink.getvalue()
   print(f"Size of serialized full batch = {buf.size}")
   
   for offset in range(0, num_rows, rows_per_batch):
       slice = all_data.slice(offset, rows_per_batch)
       sink = pa.BufferOutputStream()
       with pa.ipc.new_stream(sink, slice.schema) as writer:
           writer.write_batch(slice)
       buf = sink.getvalue()
       print(f"Size of serialized slice at offset {offset} = {buf.size}")
   ```
   
   This outputs:
   ```
   Size of serialized full batch = 3576
   Size of serialized slice at offset 0 = 1176
   Size of serialized slice at offset 100 = 1176
   Size of serialized slice at offset 200 = 1176
   Size of serialized slice at offset 300 = 1176
   ```
   
   The size of the full batch is 1/4 the full batch after accounting for the overhead of metadata.
   
   Doing the same in C#:
   ```C#
   const int numRows = 400;
   const int rowsPerBatch = 100;
   
   var allData = new RecordBatch.Builder()
       .Append("a", false, col => col.Int32(array => array.AppendRange(Enumerable.Range(0, numRows))))
       .Append("b", false, col => col.Float(array => array.AppendRange(Enumerable.Range(0, numRows).Select(i => 0.1f * i))))
       .Build();
   
   {
       using var ms = new MemoryStream();
       using var writer = new ArrowFileWriter(ms, allData.Schema, false, new IpcOptions());
       await writer.WriteStartAsync();
       await writer.WriteRecordBatchAsync(allData);
       await writer.WriteEndAsync();
   
       Console.WriteLine($"Size of serialized full batch = {ms.Length}");
   }
   
   for (var offset = 0; offset < allData.Length; offset += rowsPerBatch)
   {
       var arraySlices = allData.Arrays
           .Select(arr => ArrowArrayFactory.Slice(arr, offset, rowsPerBatch))
           .ToArray();
       var slice = new RecordBatch(allData.Schema, arraySlices, arraySlices[0].Length);
   
       using var ms = new MemoryStream();
       using var writer = new ArrowFileWriter(ms, slice.Schema, false, new IpcOptions());
       await writer.WriteStartAsync();
       await writer.WriteRecordBatchAsync(slice);
       await writer.WriteEndAsync();
   
       Console.WriteLine($"Size of serialized slice at offset {offset} = {ms.Length}");
   }
   ```
   
   This outputs:
   ```
   Size of serialized full batch = 3802
   Size of serialized slice at offset 0 = 3802
   Size of serialized slice at offset 100 = 3802
   Size of serialized slice at offset 200 = 3802
   Size of serialized slice at offset 300 = 3802
   ```
   
   Writing a slice of the data results in the same file size as writing the full data, but we'd like to be able to break IPC data into smaller slices in order to send it over a transport that has a message size limit.
   
   From a quick look at the C++ implementation, one complication is dealing with null bitmaps, which need to be copied to ensure the start is aligned with a byte boundary.
   
   ### Component(s)
   
   C#


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [C#] IPC stream writer should write slices of buffers when writing sliced arrays [arrow]

Posted by "CurtHagenlocher (via GitHub)" <gi...@apache.org>.

CurtHagenlocher commented on issue #40517:
URL: https://github.com/apache/arrow/issues/40517#issuecomment-1996323178

   Your example works if I add a quick hack e.g.
   
   ```
   +            if (nullCount == RecalculateNullCount)
   +            {
   +                NullCount = CalculateNullCount();
   +            }
   ```
   and
   
   ```
   +        public int CalculateNullCount()
   +        {
   +            switch (DataType)
   +            {
   +                case FixedWidthType fixedWidthType:
   +                    ArrowBuffer nullBuffer = Buffers[0];
   +                    return nullBuffer.IsEmpty ? 0 : BitUtility.CountBits(nullBuffer.Span);
   +                default:
   +                    // TODO:
   +                    throw new NotImplementedException();
   +            }
   +        }
   ```
   
   I can finish implementing this over the coming weekend, or someone else could take it on.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [C#] IPC stream writer should write slices of buffers when writing sliced arrays [arrow]

Posted by "adamreeve (via GitHub)" <gi...@apache.org>.

adamreeve commented on issue #40517:
URL: https://github.com/apache/arrow/issues/40517#issuecomment-2048407839

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [C#] IPC stream writer should write slices of buffers when writing sliced arrays [arrow]

Posted by "CurtHagenlocher (via GitHub)" <gi...@apache.org>.

CurtHagenlocher commented on issue #40517:
URL: https://github.com/apache/arrow/issues/40517#issuecomment-1996308518

   There's code which suggests a null count of -1 was intended to mean "RecalculateNullCount" but apparently there's nothing which actually does this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [C#] IPC stream writer should write slices of buffers when writing sliced arrays [arrow]

Posted by "adamreeve (via GitHub)" <gi...@apache.org>.

adamreeve commented on issue #40517:
URL: https://github.com/apache/arrow/issues/40517#issuecomment-1996171032

   I initially thought that the problem was just that the files were bigger than they need to be, and that the array offsets would be correctly round-tripped. But on further investigation, it looks like the IPC files written from sliced arrays are invalid and can't be read back by the C# library or Python library.
   
   Reading these sliced batches back from C# fails with:
   ```
   System.IO.InvalidDataException
   Null count length must be >= 0
      at Apache.Arrow.Ipc.ArrowReaderImplementation.LoadField(MetadataVersion version, RecordBatchEnumerator& recordBatchEnumerator, Field field, FieldNode& fieldNode, ByteBuffer bodyData, IBufferCreator bufferCreator)
   ```
   
   and Python displays the arrays as invalid:
   ```
   <Invalid array: Buffer #0 too small in array of type int32 and length 100: expected at least 13 byte(s), got 0>
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [C#] IPC stream writer should write slices of buffers when writing sliced arrays [arrow]

Posted by "CurtHagenlocher (via GitHub)" <gi...@apache.org>.

CurtHagenlocher closed issue #40517: [C#] IPC stream writer should write slices of buffers when writing sliced arrays
URL: https://github.com/apache/arrow/issues/40517


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [C#] IPC stream writer should write slices of buffers when writing sliced arrays [arrow]

Posted by "CurtHagenlocher (via GitHub)" <gi...@apache.org>.

CurtHagenlocher commented on issue #40517:
URL: https://github.com/apache/arrow/issues/40517#issuecomment-2056839624

   Issue resolved by pull request 41197
   https://github.com/apache/arrow/pull/41197


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [C#] IPC stream writer should write slices of buffers when writing sliced arrays [arrow]

Posted by "adamreeve (via GitHub)" <gi...@apache.org>.

adamreeve commented on issue #40517:
URL: https://github.com/apache/arrow/issues/40517#issuecomment-1996330269

   When you say it works, I guess you mean that the data can be round-tripped and read correctly (at least from .NET), but the files for a slice are still as large as writing the unsliced data?
   
   I should be able to work on fixing this eventually, but I have other more urgent work at the moment so am not sure when I will get around to this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [C#] IPC stream writer should write slices of buffers when writing sliced arrays [arrow]

Posted by "CurtHagenlocher (via GitHub)" <gi...@apache.org>.

CurtHagenlocher commented on issue #40517:
URL: https://github.com/apache/arrow/issues/40517#issuecomment-1997510164

   Yeah, it was late and I only had time for a quick look. This avoids the exception but ends up producing the wrong results. The entire ArrowBuffer is being serialized instead of just the part needed for the slice, and while the length is recorded correctly there's (of course) no offset being serialized so the data being read back starts at the beginning of the buffer for each slice -- meaning it's the wrong data. 
   
   The root of this problem is that ArrowStreamWriter.ArrowRecordBatchFlatBufferBuilder is serializing entire buffers instead of just the parts of the buffer associated with the slice. I suspect this will not be a trivial fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org