You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "Platob (via GitHub)" <gi...@apache.org> on 2023/04/24 09:44:07 UTC

[GitHub] [arrow] Platob opened a new pull request, #35299: GH-35285: [C#] Need refacto IArrowArrayBuilder for nested types

Platob opened a new pull request, #35299:
URL: https://github.com/apache/arrow/pull/35299

   Make more native Builders with raw bit / byte access memory
   
   Buffer
   Data
   Array


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] Platob commented on a diff in pull request #35299: GH-35285: [C#] Need refacto IArrowArrayBuilder for nested types

Posted by "Platob (via GitHub)" <gi...@apache.org>.
Platob commented on code in PR #35299:
URL: https://github.com/apache/arrow/pull/35299#discussion_r1186646427


##########
csharp/src/Apache.Arrow/Builder/BufferBuilder.cs:
##########
@@ -0,0 +1,356 @@
+using System;
+using System.Buffers;
+using System.Collections.Generic;
+using System.Runtime.CompilerServices;
+using System.Runtime.InteropServices;
+using Apache.Arrow.Memory;
+
+namespace Apache.Arrow.Builder
+{
+    public class BufferBuilder : IBufferBuilder
+    {
+        public class BitBuffer
+        {
+            private readonly bool[] _bits;
+
+            public int Length { get; private set; }
+            public int AvailableLength => Capacity - Length;
+
+            public int Capacity;
+
+            public bool IsFull => Length == Capacity;
+            public byte ToByte(ref byte data) => BitUtility.ToByte(ref data, _bits);
+
+            public BitBuffer(int capacity = 8)
+            {
+                Capacity = capacity;
+                _bits = new bool[capacity];
+                Length = 0;
+            }
+
+            public void Append(bool bit) => _bits[Length++] = bit;
+            public void Fill(ReadOnlySpan<bool> bits)
+            {
+                bits.CopyTo(_bits.AsSpan().Slice(Length, bits.Length));
+                Length += bits.Length;
+            }
+
+            public void Reset()
+            {
+                for (int i = 0; i < _bits.Length; i++)
+                {
+                    _bits[i] = false;
+                }
+                Length = 0;
+            }
+        }
+
+        private const int DefaultCapacity = 64;
+        public int ByteLength { get; private set; }
+
+        public Memory<byte> Memory { get; private set; }
+        public BitBuffer BitOverhead { get; }

Review Comment:
   During its lifetime, the builder will allocate only 1 time 8 bits in bit overhead
   
   On Ever bit(s) append, it will check if the 8 bits are written (like the old bitmap builder checking if it is % 8)
   
   If it is full (8 bits are written) i will write a new byte in the Memory Buffer and reset all values in the BitOverhead to false setting his length to 0 an dread to recieve other bits
   
   ```csharp
   public class BenchmarkBits
   {
       public static long ElapsedTicks(int repetition, Action action)
       {
           Stopwatch stopwatch = Stopwatch.StartNew();
   
           for (int i = 0; i < repetition; i++)
               action();
   
           stopwatch.Stop();
   
           return stopwatch.ElapsedTicks / repetition;
       }
   
       private readonly ITestOutputHelper output;
   
       private static readonly bool[] trueBits = Enumerable.Range(0, 100).Select(_ => true).ToArray();
       private static readonly bool[] falseBits = Enumerable.Range(0, 100).Select(_ => false).ToArray();
   
       public BenchmarkBits(ITestOutputHelper output)
       {
           this.output = output;
       }
   
       [Fact]
       public void Bench()
       {
           output.WriteLine($"Elapsed {ElapsedTicks(1000000, MakeOld)}");
           output.WriteLine($"Elapsed {ElapsedTicks(1000000, MakeNew)}");
       }
   
       private static void MakeNew()
       {
           var builder = new BufferBuilder(64);
   
           builder.AppendBit(true).AppendBit(true)
               .AppendBits(trueBits)
               .AppendBits(falseBits)
               .Build();
       }
   
       private static void MakeOld()
       {
           var builder = new ArrowBuffer.BitmapBuilder(64);
   
           builder.Append(true).Append(true)
               .AppendRange(trueBits)
               .AppendRange(falseBits)
               .Build();
       }
   }
   ````
   
   On 1 million iterations the new implementation is in mean 168 ticks
   On 1 million iterations the old implementation is in mean 312 ticks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] Platob commented on a diff in pull request #35299: GH-35285: [C#] Need refacto IArrowArrayBuilder for nested types

Posted by "Platob (via GitHub)" <gi...@apache.org>.
Platob commented on code in PR #35299:
URL: https://github.com/apache/arrow/pull/35299#discussion_r1186649052


##########
csharp/src/Apache.Arrow/Builder/BufferBuilder.cs:
##########
@@ -0,0 +1,356 @@
+using System;
+using System.Buffers;
+using System.Collections.Generic;
+using System.Runtime.CompilerServices;
+using System.Runtime.InteropServices;
+using Apache.Arrow.Memory;
+
+namespace Apache.Arrow.Builder
+{
+    public class BufferBuilder : IBufferBuilder
+    {
+        public class BitBuffer
+        {
+            private readonly bool[] _bits;
+
+            public int Length { get; private set; }
+            public int AvailableLength => Capacity - Length;
+
+            public int Capacity;
+
+            public bool IsFull => Length == Capacity;
+            public byte ToByte(ref byte data) => BitUtility.ToByte(ref data, _bits);
+
+            public BitBuffer(int capacity = 8)
+            {
+                Capacity = capacity;
+                _bits = new bool[capacity];
+                Length = 0;
+            }
+
+            public void Append(bool bit) => _bits[Length++] = bit;
+            public void Fill(ReadOnlySpan<bool> bits)
+            {
+                bits.CopyTo(_bits.AsSpan().Slice(Length, bits.Length));
+                Length += bits.Length;
+            }
+
+            public void Reset()
+            {
+                for (int i = 0; i < _bits.Length; i++)
+                {
+                    _bits[i] = false;
+                }
+                Length = 0;
+            }
+        }
+
+        private const int DefaultCapacity = 64;
+        public int ByteLength { get; private set; }
+
+        public Memory<byte> Memory { get; private set; }
+        public BitBuffer BitOverhead { get; }

Review Comment:
   Same for sequential append it looks better, thus it scales good with larger value range
   
   Old was in mean 1183 Ticks vs 938 Ticks for the new
   
   ```csharp
   [Fact]
   public void Bench()
   {
       output.WriteLine($"Elapsed {ElapsedTicks(100000, MakeOld)}");
       output.WriteLine($"Elapsed {ElapsedTicks(100000, MakeNew)}");
   }
   
   private static void MakeNew()
   {
       var builder = new BufferBuilder(64);
   
       for (int i = 0; i < 1000; i++)
           builder.AppendBit(true);
   
       builder.Build();
   }
   
   private static void MakeOld()
   {
       var builder = new ArrowBuffer.BitmapBuilder(64);
   
       for (int i = 0; i < 1000; i++)
           builder.Append(true);
   
       builder.Build();
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] Platob closed pull request #35299: GH-35285: [C#] Need refacto IArrowArrayBuilder for nested types

Posted by "Platob (via GitHub)" <gi...@apache.org>.
Platob closed pull request #35299: GH-35285: [C#] Need refacto IArrowArrayBuilder for nested types
URL: https://github.com/apache/arrow/pull/35299


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #35299: GH-35285: [C#] Need refacto IArrowArrayBuilder for nested types

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #35299:
URL: https://github.com/apache/arrow/pull/35299#discussion_r1186374522


##########
csharp/src/Apache.Arrow/Arrays/StringArray.cs:
##########
@@ -23,8 +23,6 @@ namespace Apache.Arrow
 {
     public class StringArray: BinaryArray
     {
-        public static readonly Encoding DefaultEncoding = Encoding.UTF8;

Review Comment:
   Is there anyway you can mark this with deprecated and run it through a deprecation cycle?  Given this is a public member it's possible that others are relying on this.



##########
csharp/src/Apache.Arrow/Types/StringType.cs:
##########
@@ -14,14 +14,18 @@
 // limitations under the License.
 
 
+using System.Text;
+
 namespace Apache.Arrow.Types
 {
     public sealed class StringType : ArrowType
     {
+        public static readonly Encoding DefaultEncoding = Encoding.UTF8;
         public static StringType Default = new StringType();
 
         public override ArrowTypeId TypeId => ArrowTypeId.String;
         public override string Name => "utf8";
+        public Encoding Encoding => DefaultEncoding;

Review Comment:
   The arrow spec mandates UTF8 for strings:
   
   https://github.com/apache/arrow/blob/main/format/Schema.fbs#L156-L158
   
   I'm not sure I see the value in adding an `Encoding` property.  If a user is converting from some other representation to Arrow then they should either convert to UTF8 or create an extension type (with a binary array as the storage type)



##########
csharp/src/Apache.Arrow/ArrowBuffer.cs:
##########
@@ -57,11 +57,7 @@ public ReadOnlySpan<byte> Span
         }
 
         public ArrowBuffer Clone(MemoryAllocator allocator = default)
-        {
-            return new Builder<byte>(Span.Length)
-                .Append(Span)
-                .Build(allocator);
-        }
+            => new Builder.BufferBuilder(Span.Length).AppendBytes(Span).Build(allocator);

Review Comment:
   When stylistic changes are mixed with content changes it makes it harder to review.  Ideally we could adopt some kind of auto-formatter and configure the CI to enforce the same formatting style.  In the meantime I would discourage you from making style changes if possible (unless as part of a style-only PR)



##########
csharp/src/Apache.Arrow/Builder/IArrayBuilder.cs:
##########
@@ -0,0 +1,52 @@
+using Apache.Arrow.Memory;
+using Apache.Arrow.Types;
+
+namespace Apache.Arrow.Builder
+{
+    public interface IArrayBuilder
+    {
+        IArrowType DataType { get; }
+        int Length { get; }
+        int NullCount { get; }
+        int Offset { get; }
+
+        IValueBufferBuilder[] Buffers { get; }
+        IArrayBuilder[] Children { get; }
+        IArrayBuilder Dictionary { get; } // Only used for dictionary type
+
+        IArrayBuilder AppendNull();
+        IArrayBuilder AppendNulls(int count);
+
+        IArrayBuilder AppendValues(ArrayData data);
+        IArrayBuilder AppendValues(IArrowArray data);
+
+        /// <summary>
+        /// Clear all contents appended so far.
+        /// </summary>
+        IArrayBuilder Clear();
+
+        /// <summary>
+        /// Reserve a given number of items' additional capacity.
+        /// </summary>
+        /// <param name="capacity">Number of new values.</param>
+        IArrayBuilder Reserve(int capacity);
+
+        /// <summary>
+        /// Resize the buffer to a given size.
+        /// </summary>
+        /// <remarks>
+        /// Note that if the required capacity is larger than the current length of the populated buffer so far,
+        /// the buffer's contents in the new, expanded region are undefined.

Review Comment:
   If the new region is undefined then how exactly is this different than `Reserve`? (I've been working a lot in C++ lately where "resize" means "create default elements to fill the new space" and "reserve" means "allocate new space but leave it undefined").



##########
csharp/src/Apache.Arrow/Builder/ArrayBuilder.cs:
##########
@@ -0,0 +1,353 @@
+using System;
+using System.Linq;
+using System.Threading.Tasks;
+using Apache.Arrow.Arrays;
+using Apache.Arrow.Memory;
+using Apache.Arrow.Types;
+
+namespace Apache.Arrow.Builder
+{
+    public abstract class ArrayBuilder : IArrayBuilder
+    {
+        public IArrowType DataType { get; }
+
+        public int Length { get; internal set; }
+
+        public int NullCount { get; internal set; }

Review Comment:
   Why `internal` and not `protected`?



##########
csharp/src/Apache.Arrow/Builder/ArrayBuilder.cs:
##########
@@ -0,0 +1,353 @@
+using System;
+using System.Linq;
+using System.Threading.Tasks;
+using Apache.Arrow.Arrays;
+using Apache.Arrow.Memory;
+using Apache.Arrow.Types;
+
+namespace Apache.Arrow.Builder
+{
+    public abstract class ArrayBuilder : IArrayBuilder
+    {
+        public IArrowType DataType { get; }
+
+        public int Length { get; internal set; }
+
+        public int NullCount { get; internal set; }
+
+        public int Offset { get; }
+
+        public IValueBufferBuilder[] Buffers { get; }
+        public IValueBufferBuilder ValidityBuffer => Buffers[0];
+
+        public IArrayBuilder[] Children { get; }
+
+        public IArrayBuilder Dictionary { get; }
+
+        public ArrayBuilder(

Review Comment:
   ```suggestion
           protected ArrayBuilder(
   ```
   
   Minor nit: I suppose it really doesn't matter since this is an abstract class.



##########
csharp/src/Apache.Arrow/Field.Builder.cs:
##########
@@ -25,7 +28,7 @@ public class Builder
         {
             private Dictionary<string, string> _metadata;
             private string _name;
-            private IArrowType _type;
+            internal IArrowType _type;

Review Comment:
   What's the motivation for this change?



##########
csharp/src/Apache.Arrow/BitUtility.cs:
##########
@@ -200,5 +208,75 @@ internal static int ReadInt32(ReadOnlyMemory<byte> value)
 
             return Unsafe.ReadUnaligned<int>(ref MemoryMarshal.GetReference(value.Span));
         }
+
+        // Bytes

Review Comment:
   I'm not sure this comment is useful.  On the other hand, it would help to have a comment here explaining what this method is doing (e.g. bits must be size 8 or less or the excess will be ignored)



##########
csharp/src/Apache.Arrow/Types/Decimal256Type.cs:
##########
@@ -17,6 +17,9 @@ namespace Apache.Arrow.Types
 {
     public sealed class Decimal256Type: FixedSizeBinaryType
     {
+        // max int decimal 79228162514264337593543950335M : precision = 29
+        // max scaled decimal 1.2345678901234567890123456789M : scale = 28

Review Comment:
   Can we clean these up?



##########
csharp/src/Apache.Arrow/BitUtility.cs:
##########
@@ -200,5 +208,75 @@ internal static int ReadInt32(ReadOnlyMemory<byte> value)
 
             return Unsafe.ReadUnaligned<int>(ref MemoryMarshal.GetReference(value.Span));
         }
+
+        // Bytes
+        public static byte ToByte(ref byte data, ReadOnlySpan<bool> bits)
+        {
+            for (int i = 0; i < Math.Min(8, bits.Length); i++)
+            {
+                SetBit(ref data, i, bits[i]);
+            }
+            return data;
+        }
+
+        public static void ToBytes(Span<byte> bytes, ReadOnlySpan<bool> bits)
+        {
+            for (int i = 0; i < bytes.Length; i++)
+            {
+                ToByte(ref bytes[i], bits.Slice(i * 8));
+            }
+        }
+
+        // Bits
+        public static void ToBits(Span<bool> bools, byte value)
+        {
+            for (int i = 0; i < 8; i++)
+            {
+                bools[i] = GetBit(value, i);
+            }
+        }
+
+        public static bool[] ToBits(byte value)
+        {
+            bool[] boolArray = new bool[8]; // initialize bool array with correct length
+
+            for (int i = 0; i < 8; i++)
+            {
+                boolArray[i] = GetBit(value, i);
+            }
+
+            return boolArray;
+        }
+
+        public static void ToBits(Span<bool> bits, ReadOnlySpan<byte> bytes)
+        {
+            for (int i = 0; i < bytes.Length; i++)
+            {
+                ToBits(bits.Slice(i * 8, 8), bytes[i]);
+            }
+        }
+
+        public static Span<bool> ToBits(ReadOnlySpan<byte> bytes)
+        {
+            Span<bool> bits = new bool[bytes.Length * 8].AsSpan();
+
+            ToBits(bits, bytes);
+
+            return bits;
+        }
+
+        internal static int NextPowerOfTwo(int n)
+        {
+            // 16 for int32

Review Comment:
   I don't know what this comment means



##########
csharp/src/Apache.Arrow/BitUtility.cs:
##########
@@ -200,5 +208,75 @@ internal static int ReadInt32(ReadOnlyMemory<byte> value)
 
             return Unsafe.ReadUnaligned<int>(ref MemoryMarshal.GetReference(value.Span));
         }
+
+        // Bytes
+        public static byte ToByte(ref byte data, ReadOnlySpan<bool> bits)

Review Comment:
   It's a little confusing that `data` is both a ref parameter and a return value here.
   
   Also, given the name `ToByte` I would expect this method converted `bits` to a byte.  However, that isn't exactly what it is doing because it depends on the original value of `data`.  Maybe `SetBits` would be more appropriate?



##########
csharp/src/Apache.Arrow/BitUtility.cs:
##########
@@ -200,5 +208,75 @@ internal static int ReadInt32(ReadOnlyMemory<byte> value)
 
             return Unsafe.ReadUnaligned<int>(ref MemoryMarshal.GetReference(value.Span));
         }
+
+        // Bytes
+        public static byte ToByte(ref byte data, ReadOnlySpan<bool> bits)
+        {
+            for (int i = 0; i < Math.Min(8, bits.Length); i++)
+            {
+                SetBit(ref data, i, bits[i]);
+            }
+            return data;
+        }
+
+        public static void ToBytes(Span<byte> bytes, ReadOnlySpan<bool> bits)
+        {
+            for (int i = 0; i < bytes.Length; i++)
+            {
+                ToByte(ref bytes[i], bits.Slice(i * 8));
+            }
+        }
+
+        // Bits
+        public static void ToBits(Span<bool> bools, byte value)
+        {
+            for (int i = 0; i < 8; i++)
+            {
+                bools[i] = GetBit(value, i);
+            }
+        }
+
+        public static bool[] ToBits(byte value)
+        {
+            bool[] boolArray = new bool[8]; // initialize bool array with correct length
+
+            for (int i = 0; i < 8; i++)
+            {
+                boolArray[i] = GetBit(value, i);
+            }
+
+            return boolArray;
+        }
+
+        public static void ToBits(Span<bool> bits, ReadOnlySpan<byte> bytes)
+        {
+            for (int i = 0; i < bytes.Length; i++)
+            {
+                ToBits(bits.Slice(i * 8, 8), bytes[i]);
+            }
+        }
+
+        public static Span<bool> ToBits(ReadOnlySpan<byte> bytes)
+        {
+            Span<bool> bits = new bool[bytes.Length * 8].AsSpan();
+
+            ToBits(bits, bytes);
+
+            return bits;
+        }
+
+        internal static int NextPowerOfTwo(int n)

Review Comment:
   This needs a comment explaining what it is because it is not obvious looking at the code.



##########
csharp/src/Apache.Arrow/Builder/IArrayBuilder.cs:
##########
@@ -0,0 +1,52 @@
+using Apache.Arrow.Memory;
+using Apache.Arrow.Types;
+
+namespace Apache.Arrow.Builder
+{
+    public interface IArrayBuilder
+    {
+        IArrowType DataType { get; }
+        int Length { get; }
+        int NullCount { get; }
+        int Offset { get; }
+
+        IValueBufferBuilder[] Buffers { get; }
+        IArrayBuilder[] Children { get; }
+        IArrayBuilder Dictionary { get; } // Only used for dictionary type
+
+        IArrayBuilder AppendNull();
+        IArrayBuilder AppendNulls(int count);
+
+        IArrayBuilder AppendValues(ArrayData data);
+        IArrayBuilder AppendValues(IArrowArray data);
+
+        /// <summary>
+        /// Clear all contents appended so far.
+        /// </summary>
+        IArrayBuilder Clear();
+
+        /// <summary>
+        /// Reserve a given number of items' additional capacity.
+        /// </summary>
+        /// <param name="capacity">Number of new values.</param>
+        IArrayBuilder Reserve(int capacity);
+
+        /// <summary>
+        /// Resize the buffer to a given size.
+        /// </summary>
+        /// <remarks>
+        /// Note that if the required capacity is larger than the current length of the populated buffer so far,
+        /// the buffer's contents in the new, expanded region are undefined.
+        /// </remarks>
+        /// <remarks>
+        /// Note that if the required capacity is smaller than the current length of the populated buffer so far,
+        /// the buffer will be truncated and items at the end of the buffer will be lost.
+        /// </remarks>
+        /// <param name="capacity">Number of values.</param>
+        IArrayBuilder Resize(int capacity);
+
+        ArrayData FinishInternal(MemoryAllocator allocator = default);

Review Comment:
   Does this need to be part of the interface?



##########
csharp/src/Apache.Arrow/Builder/IArrayBuilder.cs:
##########
@@ -0,0 +1,52 @@
+using Apache.Arrow.Memory;
+using Apache.Arrow.Types;
+
+namespace Apache.Arrow.Builder
+{
+    public interface IArrayBuilder

Review Comment:
   Should this replace `IArrowArrayBuilder`?  Should it extend it?  What is the relationship between the two?



##########
csharp/src/Apache.Arrow/Builder/BufferBuilder.cs:
##########
@@ -0,0 +1,356 @@
+using System;
+using System.Buffers;
+using System.Collections.Generic;
+using System.Runtime.CompilerServices;
+using System.Runtime.InteropServices;
+using Apache.Arrow.Memory;
+
+namespace Apache.Arrow.Builder
+{
+    public class BufferBuilder : IBufferBuilder
+    {
+        public class BitBuffer
+        {
+            private readonly bool[] _bits;
+
+            public int Length { get; private set; }
+            public int AvailableLength => Capacity - Length;
+
+            public int Capacity;
+
+            public bool IsFull => Length == Capacity;
+            public byte ToByte(ref byte data) => BitUtility.ToByte(ref data, _bits);
+
+            public BitBuffer(int capacity = 8)
+            {
+                Capacity = capacity;
+                _bits = new bool[capacity];
+                Length = 0;
+            }
+
+            public void Append(bool bit) => _bits[Length++] = bit;
+            public void Fill(ReadOnlySpan<bool> bits)
+            {
+                bits.CopyTo(_bits.AsSpan().Slice(Length, bits.Length));
+                Length += bits.Length;
+            }
+
+            public void Reset()
+            {
+                for (int i = 0; i < _bits.Length; i++)
+                {
+                    _bits[i] = false;
+                }
+                Length = 0;
+            }
+        }
+
+        private const int DefaultCapacity = 64;
+        public int ByteLength { get; private set; }
+
+        public Memory<byte> Memory { get; private set; }
+        public BitBuffer BitOverhead { get; }
+
+        /// <summary>
+        /// Creates an instance of the <see cref="BufferBuilder"/> class.
+        /// </summary>
+        /// <param name="valueBitSize">Number of bits of one value item.</param>
+        /// <param name="capacity">Number of items of initial capacity to reserve.</param>
+        public BufferBuilder(int capacity = DefaultCapacity)
+        {
+            Memory = new byte[capacity];
+            BitOverhead = new BitBuffer();
+
+            ByteLength = 0;
+        }
+
+        private void CommitBitBuffer(bool force = false)
+        {
+            if (BitOverhead.IsFull || force)
+            {
+                EnsureAdditionalBytes(1);
+                BitOverhead.ToByte(ref Memory.Span[ByteLength]);
+                BitOverhead.Reset();
+                ByteLength++;
+            }
+        }
+
+        public IBufferBuilder AppendBit(bool bit)
+        {
+            BitOverhead.Append(bit);
+            CommitBitBuffer();
+            return this;
+        }
+
+        public IBufferBuilder AppendBits(ReadOnlySpan<bool> bits)
+        {
+            if (BitOverhead.Length > 0)
+            {
+                int available = BitOverhead.AvailableLength;
+
+                if (bits.Length > available)
+                {
+                    // Fill byte buffer
+                    BitOverhead.Fill(bits.Slice(0, available));
+
+                    // Commit to memory and reset
+                    CommitBitBuffer();
+
+                    bits = bits.Slice(available);
+                }
+                else
+                {
+                    // Fill byte buffer
+                    BitOverhead.Fill(bits);
+
+                    bits = ReadOnlySpan<bool>.Empty;
+                }
+            }
+
+            if (bits.Length > 0)
+            {
+                int byteEnd = bits.Length / 8;
+                int bitEnd = byteEnd * 8;
+
+                if (byteEnd > 0)
+                {
+                    // Ensure byte length
+                    EnsureAdditionalBytes(byteEnd);
+
+                    // Raw Span copy to memory
+                    BitUtility.ToBytes(Memory.Span.Slice(ByteLength, byteEnd), bits.Slice(0, bitEnd));
+
+                    ByteLength += byteEnd;
+
+                    bits = bits.Slice(bitEnd);
+                }
+                
+                if (bits.Length > 0)
+                {
+                    // Fill byte buffer with last unfilled
+                    BitOverhead.Fill(bits);
+                }
+            }
+
+            return this;
+        }
+
+        public IBufferBuilder AppendBits(bool value, int count)
+        {
+            Span<bool> span = new bool[count];
+
+            // default bool are already false
+            if (value)
+                for (int i = 0; i < count; i++)
+                    span[i] = value;
+
+            return AppendStructs(span);
+        }
+
+        public IBufferBuilder AppendByte(byte byteValue)
+        {
+            if (BitOverhead.Length > 0)
+            {
+                // Fill current bit buffer
+                int available = BitOverhead.AvailableLength;
+
+                // Convert byte to bit array
+                Span<bool> bits = BitUtility.ToBits(byteValue).AsSpan();
+
+                // Fill byte buffer
+                BitOverhead.Fill(bits.Slice(0, available));
+
+                // Commit to memory and reset
+                CommitBitBuffer();
+
+                // Fill new bit buffer
+                BitOverhead.Fill(bits.Slice(available));
+            }
+            else
+            {
+                EnsureAdditionalBytes(1);
+                // Raw add to memory
+                Memory.Span[ByteLength] = byteValue;
+                ByteLength++;
+            }
+
+            return this;
+        }
+
+        public IBufferBuilder AppendBytes(ReadOnlySpan<byte> bytes)
+        {
+            if (BitOverhead.Length == 0)
+            {
+                EnsureAdditionalBytes(bytes.Length);
+                // Raw Span copy to memory
+                bytes.CopyTo(Memory.Span.Slice(ByteLength, bytes.Length));
+                ByteLength += bytes.Length;
+            }
+            else
+            {
+                // Convert Bytes to Bits streamed in batchsize = 64
+                int offset = 0;
+                while (offset < bytes.Length)
+                {
+                    int remainingBytes = bytes.Length - offset;
+                    int bufferLength = Math.Min(128, remainingBytes);
+
+                    // Bits span
+                    Span<bool> buffer = new bool[bufferLength * 8];
+
+                    // Fill bits
+                    BitUtility.ToBits(buffer, bytes.Slice(offset, bufferLength));
+
+                    // Append batch bits
+                    AppendBits(buffer);
+                    offset += bufferLength;
+                }
+            }
+
+            return this;
+        }
+
+        public IBufferBuilder AppendStruct(bool value) => AppendBit(value);
+        public IBufferBuilder AppendStruct(byte value) => AppendByte(value);
+        public IBufferBuilder AppendStruct<T>(T value) where T : struct
+        {
+#if NETCOREAPP3_1_OR_GREATER
+            return AppendStructs(MemoryMarshal.CreateReadOnlySpan(ref value, 1));
+#else
+            return AppendStructs<T>(new T[] { value }.AsSpan());
+#endif
+        }
+
+        public IBufferBuilder AppendStructs(ReadOnlySpan<bool> values) => AppendBits(values);
+        public IBufferBuilder AppendStructs(ReadOnlySpan<byte> values) => AppendBytes(values);

Review Comment:
   `Struct` is a confusing term.  How about `AppendValues`?  Or `AppendNative` or `AppendDotNet`?



##########
csharp/src/Apache.Arrow/Builder/BufferBuilder.cs:
##########
@@ -0,0 +1,356 @@
+using System;
+using System.Buffers;
+using System.Collections.Generic;
+using System.Runtime.CompilerServices;
+using System.Runtime.InteropServices;
+using Apache.Arrow.Memory;
+
+namespace Apache.Arrow.Builder
+{
+    public class BufferBuilder : IBufferBuilder
+    {
+        public class BitBuffer
+        {
+            private readonly bool[] _bits;
+
+            public int Length { get; private set; }
+            public int AvailableLength => Capacity - Length;
+
+            public int Capacity;
+
+            public bool IsFull => Length == Capacity;
+            public byte ToByte(ref byte data) => BitUtility.ToByte(ref data, _bits);
+
+            public BitBuffer(int capacity = 8)
+            {
+                Capacity = capacity;
+                _bits = new bool[capacity];
+                Length = 0;
+            }
+
+            public void Append(bool bit) => _bits[Length++] = bit;
+            public void Fill(ReadOnlySpan<bool> bits)
+            {
+                bits.CopyTo(_bits.AsSpan().Slice(Length, bits.Length));
+                Length += bits.Length;
+            }
+
+            public void Reset()
+            {
+                for (int i = 0; i < _bits.Length; i++)
+                {
+                    _bits[i] = false;
+                }
+                Length = 0;
+            }
+        }
+
+        private const int DefaultCapacity = 64;
+        public int ByteLength { get; private set; }
+
+        public Memory<byte> Memory { get; private set; }
+        public BitBuffer BitOverhead { get; }
+
+        /// <summary>
+        /// Creates an instance of the <see cref="BufferBuilder"/> class.
+        /// </summary>
+        /// <param name="valueBitSize">Number of bits of one value item.</param>
+        /// <param name="capacity">Number of items of initial capacity to reserve.</param>
+        public BufferBuilder(int capacity = DefaultCapacity)
+        {
+            Memory = new byte[capacity];
+            BitOverhead = new BitBuffer();
+
+            ByteLength = 0;
+        }
+
+        private void CommitBitBuffer(bool force = false)
+        {
+            if (BitOverhead.IsFull || force)
+            {
+                EnsureAdditionalBytes(1);
+                BitOverhead.ToByte(ref Memory.Span[ByteLength]);
+                BitOverhead.Reset();
+                ByteLength++;
+            }
+        }
+
+        public IBufferBuilder AppendBit(bool bit)
+        {
+            BitOverhead.Append(bit);
+            CommitBitBuffer();
+            return this;
+        }
+
+        public IBufferBuilder AppendBits(ReadOnlySpan<bool> bits)
+        {
+            if (BitOverhead.Length > 0)
+            {
+                int available = BitOverhead.AvailableLength;
+
+                if (bits.Length > available)
+                {
+                    // Fill byte buffer
+                    BitOverhead.Fill(bits.Slice(0, available));
+
+                    // Commit to memory and reset
+                    CommitBitBuffer();
+
+                    bits = bits.Slice(available);
+                }
+                else
+                {
+                    // Fill byte buffer
+                    BitOverhead.Fill(bits);
+
+                    bits = ReadOnlySpan<bool>.Empty;
+                }
+            }
+
+            if (bits.Length > 0)
+            {
+                int byteEnd = bits.Length / 8;
+                int bitEnd = byteEnd * 8;
+
+                if (byteEnd > 0)
+                {
+                    // Ensure byte length
+                    EnsureAdditionalBytes(byteEnd);
+
+                    // Raw Span copy to memory
+                    BitUtility.ToBytes(Memory.Span.Slice(ByteLength, byteEnd), bits.Slice(0, bitEnd));
+
+                    ByteLength += byteEnd;
+
+                    bits = bits.Slice(bitEnd);
+                }
+                
+                if (bits.Length > 0)
+                {
+                    // Fill byte buffer with last unfilled
+                    BitOverhead.Fill(bits);
+                }
+            }
+
+            return this;
+        }
+
+        public IBufferBuilder AppendBits(bool value, int count)
+        {
+            Span<bool> span = new bool[count];
+
+            // default bool are already false
+            if (value)
+                for (int i = 0; i < count; i++)
+                    span[i] = value;
+
+            return AppendStructs(span);
+        }
+
+        public IBufferBuilder AppendByte(byte byteValue)
+        {
+            if (BitOverhead.Length > 0)
+            {
+                // Fill current bit buffer
+                int available = BitOverhead.AvailableLength;
+
+                // Convert byte to bit array
+                Span<bool> bits = BitUtility.ToBits(byteValue).AsSpan();
+
+                // Fill byte buffer
+                BitOverhead.Fill(bits.Slice(0, available));
+
+                // Commit to memory and reset
+                CommitBitBuffer();
+
+                // Fill new bit buffer
+                BitOverhead.Fill(bits.Slice(available));
+            }
+            else
+            {
+                EnsureAdditionalBytes(1);
+                // Raw add to memory
+                Memory.Span[ByteLength] = byteValue;
+                ByteLength++;
+            }
+
+            return this;
+        }
+
+        public IBufferBuilder AppendBytes(ReadOnlySpan<byte> bytes)
+        {
+            if (BitOverhead.Length == 0)
+            {
+                EnsureAdditionalBytes(bytes.Length);
+                // Raw Span copy to memory
+                bytes.CopyTo(Memory.Span.Slice(ByteLength, bytes.Length));
+                ByteLength += bytes.Length;
+            }
+            else
+            {
+                // Convert Bytes to Bits streamed in batchsize = 64
+                int offset = 0;
+                while (offset < bytes.Length)
+                {
+                    int remainingBytes = bytes.Length - offset;
+                    int bufferLength = Math.Min(128, remainingBytes);
+
+                    // Bits span
+                    Span<bool> buffer = new bool[bufferLength * 8];
+
+                    // Fill bits
+                    BitUtility.ToBits(buffer, bytes.Slice(offset, bufferLength));
+
+                    // Append batch bits
+                    AppendBits(buffer);
+                    offset += bufferLength;
+                }
+            }
+
+            return this;
+        }
+
+        public IBufferBuilder AppendStruct(bool value) => AppendBit(value);
+        public IBufferBuilder AppendStruct(byte value) => AppendByte(value);
+        public IBufferBuilder AppendStruct<T>(T value) where T : struct
+        {
+#if NETCOREAPP3_1_OR_GREATER
+            return AppendStructs(MemoryMarshal.CreateReadOnlySpan(ref value, 1));
+#else
+            return AppendStructs<T>(new T[] { value }.AsSpan());
+#endif
+        }
+
+        public IBufferBuilder AppendStructs(ReadOnlySpan<bool> values) => AppendBits(values);
+        public IBufferBuilder AppendStructs(ReadOnlySpan<byte> values) => AppendBytes(values);
+        public IBufferBuilder AppendStructs<T>(ReadOnlySpan<T> values) where T : struct
+        {
+            ReadOnlySpan<byte> bytes = MemoryMarshal.AsBytes(values);
+            return AppendBytes(bytes);
+        }
+        public IBufferBuilder AppendStructs<T>(T value, int count) where T : struct
+        {
+            Span<T> span = new T[count];
+
+            for (int i = 0; i < count; i++)
+                span[i] = value;
+
+            return AppendStructs<T>(span);
+        }
+
+        internal IBufferBuilder ReserveBytes(int numBytes)
+        {
+            EnsureAdditionalBytes(numBytes);
+            return this;
+        }
+
+        internal IBufferBuilder ResizeBytes(int numBytes)
+        {
+            EnsureBytes(numBytes);
+            ByteLength = numBytes;
+            return this;
+        }
+
+        public IBufferBuilder Clear()
+        {
+            Memory.Span.Fill(default);
+            ByteLength = 0;
+            return this;
+        }
+
+        public ArrowBuffer Build(MemoryAllocator allocator = default) => Build(64, allocator);
+
+        public ArrowBuffer Build(int byteSize, MemoryAllocator allocator = default)
+        {
+            if (BitOverhead.Length > 0)
+                CommitBitBuffer(true);
+
+            int bufferLength = checked((int)BitUtility.RoundUpToMultiplePowerOfTwo(ByteLength, byteSize));
+
+            MemoryAllocator memoryAllocator = allocator ?? MemoryAllocator.Default.Value;
+            IMemoryOwner<byte> memoryOwner = memoryAllocator.Allocate(bufferLength);
+            Memory.Slice(0, ByteLength).CopyTo(memoryOwner.Memory);
+
+            return new ArrowBuffer(memoryOwner);
+        }
+         
+        private void EnsureAdditionalBytes(int numBytes) => EnsureBytes(checked(Memory.Length + numBytes));

Review Comment:
   ```suggestion
           private void EnsureAdditionalBytes(int numBytes) => EnsureBytes(checked(ByteLength + numBytes));
   ```
   
   Otherwise this doesn't play nicely with `Clear`.



##########
csharp/src/Apache.Arrow/Builder/BufferBuilder.cs:
##########
@@ -0,0 +1,356 @@
+using System;
+using System.Buffers;
+using System.Collections.Generic;
+using System.Runtime.CompilerServices;
+using System.Runtime.InteropServices;
+using Apache.Arrow.Memory;
+
+namespace Apache.Arrow.Builder
+{
+    public class BufferBuilder : IBufferBuilder
+    {
+        public class BitBuffer
+        {
+            private readonly bool[] _bits;
+
+            public int Length { get; private set; }
+            public int AvailableLength => Capacity - Length;
+
+            public int Capacity;
+
+            public bool IsFull => Length == Capacity;
+            public byte ToByte(ref byte data) => BitUtility.ToByte(ref data, _bits);
+
+            public BitBuffer(int capacity = 8)
+            {
+                Capacity = capacity;
+                _bits = new bool[capacity];
+                Length = 0;
+            }
+
+            public void Append(bool bit) => _bits[Length++] = bit;
+            public void Fill(ReadOnlySpan<bool> bits)
+            {
+                bits.CopyTo(_bits.AsSpan().Slice(Length, bits.Length));
+                Length += bits.Length;
+            }
+
+            public void Reset()
+            {
+                for (int i = 0; i < _bits.Length; i++)
+                {
+                    _bits[i] = false;
+                }
+                Length = 0;
+            }
+        }
+
+        private const int DefaultCapacity = 64;
+        public int ByteLength { get; private set; }
+
+        public Memory<byte> Memory { get; private set; }
+        public BitBuffer BitOverhead { get; }

Review Comment:
   This approach of using a BitBuffer for BitOverhead is inefficient because it is constantly allocating small byte spans.  I created a quick benchmark comparing this with a simpler approach for demonstration though its possible I'm missing something or did something incorrect:
   
   https://gist.github.com/westonpace/35b619c4c9e90824d93ecb8306a9e169
   
   ```
   |                Method | NumBits |          Mean |        Error |     StdDev |
   |---------------------- |-------- |--------------:|-------------:|-----------:|
   |           AppendFalse |       1 |      17.82 ns |     0.101 ns |   0.095 ns |
   |            AppendTrue |       1 |      19.65 ns |     0.317 ns |   0.296 ns |
   | AppendFalseNoOverhead |       1 |      10.42 ns |     0.072 ns |   0.067 ns |
   |  AppendTrueNoOverhead |       1 |      10.30 ns |     0.100 ns |   0.089 ns |
   |           AppendFalse |     100 |     257.41 ns |     1.035 ns |   0.968 ns |
   |            AppendTrue |     100 |     293.54 ns |     1.179 ns |   1.102 ns |
   | AppendFalseNoOverhead |     100 |      36.18 ns |     0.284 ns |   0.266 ns |
   |  AppendTrueNoOverhead |     100 |      34.80 ns |     0.094 ns |   0.088 ns |
   |           AppendFalse |   10000 |  22,638.86 ns |    77.641 ns |  68.827 ns |
   |            AppendTrue |   10000 |  25,443.68 ns |    65.197 ns |  60.985 ns |
   | AppendFalseNoOverhead |   10000 |      49.20 ns |     0.159 ns |   0.141 ns |
   |  AppendTrueNoOverhead |   10000 |      49.24 ns |     0.180 ns |   0.168 ns |
   |           AppendFalse |  100000 | 235,708.02 ns |   633.895 ns | 592.946 ns |
   |            AppendTrue |  100000 | 271,133.50 ns | 1,123.712 ns | 996.141 ns |
   | AppendFalseNoOverhead |  100000 |     346.36 ns |     0.456 ns |   0.380 ns |
   |  AppendTrueNoOverhead |  100000 |     351.31 ns |     1.170 ns |   1.094 ns |
   ```



##########
csharp/src/Apache.Arrow/Builder/BufferBuilder.cs:
##########
@@ -0,0 +1,356 @@
+using System;
+using System.Buffers;
+using System.Collections.Generic;
+using System.Runtime.CompilerServices;
+using System.Runtime.InteropServices;
+using Apache.Arrow.Memory;
+
+namespace Apache.Arrow.Builder
+{
+    public class BufferBuilder : IBufferBuilder
+    {
+        public class BitBuffer
+        {
+            private readonly bool[] _bits;
+
+            public int Length { get; private set; }
+            public int AvailableLength => Capacity - Length;
+
+            public int Capacity;
+
+            public bool IsFull => Length == Capacity;
+            public byte ToByte(ref byte data) => BitUtility.ToByte(ref data, _bits);
+
+            public BitBuffer(int capacity = 8)
+            {
+                Capacity = capacity;
+                _bits = new bool[capacity];
+                Length = 0;
+            }
+
+            public void Append(bool bit) => _bits[Length++] = bit;
+            public void Fill(ReadOnlySpan<bool> bits)
+            {
+                bits.CopyTo(_bits.AsSpan().Slice(Length, bits.Length));
+                Length += bits.Length;
+            }
+
+            public void Reset()
+            {
+                for (int i = 0; i < _bits.Length; i++)
+                {
+                    _bits[i] = false;
+                }
+                Length = 0;
+            }
+        }
+
+        private const int DefaultCapacity = 64;
+        public int ByteLength { get; private set; }
+
+        public Memory<byte> Memory { get; private set; }
+        public BitBuffer BitOverhead { get; }
+
+        /// <summary>
+        /// Creates an instance of the <see cref="BufferBuilder"/> class.
+        /// </summary>
+        /// <param name="valueBitSize">Number of bits of one value item.</param>
+        /// <param name="capacity">Number of items of initial capacity to reserve.</param>
+        public BufferBuilder(int capacity = DefaultCapacity)
+        {
+            Memory = new byte[capacity];
+            BitOverhead = new BitBuffer();
+
+            ByteLength = 0;
+        }
+
+        private void CommitBitBuffer(bool force = false)
+        {
+            if (BitOverhead.IsFull || force)
+            {
+                EnsureAdditionalBytes(1);
+                BitOverhead.ToByte(ref Memory.Span[ByteLength]);
+                BitOverhead.Reset();
+                ByteLength++;
+            }
+        }
+
+        public IBufferBuilder AppendBit(bool bit)
+        {
+            BitOverhead.Append(bit);
+            CommitBitBuffer();
+            return this;
+        }
+
+        public IBufferBuilder AppendBits(ReadOnlySpan<bool> bits)
+        {
+            if (BitOverhead.Length > 0)
+            {
+                int available = BitOverhead.AvailableLength;
+
+                if (bits.Length > available)
+                {
+                    // Fill byte buffer
+                    BitOverhead.Fill(bits.Slice(0, available));
+
+                    // Commit to memory and reset
+                    CommitBitBuffer();
+
+                    bits = bits.Slice(available);
+                }
+                else
+                {
+                    // Fill byte buffer
+                    BitOverhead.Fill(bits);
+
+                    bits = ReadOnlySpan<bool>.Empty;
+                }
+            }
+
+            if (bits.Length > 0)
+            {
+                int byteEnd = bits.Length / 8;
+                int bitEnd = byteEnd * 8;
+
+                if (byteEnd > 0)
+                {
+                    // Ensure byte length
+                    EnsureAdditionalBytes(byteEnd);
+
+                    // Raw Span copy to memory
+                    BitUtility.ToBytes(Memory.Span.Slice(ByteLength, byteEnd), bits.Slice(0, bitEnd));
+
+                    ByteLength += byteEnd;
+
+                    bits = bits.Slice(bitEnd);
+                }
+                
+                if (bits.Length > 0)
+                {
+                    // Fill byte buffer with last unfilled
+                    BitOverhead.Fill(bits);
+                }
+            }
+
+            return this;
+        }
+
+        public IBufferBuilder AppendBits(bool value, int count)
+        {
+            Span<bool> span = new bool[count];
+
+            // default bool are already false
+            if (value)
+                for (int i = 0; i < count; i++)
+                    span[i] = value;
+
+            return AppendStructs(span);
+        }
+
+        public IBufferBuilder AppendByte(byte byteValue)
+        {
+            if (BitOverhead.Length > 0)
+            {
+                // Fill current bit buffer
+                int available = BitOverhead.AvailableLength;
+
+                // Convert byte to bit array
+                Span<bool> bits = BitUtility.ToBits(byteValue).AsSpan();
+
+                // Fill byte buffer
+                BitOverhead.Fill(bits.Slice(0, available));
+
+                // Commit to memory and reset
+                CommitBitBuffer();
+
+                // Fill new bit buffer
+                BitOverhead.Fill(bits.Slice(available));
+            }
+            else
+            {
+                EnsureAdditionalBytes(1);
+                // Raw add to memory
+                Memory.Span[ByteLength] = byteValue;
+                ByteLength++;
+            }
+
+            return this;
+        }
+
+        public IBufferBuilder AppendBytes(ReadOnlySpan<byte> bytes)
+        {
+            if (BitOverhead.Length == 0)
+            {
+                EnsureAdditionalBytes(bytes.Length);
+                // Raw Span copy to memory
+                bytes.CopyTo(Memory.Span.Slice(ByteLength, bytes.Length));
+                ByteLength += bytes.Length;
+            }
+            else
+            {
+                // Convert Bytes to Bits streamed in batchsize = 64
+                int offset = 0;
+                while (offset < bytes.Length)
+                {
+                    int remainingBytes = bytes.Length - offset;
+                    int bufferLength = Math.Min(128, remainingBytes);
+
+                    // Bits span
+                    Span<bool> buffer = new bool[bufferLength * 8];
+
+                    // Fill bits
+                    BitUtility.ToBits(buffer, bytes.Slice(offset, bufferLength));
+
+                    // Append batch bits
+                    AppendBits(buffer);
+                    offset += bufferLength;
+                }
+            }
+
+            return this;
+        }
+
+        public IBufferBuilder AppendStruct(bool value) => AppendBit(value);
+        public IBufferBuilder AppendStruct(byte value) => AppendByte(value);
+        public IBufferBuilder AppendStruct<T>(T value) where T : struct
+        {
+#if NETCOREAPP3_1_OR_GREATER
+            return AppendStructs(MemoryMarshal.CreateReadOnlySpan(ref value, 1));
+#else
+            return AppendStructs<T>(new T[] { value }.AsSpan());
+#endif
+        }
+
+        public IBufferBuilder AppendStructs(ReadOnlySpan<bool> values) => AppendBits(values);
+        public IBufferBuilder AppendStructs(ReadOnlySpan<byte> values) => AppendBytes(values);
+        public IBufferBuilder AppendStructs<T>(ReadOnlySpan<T> values) where T : struct
+        {
+            ReadOnlySpan<byte> bytes = MemoryMarshal.AsBytes(values);
+            return AppendBytes(bytes);
+        }
+        public IBufferBuilder AppendStructs<T>(T value, int count) where T : struct
+        {
+            Span<T> span = new T[count];
+
+            for (int i = 0; i < count; i++)
+                span[i] = value;
+
+            return AppendStructs<T>(span);
+        }
+
+        internal IBufferBuilder ReserveBytes(int numBytes)
+        {
+            EnsureAdditionalBytes(numBytes);

Review Comment:
   Coming from the C++ implementation I would be surprised to learn that `ReserveBytes` reserved _additional_ bytes and didn't simply ensure bytes.  Maybe name this method `ReserveAdditionalBytes`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] Platob commented on a diff in pull request #35299: GH-35285: [C#] Need refacto IArrowArrayBuilder for nested types

Posted by "Platob (via GitHub)" <gi...@apache.org>.
Platob commented on code in PR #35299:
URL: https://github.com/apache/arrow/pull/35299#discussion_r1186646427


##########
csharp/src/Apache.Arrow/Builder/BufferBuilder.cs:
##########
@@ -0,0 +1,356 @@
+using System;
+using System.Buffers;
+using System.Collections.Generic;
+using System.Runtime.CompilerServices;
+using System.Runtime.InteropServices;
+using Apache.Arrow.Memory;
+
+namespace Apache.Arrow.Builder
+{
+    public class BufferBuilder : IBufferBuilder
+    {
+        public class BitBuffer
+        {
+            private readonly bool[] _bits;
+
+            public int Length { get; private set; }
+            public int AvailableLength => Capacity - Length;
+
+            public int Capacity;
+
+            public bool IsFull => Length == Capacity;
+            public byte ToByte(ref byte data) => BitUtility.ToByte(ref data, _bits);
+
+            public BitBuffer(int capacity = 8)
+            {
+                Capacity = capacity;
+                _bits = new bool[capacity];
+                Length = 0;
+            }
+
+            public void Append(bool bit) => _bits[Length++] = bit;
+            public void Fill(ReadOnlySpan<bool> bits)
+            {
+                bits.CopyTo(_bits.AsSpan().Slice(Length, bits.Length));
+                Length += bits.Length;
+            }
+
+            public void Reset()
+            {
+                for (int i = 0; i < _bits.Length; i++)
+                {
+                    _bits[i] = false;
+                }
+                Length = 0;
+            }
+        }
+
+        private const int DefaultCapacity = 64;
+        public int ByteLength { get; private set; }
+
+        public Memory<byte> Memory { get; private set; }
+        public BitBuffer BitOverhead { get; }

Review Comment:
   During its lifetime, the builder will allocate only 1 time 8 bits in bit overhead
   
   On every bit(s) append, it will check if the 8 bits are written (like the old bitmap builder checking if it is % 8)
   
   If it is full (8 bits are written) i will write a new byte in the Memory Buffer and reset all values in the BitOverhead to false setting his length to 0 and ready to recieve other bits
   
   ```csharp
   public class BenchmarkBits
   {
       public static long ElapsedTicks(int repetition, Action action)
       {
           Stopwatch stopwatch = Stopwatch.StartNew();
   
           for (int i = 0; i < repetition; i++)
               action();
   
           stopwatch.Stop();
   
           return stopwatch.ElapsedTicks / repetition;
       }
   
       private readonly ITestOutputHelper output;
   
       private static readonly bool[] trueBits = Enumerable.Range(0, 100).Select(_ => true).ToArray();
       private static readonly bool[] falseBits = Enumerable.Range(0, 100).Select(_ => false).ToArray();
   
       public BenchmarkBits(ITestOutputHelper output)
       {
           this.output = output;
       }
   
       [Fact]
       public void Bench()
       {
           output.WriteLine($"Elapsed {ElapsedTicks(1000000, MakeOld)}");
           output.WriteLine($"Elapsed {ElapsedTicks(1000000, MakeNew)}");
       }
   
       private static void MakeNew()
       {
           var builder = new BufferBuilder(64);
   
           builder.AppendBit(true).AppendBit(true)
               .AppendBits(trueBits)
               .AppendBits(falseBits)
               .Build();
       }
   
       private static void MakeOld()
       {
           var builder = new ArrowBuffer.BitmapBuilder(64);
   
           builder.Append(true).Append(true)
               .AppendRange(trueBits)
               .AppendRange(falseBits)
               .Build();
       }
   }
   ````
   
   On 1 million iterations the new implementation is in mean 168 ticks
   On 1 million iterations the old implementation is in mean 312 ticks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] Platob commented on a diff in pull request #35299: GH-35285: [C#] Need refacto IArrowArrayBuilder for nested types

Posted by "Platob (via GitHub)" <gi...@apache.org>.
Platob commented on code in PR #35299:
URL: https://github.com/apache/arrow/pull/35299#discussion_r1186646427


##########
csharp/src/Apache.Arrow/Builder/BufferBuilder.cs:
##########
@@ -0,0 +1,356 @@
+using System;
+using System.Buffers;
+using System.Collections.Generic;
+using System.Runtime.CompilerServices;
+using System.Runtime.InteropServices;
+using Apache.Arrow.Memory;
+
+namespace Apache.Arrow.Builder
+{
+    public class BufferBuilder : IBufferBuilder
+    {
+        public class BitBuffer
+        {
+            private readonly bool[] _bits;
+
+            public int Length { get; private set; }
+            public int AvailableLength => Capacity - Length;
+
+            public int Capacity;
+
+            public bool IsFull => Length == Capacity;
+            public byte ToByte(ref byte data) => BitUtility.ToByte(ref data, _bits);
+
+            public BitBuffer(int capacity = 8)
+            {
+                Capacity = capacity;
+                _bits = new bool[capacity];
+                Length = 0;
+            }
+
+            public void Append(bool bit) => _bits[Length++] = bit;
+            public void Fill(ReadOnlySpan<bool> bits)
+            {
+                bits.CopyTo(_bits.AsSpan().Slice(Length, bits.Length));
+                Length += bits.Length;
+            }
+
+            public void Reset()
+            {
+                for (int i = 0; i < _bits.Length; i++)
+                {
+                    _bits[i] = false;
+                }
+                Length = 0;
+            }
+        }
+
+        private const int DefaultCapacity = 64;
+        public int ByteLength { get; private set; }
+
+        public Memory<byte> Memory { get; private set; }
+        public BitBuffer BitOverhead { get; }

Review Comment:
   During its lifetime, the builder will allocate only 1 time 8 bits in bit overhead
   
   On Ever bit(s) append, it will check if the 8 bits are written (like the old bitmap builder checking if it is % 8)
   
   If it is full (8 bits are written) i will write a new byte in the Memory Buffer and reset all values in the BitOverhead to false setting his length to 0 an dread to recieve other bits
   
   ```csharp
   public class BenchmarkBits
   {
       public static long ElapsedTicks(int repetition, Action action)
       {
           Stopwatch stopwatch = Stopwatch.StartNew();
   
           for (int i = 0; i < repetition; i++)
               action();
   
           stopwatch.Stop();
   
           return stopwatch.ElapsedTicks / repetition;
       }
   
       private readonly ITestOutputHelper output;
   
       private static readonly bool[] trueBits = Enumerable.Range(0, 100).Select(_ => true).ToArray();
       private static readonly bool[] falseBits = Enumerable.Range(0, 100).Select(_ => false).ToArray();
   
       public BenchmarkBits(ITestOutputHelper output)
       {
           this.output = output;
       }
   
       [Fact]
       public void Bench()
       {
           output.WriteLine($"Elapsed {ElapsedTicks(1000000, MakeOld)}");
           output.WriteLine($"Elapsed {ElapsedTicks(1000000, MakeNew)}");
       }
   
       private static void MakeNew()
       {
           var builder = new BufferBuilder(64);
   
           builder.AppendBit(true).AppendBit(true)
               .AppendBits(trueBits)
               .AppendBits(falseBits)
               .Build();
       }
   
       private static void MakeOld()
       {
           var builder = new ArrowBuffer.BitmapBuilder(64);
   
           builder.Append(true).Append(true)
               .AppendRange(trueBits)
               .AppendRange(falseBits)
               .Build();
       }
   }
   ````
   
   On 1 million iterations the new implementation is in mean 168ticks
   On 1 million iterations the old implementation is in mean 312 ticks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] Platob commented on pull request #35299: GH-35285: [C#] Need refacto IArrowArrayBuilder for nested types

Posted by "Platob (via GitHub)" <gi...@apache.org>.
Platob commented on PR #35299:
URL: https://github.com/apache/arrow/pull/35299#issuecomment-1537094036

   > Also, it seems there is already ArrowBuffer.BitmapBuilder which is a buffer builder that is optimized for bits (and #35342 adds an efficient `AppendRange`). Perhaps you could use that and leave your buffer builder focused on non-bit types? Either way, it might be good to align on one approach to maintain.
   > 
   > CC @asmirnov82
   
   Indeed, but for nested structures it needs specify the builders to check how to merge correctly 2 arrow buffers knowing if they are bit based or byte based
   
   the main reason would be merge 2 array data by just checking their datatypes since their structures should be the same
   
   ```csharp
   public virtual IArrayBuilder AppendValues(ArrayData data)
   {
       // TODO: Make better / recursive fields data type check
       if (data.DataType.TypeId != DataType.TypeId)
           throw new ArgumentException($"Cannot append data type {data.DataType} in builder with data type {DataType}");
   
       NullCount += data.NullCount;
       Length += data.Length;
   
       Reserve(data.Length);
   
       for (int i = 0; i < Buffers.Length; i++)
       {
           IValueBufferBuilder current = Buffers[i];
           ArrowBuffer other = data.Buffers[i];
   
           if (current.ValueBitSize % 8 == 0)
           {
               // Full byte encoded
               current.AppendBytes(other.Span);
           }
           else
           {
               // Safe copy Bytes and remaining bits
               int end = (data.Length * current.ValueBitSize) / 8;
   
               current.AppendBytes(other.Span.Slice(0, end));
   
               Span<bool> bits = BitUtility.BytesToBits(other.Span.Slice(end)).Slice(0, data.Length - end * 8);
               current.AppendBits(bits);
           }
       }
   
       if (Children != null && data.Children != null)
       {
           for (int i = 0; i < Children.Length; i++)
           {
               Children[i].AppendValues(data.Children[i]);
           }
       }
   
       if (Dictionary != null && data.Dictionary != null)
       {
           Dictionary.AppendValues(data.Dictionary);
       }
   
       return this;
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35299: GH-35285: [C#] Need refacto IArrowArrayBuilder for nested types

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35299:
URL: https://github.com/apache/arrow/pull/35299#issuecomment-1519759761

   * Closes: #35285


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] Platob commented on pull request #35299: GH-35285: [C#] Need refacto IArrowArrayBuilder for nested types

Posted by "Platob (via GitHub)" <gi...@apache.org>.
Platob commented on PR #35299:
URL: https://github.com/apache/arrow/pull/35299#issuecomment-1737769913

   Need to close this pr, too many structural changes to handle large memory allocation (the buffer length are in int32, should use buffer which can handle uint64 length at least for 2GB string elements)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35299: GH-35285: [C#] Need refacto IArrowArrayBuilder for nested types

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35299:
URL: https://github.com/apache/arrow/pull/35299#issuecomment-1519759863

   :warning: GitHub issue #35285 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #35299: GH-35285: [C#] Need refacto IArrowArrayBuilder for nested types

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #35299:
URL: https://github.com/apache/arrow/pull/35299#issuecomment-1536777765

   Also, it seems there is already ArrowBuffer.BitmapBuilder which is a buffer builder that is optimized for bits (and https://github.com/apache/arrow/pull/35342 adds an efficient `AppendRange`).  Perhaps you could use that and leave your buffer builder focused on non-bit types? Either way, it might be good to align on one approach to maintain.
   
   CC @asmirnov82 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] Platob commented on a diff in pull request #35299: GH-35285: [C#] Need refacto IArrowArrayBuilder for nested types

Posted by "Platob (via GitHub)" <gi...@apache.org>.
Platob commented on code in PR #35299:
URL: https://github.com/apache/arrow/pull/35299#discussion_r1186646427


##########
csharp/src/Apache.Arrow/Builder/BufferBuilder.cs:
##########
@@ -0,0 +1,356 @@
+using System;
+using System.Buffers;
+using System.Collections.Generic;
+using System.Runtime.CompilerServices;
+using System.Runtime.InteropServices;
+using Apache.Arrow.Memory;
+
+namespace Apache.Arrow.Builder
+{
+    public class BufferBuilder : IBufferBuilder
+    {
+        public class BitBuffer
+        {
+            private readonly bool[] _bits;
+
+            public int Length { get; private set; }
+            public int AvailableLength => Capacity - Length;
+
+            public int Capacity;
+
+            public bool IsFull => Length == Capacity;
+            public byte ToByte(ref byte data) => BitUtility.ToByte(ref data, _bits);
+
+            public BitBuffer(int capacity = 8)
+            {
+                Capacity = capacity;
+                _bits = new bool[capacity];
+                Length = 0;
+            }
+
+            public void Append(bool bit) => _bits[Length++] = bit;
+            public void Fill(ReadOnlySpan<bool> bits)
+            {
+                bits.CopyTo(_bits.AsSpan().Slice(Length, bits.Length));
+                Length += bits.Length;
+            }
+
+            public void Reset()
+            {
+                for (int i = 0; i < _bits.Length; i++)
+                {
+                    _bits[i] = false;
+                }
+                Length = 0;
+            }
+        }
+
+        private const int DefaultCapacity = 64;
+        public int ByteLength { get; private set; }
+
+        public Memory<byte> Memory { get; private set; }
+        public BitBuffer BitOverhead { get; }

Review Comment:
   During its lifetime, the builder will allocate only 1 time 8 bits in bit overhead
   
   On every bit(s) append, it will check if the 8 bits are written (like the old bitmap builder checking if it is % 8)
   
   If it is full (8 bits are written) i will write a new byte in the Memory Buffer and reset all values in the BitOverhead to false setting his length to 0 and ready to recieve other bits
   
   ```csharp
   public class BenchmarkBits
   {
       public static long ElapsedTicks(int repetition, Action action)
       {
           Stopwatch stopwatch = Stopwatch.StartNew();
   
           for (int i = 0; i < repetition; i++)
               action();
   
           stopwatch.Stop();
   
           return stopwatch.ElapsedTicks / repetition;
       }
   
       private readonly ITestOutputHelper output;
   
       private static readonly bool[] trueBits = Enumerable.Range(0, 100).Select(_ => true).ToArray();
       private static readonly bool[] falseBits = Enumerable.Range(0, 100).Select(_ => false).ToArray();
   
       public BenchmarkBits(ITestOutputHelper output)
       {
           this.output = output;
       }
   
       [Fact]
       public void Bench()
       {
           output.WriteLine($"Elapsed {ElapsedTicks(1000000, MakeOld)}");
           output.WriteLine($"Elapsed {ElapsedTicks(1000000, MakeNew)}");
       }
   
       private static void MakeNew()
       {
           var builder = new BufferBuilder(64);
   
           builder.AppendBit(true).AppendBit(true)
               .AppendBits(trueBits)
               .AppendBits(falseBits)
               .Build();
       }
   
       private static void MakeOld()
       {
           var builder = new ArrowBuffer.BitmapBuilder(64);
   
           builder.Append(true).Append(true)
               .AppendRange(trueBits)
               .AppendRange(falseBits)
               .Build();
       }
   }
   ````
   
   On 1 million iterations the new implementation is in mean 168 ticks
   On 1 million iterations the old implementation is in mean 312 ticks



##########
csharp/src/Apache.Arrow/Builder/BufferBuilder.cs:
##########
@@ -0,0 +1,356 @@
+using System;
+using System.Buffers;
+using System.Collections.Generic;
+using System.Runtime.CompilerServices;
+using System.Runtime.InteropServices;
+using Apache.Arrow.Memory;
+
+namespace Apache.Arrow.Builder
+{
+    public class BufferBuilder : IBufferBuilder
+    {
+        public class BitBuffer
+        {
+            private readonly bool[] _bits;
+
+            public int Length { get; private set; }
+            public int AvailableLength => Capacity - Length;
+
+            public int Capacity;
+
+            public bool IsFull => Length == Capacity;
+            public byte ToByte(ref byte data) => BitUtility.ToByte(ref data, _bits);
+
+            public BitBuffer(int capacity = 8)
+            {
+                Capacity = capacity;
+                _bits = new bool[capacity];
+                Length = 0;
+            }
+
+            public void Append(bool bit) => _bits[Length++] = bit;
+            public void Fill(ReadOnlySpan<bool> bits)
+            {
+                bits.CopyTo(_bits.AsSpan().Slice(Length, bits.Length));
+                Length += bits.Length;
+            }
+
+            public void Reset()
+            {
+                for (int i = 0; i < _bits.Length; i++)
+                {
+                    _bits[i] = false;
+                }
+                Length = 0;
+            }
+        }
+
+        private const int DefaultCapacity = 64;
+        public int ByteLength { get; private set; }
+
+        public Memory<byte> Memory { get; private set; }
+        public BitBuffer BitOverhead { get; }

Review Comment:
   Same for sequential append it looks better, thus it scales good with larger value range
   
   Old was in mean 1183 Ticks vs 938 Ticks for the new
   
   ```csharp
   [Fact]
   public void Bench()
   {
       output.WriteLine($"Elapsed {ElapsedTicks(100000, MakeOld)}");
       output.WriteLine($"Elapsed {ElapsedTicks(100000, MakeNew)}");
   }
   
   private static void MakeNew()
   {
       var builder = new BufferBuilder(64);
   
       for (int i = 0; i < 1000; i++)
           builder.AppendBit(true);
   
       builder.Build();
   }
   
   private static void MakeOld()
   {
       var builder = new ArrowBuffer.BitmapBuilder(64);
   
       for (int i = 0; i < 1000; i++)
           builder.Append(true);
   
       builder.Build();
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org