You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/04/25 15:25:05 UTC

[GitHub] [arrow] eerhardt commented on a change in pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support

eerhardt commented on a change in pull request #7032:
URL: https://github.com/apache/arrow/pull/7032#discussion_r415072524



##########
File path: csharp/src/Apache.Arrow/Apache.Arrow.csproj
##########
@@ -4,7 +4,7 @@
     <TargetFrameworks>netstandard1.3;netcoreapp2.1</TargetFrameworks>
     <AllowUnsafeBlocks>true</AllowUnsafeBlocks>
     <DefineConstants>$(DefineConstants);UNSAFE_BYTEBUFFER;BYTEBUFFER_NO_BOUNDS_CHECK;ENABLE_SPAN_T</DefineConstants>
-    
+

Review comment:
       (nit) Can you revert this change? That way it doesn't pollute version history.

##########
File path: csharp/Directory.Build.props
##########
@@ -21,6 +21,12 @@
     <AssemblyOriginatorKeyFile>$(CSharpDir)ApacheArrow.snk</AssemblyOriginatorKeyFile>
   </PropertyGroup>
 
+  <ItemGroup>
+    <AssemblyAttribute Include="System.Runtime.CompilerServices.InternalsVisibleTo">

Review comment:
       I don't see where this is necessary. I've removed it and it still builds and tests pass on my machine.
   
   Can this be removed?
   
   In general, in the http://github.com/dotnet/runtime tests, we try to avoid using `InternalsVisibleTo` for testing, and instead write the tests in the same way a consumer of the library would use it.

##########
File path: csharp/src/Apache.Arrow/Arrays/ArrayData.cs
##########
@@ -84,7 +84,7 @@ public ArrayData Slice(int offset, int length)
             length = Math.Min(Length - offset, length);
             offset += Offset;
 
-            return new ArrayData(DataType, length, -1, offset, Buffers, Children);
+            return new ArrayData(DataType, length, NullCount, offset, Buffers, Children);

Review comment:
       This isn't correct because we are slicing a subset of the `ArrayData`.
   
   If the total `ArrayData` has values:
   
   `0`, `null`, `2`, `3`, `4`
   
   `NullCount` for the outer `ArrayData` will be `1`.
   
   But if we are slicing from `offset = 2` and `length = 3` and creating a new `ArrayData` from the above, the new `ArrayData` won't have any nulls in it. But with this change, it will have `NullCount = 1` from its parent.
   
   `NullCount = -1` in the original code means "we don't know yet, it needs to be computed". That was done so we don't have to count the nulls in the sliced values until it is necessary. Since there are cases where it may not be required.

##########
File path: csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj
##########
@@ -4,6 +4,8 @@
   <PropertyGroup>
     <TargetFramework>netcoreapp2.1</TargetFramework>
     <AllowUnsafeBlocks>true</AllowUnsafeBlocks>
+    <SignAssembly>true</SignAssembly>

Review comment:
       If we can remove the `InternalsVisibleTo` attribute, we can revert this change as well.

##########
File path: csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs
##########
@@ -99,43 +105,63 @@ public abstract class PrimitiveArrayBuilder<T, TArray, TBuilder> : IArrowArrayBu
     {
         protected TBuilder Instance => this as TBuilder;
         protected ArrowBuffer.Builder<T> ValueBuffer { get; }
+        protected BooleanArray.Builder ValidityBuffer { get; }
 
         public int Length => ValueBuffer.Length;
 
+        protected int NullCount { get; set; }
+
         // TODO: Implement support for null values (null bitmaps)
 
         internal PrimitiveArrayBuilder()
         {
             ValueBuffer = new ArrowBuffer.Builder<T>();
+            ValidityBuffer = new BooleanArray.Builder();
         }
 
         public TBuilder Resize(int length)
         {
             ValueBuffer.Resize(length);
+            ValidityBuffer.Resize(length + 1);
             return Instance;
         }
 
         public TBuilder Reserve(int capacity)
         {
             ValueBuffer.Reserve(capacity);
+            ValidityBuffer.Reserve(capacity + 1);
             return Instance;
         }
 
         public TBuilder Append(T value)
         {
             ValueBuffer.Append(value);
+            ValidityBuffer.Append(true);
             return Instance;
         }
 
         public TBuilder Append(ReadOnlySpan<T> span)
         {
             ValueBuffer.Append(span);
+            ValidityBuffer.Append(span != Span<T>.Empty);

Review comment:
       This needs to be similar to `AppendRange` below and append a `true` for each element in the `span`.

##########
File path: csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs
##########
@@ -162,8 +188,8 @@ public TBuilder Swap(int i, int j)
         public TArray Build(MemoryAllocator allocator = default)
         {
             return Build(
-                ValueBuffer.Build(allocator), ArrowBuffer.Empty,
-                ValueBuffer.Length, 0, 0);
+                ValueBuffer.Build(allocator), ValidityBuffer.Build(allocator).ValueBuffer,

Review comment:
       How do you feel about making the `ValidityBuffer` lazy? That is, if no `null` values were ever appended to the builder, then the `ValidityBuffer` stays empty. And only upon the first `null` value being append do we allocate a `ValidityBuffer` and append all `true`s for all the values already appended before the first `null`.
   
   The advantage is that when a buffer doesn't contain any nulls, we don't need to allocate memory here, and it will be smaller memory size when transferring the Arrow bytes across a network.

##########
File path: csharp/test/Apache.Arrow.Tests/ArrayBuilderTests.cs
##########
@@ -81,7 +110,9 @@ List<string> ConvertStringArrayToList(StringArray array)
         public void NestedListArrayBuilder()
         {
             var childListType = new ListType(Int64Type.Default);
-            var parentListBuilder = new ListArray.Builder(childListType);
+            // Because internals are now visible here, the wrong

Review comment:
       If we can remove `InternalsVisibleTo`, this can be reverted.

##########
File path: csharp/src/Apache.Arrow/Arrays/BooleanArray.cs
##########
@@ -24,25 +24,38 @@ public class BooleanArray: Array
     {
         public class Builder : IArrowArrayBuilder<bool, BooleanArray, Builder>
         {
-            private ArrowBuffer.Builder<byte> ValueBuffer { get; }

Review comment:
       I don't see why `ValueBuffer` and `ValidityBuffer` need to be `internal` here. Can they remain `private`?

##########
File path: csharp/test/Apache.Arrow.Tests/ArrayBuilderTests.cs
##########
@@ -149,8 +180,8 @@ public void ProducesExpectedArray()
 
             Assert.IsAssignableFrom<TArray>(array);
             Assert.NotNull(array);
-            Assert.Equal(3, array.Length);
-            Assert.Equal(0, array.NullCount);
+            Assert.Equal(expectedLength, array.Length);
+            Assert.Equal(expectedNullCount, array.NullCount);

Review comment:
       It might be good to have at least a test (or maybe more) verify that the "Validity" bit map is correct.

##########
File path: csharp/src/Apache.Arrow/Arrays/BinaryArray.cs
##########
@@ -73,61 +80,88 @@ public TArray Build(MemoryAllocator allocator = default)
             {
                 ValueOffsets.Append(Offset);
 
-                var data = new ArrayData(DataType, ValueOffsets.Length - 1, 0, 0,
-                    new[] { ArrowBuffer.Empty, ValueOffsets.Build(allocator), ValueBuffer.Build(allocator) });
+                var data = new ArrayData(DataType, ValueOffsets.Length - 1, NullCount, 0,
+                    new[] { ValidityBuffer.Build(allocator).ValueBuffer, ValueOffsets.Build(allocator), ValueBuffer.Build(allocator) });
 
                 return Build(data);
             }
 
-            public TBuilder Append(byte value)
+            public TBuilder AppendNull()
+            {
+                ValueOffsets.Append(Offset);
+                Offset++;
+                NullCount++;
+                return Instance;
+            }
+
+            public TBuilder Append(T value)
             {
                 ValueOffsets.Append(Offset);
                 ValueBuffer.Append(value);
                 Offset++;
+                ValidityBuffer.Append(true);
                 return Instance;
             }
 
-            public TBuilder Append(ReadOnlySpan<byte> span)
+            public TBuilder Append(ReadOnlySpan<T> span)
             {
                 ValueOffsets.Append(Offset);
                 ValueBuffer.Append(span);
-                Offset += span.Length;
+                ValidityBuffer.Append(true);
+                Offset += span == Span<T>.Empty
+                                            ? 1
+                                            : span.Length;
                 return Instance;
             }
 
-            public TBuilder AppendRange(IEnumerable<byte[]> values)
+            public TBuilder AppendRange(IEnumerable<T[]> values)
             {
                 foreach (var arr in values)
                 {
+                    if (arr != null && arr.Length > 0)

Review comment:
       Is this check backwards? Shouldn't it be:
   
   ```C#
   if (arr == null)
   { 
       AppendNull();
       continue; 
   }
   ```

##########
File path: csharp/src/Apache.Arrow/Arrays/BinaryArray.cs
##########
@@ -73,61 +80,88 @@ public TArray Build(MemoryAllocator allocator = default)
             {
                 ValueOffsets.Append(Offset);
 
-                var data = new ArrayData(DataType, ValueOffsets.Length - 1, 0, 0,
-                    new[] { ArrowBuffer.Empty, ValueOffsets.Build(allocator), ValueBuffer.Build(allocator) });
+                var data = new ArrayData(DataType, ValueOffsets.Length - 1, NullCount, 0,
+                    new[] { ValidityBuffer.Build(allocator).ValueBuffer, ValueOffsets.Build(allocator), ValueBuffer.Build(allocator) });
 
                 return Build(data);
             }
 
-            public TBuilder Append(byte value)
+            public TBuilder AppendNull()
+            {
+                ValueOffsets.Append(Offset);
+                Offset++;
+                NullCount++;
+                return Instance;
+            }
+
+            public TBuilder Append(T value)
             {
                 ValueOffsets.Append(Offset);
                 ValueBuffer.Append(value);
                 Offset++;
+                ValidityBuffer.Append(true);
                 return Instance;
             }
 
-            public TBuilder Append(ReadOnlySpan<byte> span)
+            public TBuilder Append(ReadOnlySpan<T> span)
             {
                 ValueOffsets.Append(Offset);
                 ValueBuffer.Append(span);
-                Offset += span.Length;
+                ValidityBuffer.Append(true);
+                Offset += span == Span<T>.Empty
+                                            ? 1

Review comment:
       Check out https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout for an explanation of how this works. This is the key part:
   
   For example, the position and length of slot j is computed as:
   
   ```
   slot_position = offsets[j]
   slot_length = offsets[j + 1] - offsets[j]  // (for 0 <= j < length)
   ```
   
   If you add to `Offset` here, that will make this element appear to have `length == 1`, which it shouldn't.

##########
File path: csharp/src/Apache.Arrow/Arrays/BinaryArray.cs
##########
@@ -73,61 +80,88 @@ public TArray Build(MemoryAllocator allocator = default)
             {
                 ValueOffsets.Append(Offset);
 
-                var data = new ArrayData(DataType, ValueOffsets.Length - 1, 0, 0,
-                    new[] { ArrowBuffer.Empty, ValueOffsets.Build(allocator), ValueBuffer.Build(allocator) });
+                var data = new ArrayData(DataType, ValueOffsets.Length - 1, NullCount, 0,
+                    new[] { ValidityBuffer.Build(allocator).ValueBuffer, ValueOffsets.Build(allocator), ValueBuffer.Build(allocator) });
 
                 return Build(data);
             }
 
-            public TBuilder Append(byte value)
+            public TBuilder AppendNull()
+            {
+                ValueOffsets.Append(Offset);
+                Offset++;

Review comment:
       Does `Offset` really need to be incremented here? We didn't append any values to `ValueBuffer`.

##########
File path: csharp/src/Apache.Arrow/Arrays/BinaryArray.cs
##########
@@ -48,21 +48,28 @@ public BinaryArray(ArrowTypeId typeId, ArrayData data)
             data.EnsureBufferCount(3);
         }
 
-        public abstract class BuilderBase<TArray, TBuilder> : IArrowArrayBuilder<byte, TArray, TBuilder>
+        public abstract class BuilderBase<T, TArray, TBuilder> : IArrowArrayBuilder<T, TArray, TBuilder>

Review comment:
       Why did we need to add the `T` here? This is the `BinaryArray` class, which implies the `T` is `byte`. And looking at all the usages (above and in `StringArray`) we are only filling the `T` with `byte`.

##########
File path: csharp/src/Apache.Arrow/Arrays/BinaryArray.cs
##########
@@ -48,21 +48,28 @@ public BinaryArray(ArrowTypeId typeId, ArrayData data)
             data.EnsureBufferCount(3);
         }
 
-        public abstract class BuilderBase<TArray, TBuilder> : IArrowArrayBuilder<byte, TArray, TBuilder>
+        public abstract class BuilderBase<T, TArray, TBuilder> : IArrowArrayBuilder<T, TArray, TBuilder>
+            where T: struct
             where TArray : IArrowArray
-            where TBuilder : class, IArrowArrayBuilder<byte, TArray, TBuilder>
+            where TBuilder : class, IArrowArrayBuilder<T, TArray, TBuilder>
         {
+
             protected IArrowType DataType { get; }
             protected TBuilder Instance => this as TBuilder;
             protected ArrowBuffer.Builder<int> ValueOffsets { get; }
-            protected ArrowBuffer.Builder<byte> ValueBuffer { get; }
+            protected ArrowBuffer.Builder<T> ValueBuffer { get; }
+            protected BooleanArray.Builder ValidityBuffer { get; }
+
             protected int Offset { get; set; }
+            protected int ValidityOffset { get; set; }

Review comment:
       `ValidityOffset` doesn't appear to be used.

##########
File path: csharp/src/Apache.Arrow/Arrays/BinaryArray.cs
##########
@@ -73,61 +80,88 @@ public TArray Build(MemoryAllocator allocator = default)
             {
                 ValueOffsets.Append(Offset);
 
-                var data = new ArrayData(DataType, ValueOffsets.Length - 1, 0, 0,
-                    new[] { ArrowBuffer.Empty, ValueOffsets.Build(allocator), ValueBuffer.Build(allocator) });
+                var data = new ArrayData(DataType, ValueOffsets.Length - 1, NullCount, 0,
+                    new[] { ValidityBuffer.Build(allocator).ValueBuffer, ValueOffsets.Build(allocator), ValueBuffer.Build(allocator) });
 
                 return Build(data);
             }
 
-            public TBuilder Append(byte value)
+            public TBuilder AppendNull()
+            {
+                ValueOffsets.Append(Offset);
+                Offset++;
+                NullCount++;
+                return Instance;
+            }
+
+            public TBuilder Append(T value)
             {
                 ValueOffsets.Append(Offset);
                 ValueBuffer.Append(value);
                 Offset++;
+                ValidityBuffer.Append(true);
                 return Instance;
             }
 
-            public TBuilder Append(ReadOnlySpan<byte> span)
+            public TBuilder Append(ReadOnlySpan<T> span)
             {
                 ValueOffsets.Append(Offset);
                 ValueBuffer.Append(span);
-                Offset += span.Length;
+                ValidityBuffer.Append(true);
+                Offset += span == Span<T>.Empty
+                                            ? 1

Review comment:
       I'm not sure this is correct. If I append an empty value to the builder, I would expect the `ValueOffsets` to get the current `Offset`, but since there were no values append, `Offset` stays the same. That way the "Length" of the current element will be `0`.

##########
File path: csharp/src/Apache.Arrow/Arrays/BooleanArray.cs
##########
@@ -153,17 +182,25 @@ private void CheckIndex(int index)
                 new[] { nullBitmapBuffer, valueBuffer }))
         { }
 
-        public BooleanArray(ArrayData data) 
+        public BooleanArray(ArrayData data)
             : base(data)
         {
             data.EnsureDataType(ArrowTypeId.Boolean);
         }
 
         public override void Accept(IArrowArrayVisitor visitor) => Accept(this, visitor);
 
+        [Obsolete("GetBoolean does not support null values. Use GetValue instead (which this method invokes internally).")]

Review comment:
       This is causing warnings in the build now because the tests are still using this method. Can we update the tests to either:
   
   1. Use the new method or
   2. If they are explicitly testing this method, then put a `#pragma warning disable` to disable the warning in the build.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org