You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/10 12:30:54 UTC

[GitHub] [arrow] pitrou opened a new pull request #12599: ARROW-15846: [Format] Clarify presence of struct validity bitmap

pitrou opened a new pull request #12599:
URL: https://github.com/apache/arrow/pull/12599


   The original wording could lead to misunderstandings about the presence of a validity bitmap in struct arrays.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #12599: ARROW-15846: [Format] Clarify presence of struct validity bitmap

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #12599:
URL: https://github.com/apache/arrow/pull/12599#discussion_r824218929



##########
File path: docs/source/format/Columnar.rst
##########
@@ -208,17 +208,19 @@ right-to-left: ::
               0  0  1  0  1  0  1  1
 
 Arrays having a 0 null count may choose to not allocate the validity
-bitmap. Implementations may choose to always allocate one anyway as a
-matter of convenience, but this should be noted when memory is being
-shared.
+bitmap; how this is represented depends on the implementation (for
+example, a C++ implementation may represent such an "absent" validity
+bitmap using a NULL pointer). Implementations may choose to always allocate
+a validity bitmap anyway as a matter of convenience. Consumers of Arrow
+arrays should be ready to handle those two possibilities.
 
-Nested type arrays except for union types have their own validity bitmap and
-null count regardless of the null count and valid bits of their child arrays.
+Nested type arrays (except for union types as noted above) have their own
+top-level validity bitmap and null count, regardless of the null count and
+valid bits of their child arrays.
 
-Array slots which are null are not required to have a particular
-value; any "masked" memory can have any value and need not be zeroed,
-though implementations frequently choose to zero memory for null
-values.
+Array slots which are null are not required to have a particular value;
+any "masked" memory can have any value and need not be zeroed, though

Review comment:
       sorry, I meant to use the defered pointer for anything. I was thinking in terms of leveraging SIMD over null and non-null slots.
   
   An example (in Rust):
   
   ```rust
   // cargo miri run --example example.rs
   
   // example.rs
   fn main() {
       let mut a = Vec::<u8>::with_capacity(4); // similar to aligned_alloc
   
       let mut dst = a.as_mut_ptr(); // `*mut u8`
   
       // create a [10, ?, 10, ?]
       for i in 0..4 {
           if i % 2 == 0 {
               unsafe { dst.write(10) }
           } else {
               // let's skip this branch to "save" a `.write`
           }
           dst = unsafe { dst.add(1) };
       }
       unsafe { a.set_len(4) }    // this is unsound in Rust: we can't "expose" uninitialized memory
   
       let data = a.as_slice();
   
       // this is UB, we use un-initialized memory; however, we would benefit from being able to perform 
       /// this op, if e.g. we wanted to perform the +1 over all values in lanes
       data[1] + 1;
   }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #12599: ARROW-15846: [Format] Clarify presence of struct validity bitmap

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #12599:
URL: https://github.com/apache/arrow/pull/12599#discussion_r824114517



##########
File path: docs/source/format/Columnar.rst
##########
@@ -522,19 +521,24 @@ The layout for ``[{'joe', 1}, {null, 2}, null, {'mark', 4}]`` would be: ::
           |------------|-------------|-------------|-------------|-------------|
           | 1          | 2           | unspecified | 4           | unspecified |
 
-While a struct does not have physical storage for each of its semantic
-slots (i.e. each scalar C-like struct), an entire struct slot can be
-set to null via the validity bitmap. Any of the child field arrays can
-have null values according to their respective independent validity
-bitmaps. This implies that for a particular struct slot the validity
-bitmap for the struct array might indicate a null slot when one or
-more of its child arrays has a non-null value in their corresponding
-slot.  When reading the struct array the parent validity bitmap takes
-priority.  This is illustrated in the example above, the child arrays
-have valid entries for the null struct but are 'hidden' from the
-consumer by the parent array's validity bitmap.  However, when treated
-independently corresponding values of the children array will be
-non-null.
+Struct Validity
+~~~~~~~~~~~~~~~
+
+A struct array has its own validity bitmap in addition to each of its

Review comment:
       ```suggestion
   A struct array has its own validity bitmap that is independent of its
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12599: ARROW-15846: [Format] Clarify presence of struct validity bitmap

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12599:
URL: https://github.com/apache/arrow/pull/12599#issuecomment-1067011157


   Benchmark runs are scheduled for baseline = 5cb5afc40547b4f75739e31ff8632c71a10d3084 and contender = 56b06bbe90e279dd4000736bd1e7d14d7b0558ca. 56b06bbe90e279dd4000736bd1e7d14d7b0558ca is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/506f0446281942c0958b68df2ef20e55...4a532973ae45491186c7ff050f531e4e/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/71f9fee4c7d14bb589bacb2e4a95f605...40b0b7b632a541b892f934048c0c05ec/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/2cffd91b5ca14a20b5df0d0bf5ee7ae9...20fd46e8541d4b83b48f4e73dea2728f/)
   [Finished :arrow_down:0.38% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/64116ccec5c34c50b0443a90f06214ec...56544810c45b42be9ce03228db4d66ab/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #12599: ARROW-15846: [Format] Clarify presence of struct validity bitmap

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12599:
URL: https://github.com/apache/arrow/pull/12599#issuecomment-1064005191


   https://issues.apache.org/jira/browse/ARROW-15846


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #12599: ARROW-15846: [Format] Clarify presence of struct validity bitmap

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #12599:
URL: https://github.com/apache/arrow/pull/12599#discussion_r824371738



##########
File path: docs/source/format/Columnar.rst
##########
@@ -208,17 +208,19 @@ right-to-left: ::
               0  0  1  0  1  0  1  1
 
 Arrays having a 0 null count may choose to not allocate the validity
-bitmap. Implementations may choose to always allocate one anyway as a
-matter of convenience, but this should be noted when memory is being
-shared.
+bitmap; how this is represented depends on the implementation (for
+example, a C++ implementation may represent such an "absent" validity
+bitmap using a NULL pointer). Implementations may choose to always allocate
+a validity bitmap anyway as a matter of convenience. Consumers of Arrow
+arrays should be ready to handle those two possibilities.
 
-Nested type arrays except for union types have their own validity bitmap and
-null count regardless of the null count and valid bits of their child arrays.
+Nested type arrays (except for union types as noted above) have their own
+top-level validity bitmap and null count, regardless of the null count and
+valid bits of their child arrays.
 
-Array slots which are null are not required to have a particular
-value; any "masked" memory can have any value and need not be zeroed,
-though implementations frequently choose to zero memory for null
-values.
+Array slots which are null are not required to have a particular value;
+any "masked" memory can have any value and need not be zeroed, though

Review comment:
       This would be a change in specification but a minor one.   I'd point out that i think unless you specify zero initialized you still need to worry about UB because some could choose MAX_INT.    Where adding 1 for the signed values would cause an overflow and UB.    C++ amortized null values by finding chunks of non null sequences 256 values at a time and dispatching to SIMD for these cases




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #12599: ARROW-15846: [Format] Clarify presence of struct validity bitmap

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #12599:
URL: https://github.com/apache/arrow/pull/12599#discussion_r824218929



##########
File path: docs/source/format/Columnar.rst
##########
@@ -208,17 +208,19 @@ right-to-left: ::
               0  0  1  0  1  0  1  1
 
 Arrays having a 0 null count may choose to not allocate the validity
-bitmap. Implementations may choose to always allocate one anyway as a
-matter of convenience, but this should be noted when memory is being
-shared.
+bitmap; how this is represented depends on the implementation (for
+example, a C++ implementation may represent such an "absent" validity
+bitmap using a NULL pointer). Implementations may choose to always allocate
+a validity bitmap anyway as a matter of convenience. Consumers of Arrow
+arrays should be ready to handle those two possibilities.
 
-Nested type arrays except for union types have their own validity bitmap and
-null count regardless of the null count and valid bits of their child arrays.
+Nested type arrays (except for union types as noted above) have their own
+top-level validity bitmap and null count, regardless of the null count and
+valid bits of their child arrays.
 
-Array slots which are null are not required to have a particular
-value; any "masked" memory can have any value and need not be zeroed,
-though implementations frequently choose to zero memory for null
-values.
+Array slots which are null are not required to have a particular value;
+any "masked" memory can have any value and need not be zeroed, though

Review comment:
       sorry, I meant to use the defered pointer for anything. I was thinking in terms of leveraging SIMD over null and non-null slots.
   
   An example (in Rust):
   
   ```rust
   // cargo miri run --example example.rs
   
   // example.rs
   fn main() {
       let mut a = Vec::<u8>::with_capacity(4); // similar to aligned_alloc
   
       let mut dst = a.as_mut_ptr(); // `*mut u8`
   
       // create a [10, ?, 10, ?]
       for i in 0..4 {
           if i % 2 == 0 {
               unsafe { dst.write(10) }
           } else {
               // let's skip this branch to "save" a `.write`
           }
           dst = unsafe { dst.add(1) };
       }
       unsafe { a.set_len(4) }    // this is unsound in Rust: we can't "expose" uninitialized memory
   
       let data = a.as_slice();
   
       // this is UB, we use un-initialized memory; however, we would benefit from being able to perform this op, if e.g.
       // we wanted to perform the +1 over all values in lanes
       data[1] + 1;
   }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12599: ARROW-15846: [Format] Clarify presence of struct validity bitmap

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12599:
URL: https://github.com/apache/arrow/pull/12599#issuecomment-1067011157


   Benchmark runs are scheduled for baseline = 5cb5afc40547b4f75739e31ff8632c71a10d3084 and contender = 56b06bbe90e279dd4000736bd1e7d14d7b0558ca. 56b06bbe90e279dd4000736bd1e7d14d7b0558ca is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/506f0446281942c0958b68df2ef20e55...4a532973ae45491186c7ff050f531e4e/)
   [Finished :arrow_down:0.13% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/71f9fee4c7d14bb589bacb2e4a95f605...40b0b7b632a541b892f934048c0c05ec/)
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/2cffd91b5ca14a20b5df0d0bf5ee7ae9...20fd46e8541d4b83b48f4e73dea2728f/)
   [Finished :arrow_down:0.38% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/64116ccec5c34c50b0443a90f06214ec...56544810c45b42be9ce03228db4d66ab/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #12599: ARROW-15846: [Format] Clarify presence of struct validity bitmap

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #12599:
URL: https://github.com/apache/arrow/pull/12599#discussion_r824966527



##########
File path: docs/source/format/Columnar.rst
##########
@@ -522,19 +521,24 @@ The layout for ``[{'joe', 1}, {null, 2}, null, {'mark', 4}]`` would be: ::
           |------------|-------------|-------------|-------------|-------------|
           | 1          | 2           | unspecified | 4           | unspecified |
 
-While a struct does not have physical storage for each of its semantic
-slots (i.e. each scalar C-like struct), an entire struct slot can be
-set to null via the validity bitmap. Any of the child field arrays can
-have null values according to their respective independent validity
-bitmaps. This implies that for a particular struct slot the validity
-bitmap for the struct array might indicate a null slot when one or
-more of its child arrays has a non-null value in their corresponding
-slot.  When reading the struct array the parent validity bitmap takes
-priority.  This is illustrated in the example above, the child arrays
-have valid entries for the null struct but are 'hidden' from the
-consumer by the parent array's validity bitmap.  However, when treated
-independently corresponding values of the children array will be
-non-null.
+Struct Validity
+~~~~~~~~~~~~~~~
+
+A struct array has its own validity bitmap in addition to each of its
+child arrays' own validity bitmap. This implies that, for a particular struct
+array slot, the validity bitmap for the struct array might indicate a null
+when one or more of its child arrays has a non-null value in their corresponding
+slot; or conversely, a child array might have a null in its validity bitmap
+while the struct array's validity bitmap shows a non-null value.
+
+Therefore, to know whether a particular child entry is null, one must

Review comment:
       ```suggestion
   Therefore, to know whether a particular child entry is valid, one must
   ```

##########
File path: docs/source/format/Columnar.rst
##########
@@ -522,19 +521,24 @@ The layout for ``[{'joe', 1}, {null, 2}, null, {'mark', 4}]`` would be: ::
           |------------|-------------|-------------|-------------|-------------|
           | 1          | 2           | unspecified | 4           | unspecified |
 
-While a struct does not have physical storage for each of its semantic
-slots (i.e. each scalar C-like struct), an entire struct slot can be
-set to null via the validity bitmap. Any of the child field arrays can
-have null values according to their respective independent validity
-bitmaps. This implies that for a particular struct slot the validity
-bitmap for the struct array might indicate a null slot when one or
-more of its child arrays has a non-null value in their corresponding
-slot.  When reading the struct array the parent validity bitmap takes
-priority.  This is illustrated in the example above, the child arrays
-have valid entries for the null struct but are 'hidden' from the
-consumer by the parent array's validity bitmap.  However, when treated
-independently corresponding values of the children array will be
-non-null.
+Struct Validity
+~~~~~~~~~~~~~~~
+
+A struct array has its own validity bitmap in addition to each of its
+child arrays' own validity bitmap. This implies that, for a particular struct
+array slot, the validity bitmap for the struct array might indicate a null
+when one or more of its child arrays has a non-null value in their corresponding
+slot; or conversely, a child array might have a null in its validity bitmap
+while the struct array's validity bitmap shows a non-null value.

Review comment:
       ```suggestion
   A struct array has its own validity bitmap in addition to each of its
   child arrays' own validity bitmap. The validity bitmap for the struct
   array might indicate a null when one or more of its child arrays has
   a non-null value in their corresponding slot; or conversely, a child
   array might have a null in its validity bitmap while the struct array's
   validity bitmap shows a non-null value.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12599: ARROW-15846: [Format] Clarify presence of struct validity bitmap

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12599:
URL: https://github.com/apache/arrow/pull/12599#issuecomment-1067011157


   Benchmark runs are scheduled for baseline = 5cb5afc40547b4f75739e31ff8632c71a10d3084 and contender = 56b06bbe90e279dd4000736bd1e7d14d7b0558ca. 56b06bbe90e279dd4000736bd1e7d14d7b0558ca is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/506f0446281942c0958b68df2ef20e55...4a532973ae45491186c7ff050f531e4e/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/71f9fee4c7d14bb589bacb2e4a95f605...40b0b7b632a541b892f934048c0c05ec/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/2cffd91b5ca14a20b5df0d0bf5ee7ae9...20fd46e8541d4b83b48f4e73dea2728f/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/64116ccec5c34c50b0443a90f06214ec...56544810c45b42be9ce03228db4d66ab/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #12599: ARROW-15846: [Format] Clarify presence of struct validity bitmap

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #12599:
URL: https://github.com/apache/arrow/pull/12599#discussion_r825314240



##########
File path: docs/source/format/Columnar.rst
##########
@@ -208,17 +208,19 @@ right-to-left: ::
               0  0  1  0  1  0  1  1
 
 Arrays having a 0 null count may choose to not allocate the validity
-bitmap. Implementations may choose to always allocate one anyway as a
-matter of convenience, but this should be noted when memory is being
-shared.
+bitmap; how this is represented depends on the implementation (for
+example, a C++ implementation may represent such an "absent" validity
+bitmap using a NULL pointer). Implementations may choose to always allocate
+a validity bitmap anyway as a matter of convenience. Consumers of Arrow
+arrays should be ready to handle those two possibilities.
 
-Nested type arrays except for union types have their own validity bitmap and
-null count regardless of the null count and valid bits of their child arrays.
+Nested type arrays (except for union types as noted above) have their own
+top-level validity bitmap and null count, regardless of the null count and
+valid bits of their child arrays.
 
-Array slots which are null are not required to have a particular
-value; any "masked" memory can have any value and need not be zeroed,
-though implementations frequently choose to zero memory for null
-values.
+Array slots which are null are not required to have a particular value;
+any "masked" memory can have any value and need not be zeroed, though

Review comment:
       The rational of this comment is related to general observations about C, C++ and Rust, nicely described e.g. here: https://www.ralfj.de/blog/2019/07/14/uninit.html
   
   all zero-initialized memory is initialized, but not all initialized memory is zero-initialized. My comment/suggestion to require regions to be initialized so that users can access and operate over null slots without triggering undefined behavior.
   
   However, I do agree with Micah that this is beyond this PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #12599: ARROW-15846: [Format] Clarify presence of struct validity bitmap

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #12599:
URL: https://github.com/apache/arrow/pull/12599#discussion_r824416986



##########
File path: docs/source/format/Columnar.rst
##########
@@ -208,17 +208,19 @@ right-to-left: ::
               0  0  1  0  1  0  1  1
 
 Arrays having a 0 null count may choose to not allocate the validity
-bitmap. Implementations may choose to always allocate one anyway as a
-matter of convenience, but this should be noted when memory is being
-shared.
+bitmap; how this is represented depends on the implementation (for
+example, a C++ implementation may represent such an "absent" validity
+bitmap using a NULL pointer). Implementations may choose to always allocate
+a validity bitmap anyway as a matter of convenience. Consumers of Arrow
+arrays should be ready to handle those two possibilities.
 
-Nested type arrays except for union types have their own validity bitmap and
-null count regardless of the null count and valid bits of their child arrays.
+Nested type arrays (except for union types as noted above) have their own
+top-level validity bitmap and null count, regardless of the null count and
+valid bits of their child arrays.
 
-Array slots which are null are not required to have a particular
-value; any "masked" memory can have any value and need not be zeroed,
-though implementations frequently choose to zero memory for null
-values.
+Array slots which are null are not required to have a particular value;
+any "masked" memory can have any value and need not be zeroed, though

Review comment:
       > In Rust overflowing panics in debug and wraps in release (doesn't UB)
   
   OK, there are differences in C++ and Rust's definition of UB so I guess we should be careful with terminology.  I still think this is technically a specification change, and I feel like it has been discussed on the past in the mailing list (but maybe that was zero initialization) and there didn't seem like strong enough consensus to make the change.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #12599: ARROW-15846: [Format] Clarify presence of struct validity bitmap

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #12599:
URL: https://github.com/apache/arrow/pull/12599


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot commented on pull request #12599: ARROW-15846: [Format] Clarify presence of struct validity bitmap

Posted by GitBox <gi...@apache.org>.
ursabot commented on pull request #12599:
URL: https://github.com/apache/arrow/pull/12599#issuecomment-1067011157


   Benchmark runs are scheduled for baseline = 5cb5afc40547b4f75739e31ff8632c71a10d3084 and contender = 56b06bbe90e279dd4000736bd1e7d14d7b0558ca. 56b06bbe90e279dd4000736bd1e7d14d7b0558ca is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/506f0446281942c0958b68df2ef20e55...4a532973ae45491186c7ff050f531e4e/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/71f9fee4c7d14bb589bacb2e4a95f605...40b0b7b632a541b892f934048c0c05ec/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/2cffd91b5ca14a20b5df0d0bf5ee7ae9...20fd46e8541d4b83b48f4e73dea2728f/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/64116ccec5c34c50b0443a90f06214ec...56544810c45b42be9ce03228db4d66ab/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12599: ARROW-15846: [Format] Clarify presence of struct validity bitmap

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12599:
URL: https://github.com/apache/arrow/pull/12599#issuecomment-1067011157


   Benchmark runs are scheduled for baseline = 5cb5afc40547b4f75739e31ff8632c71a10d3084 and contender = 56b06bbe90e279dd4000736bd1e7d14d7b0558ca. 56b06bbe90e279dd4000736bd1e7d14d7b0558ca is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/506f0446281942c0958b68df2ef20e55...4a532973ae45491186c7ff050f531e4e/)
   [Finished :arrow_down:0.13% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/71f9fee4c7d14bb589bacb2e4a95f605...40b0b7b632a541b892f934048c0c05ec/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/2cffd91b5ca14a20b5df0d0bf5ee7ae9...20fd46e8541d4b83b48f4e73dea2728f/)
   [Finished :arrow_down:0.38% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/64116ccec5c34c50b0443a90f06214ec...56544810c45b42be9ce03228db4d66ab/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #12599: ARROW-15846: [Format] Clarify presence of struct validity bitmap

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #12599:
URL: https://github.com/apache/arrow/pull/12599#discussion_r824114834



##########
File path: docs/source/format/Columnar.rst
##########
@@ -522,19 +521,24 @@ The layout for ``[{'joe', 1}, {null, 2}, null, {'mark', 4}]`` would be: ::
           |------------|-------------|-------------|-------------|-------------|
           | 1          | 2           | unspecified | 4           | unspecified |
 
-While a struct does not have physical storage for each of its semantic
-slots (i.e. each scalar C-like struct), an entire struct slot can be
-set to null via the validity bitmap. Any of the child field arrays can
-have null values according to their respective independent validity
-bitmaps. This implies that for a particular struct slot the validity
-bitmap for the struct array might indicate a null slot when one or
-more of its child arrays has a non-null value in their corresponding
-slot.  When reading the struct array the parent validity bitmap takes
-priority.  This is illustrated in the example above, the child arrays
-have valid entries for the null struct but are 'hidden' from the
-consumer by the parent array's validity bitmap.  However, when treated
-independently corresponding values of the children array will be
-non-null.
+Struct Validity
+~~~~~~~~~~~~~~~
+
+A struct array has its own validity bitmap in addition to each of its
+child arrays' own validity bitmap. This implies that, for a particular struct

Review comment:
       ```suggestion
   child arrays' validity bitmaps. This implies that, for a particular struct
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #12599: ARROW-15846: [Format] Clarify presence of struct validity bitmap

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #12599:
URL: https://github.com/apache/arrow/pull/12599#discussion_r824398902



##########
File path: docs/source/format/Columnar.rst
##########
@@ -208,17 +208,19 @@ right-to-left: ::
               0  0  1  0  1  0  1  1
 
 Arrays having a 0 null count may choose to not allocate the validity
-bitmap. Implementations may choose to always allocate one anyway as a
-matter of convenience, but this should be noted when memory is being
-shared.
+bitmap; how this is represented depends on the implementation (for
+example, a C++ implementation may represent such an "absent" validity
+bitmap using a NULL pointer). Implementations may choose to always allocate
+a validity bitmap anyway as a matter of convenience. Consumers of Arrow
+arrays should be ready to handle those two possibilities.
 
-Nested type arrays except for union types have their own validity bitmap and
-null count regardless of the null count and valid bits of their child arrays.
+Nested type arrays (except for union types as noted above) have their own
+top-level validity bitmap and null count, regardless of the null count and
+valid bits of their child arrays.
 
-Array slots which are null are not required to have a particular
-value; any "masked" memory can have any value and need not be zeroed,
-though implementations frequently choose to zero memory for null
-values.
+Array slots which are null are not required to have a particular value;
+any "masked" memory can have any value and need not be zeroed, though

Review comment:
       In Rust overflowing panics in debug and wraps in release (doesn't UB)
   
   ```rust
   #![allow(arithmetic_overflow)] // the linter does not even compile the below without this ^^
   
   fn main() {
       let a = i32::MAX + 1; // panics debug, wraps in release
   }
   ```
   
   but we usually use either saturating or wrapping as we do not want the query engine to panic if the user performs an operation that would panic in a non-null slot.
   
   fwiw `numpy` uses wrapping: `import numpy; a = numpy.array([2147483647], dtype=numpy.int32); print(a + 1) // [-2147483648]`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #12599: ARROW-15846: [Format] Clarify presence of struct validity bitmap

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #12599:
URL: https://github.com/apache/arrow/pull/12599#discussion_r824398902



##########
File path: docs/source/format/Columnar.rst
##########
@@ -208,17 +208,19 @@ right-to-left: ::
               0  0  1  0  1  0  1  1
 
 Arrays having a 0 null count may choose to not allocate the validity
-bitmap. Implementations may choose to always allocate one anyway as a
-matter of convenience, but this should be noted when memory is being
-shared.
+bitmap; how this is represented depends on the implementation (for
+example, a C++ implementation may represent such an "absent" validity
+bitmap using a NULL pointer). Implementations may choose to always allocate
+a validity bitmap anyway as a matter of convenience. Consumers of Arrow
+arrays should be ready to handle those two possibilities.
 
-Nested type arrays except for union types have their own validity bitmap and
-null count regardless of the null count and valid bits of their child arrays.
+Nested type arrays (except for union types as noted above) have their own
+top-level validity bitmap and null count, regardless of the null count and
+valid bits of their child arrays.
 
-Array slots which are null are not required to have a particular
-value; any "masked" memory can have any value and need not be zeroed,
-though implementations frequently choose to zero memory for null
-values.
+Array slots which are null are not required to have a particular value;
+any "masked" memory can have any value and need not be zeroed, though

Review comment:
       In Rust overflowing panics (doesn't UB)
   
   ```rust
   #![allow(arithmetic_overflow)] // the linter does not even compile the below without this ^^
   
   fn main() {
       let a = i32::MAX + 1; // this panics
   }
   ```
   
   but we usually use either saturating or wrapping as we do not want the query engine to panic if the user performs an operation that would panic in a non-null slot.
   
   fwiw `numpy` uses wrapping: `import numpy; a = numpy.array([2147483647], dtype=numpy.int32); print(a + 1) // [-2147483648]`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #12599: ARROW-15846: [Format] Clarify presence of struct validity bitmap

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #12599:
URL: https://github.com/apache/arrow/pull/12599#discussion_r824398902



##########
File path: docs/source/format/Columnar.rst
##########
@@ -208,17 +208,19 @@ right-to-left: ::
               0  0  1  0  1  0  1  1
 
 Arrays having a 0 null count may choose to not allocate the validity
-bitmap. Implementations may choose to always allocate one anyway as a
-matter of convenience, but this should be noted when memory is being
-shared.
+bitmap; how this is represented depends on the implementation (for
+example, a C++ implementation may represent such an "absent" validity
+bitmap using a NULL pointer). Implementations may choose to always allocate
+a validity bitmap anyway as a matter of convenience. Consumers of Arrow
+arrays should be ready to handle those two possibilities.
 
-Nested type arrays except for union types have their own validity bitmap and
-null count regardless of the null count and valid bits of their child arrays.
+Nested type arrays (except for union types as noted above) have their own
+top-level validity bitmap and null count, regardless of the null count and
+valid bits of their child arrays.
 
-Array slots which are null are not required to have a particular
-value; any "masked" memory can have any value and need not be zeroed,
-though implementations frequently choose to zero memory for null
-values.
+Array slots which are null are not required to have a particular value;
+any "masked" memory can have any value and need not be zeroed, though

Review comment:
       In Rust overflowing panics (doesn't UB)
   
   ```rust
   #![allow(arithmetic_overflow)]
   
   fn main() {
       let a = i32::MAX + 1; // this panics
   }
   ```
   
   but we usually use either saturating or wrapping as we do not want the query engine to panic if the user performs an operation that would panic in a non-null slot.
   
   fwiw `numpy` uses wrapping: `import numpy; a = numpy.array([2147483647], dtype=numpy.int32); print(a + 1) // [-2147483648]`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12599: ARROW-15846: [Format] Clarify presence of struct validity bitmap

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12599:
URL: https://github.com/apache/arrow/pull/12599#issuecomment-1067011157


   Benchmark runs are scheduled for baseline = 5cb5afc40547b4f75739e31ff8632c71a10d3084 and contender = 56b06bbe90e279dd4000736bd1e7d14d7b0558ca. 56b06bbe90e279dd4000736bd1e7d14d7b0558ca is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/506f0446281942c0958b68df2ef20e55...4a532973ae45491186c7ff050f531e4e/)
   [Finished :arrow_down:0.13% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/71f9fee4c7d14bb589bacb2e4a95f605...40b0b7b632a541b892f934048c0c05ec/)
   [Scheduled :warning: ursa-i9-9960x is offline.] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/2cffd91b5ca14a20b5df0d0bf5ee7ae9...20fd46e8541d4b83b48f4e73dea2728f/)
   [Finished :arrow_down:0.38% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/64116ccec5c34c50b0443a90f06214ec...56544810c45b42be9ce03228db4d66ab/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #12599: ARROW-15846: [Format] Clarify presence of struct validity bitmap

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #12599:
URL: https://github.com/apache/arrow/pull/12599#discussion_r824961421



##########
File path: docs/source/format/Columnar.rst
##########
@@ -208,17 +208,19 @@ right-to-left: ::
               0  0  1  0  1  0  1  1
 
 Arrays having a 0 null count may choose to not allocate the validity
-bitmap. Implementations may choose to always allocate one anyway as a
-matter of convenience, but this should be noted when memory is being
-shared.
+bitmap; how this is represented depends on the implementation (for
+example, a C++ implementation may represent such an "absent" validity
+bitmap using a NULL pointer). Implementations may choose to always allocate
+a validity bitmap anyway as a matter of convenience. Consumers of Arrow
+arrays should be ready to handle those two possibilities.
 
-Nested type arrays except for union types have their own validity bitmap and
-null count regardless of the null count and valid bits of their child arrays.
+Nested type arrays (except for union types as noted above) have their own
+top-level validity bitmap and null count, regardless of the null count and
+valid bits of their child arrays.
 
-Array slots which are null are not required to have a particular
-value; any "masked" memory can have any value and need not be zeroed,
-though implementations frequently choose to zero memory for null
-values.
+Array slots which are null are not required to have a particular value;
+any "masked" memory can have any value and need not be zeroed, though

Review comment:
       @jorgecarleitao Sorry, I'm still not entirely sure what initialized means.  Can you give an example of what it would mean for memory to be "initialized" but not "zero initialized"?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #12599: ARROW-15846: [Format] Clarify presence of struct validity bitmap

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #12599:
URL: https://github.com/apache/arrow/pull/12599#discussion_r824094653



##########
File path: docs/source/format/Columnar.rst
##########
@@ -208,17 +208,19 @@ right-to-left: ::
               0  0  1  0  1  0  1  1
 
 Arrays having a 0 null count may choose to not allocate the validity
-bitmap. Implementations may choose to always allocate one anyway as a
-matter of convenience, but this should be noted when memory is being
-shared.
+bitmap; how this is represented depends on the implementation (for
+example, a C++ implementation may represent such an "absent" validity
+bitmap using a NULL pointer). Implementations may choose to always allocate
+a validity bitmap anyway as a matter of convenience. Consumers of Arrow
+arrays should be ready to handle those two possibilities.
 
-Nested type arrays except for union types have their own validity bitmap and
-null count regardless of the null count and valid bits of their child arrays.
+Nested type arrays (except for union types as noted above) have their own
+top-level validity bitmap and null count, regardless of the null count and
+valid bits of their child arrays.
 
-Array slots which are null are not required to have a particular
-value; any "masked" memory can have any value and need not be zeroed,
-though implementations frequently choose to zero memory for null
-values.
+Array slots which are null are not required to have a particular value;
+any "masked" memory can have any value and need not be zeroed, though

Review comment:
       it could be worth mentioning that the value in a null slot should be still initialized, i.e. deferencing a pointer pointing to it should not result in undefined behavior.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #12599: ARROW-15846: [Format] Clarify presence of struct validity bitmap

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #12599:
URL: https://github.com/apache/arrow/pull/12599#discussion_r824114517



##########
File path: docs/source/format/Columnar.rst
##########
@@ -522,19 +521,24 @@ The layout for ``[{'joe', 1}, {null, 2}, null, {'mark', 4}]`` would be: ::
           |------------|-------------|-------------|-------------|-------------|
           | 1          | 2           | unspecified | 4           | unspecified |
 
-While a struct does not have physical storage for each of its semantic
-slots (i.e. each scalar C-like struct), an entire struct slot can be
-set to null via the validity bitmap. Any of the child field arrays can
-have null values according to their respective independent validity
-bitmaps. This implies that for a particular struct slot the validity
-bitmap for the struct array might indicate a null slot when one or
-more of its child arrays has a non-null value in their corresponding
-slot.  When reading the struct array the parent validity bitmap takes
-priority.  This is illustrated in the example above, the child arrays
-have valid entries for the null struct but are 'hidden' from the
-consumer by the parent array's validity bitmap.  However, when treated
-independently corresponding values of the children array will be
-non-null.
+Struct Validity
+~~~~~~~~~~~~~~~
+
+A struct array has its own validity bitmap in addition to each of its

Review comment:
       ```suggestion
   A struct array has its own validity bitmap that is independent of
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #12599: ARROW-15846: [Format] Clarify presence of struct validity bitmap

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #12599:
URL: https://github.com/apache/arrow/pull/12599#discussion_r824114223



##########
File path: docs/source/format/Columnar.rst
##########
@@ -208,17 +208,19 @@ right-to-left: ::
               0  0  1  0  1  0  1  1
 
 Arrays having a 0 null count may choose to not allocate the validity
-bitmap. Implementations may choose to always allocate one anyway as a
-matter of convenience, but this should be noted when memory is being
-shared.
+bitmap; how this is represented depends on the implementation (for
+example, a C++ implementation may represent such an "absent" validity
+bitmap using a NULL pointer). Implementations may choose to always allocate
+a validity bitmap anyway as a matter of convenience. Consumers of Arrow
+arrays should be ready to handle those two possibilities.
 
-Nested type arrays except for union types have their own validity bitmap and
-null count regardless of the null count and valid bits of their child arrays.
+Nested type arrays (except for union types as noted above) have their own
+top-level validity bitmap and null count, regardless of the null count and
+valid bits of their child arrays.
 
-Array slots which are null are not required to have a particular
-value; any "masked" memory can have any value and need not be zeroed,
-though implementations frequently choose to zero memory for null
-values.
+Array slots which are null are not required to have a particular value;
+any "masked" memory can have any value and need not be zeroed, though

Review comment:
       can you clarify what you mean by UB here? Dereferencing a pointer should always be defined given other constraints but tools like ASAN might complain if you try to take actions on the memory location. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org