You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/24 16:52:35 UTC

[GitHub] [arrow-rs] tustvold commented on a change in pull request #1444: Add Full UnionArray validation

tustvold commented on a change in pull request #1444:
URL: https://github.com/apache/arrow-rs/pull/1444#discussion_r834509184



##########
File path: arrow/src/array/data.rs
##########
@@ -957,12 +955,16 @@ impl ArrayData {
                 let child = &self.child_data[0];
                 self.validate_offsets_full::<i64>(child.len + child.offset)?;
             }
-            DataType::Union(_, _) => {
-                // Validate Union Array as part of implementing new Union semantics
-                // See comments in `ArrayData::validate()`
-                // https://github.com/apache/arrow-rs/issues/85
-                //
-                // TODO file follow on ticket for full union validation
+            DataType::Union(_fields, mode) => {
+                match mode {
+                    UnionMode::Sparse => {
+                        // typeids should all be valid
+                        self.validate_offsets_full::<i8>(self.child_data.len())?;

Review comment:
       I don't think this is correct, despite what the method says it is designed for validating list offsets and therefore also checks for monotonicity.
   
   As an aside the spec doesn't seem very clear about if an offset can be repeated...

##########
File path: arrow/src/array/data.rs
##########
@@ -957,12 +955,16 @@ impl ArrayData {
                 let child = &self.child_data[0];
                 self.validate_offsets_full::<i64>(child.len + child.offset)?;
             }
-            DataType::Union(_, _) => {
-                // Validate Union Array as part of implementing new Union semantics
-                // See comments in `ArrayData::validate()`
-                // https://github.com/apache/arrow-rs/issues/85
-                //
-                // TODO file follow on ticket for full union validation
+            DataType::Union(_fields, mode) => {

Review comment:
       Unless I'm missing something, we should probably also add buffer length checks into `ArrayData::validate` as I don't think these are currently present anywhere

##########
File path: arrow/src/array/data.rs
##########
@@ -1117,6 +1119,44 @@ impl ArrayData {
         )
     }
 
+    /// Ensures that for each union element, the offset is correct for
+    /// the corresponding child array
+    fn validate_dense_union_full(&self) -> Result<()> {

Review comment:
       I think should also check that offsets are monotonic for a given array type, but that could definitely be left as a todo

##########
File path: arrow/src/array/data.rs
##########
@@ -1117,6 +1119,44 @@ impl ArrayData {
         )
     }
 
+    /// Ensures that for each union element, the offset is correct for
+    /// the corresponding child array
+    fn validate_dense_union_full(&self) -> Result<()> {
+        // safety justification is that the size of the buffers was validated in self.validate()

Review comment:
       We could potentially make `validate` also check that all child arrays have the length of the parent in the case of a dense representation

##########
File path: arrow/src/array/data.rs
##########
@@ -957,12 +955,16 @@ impl ArrayData {
                 let child = &self.child_data[0];
                 self.validate_offsets_full::<i64>(child.len + child.offset)?;
             }
-            DataType::Union(_, _) => {
-                // Validate Union Array as part of implementing new Union semantics
-                // See comments in `ArrayData::validate()`
-                // https://github.com/apache/arrow-rs/issues/85
-                //
-                // TODO file follow on ticket for full union validation
+            DataType::Union(_fields, mode) => {
+                match mode {
+                    UnionMode::Sparse => {
+                        // typeids should all be valid
+                        self.validate_offsets_full::<i8>(self.child_data.len())?;
+                    }
+                    UnionMode::Dense => {
+                        self.validate_dense_union_full()?;

Review comment:
       I was going to suggest that this should validate that the null bitmasks are disjoint, but this may not even be a requirement - the specification says "All “unselected” values are ignored and could be any semantically correct array value."




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org