You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "tustvold (via GitHub)" <gi...@apache.org> on 2023/03/30 12:19:12 UTC

[GitHub] [arrow-rs] tustvold opened a new pull request, #3981: Add UnionFields (#3955)

tustvold opened a new pull request, #3981:
URL: https://github.com/apache/arrow-rs/pull/3981

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Part of #3955
   
   # Rationale for this change
    
   <!--
   Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
   Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
   -->
   
   * Reduces cloning of Field #3956 
   * Adds an abstraction to allow improving ergonomics of interacting with Union datatypes
   * Reduces the size of DataType from 56 bytes to 24 bytes
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #3981: Add UnionFields (#3955)

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #3981:
URL: https://github.com/apache/arrow-rs/pull/3981#discussion_r1153229128


##########
arrow-schema/src/fields.rs:
##########
@@ -148,3 +150,91 @@ impl<'de> serde::Deserialize<'de> for Fields {
         Ok(Vec::<Field>::deserialize(deserializer)?.into())
     }
 }
+
+/// A cheaply cloneable, owned collection of [`FieldRef`] and their corresponding type ids
+#[derive(Clone, Eq, PartialEq, Ord, PartialOrd, Hash)]
+#[cfg_attr(feature = "serde", derive(serde::Serialize, serde::Deserialize))]
+#[cfg_attr(feature = "serde", serde(transparent))]
+pub struct UnionFields(Arc<[(i8, FieldRef)]>);
+
+impl std::fmt::Debug for UnionFields {
+    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
+        self.0.as_ref().fmt(f)
+    }
+}
+
+impl UnionFields {
+    /// Create a new [`UnionFields`] with no fields
+    pub fn empty() -> Self {
+        Self(Arc::from([]))
+    }
+
+    /// Create a new [`UnionFields`] from a [`Fields`] and array of type_ids
+    pub fn new<F, T>(type_ids: T, fields: F) -> Self

Review Comment:
   Should this be fallible (to prevent repeated type ids, for example)?



##########
arrow-schema/src/fields.rs:
##########
@@ -148,3 +150,91 @@ impl<'de> serde::Deserialize<'de> for Fields {
         Ok(Vec::<Field>::deserialize(deserializer)?.into())
     }
 }
+
+/// A cheaply cloneable, owned collection of [`FieldRef`] and their corresponding type ids
+#[derive(Clone, Eq, PartialEq, Ord, PartialOrd, Hash)]
+#[cfg_attr(feature = "serde", derive(serde::Serialize, serde::Deserialize))]
+#[cfg_attr(feature = "serde", serde(transparent))]
+pub struct UnionFields(Arc<[(i8, FieldRef)]>);
+
+impl std::fmt::Debug for UnionFields {
+    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
+        self.0.as_ref().fmt(f)
+    }
+}
+
+impl UnionFields {
+    /// Create a new [`UnionFields`] with no fields
+    pub fn empty() -> Self {
+        Self(Arc::from([]))
+    }
+
+    /// Create a new [`UnionFields`] from a [`Fields`] and array of type_ids
+    pub fn new<F, T>(type_ids: T, fields: F) -> Self
+    where
+        F: IntoIterator,
+        F::Item: Into<FieldRef>,
+        T: IntoIterator<Item = i8>,
+    {
+        let fields = fields.into_iter().map(Into::into);
+        type_ids.into_iter().zip(fields).collect()
+    }
+
+    /// Return size of this instance in bytes.
+    pub fn size(&self) -> usize {
+        self.iter()
+            .map(|(_, field)| field.size() + std::mem::size_of::<(i8, FieldRef)>())
+            .sum()
+    }
+
+    /// Returns the number of fields in this [`UnionFields`]
+    pub fn len(&self) -> usize {
+        self.0.len()
+    }
+
+    /// Returns `true` if this is empty
+    pub fn is_empty(&self) -> bool {
+        self.0.is_empty()
+    }
+
+    /// Returns an iterator over the fields and type ids in this [`UnionFields`]
+    ///
+    /// Note: the iteration order is not guaranteed
+    pub fn iter(&self) -> impl Iterator<Item = (i8, &FieldRef)> + '_ {
+        self.0.iter().map(|(id, f)| (*id, f))
+    }
+
+    pub(crate) fn try_merge(&mut self, other: &Self) -> Result<(), ArrowError> {

Review Comment:
   Can you please add some docs about what this does?



##########
arrow-schema/src/fields.rs:
##########
@@ -148,3 +150,91 @@ impl<'de> serde::Deserialize<'de> for Fields {
         Ok(Vec::<Field>::deserialize(deserializer)?.into())
     }
 }
+
+/// A cheaply cloneable, owned collection of [`FieldRef`] and their corresponding type ids
+#[derive(Clone, Eq, PartialEq, Ord, PartialOrd, Hash)]
+#[cfg_attr(feature = "serde", derive(serde::Serialize, serde::Deserialize))]
+#[cfg_attr(feature = "serde", serde(transparent))]
+pub struct UnionFields(Arc<[(i8, FieldRef)]>);
+
+impl std::fmt::Debug for UnionFields {
+    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
+        self.0.as_ref().fmt(f)
+    }
+}
+
+impl UnionFields {

Review Comment:
   Can we please add documentation that explains:
   1. what type_ids are (a link to the spec / docs)
   2. has an example of how to construct these things



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3981: Add UnionFields (#3955)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #3981:
URL: https://github.com/apache/arrow-rs/pull/3981#discussion_r1153242078


##########
arrow-schema/src/fields.rs:
##########
@@ -148,3 +150,91 @@ impl<'de> serde::Deserialize<'de> for Fields {
         Ok(Vec::<Field>::deserialize(deserializer)?.into())
     }
 }
+
+/// A cheaply cloneable, owned collection of [`FieldRef`] and their corresponding type ids
+#[derive(Clone, Eq, PartialEq, Ord, PartialOrd, Hash)]
+#[cfg_attr(feature = "serde", derive(serde::Serialize, serde::Deserialize))]
+#[cfg_attr(feature = "serde", serde(transparent))]
+pub struct UnionFields(Arc<[(i8, FieldRef)]>);
+
+impl std::fmt::Debug for UnionFields {
+    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
+        self.0.as_ref().fmt(f)
+    }
+}
+
+impl UnionFields {
+    /// Create a new [`UnionFields`] with no fields
+    pub fn empty() -> Self {
+        Self(Arc::from([]))
+    }
+
+    /// Create a new [`UnionFields`] from a [`Fields`] and array of type_ids
+    pub fn new<F, T>(type_ids: T, fields: F) -> Self

Review Comment:
   I honestly wasn't sure about this, the existing code doesn't handle this consistently. I'll file a ticket



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3981: Add UnionFields (#3955)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #3981:
URL: https://github.com/apache/arrow-rs/pull/3981#discussion_r1153174556


##########
arrow-schema/src/fields.rs:
##########
@@ -148,3 +150,91 @@ impl<'de> serde::Deserialize<'de> for Fields {
         Ok(Vec::<Field>::deserialize(deserializer)?.into())
     }
 }
+
+/// A cheaply cloneable, owned collection of [`FieldRef`] and their corresponding type ids
+#[derive(Clone, Eq, PartialEq, Ord, PartialOrd, Hash)]
+#[cfg_attr(feature = "serde", derive(serde::Serialize, serde::Deserialize))]
+#[cfg_attr(feature = "serde", serde(transparent))]
+pub struct UnionFields(Arc<[(i8, FieldRef)]>);
+
+impl std::fmt::Debug for UnionFields {
+    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
+        self.0.as_ref().fmt(f)
+    }
+}
+
+impl UnionFields {

Review Comment:
   I purposefully kept the size of this interface down as I don't feel we have very mature support for UnionArrays and therefore I didn't want to paint ourselves into a corner too much w.r.t what this abstraction looked like



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3981: Add UnionFields (#3955)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #3981:
URL: https://github.com/apache/arrow-rs/pull/3981#discussion_r1153242078


##########
arrow-schema/src/fields.rs:
##########
@@ -148,3 +150,91 @@ impl<'de> serde::Deserialize<'de> for Fields {
         Ok(Vec::<Field>::deserialize(deserializer)?.into())
     }
 }
+
+/// A cheaply cloneable, owned collection of [`FieldRef`] and their corresponding type ids
+#[derive(Clone, Eq, PartialEq, Ord, PartialOrd, Hash)]
+#[cfg_attr(feature = "serde", derive(serde::Serialize, serde::Deserialize))]
+#[cfg_attr(feature = "serde", serde(transparent))]
+pub struct UnionFields(Arc<[(i8, FieldRef)]>);
+
+impl std::fmt::Debug for UnionFields {
+    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
+        self.0.as_ref().fmt(f)
+    }
+}
+
+impl UnionFields {
+    /// Create a new [`UnionFields`] with no fields
+    pub fn empty() -> Self {
+        Self(Arc::from([]))
+    }
+
+    /// Create a new [`UnionFields`] from a [`Fields`] and array of type_ids
+    pub fn new<F, T>(type_ids: T, fields: F) -> Self

Review Comment:
   I honestly wasn't sure about this, the existing code doesn't handle this consistently. I'll file a ticket
   
   Edit: https://github.com/apache/arrow-rs/issues/3982



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold merged pull request #3981: Add UnionFields (#3955)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold merged PR #3981:
URL: https://github.com/apache/arrow-rs/pull/3981


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3981: Add UnionFields (#3955)

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #3981:
URL: https://github.com/apache/arrow-rs/pull/3981#discussion_r1153255935


##########
arrow-schema/src/fields.rs:
##########
@@ -148,3 +150,91 @@ impl<'de> serde::Deserialize<'de> for Fields {
         Ok(Vec::<Field>::deserialize(deserializer)?.into())
     }
 }
+
+/// A cheaply cloneable, owned collection of [`FieldRef`] and their corresponding type ids
+#[derive(Clone, Eq, PartialEq, Ord, PartialOrd, Hash)]
+#[cfg_attr(feature = "serde", derive(serde::Serialize, serde::Deserialize))]
+#[cfg_attr(feature = "serde", serde(transparent))]
+pub struct UnionFields(Arc<[(i8, FieldRef)]>);
+
+impl std::fmt::Debug for UnionFields {
+    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
+        self.0.as_ref().fmt(f)
+    }
+}
+
+impl UnionFields {
+    /// Create a new [`UnionFields`] with no fields
+    pub fn empty() -> Self {
+        Self(Arc::from([]))
+    }
+
+    /// Create a new [`UnionFields`] from a [`Fields`] and array of type_ids
+    pub fn new<F, T>(type_ids: T, fields: F) -> Self
+    where
+        F: IntoIterator,
+        F::Item: Into<FieldRef>,
+        T: IntoIterator<Item = i8>,
+    {
+        let fields = fields.into_iter().map(Into::into);
+        type_ids.into_iter().zip(fields).collect()
+    }
+
+    /// Return size of this instance in bytes.
+    pub fn size(&self) -> usize {
+        self.iter()
+            .map(|(_, field)| field.size() + std::mem::size_of::<(i8, FieldRef)>())
+            .sum()
+    }
+
+    /// Returns the number of fields in this [`UnionFields`]
+    pub fn len(&self) -> usize {
+        self.0.len()
+    }
+
+    /// Returns `true` if this is empty
+    pub fn is_empty(&self) -> bool {
+        self.0.is_empty()
+    }
+
+    /// Returns an iterator over the fields and type ids in this [`UnionFields`]
+    ///
+    /// Note: the iteration order is not guaranteed
+    pub fn iter(&self) -> impl Iterator<Item = (i8, &FieldRef)> + '_ {
+        self.0.iter().map(|(id, f)| (*id, f))
+    }
+
+    pub(crate) fn try_merge(&mut self, other: &Self) -> Result<(), ArrowError> {

Review Comment:
   Sorry, should have called out in review, just moved some code here. Added docs in https://github.com/apache/arrow-rs/pull/3981/commits/07af4ffe9f599e58c0f937bb3e9b6a2844dbec33



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org