You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "wjones127 (via GitHub)" <gi...@apache.org> on 2023/03/25 23:27:28 UTC

[GitHub] [arrow-rs] wjones127 opened a new pull request, #3944: feat: enable metadata import/export through C data interface

wjones127 opened a new pull request, #3944:
URL: https://github.com/apache/arrow-rs/pull/3944

   # Which issue does this PR close?
   
   Closes #478.
   
   # Rationale for this change
    
   Metadata is used in extension types to communicate additional information. For example, the [geoarrow specification](https://github.com/geoarrow/geoarrow/blob/main/extension-types.md) uses field metadata to communicate extension types and coordinate systems for geospatial arrays. 
   
   # What changes are included in this PR?
   
   Adds metadata round tripping for fields and schemas.
   
   # Are there any user-facing changes?
   
   Fields and schemas brought through the C data interface will now have their metadata preserved.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] wjones127 commented on a diff in pull request #3944: feat: enable metadata import/export through C data interface

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on code in PR #3944:
URL: https://github.com/apache/arrow-rs/pull/3944#discussion_r1149951475


##########
arrow-schema/src/ffi.rs:
##########
@@ -152,6 +157,46 @@ impl FFI_ArrowSchema {
         Ok(self)
     }
 
+    pub fn with_metadata(
+        mut self,
+        metadata: Option<&HashMap<String, String>>,
+    ) -> Result<Self, ArrowError> {
+        let new_metadata = if let Some(metadata) = metadata {
+            if !metadata.is_empty() {
+                let mut metadata_serialized: Vec<u8> = Vec::new();
+                metadata_serialized.extend((metadata.len() as i32).to_ne_bytes());
+
+                for (key, value) in metadata.iter() {
+                    let key_len = key.len() as i32;
+                    let value_len = value.len() as i32;

Review Comment:
   I can't imagine this will ever come up as an issue. I'm also inferring from the fact that there's no conversion trait from the `TryFrom` associated error to `ArrowError` this isn't commonly handled elsewhere in the codebase.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on pull request #3944: feat: enable metadata import/export through C data interface

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on PR #3944:
URL: https://github.com/apache/arrow-rs/pull/3944#issuecomment-1484225823

   I plan to review this carefully tomorrow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3944: feat: enable metadata import/export through C data interface

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #3944:
URL: https://github.com/apache/arrow-rs/pull/3944#discussion_r1149122777


##########
arrow-schema/src/ffi.rs:
##########
@@ -152,6 +157,46 @@ impl FFI_ArrowSchema {
         Ok(self)
     }
 
+    pub fn with_metadata(
+        mut self,
+        metadata: Option<&HashMap<String, String>>,

Review Comment:
   ```suggestion
           metadata: &HashMap<String, String>,
   ```
   
   The option seems a touch redundant, could also consider parameterising this to be something like
   
   ```
   pub fn with_metadata<I, S>(mut self, metadata: I) -> Result<Self, ArrowError> where I: IntoIterator<Item=(S, S)>, S: AsRef<str>
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3944: feat: enable metadata import/export through C data interface

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #3944:
URL: https://github.com/apache/arrow-rs/pull/3944#discussion_r1150308482


##########
arrow-schema/src/ffi.rs:
##########
@@ -152,6 +157,46 @@ impl FFI_ArrowSchema {
         Ok(self)
     }
 
+    pub fn with_metadata(
+        mut self,
+        metadata: Option<&HashMap<String, String>>,
+    ) -> Result<Self, ArrowError> {
+        let new_metadata = if let Some(metadata) = metadata {
+            if !metadata.is_empty() {
+                let mut metadata_serialized: Vec<u8> = Vec::new();
+                metadata_serialized.extend((metadata.len() as i32).to_ne_bytes());
+
+                for (key, value) in metadata.iter() {
+                    let key_len = key.len() as i32;
+                    let value_len = value.len() as i32;

Review Comment:
   The codebase is full of overflow handling, take a look at the builders or take kernels or Array validation.
   
   I'm keen that we are defensive here, because the implications of incorrect data is that we or the downstream reader go off reading arbitrary memory



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] wjones127 commented on a diff in pull request #3944: feat: enable metadata import/export through C data interface

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on code in PR #3944:
URL: https://github.com/apache/arrow-rs/pull/3944#discussion_r1149944758


##########
arrow-schema/src/ffi.rs:
##########
@@ -212,6 +257,55 @@ impl FFI_ArrowSchema {
     pub fn dictionary_ordered(&self) -> bool {
         self.flags & 0b00000001 != 0
     }
+
+    pub fn metadata(&self) -> Result<HashMap<String, String>, ArrowError> {
+        if self.metadata.is_null() {
+            Ok(HashMap::new())
+        } else {
+            let mut pos = 0;
+            let buffer: *const u8 = self.metadata as *const u8;

Review Comment:
   The one thing I can think of is we can verify the lengths are all >= 0. It doesn't guarantee we will detect buffer overflow but it will let us detect it some of the time :shrug:



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3944: feat: enable metadata import/export through C data interface

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #3944:
URL: https://github.com/apache/arrow-rs/pull/3944#discussion_r1150308482


##########
arrow-schema/src/ffi.rs:
##########
@@ -152,6 +157,46 @@ impl FFI_ArrowSchema {
         Ok(self)
     }
 
+    pub fn with_metadata(
+        mut self,
+        metadata: Option<&HashMap<String, String>>,
+    ) -> Result<Self, ArrowError> {
+        let new_metadata = if let Some(metadata) = metadata {
+            if !metadata.is_empty() {
+                let mut metadata_serialized: Vec<u8> = Vec::new();
+                metadata_serialized.extend((metadata.len() as i32).to_ne_bytes());
+
+                for (key, value) in metadata.iter() {
+                    let key_len = key.len() as i32;
+                    let value_len = value.len() as i32;

Review Comment:
   The codebase is full of overflow handling, take a look at the builders or take kernels or Array validation.
   
   I'm keen that we are defensive here, because the implications of incorrect data is that we go off reading arbitrary memory



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3944: feat: enable metadata import/export through C data interface

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #3944:
URL: https://github.com/apache/arrow-rs/pull/3944#discussion_r1149124061


##########
arrow-schema/src/ffi.rs:
##########
@@ -152,6 +157,46 @@ impl FFI_ArrowSchema {
         Ok(self)
     }
 
+    pub fn with_metadata(
+        mut self,
+        metadata: Option<&HashMap<String, String>>,
+    ) -> Result<Self, ArrowError> {
+        let new_metadata = if let Some(metadata) = metadata {
+            if !metadata.is_empty() {
+                let mut metadata_serialized: Vec<u8> = Vec::new();
+                metadata_serialized.extend((metadata.len() as i32).to_ne_bytes());
+
+                for (key, value) in metadata.iter() {
+                    let key_len = key.len() as i32;
+                    let value_len = value.len() as i32;

Review Comment:
   Is it worth checking this doesn't overflow?



##########
arrow-schema/src/ffi.rs:
##########
@@ -152,6 +157,46 @@ impl FFI_ArrowSchema {
         Ok(self)
     }
 
+    pub fn with_metadata(
+        mut self,
+        metadata: Option<&HashMap<String, String>>,
+    ) -> Result<Self, ArrowError> {
+        let new_metadata = if let Some(metadata) = metadata {

Review Comment:
   ```suggestion
           // https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema.metadata
           let new_metadata = if let Some(metadata) = metadata {
   ```



##########
arrow-schema/src/ffi.rs:
##########
@@ -212,6 +257,55 @@ impl FFI_ArrowSchema {
     pub fn dictionary_ordered(&self) -> bool {
         self.flags & 0b00000001 != 0
     }
+
+    pub fn metadata(&self) -> Result<HashMap<String, String>, ArrowError> {
+        if self.metadata.is_null() {
+            Ok(HashMap::new())
+        } else {
+            let mut pos = 0;
+            let buffer: *const u8 = self.metadata as *const u8;

Review Comment:
   Possibly not much we can do about this, but it occurs to me that this parsing code can very easily read into arbitrary memory



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on pull request #3944: feat: enable metadata import/export through C data interface

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on PR #3944:
URL: https://github.com/apache/arrow-rs/pull/3944#issuecomment-1487979734

   Thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold merged pull request #3944: feat: enable metadata import/export through C data interface

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold merged PR #3944:
URL: https://github.com/apache/arrow-rs/pull/3944


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org