You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/03 14:03:57 UTC

[GitHub] [arrow-datafusion] jiacai2050 opened a new pull request #1914: add metadata to DFSchema, close #1806.

jiacai2050 opened a new pull request #1914:
URL: https://github.com/apache/arrow-datafusion/pull/1914


   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #1806.
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jiacai2050 commented on a change in pull request #1914: add metadata to DFSchema, close #1806.

Posted by GitBox <gi...@apache.org>.
jiacai2050 commented on a change in pull request #1914:
URL: https://github.com/apache/arrow-datafusion/pull/1914#discussion_r820655067



##########
File path: datafusion-common/src/dfschema.rs
##########
@@ -36,16 +36,30 @@ pub type DFSchemaRef = Arc<DFSchema>;
 pub struct DFSchema {
     /// Fields
     fields: Vec<DFField>,
+    /// Additional metadata in form of key value pairs
+    metadata: HashMap<String, String>,
 }
 
 impl DFSchema {
     /// Creates an empty `DFSchema`
     pub fn empty() -> Self {
-        Self { fields: vec![] }
+        Self {
+            fields: vec![],
+            metadata: HashMap::new(),
+        }
     }
 
+    #[deprecated(since = "7.0.0", note = "please use `new_with_metadata` instead")]
     /// Create a new `DFSchema`
     pub fn new(fields: Vec<DFField>) -> Result<Self> {
+        Self::new_with_metadata(fields, HashMap::new())
+    }
+
+    /// Create a new `DFSchema`
+    pub fn new_with_metadata(

Review comment:
       As you can see in https://github.com/apache/arrow-datafusion/pull/1914/files#diff-c1ef69547042f0c07aa616c9d5d58cbe2a3c5720f7237c948c99012d6cc0024a
   
   Those optimizers will rewrite plan, if any optimizer forget to attach metadata to newly-created plan, then metadata is lost, it's very easy to miss that without compiler's help.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jiacai2050 commented on a change in pull request #1914: add metadata to DFSchema, close #1806.

Posted by GitBox <gi...@apache.org>.
jiacai2050 commented on a change in pull request #1914:
URL: https://github.com/apache/arrow-datafusion/pull/1914#discussion_r819177420



##########
File path: datafusion-common/src/dfschema.rs
##########
@@ -36,16 +36,29 @@ pub type DFSchemaRef = Arc<DFSchema>;
 pub struct DFSchema {
     /// Fields
     fields: Vec<DFField>,
+    metadata: HashMap<String, String>,

Review comment:
       fixed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jiacai2050 commented on a change in pull request #1914: add metadata to DFSchema, close #1806.

Posted by GitBox <gi...@apache.org>.
jiacai2050 commented on a change in pull request #1914:
URL: https://github.com/apache/arrow-datafusion/pull/1914#discussion_r820246063



##########
File path: datafusion-common/src/dfschema.rs
##########
@@ -36,16 +36,30 @@ pub type DFSchemaRef = Arc<DFSchema>;
 pub struct DFSchema {
     /// Fields
     fields: Vec<DFField>,
+    /// Additional metadata in form of key value pairs
+    metadata: HashMap<String, String>,
 }
 
 impl DFSchema {
     /// Creates an empty `DFSchema`
     pub fn empty() -> Self {
-        Self { fields: vec![] }
+        Self {
+            fields: vec![],
+            metadata: HashMap::new(),
+        }
     }
 
+    #[deprecated(since = "7.0.0", note = "please use `new_with_metadata` instead")]
     /// Create a new `DFSchema`
     pub fn new(fields: Vec<DFField>) -> Result<Self> {
+        Self::new_with_metadata(fields, HashMap::new())
+    }
+
+    /// Create a new `DFSchema`
+    pub fn new_with_metadata(

Review comment:
       Builder-style API is good itself, but is not suitable here I think. 
   
   As you can see, schema conversion between Array and datafusion is used in many places, the compiler will output an warning when `new` is marked as deprecated, but when adopt builder-style API, no warning will be made, and it would be very easy to lost metadata when doing nested conversion.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1914: add metadata to DFSchema, close #1806.

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1914:
URL: https://github.com/apache/arrow-datafusion/pull/1914#issuecomment-1060611178


   > the compiler will output an warning when new is marked as deprecated, but when adopt builder-style API, 
   
   The compiler warning as a hint to developers not to forget the schema is a good point. 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jiacai2050 commented on pull request #1914: add metadata to DFSchema, close #1806.

Posted by GitBox <gi...@apache.org>.
jiacai2050 commented on pull request #1914:
URL: https://github.com/apache/arrow-datafusion/pull/1914#issuecomment-1062412274


   @alamb I have rebase master, please rerun the CI.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1914: add metadata to DFSchema, close #1806.

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1914:
URL: https://github.com/apache/arrow-datafusion/pull/1914#issuecomment-1061976591


   🤔  looks like somehow this PR now picked up a merge conflict 
   
   I have started the CI run


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1914: add metadata to DFSchema, close #1806.

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1914:
URL: https://github.com/apache/arrow-datafusion/pull/1914#issuecomment-1064130091


   Thanks again @jiacai2050  for sticking with this and @xudong963  for the review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jiacai2050 commented on pull request #1914: add metadata to DFSchema, close #1806.

Posted by GitBox <gi...@apache.org>.
jiacai2050 commented on pull request #1914:
URL: https://github.com/apache/arrow-datafusion/pull/1914#issuecomment-1061355369


   @alamb  It seems CI need your approve to run.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jiacai2050 commented on pull request #1914: add metadata to DFSchema, close #1806.

Posted by GitBox <gi...@apache.org>.
jiacai2050 commented on pull request #1914:
URL: https://github.com/apache/arrow-datafusion/pull/1914#issuecomment-1064002355


   @alamb I have fixed the issue, you can link with this PR  after you file a new issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jiacai2050 commented on pull request #1914: add metadata to DFSchema, close #1806.

Posted by GitBox <gi...@apache.org>.
jiacai2050 commented on pull request #1914:
URL: https://github.com/apache/arrow-datafusion/pull/1914#issuecomment-1064683141


   @alamb @xudong963  Thanks for your reviews.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1914: add metadata to DFSchema, close #1806.

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1914:
URL: https://github.com/apache/arrow-datafusion/pull/1914#issuecomment-1064079352


   Add https://github.com/apache/arrow-datafusion/issues/1977 and will add link in https://github.com/apache/arrow-datafusion/pull/1978


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #1914: add metadata to DFSchema, close #1806.

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1914:
URL: https://github.com/apache/arrow-datafusion/pull/1914#discussion_r820224374



##########
File path: datafusion-common/src/dfschema.rs
##########
@@ -36,16 +36,30 @@ pub type DFSchemaRef = Arc<DFSchema>;
 pub struct DFSchema {
     /// Fields
     fields: Vec<DFField>,
+    /// Additional metadata in form of key value pairs
+    metadata: HashMap<String, String>,
 }
 
 impl DFSchema {
     /// Creates an empty `DFSchema`
     pub fn empty() -> Self {
-        Self { fields: vec![] }
+        Self {
+            fields: vec![],
+            metadata: HashMap::new(),
+        }
     }
 
+    #[deprecated(since = "7.0.0", note = "please use `new_with_metadata` instead")]
     /// Create a new `DFSchema`
     pub fn new(fields: Vec<DFField>) -> Result<Self> {
+        Self::new_with_metadata(fields, HashMap::new())
+    }
+
+    /// Create a new `DFSchema`
+    pub fn new_with_metadata(

Review comment:
       What would you think about a more "builder-like" API here that perhaps could remain backwards compatible. 
   
   Something like:
   
   ```rust
   /// Sets the metadata on this DFSchema to `metadata`
   pub fn with_metadata(mut self, metadata: HashMap<String, String>) -> Self {
   ...
   }
   ```
   
   
   And then instead of code like
   
   ```rust
           Schema::new_with_metadata(
               df_schema.fields.iter().map(|f| f.field.clone()).collect(),
               df_schema.metadata.clone(),
           )
   ```
   
   It would look like
   ```rust
           Schema::new(
               df_schema.fields.iter().map(|f| f.field.clone()).collect(),
           )
           .with_metadata(df_schema.metadata.clone()),
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 commented on a change in pull request #1914: add metadata to DFSchema, close #1806.

Posted by GitBox <gi...@apache.org>.
xudong963 commented on a change in pull request #1914:
URL: https://github.com/apache/arrow-datafusion/pull/1914#discussion_r820351594



##########
File path: datafusion-common/src/dfschema.rs
##########
@@ -36,16 +36,30 @@ pub type DFSchemaRef = Arc<DFSchema>;
 pub struct DFSchema {
     /// Fields
     fields: Vec<DFField>,
+    /// Additional metadata in form of key value pairs
+    metadata: HashMap<String, String>,
 }
 
 impl DFSchema {
     /// Creates an empty `DFSchema`
     pub fn empty() -> Self {
-        Self { fields: vec![] }
+        Self {
+            fields: vec![],
+            metadata: HashMap::new(),
+        }
     }
 
+    #[deprecated(since = "7.0.0", note = "please use `new_with_metadata` instead")]
     /// Create a new `DFSchema`
     pub fn new(fields: Vec<DFField>) -> Result<Self> {
+        Self::new_with_metadata(fields, HashMap::new())
+    }
+
+    /// Create a new `DFSchema`
+    pub fn new_with_metadata(

Review comment:
       IMO, metadata in DFSchema is a new feature in the next release, if users want to use it, they'll check the API doc. So we don't need to depend on the compiler's warning.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1914: add metadata to DFSchema, close #1806.

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1914:
URL: https://github.com/apache/arrow-datafusion/pull/1914#issuecomment-1063984916


   @jiacai2050  sadly another clippy issue has cropped up
   
   I think it is related to serializing the schema in ballista (which actually doesn't preserve the metadata)
   
   Is suggest a fix like this (along with a ticket): 
   
   ```diff
   diff --git a/datafusion-proto/src/from_proto.rs b/datafusion-proto/src/from_proto.rs
   index d589ef388..2bf65d4db 100644
   --- a/datafusion-proto/src/from_proto.rs
   +++ b/datafusion-proto/src/from_proto.rs
   @@ -155,6 +155,7 @@ impl TryFrom<&protobuf::DfSchema> for DFSchema {
                .iter()
                .map(|c| c.try_into())
                .collect::<Result<Vec<DFField>, _>>()?;
   +        #[allow(deprecated)]
            Ok(DFSchema::new(fields)?)
        }
    }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb merged pull request #1914: add metadata to DFSchema, close #1806.

Posted by GitBox <gi...@apache.org>.
alamb merged pull request #1914:
URL: https://github.com/apache/arrow-datafusion/pull/1914


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb edited a comment on pull request #1914: add metadata to DFSchema, close #1806.

Posted by GitBox <gi...@apache.org>.
alamb edited a comment on pull request #1914:
URL: https://github.com/apache/arrow-datafusion/pull/1914#issuecomment-1063984916


   @jiacai2050  sadly another clippy issue has cropped up
   
   I think it is related to serializing the schema in ballista (which actually doesn't preserve the metadata)
   
   Is suggest a fix like the following, and file a ticket for the fact that ballista serialization doesn't preserve metadata. 
   
   I can file if it you would like 
   
   ```diff
   diff --git a/datafusion-proto/src/from_proto.rs b/datafusion-proto/src/from_proto.rs
   index d589ef388..2bf65d4db 100644
   --- a/datafusion-proto/src/from_proto.rs
   +++ b/datafusion-proto/src/from_proto.rs
   @@ -155,6 +155,7 @@ impl TryFrom<&protobuf::DfSchema> for DFSchema {
                .iter()
                .map(|c| c.try_into())
                .collect::<Result<Vec<DFField>, _>>()?;
   +        #[allow(deprecated)]
            Ok(DFSchema::new(fields)?)
        }
    }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jiacai2050 commented on a change in pull request #1914: add metadata to DFSchema, close #1806.

Posted by GitBox <gi...@apache.org>.
jiacai2050 commented on a change in pull request #1914:
URL: https://github.com/apache/arrow-datafusion/pull/1914#discussion_r819176454



##########
File path: datafusion-common/src/dfschema.rs
##########
@@ -36,16 +36,29 @@ pub type DFSchemaRef = Arc<DFSchema>;
 pub struct DFSchema {
     /// Fields
     fields: Vec<DFField>,
+    metadata: HashMap<String, String>,
 }
 
 impl DFSchema {
     /// Creates an empty `DFSchema`
     pub fn empty() -> Self {
-        Self { fields: vec![] }
+        Self {
+            fields: vec![],
+            metadata: HashMap::new(),
+        }
     }
 
+    #[deprecated(since = "7.0.0", note = "please use `new_with_metadata` instead")]
     /// Create a new `DFSchema`
     pub fn new(fields: Vec<DFField>) -> Result<Self> {

Review comment:
       It would be a breaking change if we delete it since it's a public API.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 commented on a change in pull request #1914: add metadata to DFSchema, close #1806.

Posted by GitBox <gi...@apache.org>.
xudong963 commented on a change in pull request #1914:
URL: https://github.com/apache/arrow-datafusion/pull/1914#discussion_r818693662



##########
File path: datafusion-common/src/dfschema.rs
##########
@@ -36,16 +36,29 @@ pub type DFSchemaRef = Arc<DFSchema>;
 pub struct DFSchema {
     /// Fields
     fields: Vec<DFField>,
+    metadata: HashMap<String, String>,

Review comment:
       Nicer to add some annotations

##########
File path: datafusion-common/src/dfschema.rs
##########
@@ -36,16 +36,29 @@ pub type DFSchemaRef = Arc<DFSchema>;
 pub struct DFSchema {
     /// Fields
     fields: Vec<DFField>,
+    metadata: HashMap<String, String>,
 }
 
 impl DFSchema {
     /// Creates an empty `DFSchema`
     pub fn empty() -> Self {
-        Self { fields: vec![] }
+        Self {
+            fields: vec![],
+            metadata: HashMap::new(),
+        }
     }
 
+    #[deprecated(since = "7.0.0", note = "please use `new_with_metadata` instead")]
     /// Create a new `DFSchema`
     pub fn new(fields: Vec<DFField>) -> Result<Self> {

Review comment:
       If we can directly delete the API?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 commented on a change in pull request #1914: add metadata to DFSchema, close #1806.

Posted by GitBox <gi...@apache.org>.
xudong963 commented on a change in pull request #1914:
URL: https://github.com/apache/arrow-datafusion/pull/1914#discussion_r820351594



##########
File path: datafusion-common/src/dfschema.rs
##########
@@ -36,16 +36,30 @@ pub type DFSchemaRef = Arc<DFSchema>;
 pub struct DFSchema {
     /// Fields
     fields: Vec<DFField>,
+    /// Additional metadata in form of key value pairs
+    metadata: HashMap<String, String>,
 }
 
 impl DFSchema {
     /// Creates an empty `DFSchema`
     pub fn empty() -> Self {
-        Self { fields: vec![] }
+        Self {
+            fields: vec![],
+            metadata: HashMap::new(),
+        }
     }
 
+    #[deprecated(since = "7.0.0", note = "please use `new_with_metadata` instead")]
     /// Create a new `DFSchema`
     pub fn new(fields: Vec<DFField>) -> Result<Self> {
+        Self::new_with_metadata(fields, HashMap::new())
+    }
+
+    /// Create a new `DFSchema`
+    pub fn new_with_metadata(

Review comment:
       IMO, `metadata` in `DFSchema` is a new feature in the next release, if users want to use it, they'll check the API doc. So we don't need to depend on the compiler's warning.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1914: add metadata to DFSchema, close #1806.

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1914:
URL: https://github.com/apache/arrow-datafusion/pull/1914#issuecomment-1061152622


   @jiacai2050  this PR seems to have build failures -- can you update it and I'll review it again?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org