You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jonahgao (via GitHub)" <gi...@apache.org> on 2023/06/19 15:30:00 UTC

[GitHub] [arrow-datafusion] jonahgao opened a new pull request, #6722: Fix inserting into a table with non-nullable columns

jonahgao opened a new pull request, #6722:
URL: https://github.com/apache/arrow-datafusion/pull/6722

   # Which issue does this PR close?
   Fix the following issue:
   ```shell
   DataFusion CLI v26.0.0
   ❯ create table t1(a int not null);
   0 rows in set. Query took 0.028 seconds.
   
   ❯ insert into t1 values(1);
   Error during planning: Inserting query must have the same schema with the table.
   ```
   This issue is because the nullable information of the two schemas is different.
   input schema: Schema { fields: [Field { name: "a", data_type: Int32, **nullable: true**, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }
   table schema: Schema { fields: [Field { name: "a", data_type: Int32, **nullable: false**, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }
   
   
   
   # Rationale for this change
   Declaring the schema as nullable doesn't necessarily mean that the actual data is always NULL.
   When data is NOT NULL, it should be allowed to be inserted into the table.
   
   # What changes are included in this PR?
   
   1. Ignore the nullable information when comparing schemas.
   2. Checking NOT NULL constraint during execution.
   
   
   # Are these changes tested?
   YES
   
   # Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   New parameter is added to `InsertExec::new()` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on a diff in pull request #6722: Fix inserting into a table with non-nullable columns

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on code in PR #6722:
URL: https://github.com/apache/arrow-datafusion/pull/6722#discussion_r1235720460


##########
datafusion/core/src/datasource/mod.rs:
##########
@@ -199,3 +200,24 @@ fn get_col_stats(
         })
         .collect()
 }
+
+fn schema_eq_ignore_nullable(

Review Comment:
   we have the `nullify` internal function https://github.com/apache/arrow-datafusion/blob/d32f9a7425e6269bcc0b928a81f6d46871389bdf/datafusion/expr/src/logical_plan/builder.rs#L1042
   
   and to compare 2 schemas ignoring nullability we using https://github.com/apache/arrow-rs/blob/15e0e76bfb6500799a43991f1339e69464c513f8/arrow-schema/src/fields.rs#L76 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on pull request #6722: Fix inserting into a table with non-nullable columns

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on PR #6722:
URL: https://github.com/apache/arrow-datafusion/pull/6722#issuecomment-1609745250

   > > Thanks. Functionally wise it works as expected, I still think we can reuse `Schema::equivalent_names_and_types` and not creating a separate method. So it can be replaced something like
   > > ```
   > > fn foo(schema: DFSchema, other: DFSchema) -> Result<bool> {
   > >     SchemaRef::from(schema).equivalent_names_and_types(SchemaRef::from(other))
   > > }
   > > ```
   > 
   > Perhaps the best way is to have upstream support for this similar functionality.
   
   ideally yes. but I think upstream will be resisting, saying `.contains` method is enough. I'm ok with this PR with minor thing about code duplication. we will have to maintenance this trait and method and have test coverage for it, which currently is missing. @alamb I'll leave it to you
   
   Basically this wrapper should do the trick
   ```
   pub trait SchemaExt {
       /// This is a specialized version of Eq that ignores differences
       /// in nullability and metadata.
       ///
       /// It works the same as [`DFSchema::equivalent_names_and_types`].
       fn equivalent_names_and_types(&self, other: &Self) -> bool;
   }
   impl SchemaExt for Schema {
       fn equivalent_names_and_types(&self, other: &Self) -> bool {
           DFSchema::from(&self).equivalent_names_and_types(&DFSchema::from(other))
       }
   }
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jonahgao commented on a diff in pull request #6722: Fix inserting into a table with non-nullable columns

Posted by "jonahgao (via GitHub)" <gi...@apache.org>.
jonahgao commented on code in PR #6722:
URL: https://github.com/apache/arrow-datafusion/pull/6722#discussion_r1240838739


##########
datafusion/common/src/dfschema.rs:
##########
@@ -729,6 +729,34 @@ impl From<Field> for DFField {
     }
 }
 
+/// DataFusion-specific extensions to [`Schema`].
+pub trait SchemaExt {
+    /// This is a specialized version of Eq that ignores differences
+    /// in nullability and metadata.
+    ///
+    /// It works the same as [`DFSchema::equivalent_names_and_types`].
+    fn equivalent_names_and_types(&self, other: &Self) -> bool;
+}
+
+impl SchemaExt for Schema {
+    fn equivalent_names_and_types(&self, other: &Self) -> bool {
+        if self.fields().len() != other.fields().len() {
+            return false;
+        }
+
+        self.fields()
+            .iter()
+            .zip(other.fields().iter())
+            .all(|(f1, f2)| {
+                f1.name() == f2.name()
+                    && DFSchema::datatype_is_semantically_equal(

Review Comment:
   I think this might be more convenient to use 🤔 .
   All the test cases from DFSchema have been reused for testing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6722: Fix inserting into a table with non-nullable columns

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6722:
URL: https://github.com/apache/arrow-datafusion/pull/6722#discussion_r1242679164


##########
datafusion/common/src/dfschema.rs:
##########
@@ -1133,13 +1178,25 @@ mod tests {
                 let schema2 = to_df_schema(self.fields2);
                 assert_eq!(
                     schema1.equivalent_names_and_types(&schema2),
-                    self.expected,
+                    self.expected_dfschema,
                     "Comparison did not match expected: {}\n\n\
                      schema1:\n\n{:#?}\n\nschema2:\n\n{:#?}",
-                    self.expected,
+                    self.expected_dfschema,
                     schema1,
                     schema2
                 );
+
+                let arrow_schema1 = Schema::from(schema1);

Review Comment:
   👍 
   
   Looks good -- thank you



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #6722: Fix inserting into a table with non-nullable columns

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb merged PR #6722:
URL: https://github.com/apache/arrow-datafusion/pull/6722


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jonahgao commented on pull request #6722: Fix inserting into a table with non-nullable columns

Posted by "jonahgao (via GitHub)" <gi...@apache.org>.
jonahgao commented on PR #6722:
URL: https://github.com/apache/arrow-datafusion/pull/6722#issuecomment-1605564566

   > Should we update the TableProvider schema, if we insert null items in non-nullable columns?
   @metesynnada  This insert will fail because it violates the not null constraint.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jonahgao commented on a diff in pull request #6722: Fix inserting into a table with non-nullable columns

Posted by "jonahgao (via GitHub)" <gi...@apache.org>.
jonahgao commented on code in PR #6722:
URL: https://github.com/apache/arrow-datafusion/pull/6722#discussion_r1236724681


##########
datafusion/core/src/datasource/mod.rs:
##########
@@ -199,3 +200,24 @@ fn get_col_stats(
         })
         .collect()
 }
+
+fn schema_eq_ignore_nullable(

Review Comment:
   I think using `contains()` here may not be ideal :thinking: .
   A non-nullable field does not `contains` a nullable field.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6722: Fix inserting into a table with non-nullable columns

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6722:
URL: https://github.com/apache/arrow-datafusion/pull/6722#issuecomment-1610183346

   > ideally yes. but I think upstream will be resisting, saying .contains method is enough.
   
   FWIW the upstream request is here: https://github.com/apache/arrow-rs/issues/3199
   
   > Basically this wrapper should do the trick
   
   I tried that and sadly it gave me this error:
   
   ```
   cd /Users/alamb/Software/arrow-datafusion2/ && RUST_BACKTRACE=1 CARGO_TARGET_DIR=/Users/alamb/Software/target-df2  cargo test --test sqllogictests 
      Compiling datafusion-common v27.0.0 (/Users/alamb/Software/arrow-datafusion2/datafusion/common)
   error[E0308]: mismatched types
      --> datafusion/common/src/dfschema.rs:743:24
       |
   743 |         DFSchema::from(self).equivalent_names_and_types(&DFSchema::from(other))
       |         -------------- ^^^^ expected `DFSchema`, found `&Schema`
       |         |
       |         arguments to this function are incorrect
       |
   note: associated function defined here
      --> /Users/alamb/.rustup/toolchains/stable-x86_64-apple-darwin/lib/rustlib/src/rust/library/core/src/convert/mod.rs:548:8
       |
   548 |     fn from(value: T) -> Self;
       |        ^^^^
   
   error[E0308]: mismatched types
      --> datafusion/common/src/dfschema.rs:743:73
       |
   743 |         DFSchema::from(self).equivalent_names_and_types(&DFSchema::from(other))
       |                                                          -------------- ^^^^^ expected `DFSchema`, found `&Schema`
       |                                                          |
       |                                                          arguments to this function are incorrect
       |
   note: associated function defined here
      --> /Users/alamb/.rustup/toolchains/stable-x86_64-apple-darwin/lib/rustlib/src/rust/library/core/src/convert/mod.rs:548:8
       |
   548 |     fn from(value: T) -> Self;
       |        ^^^^
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6722: Fix inserting into a table with non-nullable columns

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6722:
URL: https://github.com/apache/arrow-datafusion/pull/6722#discussion_r1244321268


##########
datafusion/common/src/dfschema.rs:
##########
@@ -729,6 +729,34 @@ impl From<Field> for DFField {
     }
 }
 
+/// DataFusion-specific extensions to [`Schema`].
+pub trait SchemaExt {
+    /// This is a specialized version of Eq that ignores differences
+    /// in nullability and metadata.
+    ///
+    /// It works the same as [`DFSchema::equivalent_names_and_types`].
+    fn equivalent_names_and_types(&self, other: &Self) -> bool;
+}
+
+impl SchemaExt for Schema {
+    fn equivalent_names_and_types(&self, other: &Self) -> bool {
+        if self.fields().len() != other.fields().len() {
+            return false;
+        }
+
+        self.fields()
+            .iter()
+            .zip(other.fields().iter())
+            .all(|(f1, f2)| {
+                f1.name() == f2.name()
+                    && DFSchema::datatype_is_semantically_equal(

Review Comment:
   I think the problem is that `DFSchema` isn't a wrapper on top of `Schema` (so you can't get a `&Schema` from a `&DFSchema`)
   
   @jonahgao  has updated the tests to cover this new code, so I think we are good



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6722: Fix inserting into a table with non-nullable columns

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6722:
URL: https://github.com/apache/arrow-datafusion/pull/6722#discussion_r1235660278


##########
datafusion/core/src/datasource/mod.rs:
##########
@@ -199,3 +200,24 @@ fn get_col_stats(
         })
         .collect()
 }
+
+fn schema_eq_ignore_nullable(

Review Comment:
   I could have sworn this function already existed (basically to compare two schemas ignoring nullabulity) but I could not find it
   
   This is the closest I found: https://docs.rs/datafusion-common/26.0.0/datafusion_common/struct.DFSchema.html#method.equivalent_names_and_types but it is not quite the same



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on a diff in pull request #6722: Fix inserting into a table with non-nullable columns

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on code in PR #6722:
URL: https://github.com/apache/arrow-datafusion/pull/6722#discussion_r1235722934


##########
datafusion/core/src/datasource/mod.rs:
##########
@@ -199,3 +200,24 @@ fn get_col_stats(
         })
         .collect()
 }
+
+fn schema_eq_ignore_nullable(

Review Comment:
   UPD: https://github.com/apache/arrow-rs/blob/15e0e76bfb6500799a43991f1339e69464c513f8/arrow-schema/src/schema.rs#L323



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6722: Fix inserting into a table with non-nullable columns

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6722:
URL: https://github.com/apache/arrow-datafusion/pull/6722#issuecomment-1610184764

   I think there is a lot of improvement that could be made to the DFSchema / Schema stuff but for now this looks good to me. Thanks agian @jonahgao  and @comphead 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jonahgao commented on pull request #6722: Fix inserting into a table with non-nullable columns

Posted by "jonahgao (via GitHub)" <gi...@apache.org>.
jonahgao commented on PR #6722:
URL: https://github.com/apache/arrow-datafusion/pull/6722#issuecomment-1608798397

   > Thanks. Functionally wise it works as expected, I still think we can reuse `Schema::equivalent_names_and_types` and not creating a separate method. So it can be replaced something like
   > 
   > ```
   > fn foo(schema: DFSchema, other: DFSchema) -> Result<bool> {
   >     SchemaRef::from(schema).equivalent_names_and_types(SchemaRef::from(other))
   > }
   > ```
   
   Perhaps the best way is to have upstream support for this similar functionality.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6722: Fix inserting into a table with non-nullable columns

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6722:
URL: https://github.com/apache/arrow-datafusion/pull/6722#issuecomment-1608126071

   Any concerns merging this @comphead ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jonahgao commented on a diff in pull request #6722: Fix inserting into a table with non-nullable columns

Posted by "jonahgao (via GitHub)" <gi...@apache.org>.
jonahgao commented on code in PR #6722:
URL: https://github.com/apache/arrow-datafusion/pull/6722#discussion_r1238312184


##########
datafusion/core/src/datasource/mod.rs:
##########
@@ -199,3 +200,24 @@ fn get_col_stats(
         })
         .collect()
 }
+
+fn schema_eq_ignore_nullable(

Review Comment:
   @comphead I have tried two possible ways of using it and both have failed.
   1. Check `table_schema.contains(input_schema)`
   The failed query is :
     ```sql
     DataFusion CLI v26.0.0
     ❯ create table t(a int not null);
     0 rows in set. Query took 0.027 seconds.
     
     ❯ insert into t values(1);
     Error during planning: Inserting query must have the same schema with the table.
     ```
   
   2. Check `input_schema.contains(table_schema)`
   The failed query is :
   ```sql
   DataFusion CLI v26.0.0
   ❯ create table t2(a int null);
   0 rows in set. Query took 0.029 seconds.
   ❯ create table t3(a int not null);
   0 rows in set. Query took 0.002 seconds.
   
   ❯ insert into t2 select * from t3;
   Error during planning: Inserting query must have the same schema with the table.
   ```
   
   For an insertion operation, I think there is no need to have one schema be a superset of the other. 
   Instead, we should make sure that there is an intersection between the data sets defined by the two schemas.
   
   I've implemented the `equivalent_names_and_types` method for `Schema` in file `datafusion_common/dfschema.rs`. @alamb @comphead 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on a diff in pull request #6722: Fix inserting into a table with non-nullable columns

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on code in PR #6722:
URL: https://github.com/apache/arrow-datafusion/pull/6722#discussion_r1238707176


##########
datafusion/core/src/physical_plan/insert.rs:
##########
@@ -219,3 +237,25 @@ fn make_count_schema() -> SchemaRef {
         false,
     )]))
 }
+
+fn check_batch(batch: RecordBatch, schema: &SchemaRef) -> Result<RecordBatch> {

Review Comment:
   Im not really sure about this method. That might be a performance hit if we scan data.
   I'd rather check metadata of incoming schema and expected schema like Spark does.



##########
datafusion/common/src/dfschema.rs:
##########
@@ -729,6 +729,34 @@ impl From<Field> for DFField {
     }
 }
 
+/// DataFusion-specific extensions to [`Schema`].
+pub trait SchemaExt {
+    /// This is a specialized version of Eq that ignores differences
+    /// in nullability and metadata.
+    ///
+    /// It works the same as [`DFSchema::equivalent_names_and_types`].
+    fn equivalent_names_and_types(&self, other: &Self) -> bool;
+}
+
+impl SchemaExt for Schema {
+    fn equivalent_names_and_types(&self, other: &Self) -> bool {
+        if self.fields().len() != other.fields().len() {
+            return false;
+        }
+
+        self.fields()
+            .iter()
+            .zip(other.fields().iter())
+            .all(|(f1, f2)| {
+                f1.name() == f2.name()
+                    && DFSchema::datatype_is_semantically_equal(

Review Comment:
   since we use `DFSchema` reference anyway, prob we can reuse `DFSchema::equivalent_names_and_types` without introducing new method. 🤔 Moreover we have Schema <-> DFSchema converters in place.
   
   Otherwise we have to test this method thoroughly



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jonahgao commented on a diff in pull request #6722: Fix inserting into a table with non-nullable columns

Posted by "jonahgao (via GitHub)" <gi...@apache.org>.
jonahgao commented on code in PR #6722:
URL: https://github.com/apache/arrow-datafusion/pull/6722#discussion_r1240834819


##########
datafusion/core/src/physical_plan/insert.rs:
##########
@@ -219,3 +237,25 @@ fn make_count_schema() -> SchemaRef {
         false,
     )]))
 }
+
+fn check_batch(batch: RecordBatch, schema: &SchemaRef) -> Result<RecordBatch> {

Review Comment:
   @comphead Thank you. There are some scenarios where calling this function is not necessary. I have improved it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] metesynnada commented on pull request #6722: Fix inserting into a table with non-nullable columns

Posted by "metesynnada (via GitHub)" <gi...@apache.org>.
metesynnada commented on PR #6722:
URL: https://github.com/apache/arrow-datafusion/pull/6722#issuecomment-1602690896

   Should we update the TableProvider schema, if we insert null items in non-nullable columns? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on a diff in pull request #6722: Fix inserting into a table with non-nullable columns

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on code in PR #6722:
URL: https://github.com/apache/arrow-datafusion/pull/6722#discussion_r1237126907


##########
datafusion/core/src/datasource/mod.rs:
##########
@@ -199,3 +200,24 @@ fn get_col_stats(
         })
         .collect()
 }
+
+fn schema_eq_ignore_nullable(

Review Comment:
   @jonahgao That is true and depending on use and what schema is driving. I tested SqlLogic engine which currently uses `.contains` and found that
   this query will fail
   
   ```
   query I
   select null a union all select 1 a;
   ----
   ```
   and this does not
   ```query II
   select 1 a union all select null a;
   ----
   ```
   Depending on driving schema
   
   
   Please try if `.contains` will work for you, if its not, we need to create a dedicated method similar to @alamb suggestion https://docs.rs/datafusion-common/26.0.0/datafusion_common/struct.DFSchema.html#method.equivalent_names_and_types but for `Schema`
   
   If you create the method so I can also fix a hidden bug in sql logic test engine



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6722: Fix inserting into a table with non-nullable columns

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6722:
URL: https://github.com/apache/arrow-datafusion/pull/6722#discussion_r1244321268


##########
datafusion/common/src/dfschema.rs:
##########
@@ -729,6 +729,34 @@ impl From<Field> for DFField {
     }
 }
 
+/// DataFusion-specific extensions to [`Schema`].
+pub trait SchemaExt {
+    /// This is a specialized version of Eq that ignores differences
+    /// in nullability and metadata.
+    ///
+    /// It works the same as [`DFSchema::equivalent_names_and_types`].
+    fn equivalent_names_and_types(&self, other: &Self) -> bool;
+}
+
+impl SchemaExt for Schema {
+    fn equivalent_names_and_types(&self, other: &Self) -> bool {
+        if self.fields().len() != other.fields().len() {
+            return false;
+        }
+
+        self.fields()
+            .iter()
+            .zip(other.fields().iter())
+            .all(|(f1, f2)| {
+                f1.name() == f2.name()
+                    && DFSchema::datatype_is_semantically_equal(

Review Comment:
   I think the problem is that `DFSchema` isn't a wrapper on top of `Schema` (so you can't get a `&Schema` from a `&DFSchema`)
   
   > n. we will have to maintenance this trait and method and have test coverage for it, which currently is missing. @alamb I'll leave it to you
   
   @jonahgao  has updated the tests to cover this new code, so I think we are good



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org