You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/13 15:41:37 UTC

[GitHub] [arrow-datafusion] doki23 opened a new pull request, #4194: create table with schema

doki23 opened a new pull request, #4194:
URL: https://github.com/apache/arrow-datafusion/pull/4194

   # Which issue does this PR close?
   Closes #4183 
   
   # What changes are included in this PR?
   Most changes are in planner of creating memory table.
   
   # Are these changes tested?
   Yes.
   
   # Are there any user-facing changes?
   Users can create a table with a specified schema with a values query.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4194: create table with schema

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #4194:
URL: https://github.com/apache/arrow-datafusion/pull/4194#discussion_r1024363641


##########
datafusion/core/tests/sql/create_drop.rs:
##########
@@ -49,6 +49,76 @@ async fn create_table_as() -> Result<()> {
     Ok(())
 }
 
+#[tokio::test]
+async fn create_table_with_schema_as_select() -> Result<()> {
+    let ctx = SessionContext::new();
+    register_aggregate_simple_csv(&ctx).await?;
+
+    let sql = "CREATE TABLE my_table(c1 float, c2 double, c3 boolean, c4 varchar) \

Review Comment:
   ![its-so-beautiful-crying-gif](https://user-images.githubusercontent.com/490673/202262919-652bca19-0f06-4e39-8498-33eab8b89857.gif)
   



##########
datafusion/core/tests/sql/create_drop.rs:
##########
@@ -49,6 +49,76 @@ async fn create_table_as() -> Result<()> {
     Ok(())
 }
 
+#[tokio::test]
+async fn create_table_with_schema_as_select() -> Result<()> {
+    let ctx = SessionContext::new();
+    register_aggregate_simple_csv(&ctx).await?;
+
+    let sql = "CREATE TABLE my_table(c1 float, c2 double, c3 boolean, c4 varchar) \
+    AS SELECT *,c3 as c4_tmp FROM aggregate_simple";
+    ctx.sql(sql).await.unwrap();
+
+    let sql_all = "SELECT * FROM my_table order by c1 LIMIT 1";
+    let results_all = execute_to_batches(&ctx, sql_all).await;
+
+    let expected = vec![
+        "+---------+----------------+------+----+",
+        "| c1      | c2             | c3   | c4 |",
+        "+---------+----------------+------+----+",
+        "| 0.00001 | 0.000000000001 | true | 1  |",
+        "+---------+----------------+------+----+",
+    ];
+
+    assert_batches_eq!(expected, &results_all);
+
+    Ok(())
+}
+
+#[tokio::test]
+async fn create_table_with_schema_as_select_mismatch() -> Result<()> {
+    let ctx = SessionContext::new();
+    register_aggregate_simple_csv(&ctx).await?;
+
+    let sql = "CREATE TABLE my_table(c1 float, c2 double, c3 boolean, c4 varchar) \

Review Comment:
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ursabot commented on pull request #4194: Support `create table` with explicit column definitions

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #4194:
URL: https://github.com/apache/arrow-datafusion/pull/4194#issuecomment-1317709015

   Benchmark runs are scheduled for baseline = 88c12013c4de0521e080d29b3f5411b44fd6af03 and contender = 74199d632d9d5625686b76a7f981cc8641a1f2a8. 74199d632d9d5625686b76a7f981cc8641a1f2a8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/d8354646821a4031972a7eccf559ecfc...83561c3e02e846d48bea9801d1d6fc14/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/13f748a6dd7d40ebbaa70ca9cec77d95...bb734fb0f02f44eda8cae9bbb40ddbdb/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/23cb95b79c8e41c384691bd64a4e3aa1...25a56775ee40408487c4db7dbdc31426/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/302e925146f8448da6a983481dcc6374...10bef71c3c614d4dad3db1d477130db1/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #4194: Support `create table` with explicit column definitions

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #4194:
URL: https://github.com/apache/arrow-datafusion/pull/4194#issuecomment-1317697486

   BTW I also made a small change to one of the tests to satisfy clippy https://github.com/apache/arrow-datafusion/pull/4194/commits/a2172db162e26c2f729663d541f337db3b006ae9


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4194: create table with schema

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #4194:
URL: https://github.com/apache/arrow-datafusion/pull/4194#discussion_r1023050220


##########
datafusion/sql/src/planner.rs:
##########
@@ -180,12 +180,36 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
                 if_not_exists,
                 or_replace,
                 ..
-            } if columns.is_empty()
-                && constraints.is_empty()
+            } if constraints.is_empty()
                 && table_properties.is_empty()
                 && with_options.is_empty() =>
             {
-                let plan = self.query_to_plan(*query, &mut HashMap::new())?;
+                let plan = self.query_to_plan(*query.clone(), &mut HashMap::new())?;
+                let input_schema = plan.schema();
+
+                let plan = if !columns.is_empty() {
+                    match *query.body {
+                        SetExpr::Values(_) => {
+                            let schema = self.build_schema(columns)?.to_dfschema_ref()?;
+                            if schema.fields().len() != input_schema.fields().len() {
+                                return Err(DataFusionError::Plan("Mismatch between schema and batches".to_string()))

Review Comment:
   ```suggestion
                                   return Err(DataFusionError::Plan(format!("Mismatch: {} columns specified, but result has {} columns", schema.fields.len(), input_schema.fields().len()))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] doki23 commented on a diff in pull request #4194: create table with schema

Posted by GitBox <gi...@apache.org>.
doki23 commented on code in PR #4194:
URL: https://github.com/apache/arrow-datafusion/pull/4194#discussion_r1023411537


##########
datafusion/sql/src/planner.rs:
##########
@@ -180,12 +180,36 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
                 if_not_exists,
                 or_replace,
                 ..
-            } if columns.is_empty()
-                && constraints.is_empty()
+            } if constraints.is_empty()
                 && table_properties.is_empty()
                 && with_options.is_empty() =>
             {
-                let plan = self.query_to_plan(*query, &mut HashMap::new())?;
+                let plan = self.query_to_plan(*query.clone(), &mut HashMap::new())?;
+                let input_schema = plan.schema();
+
+                let plan = if !columns.is_empty() {
+                    match *query.body {
+                        SetExpr::Values(_) => {
+                            let schema = self.build_schema(columns)?.to_dfschema_ref()?;
+                            if schema.fields().len() != input_schema.fields().len() {
+                                return Err(DataFusionError::Plan("Mismatch between schema and batches".to_string()))
+                            }
+                            let input_fields = input_schema.fields();
+                            let project_exprs = schema.fields().iter().zip(input_fields).map(|(field, input_field)| {
+                                cast(col(input_field.name()), field.data_type().clone()).alias(field.name())
+                            }).collect::<Vec<_>>();
+                            LogicalPlanBuilder::from(plan.clone())
+                                .project(project_exprs)?
+                                .build()?
+                        },
+                        _ => return Err(DataFusionError::Plan(
+                            "You can only specify schema when create table with a `values` statement"
+                                .to_string()
+                        ))

Review Comment:
   It's because `SELECT` has its own schema, but surely,  it makes sense if someone wants to cast these types.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #4194: Support `create table` with explicit column definitions

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #4194:
URL: https://github.com/apache/arrow-datafusion/pull/4194#issuecomment-1317697757

   Thanks again @doki23 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #4194: Support `create table` with explicit column definitions

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #4194:
URL: https://github.com/apache/arrow-datafusion/pull/4194#issuecomment-1317490728

   I took the liberty of merging this PR up to master and fixing the clippy issue so we can get it merged. Thanks again @doki23 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4194: create table with schema

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #4194:
URL: https://github.com/apache/arrow-datafusion/pull/4194#discussion_r1023043918


##########
datafusion/sql/src/planner.rs:
##########
@@ -180,12 +180,36 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
                 if_not_exists,
                 or_replace,
                 ..
-            } if columns.is_empty()
-                && constraints.is_empty()
+            } if constraints.is_empty()
                 && table_properties.is_empty()
                 && with_options.is_empty() =>
             {
-                let plan = self.query_to_plan(*query, &mut HashMap::new())?;
+                let plan = self.query_to_plan(*query.clone(), &mut HashMap::new())?;
+                let input_schema = plan.schema();
+
+                let plan = if !columns.is_empty() {
+                    match *query.body {
+                        SetExpr::Values(_) => {
+                            let schema = self.build_schema(columns)?.to_dfschema_ref()?;
+                            if schema.fields().len() != input_schema.fields().len() {
+                                return Err(DataFusionError::Plan("Mismatch between schema and batches".to_string()))

Review Comment:
   I suggest adding test for this situation (incorrect length)
   
   



##########
datafusion/sql/src/planner.rs:
##########
@@ -180,12 +180,36 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
                 if_not_exists,
                 or_replace,
                 ..
-            } if columns.is_empty()
-                && constraints.is_empty()
+            } if constraints.is_empty()
                 && table_properties.is_empty()
                 && with_options.is_empty() =>
             {
-                let plan = self.query_to_plan(*query, &mut HashMap::new())?;
+                let plan = self.query_to_plan(*query.clone(), &mut HashMap::new())?;
+                let input_schema = plan.schema();
+
+                let plan = if !columns.is_empty() {
+                    match *query.body {
+                        SetExpr::Values(_) => {
+                            let schema = self.build_schema(columns)?.to_dfschema_ref()?;
+                            if schema.fields().len() != input_schema.fields().len() {
+                                return Err(DataFusionError::Plan("Mismatch between schema and batches".to_string()))
+                            }
+                            let input_fields = input_schema.fields();
+                            let project_exprs = schema.fields().iter().zip(input_fields).map(|(field, input_field)| {
+                                cast(col(input_field.name()), field.data_type().clone()).alias(field.name())
+                            }).collect::<Vec<_>>();
+                            LogicalPlanBuilder::from(plan.clone())
+                                .project(project_exprs)?
+                                .build()?
+                        },
+                        _ => return Err(DataFusionError::Plan(
+                            "You can only specify schema when create table with a `values` statement"
+                                .to_string()
+                        ))

Review Comment:
   I wonder why not support queries as well? Why only `VALUES` -- the code you have written should work for a `CREATE TABLE AS SELECT` as well
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] doki23 commented on a diff in pull request #4194: create table with schema

Posted by GitBox <gi...@apache.org>.
doki23 commented on code in PR #4194:
URL: https://github.com/apache/arrow-datafusion/pull/4194#discussion_r1023411537


##########
datafusion/sql/src/planner.rs:
##########
@@ -180,12 +180,36 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
                 if_not_exists,
                 or_replace,
                 ..
-            } if columns.is_empty()
-                && constraints.is_empty()
+            } if constraints.is_empty()
                 && table_properties.is_empty()
                 && with_options.is_empty() =>
             {
-                let plan = self.query_to_plan(*query, &mut HashMap::new())?;
+                let plan = self.query_to_plan(*query.clone(), &mut HashMap::new())?;
+                let input_schema = plan.schema();
+
+                let plan = if !columns.is_empty() {
+                    match *query.body {
+                        SetExpr::Values(_) => {
+                            let schema = self.build_schema(columns)?.to_dfschema_ref()?;
+                            if schema.fields().len() != input_schema.fields().len() {
+                                return Err(DataFusionError::Plan("Mismatch between schema and batches".to_string()))
+                            }
+                            let input_fields = input_schema.fields();
+                            let project_exprs = schema.fields().iter().zip(input_fields).map(|(field, input_field)| {
+                                cast(col(input_field.name()), field.data_type().clone()).alias(field.name())
+                            }).collect::<Vec<_>>();
+                            LogicalPlanBuilder::from(plan.clone())
+                                .project(project_exprs)?
+                                .build()?
+                        },
+                        _ => return Err(DataFusionError::Plan(
+                            "You can only specify schema when create table with a `values` statement"
+                                .to_string()
+                        ))

Review Comment:
   It's because `SELECT` has its own schema, but surely,  it makes sense if someone wants to cast these types. I'll consider your suggestions and add some tests in this situation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] doki23 commented on a diff in pull request #4194: create table with schema

Posted by GitBox <gi...@apache.org>.
doki23 commented on code in PR #4194:
URL: https://github.com/apache/arrow-datafusion/pull/4194#discussion_r1023411537


##########
datafusion/sql/src/planner.rs:
##########
@@ -180,12 +180,36 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
                 if_not_exists,
                 or_replace,
                 ..
-            } if columns.is_empty()
-                && constraints.is_empty()
+            } if constraints.is_empty()
                 && table_properties.is_empty()
                 && with_options.is_empty() =>
             {
-                let plan = self.query_to_plan(*query, &mut HashMap::new())?;
+                let plan = self.query_to_plan(*query.clone(), &mut HashMap::new())?;
+                let input_schema = plan.schema();
+
+                let plan = if !columns.is_empty() {
+                    match *query.body {
+                        SetExpr::Values(_) => {
+                            let schema = self.build_schema(columns)?.to_dfschema_ref()?;
+                            if schema.fields().len() != input_schema.fields().len() {
+                                return Err(DataFusionError::Plan("Mismatch between schema and batches".to_string()))
+                            }
+                            let input_fields = input_schema.fields();
+                            let project_exprs = schema.fields().iter().zip(input_fields).map(|(field, input_field)| {
+                                cast(col(input_field.name()), field.data_type().clone()).alias(field.name())
+                            }).collect::<Vec<_>>();
+                            LogicalPlanBuilder::from(plan.clone())
+                                .project(project_exprs)?
+                                .build()?
+                        },
+                        _ => return Err(DataFusionError::Plan(
+                            "You can only specify schema when create table with a `values` statement"
+                                .to_string()
+                        ))

Review Comment:
   It's because `SELECT` has its own schema, but it makes sense if someone want to cast these types.



##########
datafusion/sql/src/planner.rs:
##########
@@ -180,12 +180,36 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
                 if_not_exists,
                 or_replace,
                 ..
-            } if columns.is_empty()
-                && constraints.is_empty()
+            } if constraints.is_empty()
                 && table_properties.is_empty()
                 && with_options.is_empty() =>
             {
-                let plan = self.query_to_plan(*query, &mut HashMap::new())?;
+                let plan = self.query_to_plan(*query.clone(), &mut HashMap::new())?;
+                let input_schema = plan.schema();
+
+                let plan = if !columns.is_empty() {
+                    match *query.body {
+                        SetExpr::Values(_) => {
+                            let schema = self.build_schema(columns)?.to_dfschema_ref()?;
+                            if schema.fields().len() != input_schema.fields().len() {
+                                return Err(DataFusionError::Plan("Mismatch between schema and batches".to_string()))
+                            }
+                            let input_fields = input_schema.fields();
+                            let project_exprs = schema.fields().iter().zip(input_fields).map(|(field, input_field)| {
+                                cast(col(input_field.name()), field.data_type().clone()).alias(field.name())
+                            }).collect::<Vec<_>>();
+                            LogicalPlanBuilder::from(plan.clone())
+                                .project(project_exprs)?
+                                .build()?
+                        },
+                        _ => return Err(DataFusionError::Plan(
+                            "You can only specify schema when create table with a `values` statement"
+                                .to_string()
+                        ))

Review Comment:
   It's because `SELECT` has its own schema, but it makes sense if someone wants to cast these types.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] doki23 commented on pull request #4194: create table with schema

Posted by GitBox <gi...@apache.org>.
doki23 commented on PR #4194:
URL: https://github.com/apache/arrow-datafusion/pull/4194#issuecomment-1316147872

   > Thank you @doki23 -- this looks like a nice improvement! I have a few suggestions -- the only thing I think is needed prior to merge is a test for mismatched column count.
   > 
   > We can file a follow up to support specifying column types from a create table as select as a follow on PR
   
   I'm going to do it in this pr, here's no need to file a follow up issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #4194: Support `create table` with explicit column definitions

Posted by GitBox <gi...@apache.org>.
alamb merged PR #4194:
URL: https://github.com/apache/arrow-datafusion/pull/4194


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org