You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/06/06 19:27:17 UTC

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6566: Mismatch in MemTable of Select Into when projecting on aggregate window functions

alamb commented on code in PR #6566:
URL: https://github.com/apache/arrow-datafusion/pull/6566#discussion_r1220199844


##########
datafusion/core/src/execution/context.rs:
##########
@@ -518,7 +518,7 @@ impl SessionContext {
                 let physical = DataFrame::new(self.state(), input);
 
                 let batches: Vec<_> = physical.collect_partitioned().await?;
-                let table = Arc::new(MemTable::try_new(schema, batches)?);
+                let table = Arc::new(MemTable::new_not_registered(schema, batches));

Review Comment:
   It seems like the core problem here is `schema` (aka the schema of `input`) does not actually match the schema of the batches that are produced. 
   
   As I think @comphead is suggesting in https://github.com/apache/arrow-datafusion/pull/6566#pullrequestreview-1465439929, this PR seems to be trying to workaround the deeper problem of the window exec producing an output schema (different column names) than the LogicalPlan says. 
   
   Have you looked into making the names consistent?
   



##########
datafusion/core/src/datasource/memory.rs:
##########
@@ -73,6 +75,26 @@ impl MemTable {
         }
     }
 
+    /// Create a new in-memory table from the record batches.
+    /// In case of empty table, the schema is inferred from the input plan.
+    pub fn new_not_registered(

Review Comment:
   I think this code effectively ignores the `schema` argument if any of `partitions` has a RecordBatch and uses that RecordBatch's schema instead. 
   
   I think this is a surprising behavior and might mask errors in the future



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org