You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/08 21:28:57 UTC

[GitHub] [arrow] jorgecarleitao opened a new pull request #8870: ARROW-10854: [Rust] [DataFusion] Simplify logical plan

jorgecarleitao opened a new pull request #8870:
URL: https://github.com/apache/arrow/pull/8870


   This PR simplifies the logical plan by removing `CsvScan`, `ParquetScan` and `InMemoryScan` and encapsulating all of them into `TableScan`.
   
   The underlying aspect here is that all these nodes perform the same operation (through the `dyn TableProvider`), they are just described in the plan differently. This PR makes all scans be encapsulated via `TableScan`, thereby removing the other ones.
   
   This has no execution differences - it only simplifies code at the logical level.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove edited a comment on pull request #8870: ARROW-10854: [Rust] [DataFusion] Simplify logical plan scans

Posted by GitBox <gi...@apache.org>.
andygrove edited a comment on pull request #8870:
URL: https://github.com/apache/arrow/pull/8870#issuecomment-741855709


   I think it would be helpful if I explained my design philosophy on this but also bear in mind that I am not an expert on this but this was my initial thinking:
   
   - Logical plans should be trivially serializable. This is a good forcing function to make sure that they can easily be constructed from any programming language and don't become Rust-specific. This is important when building bindings in other languages such as Python and Java and also for building drivers for distributed systems so that we can serialize query plans and send them to other processes.
   - The execution context has registries of things like tables and udfs and is responsible for translating a logical plan into a physical plan, and it makes sense for the physical plan to use opaque dyn traits.
   
   In Ballista, I have my own fork of the DataFusion logical plan currently, and I could continue with that but was considering using the DataFusion plan directly at some point (I'm not actively working on Ballista at the moment so this is all a bit theoretical).
   
   I bet @alamb will have some valuable opinions on this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #8870: ARROW-10854: [Rust] [DataFusion] Simplify logical plan scans

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #8870:
URL: https://github.com/apache/arrow/pull/8870#issuecomment-741855709


   I think it would be helpful if I explained my design philosophy on this but also bear in mind that I am not an expert on this but this was my initial thinking:
   
   - Logical plans should be trivially serializable. This is a good forcing function to make sure that they are easily constructed from any programming language and don't become Rust-specific. This is important when building bindings in other languages such as Python and Java and also for building drivers for distributed systems so that we can serialize query plans and send them to other processes.
   - The execution context has registries of things like tables and udfs and is responsible for translating a logical plan into a physical plan, and it makes sense for the physical plan to use opaque dyn traits.
   
   In Ballista, I have my own fork of the DataFusion logical plan currently, and I could continue with that but was considering using the DataFusion plan directly at some point (I'm not actively working on Ballista at the moment so this is all a bit theoretical).
   
   I bet @alamb will have some valuable opinions on this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8870: ARROW-10854: [Rust] [DataFusion] Simplify logical plan scans

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8870:
URL: https://github.com/apache/arrow/pull/8870#issuecomment-741072900


   https://issues.apache.org/jira/browse/ARROW-10854


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #8870: ARROW-10854: [Rust] [DataFusion] Simplify logical plan scans

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #8870:
URL: https://github.com/apache/arrow/pull/8870#issuecomment-743269994


   I thought about this some more and I think that the serialization requirements I have in Ballista are beyond the scope of DataFusion since the goal is to support multiple query engines, with DataFusion being just one of them, so it makes sense for me to continue with Ballista having its own logical query plan that can be translated to DataFuision and other query engines. This change certainly improves the UX of using DataFusion so it looks good to me.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8870: ARROW-10854: [Rust] [DataFusion] Simplify logical plan scans

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8870:
URL: https://github.com/apache/arrow/pull/8870#discussion_r539631894



##########
File path: rust/datafusion/src/logical_plan/builder.rs
##########
@@ -71,37 +98,44 @@ impl LogicalPlanBuilder {
         };
 
         let projected_schema = projection
-            .clone()
             .map(|p| Schema::new(p.iter().map(|i| schema.field(*i).clone()).collect()))
-            .or(Some(schema.clone()))
+            .or(Some(schema))
             .unwrap();
+        let projected_schema = DFSchemaRef::new(DFSchema::from(&projected_schema)?);
 
-        Ok(Self::from(&LogicalPlan::CsvScan {
-            path: path.to_owned(),
-            schema: SchemaRef::new(schema),
-            has_header,
-            delimiter: Some(delimiter),
-            projection,
-            projected_schema: DFSchemaRef::new(DFSchema::from(&projected_schema)?),
-        }))
+        let provider = Arc::new(CsvFile::try_new(path, options)?);
+        let schema = provider.schema();
+
+        let table_scan = LogicalPlan::TableScan {
+            schema_name: "".to_string(),
+            source: TableSource::FromProvider(provider),
+            table_schema: schema,
+            projected_schema,
+            projection: None,
+        };
+
+        Ok(Self::from(&table_scan))
     }
 
     /// Scan a Parquet data source
     pub fn scan_parquet(path: &str, projection: Option<Vec<usize>>) -> Result<Self> {
-        let p = ParquetTable::try_new(path)?;
-        let schema = p.schema();
+        let provider = Arc::new(ParquetTable::try_new(path)?);
+        let schema = provider.schema();
 
         let projected_schema = projection
-            .clone()
             .map(|p| Schema::new(p.iter().map(|i| schema.field(*i).clone()).collect()));
         let projected_schema = projected_schema.map_or(schema.clone(), SchemaRef::new);
+        let projected_schema = DFSchemaRef::new(DFSchema::from(&projected_schema)?);

Review comment:
       As an aside I find the current conversion code to/from `Schema`/`DFSchema` quite cumbersome and hope to get a PR up later today to make it easier to use




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8870: ARROW-10854: [Rust] [DataFusion] Simplify logical plan scans

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8870:
URL: https://github.com/apache/arrow/pull/8870#discussion_r541180190



##########
File path: rust/datafusion/src/logical_plan/builder.rs
##########
@@ -71,37 +98,44 @@ impl LogicalPlanBuilder {
         };
 
         let projected_schema = projection
-            .clone()
             .map(|p| Schema::new(p.iter().map(|i| schema.field(*i).clone()).collect()))
-            .or(Some(schema.clone()))
+            .or(Some(schema))
             .unwrap();
+        let projected_schema = DFSchemaRef::new(DFSchema::from(&projected_schema)?);
 
-        Ok(Self::from(&LogicalPlan::CsvScan {
-            path: path.to_owned(),
-            schema: SchemaRef::new(schema),
-            has_header,
-            delimiter: Some(delimiter),
-            projection,
-            projected_schema: DFSchemaRef::new(DFSchema::from(&projected_schema)?),
-        }))
+        let provider = Arc::new(CsvFile::try_new(path, options)?);
+        let schema = provider.schema();
+
+        let table_scan = LogicalPlan::TableScan {
+            schema_name: "".to_string(),
+            source: TableSource::FromProvider(provider),
+            table_schema: schema,
+            projected_schema,
+            projection: None,
+        };
+
+        Ok(Self::from(&table_scan))
     }
 
     /// Scan a Parquet data source
     pub fn scan_parquet(path: &str, projection: Option<Vec<usize>>) -> Result<Self> {
-        let p = ParquetTable::try_new(path)?;
-        let schema = p.schema();
+        let provider = Arc::new(ParquetTable::try_new(path)?);
+        let schema = provider.schema();
 
         let projected_schema = projection
-            .clone()
             .map(|p| Schema::new(p.iter().map(|i| schema.field(*i).clone()).collect()));
         let projected_schema = projected_schema.map_or(schema.clone(), SchemaRef::new);
+        let projected_schema = DFSchemaRef::new(DFSchema::from(&projected_schema)?);

Review comment:
       https://github.com/apache/arrow/pull/8888 for the record




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb closed pull request #8870: ARROW-10854: [Rust] [DataFusion] Simplify logical plan scans

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #8870:
URL: https://github.com/apache/arrow/pull/8870


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8870: ARROW-10854: [Rust] [DataFusion] Simplify logical plan scans

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8870:
URL: https://github.com/apache/arrow/pull/8870#issuecomment-741845545


   @andygrove that is a good point that I had not thought about.
   
   Unfortunately, I have also not though about them when reviewing the addition of `LogicalPlan::Extension`, `Expr::ScalarUDF`, and `Expr::AggregateUDF`, all of which directly or indirectly use `Arc<dyn X>`. Is this already a concern for Ballista?
   
   Would it be possible to require `TableProvider` to implement [`serde::ser::Serialize`](https://docs.serde.rs/serde/ser/trait.Serialize.html), and use something like https://github.com/dtolnay/erased-serde to serialize trait objects?
   
   Also relevant: https://stackoverflow.com/a/50026869/931303


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #8870: ARROW-10854: [Rust] [DataFusion] Simplify logical plan scans

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #8870:
URL: https://github.com/apache/arrow/pull/8870#issuecomment-741830810


   This is potentially going to be a breaking change for my use of the DataFusion logical plan in Ballista because I need to serialize logical plans to protobuf format to transmit them over the network. With the explicit variants for each type of input I have access to all the information I need, but with a reference to a `dyn TableProvider` this becomes opaque. I should have added tests for this use case.
   
   I will have a think about how I can work around this as well. Maybe there are other approaches.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org