You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/14 11:47:14 UTC

[GitHub] [arrow] rdettai opened a new pull request #8910: ARROW-10783: [Rust] [DataFusion] Resolve TableScan provider eagerly

rdettai opened a new pull request #8910:
URL: https://github.com/apache/arrow/pull/8910


   > Currently, the TableScan logical plan is quite complex. It can either reference a provider or a table name that is registered to the context.
   > 
   > This issue is about linking the logical plan to the TableProvider directly upon parsing of the SQL. This allows to simplify greatly the code and also makes the TableScan plan easier to use by external query plan manipulations.
   
   https://issues.apache.org/jira/browse/ARROW-10900
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rdettai commented on pull request #8910: ARROW-10783: [Rust] [DataFusion] Resolve TableScan provider eagerly

Posted by GitBox <gi...@apache.org>.
rdettai commented on pull request #8910:
URL: https://github.com/apache/arrow/pull/8910#issuecomment-744433855


   @XiaokunDing @seddonm1 @Dandandan this might conflict a bit with your changes around statistics


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] waynexia commented on a change in pull request #8910: ARROW-10900: [Rust] [DataFusion] Resolve TableScan provider eagerly

Posted by GitBox <gi...@apache.org>.
waynexia commented on a change in pull request #8910:
URL: https://github.com/apache/arrow/pull/8910#discussion_r543037102



##########
File path: rust/datafusion/src/logical_plan/builder.rs
##########
@@ -90,77 +75,51 @@ impl LogicalPlanBuilder {
         options: CsvReadOptions,
         projection: Option<Vec<usize>>,
     ) -> Result<Self> {
-        let schema: Schema = match options.schema {
-            Some(s) => s.to_owned(),
-            None => CsvFile::try_new(path, options)?
-                .schema()
-                .as_ref()
-                .to_owned(),
-        };
-
-        let projected_schema = projection
-            .map(|p| Schema::new(p.iter().map(|i| schema.field(*i).clone()).collect()))
-            .map_or(SchemaRef::new(schema), SchemaRef::new)
-            .to_dfschema_ref()?;
-
         let provider = Arc::new(CsvFile::try_new(path, options)?);
-        let schema = provider.schema();
-
-        let table_scan = LogicalPlan::TableScan {
-            schema_name: "".to_string(),
-            source: TableSource::FromProvider(provider),
-            table_schema: schema,
-            projected_schema,
-            projection: None,
-        };
-
-        Ok(Self::from(&table_scan))
+        Self::scan("", provider, projection)
     }
 
     /// Scan a Parquet data source
     pub fn scan_parquet(path: &str, projection: Option<Vec<usize>>) -> Result<Self> {
         let provider = Arc::new(ParquetTable::try_new(path)?);
+        Self::scan("", provider, projection)
+    }
+
+    /// Scan an empty data source, mainly used in tests
+    pub fn scan_empty(
+        name: &str,

Review comment:
       Maybe we can remove this parameter and provide `scan()` a placeholder like `""`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8910: ARROW-10900: [Rust] [DataFusion] Resolve TableScan provider eagerly

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8910:
URL: https://github.com/apache/arrow/pull/8910#issuecomment-744533089


   https://issues.apache.org/jira/browse/ARROW-10900


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb closed pull request #8910: ARROW-10900: [Rust] [DataFusion] Resolve TableScan provider eagerly

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #8910:
URL: https://github.com/apache/arrow/pull/8910


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8910: ARROW-10783: [Rust] [DataFusion] Resolve TableScan provider eagerly

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8910:
URL: https://github.com/apache/arrow/pull/8910#issuecomment-744461751


   Funny thing, I was planning to work on this exact same change ❤️ (removing the `TableProvider`, offering the provider directly, and making the planner call `provider.scan`). I agree with this design  ^_^
   
   Rename also looks good to me.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rdettai commented on a change in pull request #8910: ARROW-10900: [Rust] [DataFusion] Resolve TableScan provider eagerly

Posted by GitBox <gi...@apache.org>.
rdettai commented on a change in pull request #8910:
URL: https://github.com/apache/arrow/pull/8910#discussion_r543331995



##########
File path: rust/datafusion/src/logical_plan/builder.rs
##########
@@ -90,77 +75,51 @@ impl LogicalPlanBuilder {
         options: CsvReadOptions,
         projection: Option<Vec<usize>>,
     ) -> Result<Self> {
-        let schema: Schema = match options.schema {
-            Some(s) => s.to_owned(),
-            None => CsvFile::try_new(path, options)?
-                .schema()
-                .as_ref()
-                .to_owned(),
-        };
-
-        let projected_schema = projection
-            .map(|p| Schema::new(p.iter().map(|i| schema.field(*i).clone()).collect()))
-            .map_or(SchemaRef::new(schema), SchemaRef::new)
-            .to_dfschema_ref()?;
-
         let provider = Arc::new(CsvFile::try_new(path, options)?);
-        let schema = provider.schema();
-
-        let table_scan = LogicalPlan::TableScan {
-            schema_name: "".to_string(),
-            source: TableSource::FromProvider(provider),
-            table_schema: schema,
-            projected_schema,
-            projection: None,
-        };
-
-        Ok(Self::from(&table_scan))
+        Self::scan("", provider, projection)
     }
 
     /// Scan a Parquet data source
     pub fn scan_parquet(path: &str, projection: Option<Vec<usize>>) -> Result<Self> {
         let provider = Arc::new(ParquetTable::try_new(path)?);
+        Self::scan("", provider, projection)
+    }
+
+    /// Scan an empty data source, mainly used in tests
+    pub fn scan_empty(
+        name: &str,

Review comment:
       This is what I did originally, to match the other similar methods in the block (`scan_csv`, `scan_memory`...). But this one is used by many tests in the codebase, and often they are using names, so we would have needed to edit asserts all over the place. I find that it does not do much harm to leave the it here, but if there is a consensus on the fact that its better removed, I don't mind! 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8910: ARROW-10783: [Rust] [DataFusion] Resolve TableScan provider eagerly

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8910:
URL: https://github.com/apache/arrow/pull/8910#issuecomment-744392433


   https://issues.apache.org/jira/browse/ARROW-10783


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8910: ARROW-10900: [Rust] [DataFusion] Resolve TableScan provider eagerly

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8910:
URL: https://github.com/apache/arrow/pull/8910#discussion_r543720947



##########
File path: rust/datafusion/src/logical_plan/builder.rs
##########
@@ -90,77 +75,51 @@ impl LogicalPlanBuilder {
         options: CsvReadOptions,
         projection: Option<Vec<usize>>,
     ) -> Result<Self> {
-        let schema: Schema = match options.schema {
-            Some(s) => s.to_owned(),
-            None => CsvFile::try_new(path, options)?
-                .schema()
-                .as_ref()
-                .to_owned(),
-        };
-
-        let projected_schema = projection
-            .map(|p| Schema::new(p.iter().map(|i| schema.field(*i).clone()).collect()))
-            .map_or(SchemaRef::new(schema), SchemaRef::new)
-            .to_dfschema_ref()?;
-
         let provider = Arc::new(CsvFile::try_new(path, options)?);
-        let schema = provider.schema();
-
-        let table_scan = LogicalPlan::TableScan {
-            schema_name: "".to_string(),
-            source: TableSource::FromProvider(provider),
-            table_schema: schema,
-            projected_schema,
-            projection: None,
-        };
-
-        Ok(Self::from(&table_scan))
+        Self::scan("", provider, projection)
     }
 
     /// Scan a Parquet data source
     pub fn scan_parquet(path: &str, projection: Option<Vec<usize>>) -> Result<Self> {
         let provider = Arc::new(ParquetTable::try_new(path)?);
+        Self::scan("", provider, projection)
+    }
+
+    /// Scan an empty data source, mainly used in tests
+    pub fn scan_empty(
+        name: &str,

Review comment:
       I think keeping the name is reasonable here -- sometimes the name is needed when referring to the table in SQL, however, as this is part of the PlanBuilder I agree it isn't obviously going to be useful




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8910: ARROW-10900: [Rust] [DataFusion] Resolve TableScan provider eagerly

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8910:
URL: https://github.com/apache/arrow/pull/8910#issuecomment-744532747


   >Not sure what you mean here!
   
   I meant `TableSource` :/


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io commented on pull request #8910: ARROW-10783: [Rust] [DataFusion] Resolve TableScan provider eagerly

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #8910:
URL: https://github.com/apache/arrow/pull/8910#issuecomment-744397488


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/8910?src=pr&el=h1) Report
   > Merging [#8910](https://codecov.io/gh/apache/arrow/pull/8910?src=pr&el=desc) (a551cc6) into [master](https://codecov.io/gh/apache/arrow/commit/5353c285c6dfb3381ac0f1c9e7cd63d7fcb8da4a?el=desc) (5353c28) will **increase** coverage by `0.02%`.
   > The diff coverage is `0.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8910/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8910?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #8910      +/-   ##
   ==========================================
   + Coverage   75.48%   75.51%   +0.02%     
   ==========================================
     Files         181      182       +1     
     Lines       41649    41633      -16     
   ==========================================
     Hits        31439    31439              
   + Misses      10210    10194      -16     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8910?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/datafusion/src/datasource/empty.rs](https://codecov.io/gh/apache/arrow/pull/8910/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL2VtcHR5LnJz) | `0.00% <0.00%> (ø)` | |
   | [rust/datafusion/src/execution/context.rs](https://codecov.io/gh/apache/arrow/pull/8910/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9leGVjdXRpb24vY29udGV4dC5ycw==) | `0.00% <0.00%> (ø)` | |
   | [rust/datafusion/src/logical\_plan/builder.rs](https://codecov.io/gh/apache/arrow/pull/8910/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vYnVpbGRlci5ycw==) | `0.00% <0.00%> (ø)` | |
   | [rust/datafusion/src/logical\_plan/plan.rs](https://codecov.io/gh/apache/arrow/pull/8910/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vcGxhbi5ycw==) | `0.00% <0.00%> (ø)` | |
   | [...t/datafusion/src/optimizer/projection\_push\_down.rs](https://codecov.io/gh/apache/arrow/pull/8910/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvcHJvamVjdGlvbl9wdXNoX2Rvd24ucnM=) | `0.00% <ø> (ø)` | |
   | [rust/datafusion/src/physical\_plan/planner.rs](https://codecov.io/gh/apache/arrow/pull/8910/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3BsYW5uZXIucnM=) | `0.00% <0.00%> (ø)` | |
   | [rust/datafusion/src/sql/planner.rs](https://codecov.io/gh/apache/arrow/pull/8910/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zcWwvcGxhbm5lci5ycw==) | `0.00% <0.00%> (ø)` | |
   | [rust/datafusion/tests/dataframe.rs](https://codecov.io/gh/apache/arrow/pull/8910/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL2RhdGFmcmFtZS5ycw==) | `0.00% <ø> (ø)` | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8910?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8910?src=pr&el=footer). Last update [8ae596a...a551cc6](https://codecov.io/gh/apache/arrow/pull/8910?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rdettai commented on pull request #8910: ARROW-10783: [Rust] [DataFusion] Resolve TableScan provider eagerly

Posted by GitBox <gi...@apache.org>.
rdettai commented on pull request #8910:
URL: https://github.com/apache/arrow/pull/8910#issuecomment-744391019


   Sidenote: we should create a helper that projects a schema on Option<Vec<usize>> to avoid having the same code duplicated all over the place


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] waynexia commented on a change in pull request #8910: ARROW-10900: [Rust] [DataFusion] Resolve TableScan provider eagerly

Posted by GitBox <gi...@apache.org>.
waynexia commented on a change in pull request #8910:
URL: https://github.com/apache/arrow/pull/8910#discussion_r543513945



##########
File path: rust/datafusion/src/logical_plan/builder.rs
##########
@@ -90,77 +75,51 @@ impl LogicalPlanBuilder {
         options: CsvReadOptions,
         projection: Option<Vec<usize>>,
     ) -> Result<Self> {
-        let schema: Schema = match options.schema {
-            Some(s) => s.to_owned(),
-            None => CsvFile::try_new(path, options)?
-                .schema()
-                .as_ref()
-                .to_owned(),
-        };
-
-        let projected_schema = projection
-            .map(|p| Schema::new(p.iter().map(|i| schema.field(*i).clone()).collect()))
-            .map_or(SchemaRef::new(schema), SchemaRef::new)
-            .to_dfschema_ref()?;
-
         let provider = Arc::new(CsvFile::try_new(path, options)?);
-        let schema = provider.schema();
-
-        let table_scan = LogicalPlan::TableScan {
-            schema_name: "".to_string(),
-            source: TableSource::FromProvider(provider),
-            table_schema: schema,
-            projected_schema,
-            projection: None,
-        };
-
-        Ok(Self::from(&table_scan))
+        Self::scan("", provider, projection)
     }
 
     /// Scan a Parquet data source
     pub fn scan_parquet(path: &str, projection: Option<Vec<usize>>) -> Result<Self> {
         let provider = Arc::new(ParquetTable::try_new(path)?);
+        Self::scan("", provider, projection)
+    }
+
+    /// Scan an empty data source, mainly used in tests
+    pub fn scan_empty(
+        name: &str,

Review comment:
       Makes sense, I hadn't considered that. This change can be done in future PR if needed, so I agree we can keep them here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rdettai edited a comment on pull request #8910: ARROW-10783: [Rust] [DataFusion] Resolve TableScan provider eagerly

Posted by GitBox <gi...@apache.org>.
rdettai edited a comment on pull request #8910:
URL: https://github.com/apache/arrow/pull/8910#issuecomment-744391019


   Notes about the code: 
   - we should create a helper that projects a schema on `Option<Vec<usize>>` to avoid having the same code duplicated all over the place, but I think this should be done in a separate PR
   - I removed the `schema_name` field as it isn't used anywhere (yagni)
   - I created an `EmptyTable`, which is convenient for tests but might also be useful for all kinds of plan manipulation where the actual data is not important.
   - I think the most debatable change is about the `SchemaProvider` trait in the sql planner. It was actually not a **schema** provider as it could already provide other info about UDFs / UDAFs. So I completely renamed it to `ContextProvider`, and instead of giving access to only the schema part of the table, if gives access to the entire `TableProvider`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] rdettai commented on pull request #8910: ARROW-10783: [Rust] [DataFusion] Resolve TableScan provider eagerly

Posted by GitBox <gi...@apache.org>.
rdettai commented on pull request #8910:
URL: https://github.com/apache/arrow/pull/8910#issuecomment-744530646


   > I was planning to work on this exact same change
   
   great!
   
   > removing the TableProvider, offering the provider directly, and making the planner call provider.scan
   
   Not sure what you mean here! Isn't `TableProvider` what makes the datasource part of the query engine easily extensible?
   
   Side note: the aim of this work is to prepare https://issues.apache.org/jira/browse/ARROW-10902. Feel free to take a look at it!
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org