You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/05/21 00:29:45 UTC

[GitHub] [arrow] houqp opened a new pull request #7238: ARROW-8877: [Rust] [DataFusion] introduce CsvReadOption struct to simplify UX for read CSV data

houqp opened a new pull request #7238:
URL: https://github.com/apache/arrow/pull/7238


   A new `CsvReadOptions` is introduced to capture the following CSV read configurations:
   
   * has_header
   * optional delimiter
   * optional schema
   * number of records to read for schema inference
   
   See changes in `rust/datafusion/examples/csv_sql.rs` for an example on how the new interface looks like.
   
   Initially I thought we can unify all CSV read code path using one single options struct. It turns out it's not possible. Components from low level of the stack including `arrow::csv::reader::Read`, `CsvParition`, `CsvIterator` all work under the assumption that schema has already been defined, which makes some of the fields in CsvReadOptions irrelevant.
   
   Therefore, the newly introduced CsvReadOptions struct is only used in high level user facing APIs including: `CsvFile`, `CsvExec`, `LogicalPlanBuilder::scan_csv`, `context::register_csv`, etc.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] andygrove commented on pull request #7238: ARROW-8877: [Rust] [DataFusion] introduce CsvReadOption struct to simplify UX

Posted by GitBox <gi...@apache.org>.

andygrove commented on pull request #7238:
URL: https://github.com/apache/arrow/pull/7238#issuecomment-632401102


   @houqp Could you push an empty commit to trigger CI. The tests are showing as skipped, probably due to WIP in title.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] paddyhoran commented on a change in pull request #7238: ARROW-8877: [Rust] [DataFusion] introduce CsvReadOption struct to simplify UX for read CSV data

Posted by GitBox <gi...@apache.org>.

paddyhoran commented on a change in pull request #7238:
URL: https://github.com/apache/arrow/pull/7238#discussion_r428641550



##########
File path: rust/datafusion/src/execution/physical_plan/csv.rs
##########
@@ -28,6 +28,34 @@ use arrow::csv;
 use arrow::datatypes::Schema;
 use arrow::record_batch::RecordBatch;
 
+/// CSV file read option
+#[derive(Copy, Clone)]
+pub struct CsvReadOptions<'a> {
+    /// Does the CSV file have a header?
+    ///
+    /// If schema inference is run on a file with no headers, default column names
+    /// are created.
+    pub has_header: bool,
+    /// An optional column delimiter. Defaults to `b','`.
+    pub delimiter: Option<u8>,
+    /// An optional schema representing the CSV files. If None, CSV reader will try to infer it
+    /// based on data in file.
+    pub schema: Option<&'a Schema>,
+    /// number of rows to read from CSV files for schema inference if needed. Defaults to 1000.
+    pub schema_infer_read_records: Option<usize>,

Review comment:
       Just a nit really but there are a number of defaults that are set through `default`.  But don't we need a `new` or similar to always enforce these defaults as the comments state.  For instance, you can currently create a `CsvReadOptions` where `schema_infer_read_records` is `None` and then we have to make sure we enforce the default (1000) somewhere else where it's used.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] houqp commented on a change in pull request #7238: ARROW-8877: [Rust] [DataFusion] introduce CsvReadOption struct to simplify UX for read CSV data

Posted by GitBox <gi...@apache.org>.

houqp commented on a change in pull request #7238:
URL: https://github.com/apache/arrow/pull/7238#discussion_r428708023



##########
File path: rust/datafusion/src/execution/physical_plan/csv.rs
##########
@@ -28,6 +28,34 @@ use arrow::csv;
 use arrow::datatypes::Schema;
 use arrow::record_batch::RecordBatch;
 
+/// CSV file read option
+#[derive(Copy, Clone)]
+pub struct CsvReadOptions<'a> {
+    /// Does the CSV file have a header?
+    ///
+    /// If schema inference is run on a file with no headers, default column names
+    /// are created.
+    pub has_header: bool,
+    /// An optional column delimiter. Defaults to `b','`.
+    pub delimiter: Option<u8>,
+    /// An optional schema representing the CSV files. If None, CSV reader will try to infer it
+    /// based on data in file.
+    pub schema: Option<&'a Schema>,
+    /// number of rows to read from CSV files for schema inference if needed. Defaults to 1000.
+    pub schema_infer_read_records: Option<usize>,

Review comment:
       I think to enforce that, it's best to switch to builder pattern. The UX will look like this:
   
   ```rust
       ctx.register_csv(
           "aggregate_test_100",
           &format!("{}/csv/aggregate_test_100.csv", testdata),
           CsvReadOptions::new().schema_infer_read_records(50).delimiter(b'|'),
           true,
       )?;
   ```
   
   What do you think?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] andygrove closed pull request #7238: ARROW-8877: [Rust] [DataFusion] introduce CsvReadOption struct to simplify UX

Posted by GitBox <gi...@apache.org>.

andygrove closed pull request #7238:
URL: https://github.com/apache/arrow/pull/7238


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] houqp commented on a change in pull request #7238: ARROW-8877: [Rust] [DataFusion] introduce CsvReadOption struct to simplify UX for read CSV data

Posted by GitBox <gi...@apache.org>.

houqp commented on a change in pull request #7238:
URL: https://github.com/apache/arrow/pull/7238#discussion_r428708023



##########
File path: rust/datafusion/src/execution/physical_plan/csv.rs
##########
@@ -28,6 +28,34 @@ use arrow::csv;
 use arrow::datatypes::Schema;
 use arrow::record_batch::RecordBatch;
 
+/// CSV file read option
+#[derive(Copy, Clone)]
+pub struct CsvReadOptions<'a> {
+    /// Does the CSV file have a header?
+    ///
+    /// If schema inference is run on a file with no headers, default column names
+    /// are created.
+    pub has_header: bool,
+    /// An optional column delimiter. Defaults to `b','`.
+    pub delimiter: Option<u8>,
+    /// An optional schema representing the CSV files. If None, CSV reader will try to infer it
+    /// based on data in file.
+    pub schema: Option<&'a Schema>,
+    /// number of rows to read from CSV files for schema inference if needed. Defaults to 1000.
+    pub schema_infer_read_records: Option<usize>,

Review comment:
       To enforce that, it's best to switch to builder pattern. Here is what the UX will look like:
   
   ```rust
       ctx.register_csv(
           "aggregate_test_100",
           &format!("{}/csv/aggregate_test_100.csv", testdata),
           CsvReadOptions::new().schema_infer_read_records(50).delimiter(b'|'),
           true,
       )?;
   ```
   
   What do you think?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #7238: ARROW-8877: [Rust] [DataFusion] introduce CsvReadOption struct to simplify UX for read CSV data

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #7238:
URL: https://github.com/apache/arrow/pull/7238#issuecomment-631808670


   https://issues.apache.org/jira/browse/ARROW-8877


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] houqp commented on pull request #7238: ARROW-8877: [Rust] [DataFusion] introduce CsvReadOption struct to simplify UX

Posted by GitBox <gi...@apache.org>.

houqp commented on pull request #7238:
URL: https://github.com/apache/arrow/pull/7238#issuecomment-632410180


   Ha, I have been wondering why those tasks are skipped, now it makes sense. @andygrove updated, all builds are passing now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] houqp commented on pull request #7238: ARROW-8877: [Rust] [DataFusion] introduce CsvReadOption struct to simplify UX for read CSV data

Posted by GitBox <gi...@apache.org>.

houqp commented on pull request #7238:
URL: https://github.com/apache/arrow/pull/7238#issuecomment-632298460


   @nevi-me sorry, the PR was in WIP state. I have now pushed all the changes including docs and removing default trait to enforce default presets.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] houqp commented on pull request #7238: ARROW-8877: [Rust] [DataFusion] introduce CsvReadOption struct to simplify UX for read CSV data

Posted by GitBox <gi...@apache.org>.

houqp commented on pull request #7238:
URL: https://github.com/apache/arrow/pull/7238#issuecomment-631809066


   First follow up PR for https://github.com/apache/arrow/pull/7210.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] paddyhoran commented on a change in pull request #7238: ARROW-8877: [Rust] [DataFusion] introduce CsvReadOption struct to simplify UX for read CSV data

Posted by GitBox <gi...@apache.org>.

paddyhoran commented on a change in pull request #7238:
URL: https://github.com/apache/arrow/pull/7238#discussion_r428729008



##########
File path: rust/datafusion/src/execution/physical_plan/csv.rs
##########
@@ -28,6 +28,34 @@ use arrow::csv;
 use arrow::datatypes::Schema;
 use arrow::record_batch::RecordBatch;
 
+/// CSV file read option
+#[derive(Copy, Clone)]
+pub struct CsvReadOptions<'a> {
+    /// Does the CSV file have a header?
+    ///
+    /// If schema inference is run on a file with no headers, default column names
+    /// are created.
+    pub has_header: bool,
+    /// An optional column delimiter. Defaults to `b','`.
+    pub delimiter: Option<u8>,
+    /// An optional schema representing the CSV files. If None, CSV reader will try to infer it
+    /// based on data in file.
+    pub schema: Option<&'a Schema>,
+    /// number of rows to read from CSV files for schema inference if needed. Defaults to 1000.
+    pub schema_infer_read_records: Option<usize>,

Review comment:
       Yea, I think that works.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org