You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/03 16:16:31 UTC

[GitHub] [arrow] jorgecarleitao opened a new pull request #9084: ARROW-11119: [Rust] Expose functions to parse a single CSV column / StringRecord into an array / recordBatch

jorgecarleitao opened a new pull request #9084:
URL: https://github.com/apache/arrow/pull/9084


   This PR exposes two new functions:
   
   * to parse a single array from CSV (out of a column from a `[StringRecord]`) and
   * to parse a RecordBatch
   
   The motivation for the first function is that parsing arrays is trivially parallelizable. Thus, people may want to use e.g `rayon` to iterate in parallel over fields to build each array. IMO `arrow` crate should not make any assumption about how people want to parallelize this, and only offer the functionality to do it, in the same way we do it with kernels.
   
   The motivation for the second function stems from the fact that parsing (not the IO reading) is the slowest operation in reading a CSV and people may want to iterate over the CSV differently. The main use-case here is to split the read of a single CSV file in multiple parts (using `seek`), and returning record batches (DataFusion is the example here) in parallel. Again, IMO the arrow crate should not make assumptions about how to perform this work, and instead offer the necessary CPU-blocking core functionality for users to build on top of.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on pull request #9084: ARROW-11119: [Rust] Expose functions to parse a single CSV column / StringRecord into an array / recordBatch

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on pull request #9084:
URL: https://github.com/apache/arrow/pull/9084#issuecomment-780369269


   Thanks @alamb . Let's wait for #8781 to see if this is still needed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on pull request #9084: ARROW-11119: [Rust] Expose functions to parse a single CSV column / StringRecord into an array / recordBatch

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on pull request #9084:
URL: https://github.com/apache/arrow/pull/9084#issuecomment-763308646


   @Dandandan , you have a good point, and since you are working on the cast for this, I would also prefer to wait and see the result of that before committing to the `StringRecord`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on pull request #9084: ARROW-11119: [Rust] Expose functions to parse a single CSV column / StringRecord into an array / recordBatch

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #9084:
URL: https://github.com/apache/arrow/pull/9084#issuecomment-763148387


   Do you still want to move this in @jorgecarleitao ?
   I had a few thoughts that followed from this PR:
   
   * Making `StringRecord` public now would mean that we introduce a breaking change when changing the implementation of the csv parser.
   * We can spend some more time optimizing the parsers themselves, e.g. for csv parsing we can remove the overhead of the `csv` crate (which seems a big chunk of reading csv according to profiling results), avoiding copies, utf-8 validation in some cases, etc.
   * The parsers could use a `cast` kernel, this requires some more changes to `cast`. From there, additional parallelization on parsing might be easier / generic / reusable as we already have a number of `Array`s. This might even enable more advanced things like using SIMD for parsing some data types.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on a change in pull request #9084: ARROW-11119: [Rust] Expose functions to parse a single CSV column / StringRecord into an array / recordBatch

Posted by GitBox <gi...@apache.org>.

Dandandan commented on a change in pull request #9084:
URL: https://github.com/apache/arrow/pull/9084#discussion_r551041037



##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -394,88 +393,116 @@ impl<R: Read> Iterator for Reader<R> {
     }
 }
 
-/// parses a slice of [csv_crate::StringRecord] into a [array::record_batch::RecordBatch].
-fn parse(
+/// Tries to create an [array::Array] from a slice of [csv_crate::StringRecord] by interpreting its
+/// values at column `column_index` to be of `data_type`.
+/// `line_number` is where the set of rows starts at, and is only used to report the line number in case of errors.
+/// # Error
+/// This function errors iff:
+/// * _any_ entry from `rows` at `column_index` cannot be parsed into the DataType.
+/// * The [array::datatypes::DataType] is not supported.
+pub fn build_array(
+    rows: &[StringRecord],
+    data_type: &DataType,
+    line_number: usize,
+    column_index: usize,
+) -> Result<ArrayRef> {
+    match data_type {
+        DataType::Boolean => build_boolean_array(line_number, rows, column_index),
+        DataType::Int8 => {
+            build_primitive_array::<Int8Type>(line_number, rows, column_index)
+        }
+        DataType::Int16 => {
+            build_primitive_array::<Int16Type>(line_number, rows, column_index)
+        }
+        DataType::Int32 => {
+            build_primitive_array::<Int32Type>(line_number, rows, column_index)
+        }
+        DataType::Int64 => {
+            build_primitive_array::<Int64Type>(line_number, rows, column_index)
+        }
+        DataType::UInt8 => {
+            build_primitive_array::<UInt8Type>(line_number, rows, column_index)
+        }
+        DataType::UInt16 => {
+            build_primitive_array::<UInt16Type>(line_number, rows, column_index)
+        }
+        DataType::UInt32 => {
+            build_primitive_array::<UInt32Type>(line_number, rows, column_index)
+        }
+        DataType::UInt64 => {
+            build_primitive_array::<UInt64Type>(line_number, rows, column_index)
+        }
+        DataType::Float32 => {
+            build_primitive_array::<Float32Type>(line_number, rows, column_index)
+        }
+        DataType::Float64 => {
+            build_primitive_array::<Float64Type>(line_number, rows, column_index)
+        }
+        DataType::Date32(_) => {
+            build_primitive_array::<Date32Type>(line_number, rows, column_index)
+        }
+        DataType::Date64(_) => {
+            build_primitive_array::<Date64Type>(line_number, rows, column_index)
+        }
+        DataType::Timestamp(TimeUnit::Microsecond, _) => {
+            build_primitive_array::<TimestampMicrosecondType>(
+                line_number,
+                rows,
+                column_index,
+            )
+        }
+        DataType::Timestamp(TimeUnit::Nanosecond, _) => build_primitive_array::<
+            TimestampNanosecondType,
+        >(
+            line_number, rows, column_index
+        ),
+        DataType::Utf8 => Ok(Arc::new(
+            rows.iter()
+                .map(|row| row.get(column_index))
+                .collect::<StringArray>(),
+        ) as ArrayRef),
+        other => Err(ArrowError::ParseError(format!(
+            "Unsupported data type {:?}",
+            other
+        ))),
+    }
+}
+
+/// Tries to create an [array::record_batch::RecordBatch] from a slice of [csv_crate::StringRecord] by interpreting
+/// each of its columns according to `fields`. When `projection` is not None, it is used to select a subset of `fields` to
+/// parse.
+/// `line_number` is where the set of rows starts at, and is only used to report the line number in case of errors.
+/// # Error
+/// This function errors iff:
+/// * _any_ entry from `rows` cannot be parsed into its corresponding field's `DataType`.
+/// * Any of the fields' [array::datatypes::DataType] is not supported.
+/// # Panic
+/// This function panics if any index in `projection` is larger than `fields.len()`.
+pub fn build_batch(
     rows: &[StringRecord],
     fields: &[Field],
     projection: &Option<Vec<usize>>,
     line_number: usize,
 ) -> Result<RecordBatch> {
     let projection: Vec<usize> = match projection {
         Some(ref v) => v.clone(),
-        None => fields.iter().enumerate().map(|(i, _)| i).collect(),
+        None => (0..fields.len()).collect(),

Review comment:
       👍 I think the `v.clone()` could maybe even be removed?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on pull request #9084: ARROW-11119: [Rust] Expose functions to parse a single CSV column / StringRecord into an array / recordBatch

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #9084:
URL: https://github.com/apache/arrow/pull/9084#issuecomment-778610380


   @jorgecarleitao 
   
   What is the status of this PR?
   
   As part of trying to clean up the backlog of Rust PRs in this repo, I am going  through seemingly stale PRs and pinging the authors to see if there are any plans to continue the work or conversation.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan edited a comment on pull request #9084: ARROW-11119: [Rust] Expose functions to parse a single CSV column / StringRecord into an array / recordBatch

Posted by GitBox <gi...@apache.org>.

Dandandan edited a comment on pull request #9084:
URL: https://github.com/apache/arrow/pull/9084#issuecomment-753658416

This looks cool @jorgecarleitao !

Some thoughts for the future of csv /other parsers:

* It might be worth exploring if we can use the `cast` (or a similar) kernel of Arrow to parse the data. The benefit of this would be that we can just load the data (as bytes / string) into arrays and utilize the existing parsing logic in Arrow. I think this is interesting because the code can be vectorized/use SIMD, parallelized, etc. more easily from that point, will reduce code duplication, and creates more incentive to improve the `cast` kernels, which benefits more than "only" one parser.
* For further optimization it might be worth to stop using the `StringRecord`s at some point (and use `csv_core`), as there is quite some overhead associated with them compared to "just" loading the bytes from the file. How would this fit into your suggestion?
* For a user like DataFusion, it might often not make sense to have parallelism on the file level (if there are many files), so I think it makes sense to not make the parser slower / consume more resources for one thread. This is more a general thing we should keep in mind.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on a change in pull request #9084: ARROW-11119: [Rust] Expose functions to parse a single CSV column / StringRecord into an array / recordBatch

Posted by GitBox <gi...@apache.org>.

Dandandan commented on a change in pull request #9084:
URL: https://github.com/apache/arrow/pull/9084#discussion_r551040443



##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -394,88 +393,116 @@ impl<R: Read> Iterator for Reader<R> {
     }
 }
 
-/// parses a slice of [csv_crate::StringRecord] into a [array::record_batch::RecordBatch].
-fn parse(
+/// Tries to create an [array::Array] from a slice of [csv_crate::StringRecord] by interpreting its
+/// values at column `column_index` to be of `data_type`.
+/// `line_number` is where the set of rows starts at, and is only used to report the line number in case of errors.
+/// # Error
+/// This function errors iff:
+/// * _any_ entry from `rows` at `column_index` cannot be parsed into the DataType.
+/// * The [array::datatypes::DataType] is not supported.
+pub fn build_array(

Review comment:
       The downside is that this creates a dependency on `StringRecord` in the public API, making it harder to remove it when we want?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #9084: ARROW-11119: [Rust] Expose functions to parse a single CSV column / StringRecord into an array / recordBatch

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #9084:
URL: https://github.com/apache/arrow/pull/9084#issuecomment-753640897


   https://issues.apache.org/jira/browse/ARROW-11119


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorgecarleitao closed pull request #9084: ARROW-11119: [Rust] Expose functions to parse a single CSV column / StringRecord into an array / recordBatch

Posted by GitBox <gi...@apache.org>.

jorgecarleitao closed pull request #9084:
URL: https://github.com/apache/arrow/pull/9084


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] codecov-io commented on pull request #9084: ARROW-11119: [Rust] Expose functions to parse a single CSV column / StringRecord into an array / recordBatch

Posted by GitBox <gi...@apache.org>.

codecov-io commented on pull request #9084:
URL: https://github.com/apache/arrow/pull/9084#issuecomment-753642780


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9084?src=pr&el=h1) Report
   > Merging [#9084](https://codecov.io/gh/apache/arrow/pull/9084?src=pr&el=desc) (f764311) into [master](https://codecov.io/gh/apache/arrow/commit/dfef236f7587e4168ac1e07bd09e42d9373beb70?el=desc) (dfef236) will **decrease** coverage by `0.00%`.
   > The diff coverage is `92.59%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9084/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9084?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9084      +/-   ##
   ==========================================
   - Coverage   82.60%   82.60%   -0.01%     
   ==========================================
     Files         204      204              
     Lines       50189    50174      -15     
   ==========================================
   - Hits        41459    41446      -13     
   + Misses       8730     8728       -2     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9084?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/csv/reader.rs](https://codecov.io/gh/apache/arrow/pull/9084/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY3N2L3JlYWRlci5ycw==) | `93.14% <92.59%> (-0.02%)` | :arrow_down: |
   | [rust/arrow/src/record\_batch.rs](https://codecov.io/gh/apache/arrow/pull/9084/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvcmVjb3JkX2JhdGNoLnJz) | `73.83% <0.00%> (-2.24%)` | :arrow_down: |
   | [rust/arrow/src/array/array\_struct.rs](https://codecov.io/gh/apache/arrow/pull/9084/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfc3RydWN0LnJz) | `88.43% <0.00%> (-0.18%)` | :arrow_down: |
   | [rust/arrow/src/array/equal/mod.rs](https://codecov.io/gh/apache/arrow/pull/9084/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvbW9kLnJz) | `92.69% <0.00%> (+0.37%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9084?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9084?src=pr&el=footer). Last update [dfef236...f764311](https://codecov.io/gh/apache/arrow/pull/9084?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on pull request #9084: ARROW-11119: [Rust] Expose functions to parse a single CSV column / StringRecord into an array / recordBatch

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #9084:
URL: https://github.com/apache/arrow/pull/9084#issuecomment-753658416

This looks cool @jorgecarleitao !

Some thoughts for the future of csv /other parsers:

* It might be worth exploring if we can use the `cast` (or a similar) kernel of Arrow to parse the data. The benefit of this would be that we can just load the data (as bytes / string) into arrays and utilize the existing parsing logic in Arrow. I think this is interesting because the code can be vectorized/use SIMD, parallelized, etc. more easily from that point, will reduce code duplication, and creates more incentive to improve the `cast` kernels, which benefits more than "only" one parser.
* For further optimization it might be worth to stop using the `StringRecord`s at some point (and use `csv_core`), as there is quite some overhead associated with them compared to "just" loading the bytes from the file. How would this fit into your suggestion?
* For a user like DataFusion, it might often not make sense to have parallelism on the file level, so I think it makes sense to not make the parser slower / consume more resources for one thread.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on pull request #9084: ARROW-11119: [Rust] Expose functions to parse a single CSV column / StringRecord into an array / recordBatch

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #9084:
URL: https://github.com/apache/arrow/pull/9084#issuecomment-763137667


   Given the age of this PR I think we should rebase it against latest master prior to merging it in


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org