You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/31 01:32:13 UTC

[GitHub] [arrow-rs] sum12 opened a new pull request #1112: allow using custom datetime format for inference and parsing csv file

sum12 opened a new pull request #1112:
URL: https://github.com/apache/arrow-rs/pull/1112


   The patch extends the current implementation to allow passing a custom
   datetime_re and datetime_format to the ReaderBuilder.
   
   datetime_re is used infer schema of the csv and then datetime_format is
   used to parse the actual string to a Date64.
   ofcourse  passing non-compatible datetime_re and datetime_format values
   is going to fail the parsing or inference, however it is an expected but
   hard-to-detect failure.
   
   # Which issue does this PR close?
   
   <!---
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #.
   
   # Rationale for this change
    
   Sometimes csv files follow non-standard datetime format. Arrow does not really care for the datetime format stored in the csv but merely guesses/assumes it to be of a certain fomrat. The change extends the csv reader to use a custom fromat.
    <!---
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
   -->
   
   # What changes are included in this PR?
   
   <!---
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   Any external libraries relying on ReaderBuilder and Reader structs will have to be updated to the new new API, 
   
   <!---
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] liukun4515 commented on a change in pull request #1112: allow using custom datetime format for inference and parsing csv file

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on a change in pull request #1112:
URL: https://github.com/apache/arrow-rs/pull/1112#discussion_r777161244



##########
File path: arrow/src/csv/reader.rs
##########
@@ -1522,18 +1624,26 @@ mod tests {
 
     #[test]
     fn test_infer_field_schema() {
-        assert_eq!(infer_field_schema("A"), DataType::Utf8);
-        assert_eq!(infer_field_schema("\"123\""), DataType::Utf8);
-        assert_eq!(infer_field_schema("10"), DataType::Int64);
-        assert_eq!(infer_field_schema("10.2"), DataType::Float64);
-        assert_eq!(infer_field_schema(".2"), DataType::Float64);
-        assert_eq!(infer_field_schema("2."), DataType::Float64);
-        assert_eq!(infer_field_schema("true"), DataType::Boolean);
-        assert_eq!(infer_field_schema("false"), DataType::Boolean);
-        assert_eq!(infer_field_schema("2020-11-08"), DataType::Date32);
-        assert_eq!(infer_field_schema("2020-11-08T14:20:01"), DataType::Date64);
-        assert_eq!(infer_field_schema("-5.13"), DataType::Float64);
-        assert_eq!(infer_field_schema("0.1300"), DataType::Float64);
+        assert_eq!(infer_field_schema("A", None), DataType::Utf8);
+        assert_eq!(infer_field_schema("\"123\"", None), DataType::Utf8);
+        assert_eq!(infer_field_schema("10", None), DataType::Int64);
+        assert_eq!(infer_field_schema("10.2", None), DataType::Float64);
+        assert_eq!(infer_field_schema(".2", None), DataType::Float64);
+        assert_eq!(infer_field_schema("2.", None), DataType::Float64);
+        assert_eq!(infer_field_schema("true", None), DataType::Boolean);
+        assert_eq!(infer_field_schema("false", None), DataType::Boolean);
+        assert_eq!(infer_field_schema("2020-11-08", None), DataType::Date32);
+        assert_eq!(
+            infer_field_schema("2020-11-08T14:20:01", None),
+            DataType::Date64
+        );
+        let reg = Regex::new(r"^\d{4}-\d\d-\d\d \d\d:\d\d:\d\d$").ok();

Review comment:
       it's great.
   No more needed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] sum12 commented on a change in pull request #1112: allow using custom datetime format for inference and parsing csv file

Posted by GitBox <gi...@apache.org>.
sum12 commented on a change in pull request #1112:
URL: https://github.com/apache/arrow-rs/pull/1112#discussion_r777114869



##########
File path: arrow/src/csv/reader.rs
##########
@@ -520,47 +554,60 @@ fn parse(
                 DataType::Decimal(precision, scale) => {
                     build_decimal_array(line_number, rows, i, *precision, *scale)
                 }
-                DataType::Int8 => build_primitive_array::<Int8Type>(line_number, rows, i),
+                DataType::Int8 => {
+                    build_primitive_array::<Int8Type>(line_number, rows, i, None)
+                }
                 DataType::Int16 => {
-                    build_primitive_array::<Int16Type>(line_number, rows, i)
+                    build_primitive_array::<Int16Type>(line_number, rows, i, None)
                 }
                 DataType::Int32 => {
-                    build_primitive_array::<Int32Type>(line_number, rows, i)
+                    build_primitive_array::<Int32Type>(line_number, rows, i, None)
                 }
                 DataType::Int64 => {
-                    build_primitive_array::<Int64Type>(line_number, rows, i)
+                    build_primitive_array::<Int64Type>(line_number, rows, i, None)
                 }
                 DataType::UInt8 => {
-                    build_primitive_array::<UInt8Type>(line_number, rows, i)
+                    build_primitive_array::<UInt8Type>(line_number, rows, i, None)
                 }
                 DataType::UInt16 => {
-                    build_primitive_array::<UInt16Type>(line_number, rows, i)
+                    build_primitive_array::<UInt16Type>(line_number, rows, i, None)
                 }
                 DataType::UInt32 => {
-                    build_primitive_array::<UInt32Type>(line_number, rows, i)
+                    build_primitive_array::<UInt32Type>(line_number, rows, i, None)
                 }
                 DataType::UInt64 => {
-                    build_primitive_array::<UInt64Type>(line_number, rows, i)
+                    build_primitive_array::<UInt64Type>(line_number, rows, i, None)
                 }
                 DataType::Float32 => {
-                    build_primitive_array::<Float32Type>(line_number, rows, i)
+                    build_primitive_array::<Float32Type>(line_number, rows, i, None)
                 }
                 DataType::Float64 => {
-                    build_primitive_array::<Float64Type>(line_number, rows, i)
+                    build_primitive_array::<Float64Type>(line_number, rows, i, None)
                 }
                 DataType::Date32 => {
-                    build_primitive_array::<Date32Type>(line_number, rows, i)
-                }
-                DataType::Date64 => {
-                    build_primitive_array::<Date64Type>(line_number, rows, i)
+                    build_primitive_array::<Date32Type>(line_number, rows, i, None)

Review comment:
       this would need a `date_format`. 
   Since `datetime_re`/`DATETIME_RE` regex rejected the string. Then using the same format for date32 and date64 can lead to parsing errors 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] liukun4515 commented on a change in pull request #1112: allow using custom datetime format for inference and parsing csv file

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on a change in pull request #1112:
URL: https://github.com/apache/arrow-rs/pull/1112#discussion_r776913217



##########
File path: arrow/src/csv/reader.rs
##########
@@ -1522,18 +1624,26 @@ mod tests {
 
     #[test]
     fn test_infer_field_schema() {
-        assert_eq!(infer_field_schema("A"), DataType::Utf8);
-        assert_eq!(infer_field_schema("\"123\""), DataType::Utf8);
-        assert_eq!(infer_field_schema("10"), DataType::Int64);
-        assert_eq!(infer_field_schema("10.2"), DataType::Float64);
-        assert_eq!(infer_field_schema(".2"), DataType::Float64);
-        assert_eq!(infer_field_schema("2."), DataType::Float64);
-        assert_eq!(infer_field_schema("true"), DataType::Boolean);
-        assert_eq!(infer_field_schema("false"), DataType::Boolean);
-        assert_eq!(infer_field_schema("2020-11-08"), DataType::Date32);
-        assert_eq!(infer_field_schema("2020-11-08T14:20:01"), DataType::Date64);
-        assert_eq!(infer_field_schema("-5.13"), DataType::Float64);
-        assert_eq!(infer_field_schema("0.1300"), DataType::Float64);
+        assert_eq!(infer_field_schema("A", None), DataType::Utf8);
+        assert_eq!(infer_field_schema("\"123\"", None), DataType::Utf8);
+        assert_eq!(infer_field_schema("10", None), DataType::Int64);
+        assert_eq!(infer_field_schema("10.2", None), DataType::Float64);
+        assert_eq!(infer_field_schema(".2", None), DataType::Float64);
+        assert_eq!(infer_field_schema("2.", None), DataType::Float64);
+        assert_eq!(infer_field_schema("true", None), DataType::Boolean);
+        assert_eq!(infer_field_schema("false", None), DataType::Boolean);
+        assert_eq!(infer_field_schema("2020-11-08", None), DataType::Date32);
+        assert_eq!(
+            infer_field_schema("2020-11-08T14:20:01", None),
+            DataType::Date64
+        );
+        let reg = Regex::new(r"^\d{4}-\d\d-\d\d \d\d:\d\d:\d\d$").ok();

Review comment:
       Can you add some negative test?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter commented on pull request #1112: allow using custom datetime format for inference and parsing csv file

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #1112:
URL: https://github.com/apache/arrow-rs/pull/1112#issuecomment-1003240369


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1112?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1112](https://codecov.io/gh/apache/arrow-rs/pull/1112?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (af441b8) into [master](https://codecov.io/gh/apache/arrow-rs/commit/2ad99ec33c757d6c7b4a5e69e19a4a096813b53b?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (2ad99ec) will **decrease** coverage by `0.03%`.
   > The diff coverage is `67.50%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/1112/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/1112?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #1112      +/-   ##
   ==========================================
   - Coverage   82.34%   82.30%   -0.04%     
   ==========================================
     Files         168      168              
     Lines       49479    49546      +67     
   ==========================================
   + Hits        40743    40779      +36     
   - Misses       8736     8767      +31     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1112?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [arrow/src/csv/reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1112/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2Nzdi9yZWFkZXIucnM=) | `87.99% <67.22%> (-2.59%)` | :arrow_down: |
   | [arrow/src/csv/writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/1112/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2Nzdi93cml0ZXIucnM=) | `72.13% <100.00%> (+0.10%)` | :arrow_up: |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1112/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `65.98% <0.00%> (-0.23%)` | :arrow_down: |
   | [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1112/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `85.56% <0.00%> (ø)` | |
   | [arrow/src/datatypes/field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1112/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9maWVsZC5ycw==) | `53.68% <0.00%> (+0.30%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1112?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1112?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [2ad99ec...af441b8](https://codecov.io/gh/apache/arrow-rs/pull/1112?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on a change in pull request #1112: allow using custom datetime format for inference and parsing csv file

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1112:
URL: https://github.com/apache/arrow-rs/pull/1112#discussion_r777231360



##########
File path: arrow/src/csv/reader.rs
##########
@@ -161,38 +160,31 @@ pub fn infer_reader_schema<R: Read>(
     max_read_records: Option<usize>,
     has_header: bool,
 ) -> Result<(Schema, usize)> {
-    infer_reader_schema_with_csv_options(
-        reader,
-        delimiter,
+    let roptions = ReaderOptions {
+        delimiter: Some(delimiter),
         max_read_records,
         has_header,
-        None,
-        None,
-        None,
-    )
+        ..Default::default()
+    };
+    infer_reader_schema_with_csv_options(reader, roptions)
 }
 
 fn infer_reader_schema_with_csv_options<R: Read>(
     reader: &mut R,
-    delimiter: u8,
-    max_read_records: Option<usize>,
-    has_header: bool,
-    escape: Option<u8>,
-    quote: Option<u8>,
-    terminator: Option<u8>,
+    roptions: ReaderOptions,

Review comment:
       👍  this is much nicer




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter edited a comment on pull request #1112: allow using custom datetime format for inference and parsing csv file

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #1112:
URL: https://github.com/apache/arrow-rs/pull/1112#issuecomment-1003240369


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1112?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1112](https://codecov.io/gh/apache/arrow-rs/pull/1112?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (2e91790) into [master](https://codecov.io/gh/apache/arrow-rs/commit/2ad99ec33c757d6c7b4a5e69e19a4a096813b53b?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (2ad99ec) will **decrease** coverage by `0.06%`.
   > The diff coverage is `72.95%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/1112/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/1112?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #1112      +/-   ##
   ==========================================
   - Coverage   82.34%   82.28%   -0.07%     
   ==========================================
     Files         168      168              
     Lines       49479    49670     +191     
   ==========================================
   + Hits        40743    40870     +127     
   - Misses       8736     8800      +64     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1112?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [arrow/src/csv/reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1112/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2Nzdi9yZWFkZXIucnM=) | `88.10% <72.78%> (-2.48%)` | :arrow_down: |
   | [arrow/src/csv/writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/1112/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2Nzdi93cml0ZXIucnM=) | `72.13% <100.00%> (+0.10%)` | :arrow_up: |
   | [arrow/src/compute/kernels/comparison.rs](https://codecov.io/gh/apache/arrow-rs/pull/1112/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9jb21wYXJpc29uLnJz) | `90.87% <0.00%> (-2.60%)` | :arrow_down: |
   | [arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/1112/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==) | `66.38% <0.00%> (-0.43%)` | :arrow_down: |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1112/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `65.98% <0.00%> (-0.23%)` | :arrow_down: |
   | [parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/1112/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz) | `93.52% <0.00%> (-0.20%)` | :arrow_down: |
   | [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1112/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `85.69% <0.00%> (+0.13%)` | :arrow_up: |
   | [arrow/src/datatypes/field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1112/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9maWVsZC5ycw==) | `53.68% <0.00%> (+0.30%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1112?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1112?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [2ad99ec...2e91790](https://codecov.io/gh/apache/arrow-rs/pull/1112?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter edited a comment on pull request #1112: allow using custom datetime format for inference and parsing csv file

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #1112:
URL: https://github.com/apache/arrow-rs/pull/1112#issuecomment-1003240369


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1112?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1112](https://codecov.io/gh/apache/arrow-rs/pull/1112?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (856cf39) into [master](https://codecov.io/gh/apache/arrow-rs/commit/2ad99ec33c757d6c7b4a5e69e19a4a096813b53b?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (2ad99ec) will **decrease** coverage by `0.06%`.
   > The diff coverage is `72.95%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/1112/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/1112?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #1112      +/-   ##
   ==========================================
   - Coverage   82.34%   82.28%   -0.07%     
   ==========================================
     Files         168      168              
     Lines       49479    49670     +191     
   ==========================================
   + Hits        40743    40870     +127     
   - Misses       8736     8800      +64     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1112?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [arrow/src/csv/reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1112/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2Nzdi9yZWFkZXIucnM=) | `88.10% <72.78%> (-2.48%)` | :arrow_down: |
   | [arrow/src/csv/writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/1112/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2Nzdi93cml0ZXIucnM=) | `72.13% <100.00%> (+0.10%)` | :arrow_up: |
   | [arrow/src/compute/kernels/comparison.rs](https://codecov.io/gh/apache/arrow-rs/pull/1112/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9jb21wYXJpc29uLnJz) | `90.87% <0.00%> (-2.60%)` | :arrow_down: |
   | [arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/1112/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==) | `66.38% <0.00%> (-0.43%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1112?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1112?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [2ad99ec...856cf39](https://codecov.io/gh/apache/arrow-rs/pull/1112?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on a change in pull request #1112: allow using custom datetime format for inference and parsing csv file

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1112:
URL: https://github.com/apache/arrow-rs/pull/1112#discussion_r777230438



##########
File path: arrow/src/csv/reader.rs
##########
@@ -696,6 +738,39 @@ impl Parser for Date64Type {
             _ => None,
         }
     }
+
+    fn parse_formatted(string: &str, format: &str) -> Option<i64> {
+        match Self::DATA_TYPE {
+            DataType::Date64 => {
+                use chrono::format::Fixed;
+                use chrono::format::StrftimeItems;
+                let fmt = StrftimeItems::new(format);
+                let has_zone = fmt.into_iter().any(|item| match item {

Review comment:
       That is a cool idea -- I wonder if we could implement `Date64TypeWithZone` or something -- or move the `has_zone` attribute into `Date64Type` directly?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb merged pull request #1112: allow using custom datetime format for inference and parsing csv file

Posted by GitBox <gi...@apache.org>.
alamb merged pull request #1112:
URL: https://github.com/apache/arrow-rs/pull/1112


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on a change in pull request #1112: allow using custom datetime format for inference and parsing csv file

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1112:
URL: https://github.com/apache/arrow-rs/pull/1112#discussion_r777108645



##########
File path: arrow/src/csv/reader.rs
##########
@@ -520,47 +554,60 @@ fn parse(
                 DataType::Decimal(precision, scale) => {
                     build_decimal_array(line_number, rows, i, *precision, *scale)
                 }
-                DataType::Int8 => build_primitive_array::<Int8Type>(line_number, rows, i),
+                DataType::Int8 => {
+                    build_primitive_array::<Int8Type>(line_number, rows, i, None)
+                }
                 DataType::Int16 => {
-                    build_primitive_array::<Int16Type>(line_number, rows, i)
+                    build_primitive_array::<Int16Type>(line_number, rows, i, None)
                 }
                 DataType::Int32 => {
-                    build_primitive_array::<Int32Type>(line_number, rows, i)
+                    build_primitive_array::<Int32Type>(line_number, rows, i, None)
                 }
                 DataType::Int64 => {
-                    build_primitive_array::<Int64Type>(line_number, rows, i)
+                    build_primitive_array::<Int64Type>(line_number, rows, i, None)
                 }
                 DataType::UInt8 => {
-                    build_primitive_array::<UInt8Type>(line_number, rows, i)
+                    build_primitive_array::<UInt8Type>(line_number, rows, i, None)
                 }
                 DataType::UInt16 => {
-                    build_primitive_array::<UInt16Type>(line_number, rows, i)
+                    build_primitive_array::<UInt16Type>(line_number, rows, i, None)
                 }
                 DataType::UInt32 => {
-                    build_primitive_array::<UInt32Type>(line_number, rows, i)
+                    build_primitive_array::<UInt32Type>(line_number, rows, i, None)
                 }
                 DataType::UInt64 => {
-                    build_primitive_array::<UInt64Type>(line_number, rows, i)
+                    build_primitive_array::<UInt64Type>(line_number, rows, i, None)
                 }
                 DataType::Float32 => {
-                    build_primitive_array::<Float32Type>(line_number, rows, i)
+                    build_primitive_array::<Float32Type>(line_number, rows, i, None)
                 }
                 DataType::Float64 => {
-                    build_primitive_array::<Float64Type>(line_number, rows, i)
+                    build_primitive_array::<Float64Type>(line_number, rows, i, None)
                 }
                 DataType::Date32 => {
-                    build_primitive_array::<Date32Type>(line_number, rows, i)
-                }
-                DataType::Date64 => {
-                    build_primitive_array::<Date64Type>(line_number, rows, i)
+                    build_primitive_array::<Date32Type>(line_number, rows, i, None)
                 }
-                DataType::Timestamp(TimeUnit::Microsecond, _) => build_primitive_array::<
-                    TimestampMicrosecondType,
-                >(
-                    line_number, rows, i
+                DataType::Date64 => build_primitive_array::<Date64Type>(
+                    line_number,
+                    rows,
+                    i,
+                    datetime_format.clone(),
                 ),
+                DataType::Timestamp(TimeUnit::Microsecond, _) => {
+                    build_primitive_array::<TimestampMicrosecondType>(
+                        line_number,
+                        rows,
+                        i,
+                        None,

Review comment:
       `datefime_format`?

##########
File path: arrow/src/csv/reader.rs
##########
@@ -316,6 +323,8 @@ pub struct Reader<R: Read> {
     batch_size: usize,
     /// Vector that can hold the `StringRecord`s of the batches
     batch_records: Vec<StringRecord>,
+    /// datetime format used to parse datetime values, (format understood by chrono)

Review comment:
       Can you please provide the link to the appropriate chrono documentation?

##########
File path: arrow/src/csv/reader.rs
##########
@@ -520,47 +554,60 @@ fn parse(
                 DataType::Decimal(precision, scale) => {
                     build_decimal_array(line_number, rows, i, *precision, *scale)
                 }
-                DataType::Int8 => build_primitive_array::<Int8Type>(line_number, rows, i),
+                DataType::Int8 => {
+                    build_primitive_array::<Int8Type>(line_number, rows, i, None)
+                }
                 DataType::Int16 => {
-                    build_primitive_array::<Int16Type>(line_number, rows, i)
+                    build_primitive_array::<Int16Type>(line_number, rows, i, None)
                 }
                 DataType::Int32 => {
-                    build_primitive_array::<Int32Type>(line_number, rows, i)
+                    build_primitive_array::<Int32Type>(line_number, rows, i, None)
                 }
                 DataType::Int64 => {
-                    build_primitive_array::<Int64Type>(line_number, rows, i)
+                    build_primitive_array::<Int64Type>(line_number, rows, i, None)
                 }
                 DataType::UInt8 => {
-                    build_primitive_array::<UInt8Type>(line_number, rows, i)
+                    build_primitive_array::<UInt8Type>(line_number, rows, i, None)
                 }
                 DataType::UInt16 => {
-                    build_primitive_array::<UInt16Type>(line_number, rows, i)
+                    build_primitive_array::<UInt16Type>(line_number, rows, i, None)
                 }
                 DataType::UInt32 => {
-                    build_primitive_array::<UInt32Type>(line_number, rows, i)
+                    build_primitive_array::<UInt32Type>(line_number, rows, i, None)
                 }
                 DataType::UInt64 => {
-                    build_primitive_array::<UInt64Type>(line_number, rows, i)
+                    build_primitive_array::<UInt64Type>(line_number, rows, i, None)
                 }
                 DataType::Float32 => {
-                    build_primitive_array::<Float32Type>(line_number, rows, i)
+                    build_primitive_array::<Float32Type>(line_number, rows, i, None)
                 }
                 DataType::Float64 => {
-                    build_primitive_array::<Float64Type>(line_number, rows, i)
+                    build_primitive_array::<Float64Type>(line_number, rows, i, None)
                 }
                 DataType::Date32 => {
-                    build_primitive_array::<Date32Type>(line_number, rows, i)
-                }
-                DataType::Date64 => {
-                    build_primitive_array::<Date64Type>(line_number, rows, i)
+                    build_primitive_array::<Date32Type>(line_number, rows, i, None)

Review comment:
       should this also have `datetime_format`?

##########
File path: arrow/src/csv/reader.rs
##########
@@ -520,47 +554,60 @@ fn parse(
                 DataType::Decimal(precision, scale) => {
                     build_decimal_array(line_number, rows, i, *precision, *scale)
                 }
-                DataType::Int8 => build_primitive_array::<Int8Type>(line_number, rows, i),
+                DataType::Int8 => {
+                    build_primitive_array::<Int8Type>(line_number, rows, i, None)
+                }
                 DataType::Int16 => {
-                    build_primitive_array::<Int16Type>(line_number, rows, i)
+                    build_primitive_array::<Int16Type>(line_number, rows, i, None)
                 }
                 DataType::Int32 => {
-                    build_primitive_array::<Int32Type>(line_number, rows, i)
+                    build_primitive_array::<Int32Type>(line_number, rows, i, None)
                 }
                 DataType::Int64 => {
-                    build_primitive_array::<Int64Type>(line_number, rows, i)
+                    build_primitive_array::<Int64Type>(line_number, rows, i, None)
                 }
                 DataType::UInt8 => {
-                    build_primitive_array::<UInt8Type>(line_number, rows, i)
+                    build_primitive_array::<UInt8Type>(line_number, rows, i, None)
                 }
                 DataType::UInt16 => {
-                    build_primitive_array::<UInt16Type>(line_number, rows, i)
+                    build_primitive_array::<UInt16Type>(line_number, rows, i, None)
                 }
                 DataType::UInt32 => {
-                    build_primitive_array::<UInt32Type>(line_number, rows, i)
+                    build_primitive_array::<UInt32Type>(line_number, rows, i, None)
                 }
                 DataType::UInt64 => {
-                    build_primitive_array::<UInt64Type>(line_number, rows, i)
+                    build_primitive_array::<UInt64Type>(line_number, rows, i, None)
                 }
                 DataType::Float32 => {
-                    build_primitive_array::<Float32Type>(line_number, rows, i)
+                    build_primitive_array::<Float32Type>(line_number, rows, i, None)
                 }
                 DataType::Float64 => {
-                    build_primitive_array::<Float64Type>(line_number, rows, i)
+                    build_primitive_array::<Float64Type>(line_number, rows, i, None)
                 }
                 DataType::Date32 => {
-                    build_primitive_array::<Date32Type>(line_number, rows, i)
-                }
-                DataType::Date64 => {
-                    build_primitive_array::<Date64Type>(line_number, rows, i)
+                    build_primitive_array::<Date32Type>(line_number, rows, i, None)
                 }
-                DataType::Timestamp(TimeUnit::Microsecond, _) => build_primitive_array::<
-                    TimestampMicrosecondType,
-                >(
-                    line_number, rows, i
+                DataType::Date64 => build_primitive_array::<Date64Type>(
+                    line_number,
+                    rows,
+                    i,
+                    datetime_format.clone(),
                 ),
+                DataType::Timestamp(TimeUnit::Microsecond, _) => {
+                    build_primitive_array::<TimestampMicrosecondType>(
+                        line_number,
+                        rows,
+                        i,
+                        None,
+                    )
+                }
                 DataType::Timestamp(TimeUnit::Nanosecond, _) => {
-                    build_primitive_array::<TimestampNanosecondType>(line_number, rows, i)
+                    build_primitive_array::<TimestampNanosecondType>(
+                        line_number,
+                        rows,
+                        i,
+                        None,

Review comment:
       also here?

##########
File path: arrow/src/csv/reader.rs
##########
@@ -1041,6 +1116,20 @@ impl ReaderBuilder {
         self
     }
 
+    /// Set the datetime regex used to parse the string to Date64Type
+    /// this regex is used while infering schema
+    pub fn with_datetime_re(mut self, datetime_re: Regex) -> Self {
+        self.datetime_re = Some(datetime_re);
+        self
+    }
+
+    /// Set the datetime regex used to parse the string to Date64Type

Review comment:
       Can you also add a link here to the datetime format docs that are understood




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] sum12 commented on a change in pull request #1112: allow using custom datetime format for inference and parsing csv file

Posted by GitBox <gi...@apache.org>.
sum12 commented on a change in pull request #1112:
URL: https://github.com/apache/arrow-rs/pull/1112#discussion_r777115561



##########
File path: arrow/src/csv/reader.rs
##########
@@ -1522,18 +1624,26 @@ mod tests {
 
     #[test]
     fn test_infer_field_schema() {
-        assert_eq!(infer_field_schema("A"), DataType::Utf8);
-        assert_eq!(infer_field_schema("\"123\""), DataType::Utf8);
-        assert_eq!(infer_field_schema("10"), DataType::Int64);
-        assert_eq!(infer_field_schema("10.2"), DataType::Float64);
-        assert_eq!(infer_field_schema(".2"), DataType::Float64);
-        assert_eq!(infer_field_schema("2."), DataType::Float64);
-        assert_eq!(infer_field_schema("true"), DataType::Boolean);
-        assert_eq!(infer_field_schema("false"), DataType::Boolean);
-        assert_eq!(infer_field_schema("2020-11-08"), DataType::Date32);
-        assert_eq!(infer_field_schema("2020-11-08T14:20:01"), DataType::Date64);
-        assert_eq!(infer_field_schema("-5.13"), DataType::Float64);
-        assert_eq!(infer_field_schema("0.1300"), DataType::Float64);
+        assert_eq!(infer_field_schema("A", None), DataType::Utf8);
+        assert_eq!(infer_field_schema("\"123\"", None), DataType::Utf8);
+        assert_eq!(infer_field_schema("10", None), DataType::Int64);
+        assert_eq!(infer_field_schema("10.2", None), DataType::Float64);
+        assert_eq!(infer_field_schema(".2", None), DataType::Float64);
+        assert_eq!(infer_field_schema("2.", None), DataType::Float64);
+        assert_eq!(infer_field_schema("true", None), DataType::Boolean);
+        assert_eq!(infer_field_schema("false", None), DataType::Boolean);
+        assert_eq!(infer_field_schema("2020-11-08", None), DataType::Date32);
+        assert_eq!(
+            infer_field_schema("2020-11-08T14:20:01", None),
+            DataType::Date64
+        );
+        let reg = Regex::new(r"^\d{4}-\d\d-\d\d \d\d:\d\d:\d\d$").ok();

Review comment:
       Done :-) 
   Let me know if more are needed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] sum12 commented on a change in pull request #1112: allow using custom datetime format for inference and parsing csv file

Posted by GitBox <gi...@apache.org>.
sum12 commented on a change in pull request #1112:
URL: https://github.com/apache/arrow-rs/pull/1112#discussion_r777116128



##########
File path: arrow/src/csv/reader.rs
##########
@@ -696,6 +738,39 @@ impl Parser for Date64Type {
             _ => None,
         }
     }
+
+    fn parse_formatted(string: &str, format: &str) -> Option<i64> {
+        match Self::DATA_TYPE {
+            DataType::Date64 => {
+                use chrono::format::Fixed;
+                use chrono::format::StrftimeItems;
+                let fmt = StrftimeItems::new(format);
+                let has_zone = fmt.into_iter().any(|item| match item {

Review comment:
       I would love to avoid evaluation of `has_zone` on very call. Not sure if I can have some "attributes" on this particular implementation of Parser/Date64Type and avoid having to re-evaluate it everytime




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org