You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/11/08 17:28:39 UTC

[GitHub] [arrow] Jibbow opened a new pull request #8611: ARROW-4804: [Rust] Parse Date32 and Date64 in CSV reader

Jibbow opened a new pull request #8611:
URL: https://github.com/apache/arrow/pull/8611


   This PR partly handles the issue https://issues.apache.org/jira/browse/ARROW-4804.
   
   Things to discuss:
    1. `Date32` currently assumes to have `DateUnit::Day` and `Date64` assumes `DateUnit::Millisecond` respectively.
    2. Dates have to be in ISO format `YYY-MM-DD` or `YYY-MM-DDTHH:MM:SS`. I'm not sure if this is a reasonable assumption. It should also be noted that the C++ implementation does not strictly follow the ISO format because it uses `YYY-MM-DD HH:MM:SS` instead (without the `T`). The difference is also a point that has to be discussed before merging.
    3. I'm using `std::mem::transmute_copy` to convert `i32` and `i64` to `T::Native`. I struggled with Rust's type system here and I'd appreciate suggestions on how to do this in a less-unsafe way.
   
   Things that are not part of this PR:
    1. Handling of other temporal types like timestamp and time.
    2. No list of columns for temporal types (reference: "To keep inference performant. user should provide a Vec<String> of which columns to try convert to a temporal array")


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Jibbow commented on pull request #8611: ARROW-4804: [Rust] Parse Date32 and Date64 in CSV reader

Posted by GitBox <gi...@apache.org>.
Jibbow commented on pull request #8611:
URL: https://github.com/apache/arrow/pull/8611#issuecomment-750307456


   Sorry for being so quiet the last weeks. I had a research paper deadline coming up. Thanks @Dandandan for taking over!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8611: ARROW-4804: [Rust] Parse Date32 and Date64 in CSV reader

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8611:
URL: https://github.com/apache/arrow/pull/8611#issuecomment-724690117


   > Date32 currently assumes to have DateUnit::Day and Date64 assumes DateUnit::Millisecond respectively.
    
   I think this is fine as long as reasonable errors (unsupported XXX) are produced if alternate types are used


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8611: ARROW-4804: [Rust] Parse Date32 and Date64 in CSV reader

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8611:
URL: https://github.com/apache/arrow/pull/8611#issuecomment-723641228


   https://issues.apache.org/jira/browse/ARROW-4804


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Jibbow commented on pull request #8611: ARROW-4804: [Rust] Parse Date32 and Date64 in CSV reader

Posted by GitBox <gi...@apache.org>.
Jibbow commented on pull request #8611:
URL: https://github.com/apache/arrow/pull/8611#issuecomment-724606596


   Thanks for the feedback @vertexclique and @jorgecarleitao!
   I'll update the PR when #8609 is merged.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #8611: ARROW-4804: [Rust] Parse Date32 and Date64 in CSV reader

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #8611:
URL: https://github.com/apache/arrow/pull/8611#issuecomment-744086916


   > Benchmarks on master are failing as the csv parser doesn't support date32 / date64 related to this addition: #8892
   > 
   > @Jibbow any plans on finishing the PR? Would be really nice to add!
   
   Sorry that I broke this. I had been running my benchmarks with Parquet so did not notice. I am working with Andy to uplift the tests to catch this kind of issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8611: ARROW-4804: [Rust] Parse Date32 and Date64 in CSV reader

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8611:
URL: https://github.com/apache/arrow/pull/8611#discussion_r520530077



##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -219,6 +226,35 @@ pub fn infer_schema_from_files(
     Schema::try_merge(&schemas)
 }
 
+/// Parses a string into the specified `ArrowPrimitiveType`.
+fn parse_field<T: ArrowPrimitiveType>(s: &str) -> Result<T::Native> {
+    let from_ymd = chrono::NaiveDate::from_ymd;

Review comment:
       Note there is also code to convert strings to nanosecond timestamps (`string_to_timestamp_nanos`) here: https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/datetime_expressions.rs#L30 

##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -219,6 +226,35 @@ pub fn infer_schema_from_files(
     Schema::try_merge(&schemas)
 }
 
+/// Parses a string into the specified `ArrowPrimitiveType`.
+fn parse_field<T: ArrowPrimitiveType>(s: &str) -> Result<T::Native> {
+    let from_ymd = chrono::NaiveDate::from_ymd;
+    let since = chrono::NaiveDate::signed_duration_since;
+
+    match T::DATA_TYPE {
+        DataType::Boolean => s
+            .to_lowercase()
+            .parse::<T::Native>()
+            .map_err(|_| ArrowError::ParseError("Error parsing boolean".to_string())),
+        DataType::Date32(DateUnit::Day) => {
+            let days = chrono::NaiveDate::parse_from_str(s, "%Y-%m-%d")
+                .map(|t| since(t, from_ymd(1970, 1, 1)).num_days() as i32);
+            days.map(|t| unsafe { std::mem::transmute_copy::<i32, T::Native>(&t) })

Review comment:
       Another alternative would be to extend the `ArrowNativeType` with `from_i32` and `from_i64`, following the model of `from_usizeAnd then implement those functions for i32 and i64 respectively (as those are the underlying native types)
   
   I tried this approach out on a branch in case you are interested / want to take the change: 
   Commit with change: https://github.com/alamb/arrow/commit/cc61e7a3fd55fcbc00f400be1ae74d64d7de6d21
   
   The branch (with your change) is here: https://github.com/alamb/arrow/tree/alamb/less-unsafe)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8611: ARROW-4804: [Rust] Parse Date32 and Date64 in CSV reader

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8611:
URL: https://github.com/apache/arrow/pull/8611#issuecomment-745588487


   FYI we have incorporated this code in via https://github.com/apache/arrow/pull/8913 - thanks for the help @Jibbow . I am going to close this PR for now as it has a conflict and we are trying to clean up the PR queue -- please let me know if that was a mistake. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] vertexclique commented on pull request #8611: ARROW-4804: [Rust] Parse Date32 and Date64 in CSV reader

Posted by GitBox <gi...@apache.org>.
vertexclique commented on pull request #8611:
URL: https://github.com/apache/arrow/pull/8611#issuecomment-724324997


   Truly it is. Otherwise platform native primitives mightn't fit into the return types.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb closed pull request #8611: ARROW-4804: [Rust] Parse Date32 and Date64 in CSV reader

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #8611:
URL: https://github.com/apache/arrow/pull/8611


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8611: ARROW-4804: [Rust] Parse Date32 and Date64 in CSV reader

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8611:
URL: https://github.com/apache/arrow/pull/8611#discussion_r520319490



##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -219,6 +226,35 @@ pub fn infer_schema_from_files(
     Schema::try_merge(&schemas)
 }
 
+/// Parses a string into the specified `ArrowPrimitiveType`.
+fn parse_field<T: ArrowPrimitiveType>(s: &str) -> Result<T::Native> {
+    let from_ymd = chrono::NaiveDate::from_ymd;
+    let since = chrono::NaiveDate::signed_duration_since;
+
+    match T::DATA_TYPE {
+        DataType::Boolean => s
+            .to_lowercase()
+            .parse::<T::Native>()
+            .map_err(|_| ArrowError::ParseError("Error parsing boolean".to_string())),
+        DataType::Date32(DateUnit::Day) => {
+            let days = chrono::NaiveDate::parse_from_str(s, "%Y-%m-%d")
+                .map(|t| since(t, from_ymd(1970, 1, 1)).num_days() as i32);
+            days.map(|t| unsafe { std::mem::transmute_copy::<i32, T::Native>(&t) })

Review comment:
       what @vertexclique said: transmute is one of the most unsafe operations in rust, and this can easily lead to undefined behavior if it overflows.

##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -67,6 +67,9 @@ lazy_static! {
         .case_insensitive(true)
         .build()
         .unwrap();
+    static ref DATE_RE: Regex = Regex::new(r"^\d\d\d\d-\d\d-\d\d$").unwrap();

Review comment:
       isn't there a `\d{4}` or something like that? May make it a bit easier to read and more expressive, IMO




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8611: ARROW-4804: [Rust] Parse Date32 and Date64 in CSV reader

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8611:
URL: https://github.com/apache/arrow/pull/8611#issuecomment-724734449


   @alamb I can't respond to your comment.
   I see you're using `T::Native::from_i32(since(days, from_ymd(1970, 1, 1)).num_days() as i32)`. Shouldn't there be a way for us to get the desired value from `chrono` without introducing a fallible cast/conversion?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8611: ARROW-4804: [Rust] Parse Date32 and Date64 in CSV reader

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8611:
URL: https://github.com/apache/arrow/pull/8611#issuecomment-731551368


   Hi @Jibbow , I think we need not wait for #8609 before you can update this PR


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #8611: ARROW-4804: [Rust] Parse Date32 and Date64 in CSV reader

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #8611:
URL: https://github.com/apache/arrow/pull/8611#discussion_r520597966



##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -219,6 +226,35 @@ pub fn infer_schema_from_files(
     Schema::try_merge(&schemas)
 }
 
+/// Parses a string into the specified `ArrowPrimitiveType`.
+fn parse_field<T: ArrowPrimitiveType>(s: &str) -> Result<T::Native> {
+    let from_ymd = chrono::NaiveDate::from_ymd;

Review comment:
       In the long run, this is motivation to centralise this into temporal kernels; so we can share this with the JSON reader.
   One of the things I'll submit a PR for in the coming days/weeks, is a crate that parses strings to numbers faster than what libcore does.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8611: ARROW-4804: [Rust] Parse Date32 and Date64 in CSV reader

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8611:
URL: https://github.com/apache/arrow/pull/8611#discussion_r520546828



##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -219,6 +226,35 @@ pub fn infer_schema_from_files(
     Schema::try_merge(&schemas)
 }
 
+/// Parses a string into the specified `ArrowPrimitiveType`.
+fn parse_field<T: ArrowPrimitiveType>(s: &str) -> Result<T::Native> {
+    let from_ymd = chrono::NaiveDate::from_ymd;
+    let since = chrono::NaiveDate::signed_duration_since;
+
+    match T::DATA_TYPE {
+        DataType::Boolean => s
+            .to_lowercase()
+            .parse::<T::Native>()
+            .map_err(|_| ArrowError::ParseError("Error parsing boolean".to_string())),
+        DataType::Date32(DateUnit::Day) => {
+            let days = chrono::NaiveDate::parse_from_str(s, "%Y-%m-%d")
+                .map(|t| since(t, from_ymd(1970, 1, 1)).num_days() as i32);
+            days.map(|t| unsafe { std::mem::transmute_copy::<i32, T::Native>(&t) })

Review comment:
       Another alternative would be to extend the `ArrowNativeType` with `from_i32` and `from_i64`, following the model of `from_usize` and then implement those functions for i32 and i64 respectively (as those are the underlying native types)
   
   I tried this approach out on a branch in case you are interested / want to take the change: 
   Commit with change: https://github.com/alamb/arrow/commit/cc61e7a3fd55fcbc00f400be1ae74d64d7de6d21
   
   The branch (with your change) is here: https://github.com/alamb/arrow/tree/alamb/less-unsafe)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #8611: ARROW-4804: [Rust] Parse Date32 and Date64 in CSV reader

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #8611:
URL: https://github.com/apache/arrow/pull/8611#issuecomment-744038605


   Benchmarks on master are failing as the csv parser doesn't support date32 / date64 related to this addition: https://github.com/apache/arrow/pull/8892
   
   @Jibbow any plans on finishing the PR? Would be really nice to add!
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8611: ARROW-4804: [Rust] Parse Date32 and Date64 in CSV reader

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8611:
URL: https://github.com/apache/arrow/pull/8611#issuecomment-740401635


   @Jibbow, are you blocked or need help here? Let us know if that is the case so that we can help.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #8611: ARROW-4804: [Rust] Parse Date32 and Date64 in CSV reader

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #8611:
URL: https://github.com/apache/arrow/pull/8611#discussion_r528190149



##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -219,6 +226,35 @@ pub fn infer_schema_from_files(
     Schema::try_merge(&schemas)
 }
 
+/// Parses a string into the specified `ArrowPrimitiveType`.
+fn parse_field<T: ArrowPrimitiveType>(s: &str) -> Result<T::Native> {
+    let from_ymd = chrono::NaiveDate::from_ymd;
+    let since = chrono::NaiveDate::signed_duration_since;
+
+    match T::DATA_TYPE {

Review comment:
       I think this will benefit from changes in this PR to include a trait.
   https://github.com/apache/arrow/pull/8714




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Jibbow commented on pull request #8611: ARROW-4804: [Rust] Parse Date32 and Date64 in CSV reader

Posted by GitBox <gi...@apache.org>.
Jibbow commented on pull request #8611:
URL: https://github.com/apache/arrow/pull/8611#issuecomment-723641485


   Just saw https://github.com/apache/arrow/pull/8609 and `num_cast()` might actually be the thing I was looking for to convert `i64` to `T::Native`?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8611: ARROW-4804: [Rust] Parse Date32 and Date64 in CSV reader

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8611:
URL: https://github.com/apache/arrow/pull/8611#issuecomment-724756612


   @nevi-me  -- good call -- I was simply copy/pasting what was in this PR in terms of `as i32` without looking carefully enough. I updated https://github.com/alamb/arrow/tree/alamb/less-unsafe with a new commit that doesn't use `as i32` and instead uses `try_into`()
   
   https://github.com/apache/arrow/commit/86741de18ad148589873cf43c7e0d4ea7f9c5233


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org