You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/11/18 23:39:06 UTC

[GitHub] [arrow] Dandandan opened a new pull request #8710: ARROW-10649: [Rust] Parse manually, remove lazy static dependency

Dandandan opened a new pull request #8710:
URL: https://github.com/apache/arrow/pull/8710


   Changes infer_field_schema to parse manually.
   
   lazy_static is no longer needed as dependency.
   
   Probably could be a bit faster as well.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on pull request #8710: ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #8710:
URL: https://github.com/apache/arrow/pull/8710#issuecomment-739739060


   Let's park this one for now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorgecarleitao edited a comment on pull request #8710: ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency

Posted by GitBox <gi...@apache.org>.

jorgecarleitao edited a comment on pull request #8710:
URL: https://github.com/apache/arrow/pull/8710#issuecomment-731554302


   * Converting `CSV -> StringArray -> [Type]Array` is not recommended, as it forces us to load everything in memory, even if there are shorter representations. Therefore, really need a way to build arrays out of CSV columns.
   
   * CSV is parsed as rows, but arrow is column-based. Therefore, there will need to be a pivot of the data at some point.
   
   My feeling is that there are wildly different specs out there into how we should convert a CSV column into an Array. IMO we should not try to solve all those use-cases ourselves and instead offer users the freedom to choose, as well as common utilities.
   
   As such, one idea is to offer a way to plugin that allow users to parse CSV column into `[Type]Array`, and offer a default offering.
   
   Since these are stateless, one simple idea is have the CSV reader accept a trait with two functions:
   
   ```rust
   infer: Fn(rows: &[StringRecord]) -> Vec<Option<DataType>>;
   convert: Fn(data_type: &DataType, rows: &[StringRecord], col_idx: usize) -> Result<ArrayRef>;
   # or something like this
   ```
   
   This signature indicates that:
   
   1. The function transverses rows
   2. the function is falible
   3. the resulting array is dynamic
   
   This allows the user to e.g. make unparsable rows as nulls, adopt specific notations for CSV files that are (for them) interoperable with Arrow, etc.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on a change in pull request #8710: ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency

Posted by GitBox <gi...@apache.org>.

Dandandan commented on a change in pull request #8710:
URL: https://github.com/apache/arrow/pull/8710#discussion_r526508067



##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -77,15 +70,20 @@ fn infer_field_schema(string: &str) -> DataType {
         return DataType::Utf8;
     }
     // match regex in a particular order
-    if BOOLEAN_RE.is_match(string) {
-        DataType::Boolean
-    } else if DECIMAL_RE.is_match(string) {
-        DataType::Float64
-    } else if INTEGER_RE.is_match(string) {
-        DataType::Int64
-    } else {
-        DataType::Utf8
+    let lower = string.to_ascii_lowercase();

Review comment:
       Would be good in that case to have a benchmark.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Jibbow commented on pull request #8710: ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency

Posted by GitBox <gi...@apache.org>.

Jibbow commented on pull request #8710:
URL: https://github.com/apache/arrow/pull/8710#issuecomment-731550846


   Sounds good!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on a change in pull request #8710: ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency

Posted by GitBox <gi...@apache.org>.

Dandandan commented on a change in pull request #8710:
URL: https://github.com/apache/arrow/pull/8710#discussion_r526506545



##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -77,15 +70,20 @@ fn infer_field_schema(string: &str) -> DataType {
         return DataType::Utf8;
     }
     // match regex in a particular order
-    if BOOLEAN_RE.is_match(string) {
-        DataType::Boolean
-    } else if DECIMAL_RE.is_match(string) {
-        DataType::Float64
-    } else if INTEGER_RE.is_match(string) {
-        DataType::Int64
-    } else {
-        DataType::Utf8
+    let lower = string.to_ascii_lowercase();

Review comment:
       It could be avoided (e.g. iterating over chars and doing comparison at the same time) but not sure how much it avoids.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me commented on pull request #8710: ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency

Posted by GitBox <gi...@apache.org>.

nevi-me commented on pull request #8710:
URL: https://github.com/apache/arrow/pull/8710#issuecomment-730070538


   Before we remove `lazy_static`, how would we also remove it in #8611? CC @Jibbow 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on a change in pull request #8710: ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency

Posted by GitBox <gi...@apache.org>.

Dandandan commented on a change in pull request #8710:
URL: https://github.com/apache/arrow/pull/8710#discussion_r526512058



##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -77,15 +70,20 @@ fn infer_field_schema(string: &str) -> DataType {
         return DataType::Utf8;
     }
     // match regex in a particular order
-    if BOOLEAN_RE.is_match(string) {
-        DataType::Boolean
-    } else if DECIMAL_RE.is_match(string) {
-        DataType::Float64
-    } else if INTEGER_RE.is_match(string) {
-        DataType::Int64
-    } else {
-        DataType::Utf8
+    let lower = string.to_ascii_lowercase();
+    if lower == "true" || lower == "false" {
+        return DataType::Boolean;
+    }
+    let skip_minus = if string.starts_with('-') { 1 } else { 0 };

Review comment:
       Added




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on a change in pull request #8710: ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency

Posted by GitBox <gi...@apache.org>.

Dandandan commented on a change in pull request #8710:
URL: https://github.com/apache/arrow/pull/8710#discussion_r526520746



##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -77,15 +70,20 @@ fn infer_field_schema(string: &str) -> DataType {
         return DataType::Utf8;
     }
     // match regex in a particular order
-    if BOOLEAN_RE.is_match(string) {
-        DataType::Boolean
-    } else if DECIMAL_RE.is_match(string) {
-        DataType::Float64
-    } else if INTEGER_RE.is_match(string) {
-        DataType::Int64
-    } else {
-        DataType::Utf8
+    let lower = string.to_ascii_lowercase();

Review comment:
       Also uses `eq_ignore_ascii_case` (which doesn't allocate) now




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on pull request #8710: ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #8710:
URL: https://github.com/apache/arrow/pull/8710#issuecomment-730331585


   > Before we remove `lazy_static`, how would we also remove it in #8611? CC @Jibbow
   
   I think it can reuse the structure here and also use the all digit function.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #8710: ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #8710:
URL: https://github.com/apache/arrow/pull/8710#issuecomment-730028364


   https://issues.apache.org/jira/browse/ARROW-10649


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on pull request #8710: ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #8710:
URL: https://github.com/apache/arrow/pull/8710#issuecomment-730034325


   CI failure seems unrelated: https://github.com/apache/arrow/pull/8710/checks?check_run_id=1421269108
   ```
   Post job cleanup.
   /bin/tar -cz -f /home/runner/work/_temp/0ba3abcf-1c45-4bd4-b0d9-f3334c9cc174/cache.tgz -C /home/runner/work/arrow/arrow/.docker .
   Warning: Cache service responded with 400 during chunk upload.
   events.js:187
         throw er; // Unhandled 'error' event
         ^
   
   ```
   
   restarting


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on a change in pull request #8710: ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #8710:
URL: https://github.com/apache/arrow/pull/8710#discussion_r526502537



##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -77,15 +70,20 @@ fn infer_field_schema(string: &str) -> DataType {
         return DataType::Utf8;
     }
     // match regex in a particular order
-    if BOOLEAN_RE.is_match(string) {
-        DataType::Boolean
-    } else if DECIMAL_RE.is_match(string) {
-        DataType::Float64
-    } else if INTEGER_RE.is_match(string) {
-        DataType::Int64
-    } else {
-        DataType::Utf8
+    let lower = string.to_ascii_lowercase();

Review comment:
       I think `to_ascii_lowercase` introduces a copy. I wonder if we are worried about the costs of doing so (I don't know if there are good benchmarks for the CSV parser anywhere)

##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -77,15 +70,20 @@ fn infer_field_schema(string: &str) -> DataType {
         return DataType::Utf8;
     }
     // match regex in a particular order
-    if BOOLEAN_RE.is_match(string) {
-        DataType::Boolean
-    } else if DECIMAL_RE.is_match(string) {
-        DataType::Float64
-    } else if INTEGER_RE.is_match(string) {
-        DataType::Int64
-    } else {
-        DataType::Utf8
+    let lower = string.to_ascii_lowercase();
+    if lower == "true" || lower == "false" {
+        return DataType::Boolean;
+    }
+    let skip_minus = if string.starts_with('-') { 1 } else { 0 };

Review comment:
       One thing I I wonder is does this code handle invalid data like `12.12.12` (aka if `split` returns more than 1 decimal)?
   
   But given that this is just code for inferring schema, it is probably ok if that gets identified incorrect as a float...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on pull request #8710: ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #8710:
URL: https://github.com/apache/arrow/pull/8710#issuecomment-730390464


   > `regex` + `lazy_static` is somewhat a nice combination, but I agree that we could also recognize dates without those two libraries. But instead of using `all_digit()` and manually dissecting the string, we could also apply `chrono`'s `parse()` function which returns a `ParseResult` and check whether parsing was successful or not. Opinions on that?
   
   I think for the parsing (not recognizing) the dates itself that makes sense. I think parsing usually is slower than only matching it so that might be something to consider here?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan closed pull request #8710: ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency

Posted by GitBox <gi...@apache.org>.

Dandandan closed pull request #8710:
URL: https://github.com/apache/arrow/pull/8710


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on pull request #8710: ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on pull request #8710:
URL: https://github.com/apache/arrow/pull/8710#issuecomment-731554302


   * Converting `CSV -> StringArray -> [Type]Array` is not recommended, as it forces us to load everything in memory, even if there are shorter representations. Therefore, really need a way to build arrays out of CSV columns.
   
   * CSV is parsed as rows, but arrow is column-based. Therefore, there will need to be a pivot of the data at some point.
   
   My feeling is that there are wildly different specs out there into how we should convert a CSV column into an Array. IMO we should not try to solve all those use-cases ourselves and instead offer users the freedom to choose, as well as common utilities.
   
   As such, one idea is to offer a way to plugin that allow users to parse CSV column into `[Type]Array`, and offer a default offering.
   
   Since these are stateless, one simple idea is have the CSV reader accept a trait with two functions:
   
   ```rust
   infer: Fn(rows: &[StringRecord], col_idx: usize) -> DataType;
   convert: Fn(data_type: &DataType, rows: &[StringRecord], col_idx: usize) -> Result<ArrayRef>;
   # or something like this
   ```
   
   This signature indicates that:
   
   1. The function transverses rows
   2. the function is falible
   3. the resulting array is dynamic
   
   This allows the user to e.g. make unparsable rows as nulls, adopt specific notations for CSV files that are (for them) interoperable with Arrow, etc.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorgecarleitao edited a comment on pull request #8710: ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency

Posted by GitBox <gi...@apache.org>.

jorgecarleitao edited a comment on pull request #8710:
URL: https://github.com/apache/arrow/pull/8710#issuecomment-731554302


   * Converting `CSV -> StringArray -> [Type]Array` is not recommended, as it forces us to load everything in memory, even if there are shorter representations. Therefore, we really need a way to build arrays out of CSV columns.
   
   * CSV is parsed as rows, but arrow is column-based. Therefore, there will need to be a pivot of the data at some point.
   
   My feeling is that there are wildly different specs out there into how we should convert a CSV column into an Array. IMO we should not try to solve all those use-cases ourselves and instead offer users the freedom to choose, as well as common utilities.
   
   As such, one idea is to offer a way to plugin that allow users to parse CSV column into `[Type]Array`, and offer a default offering.
   
   Since these are stateless, one simple idea is have the CSV reader accept a trait with two functions:
   
   ```rust
   infer: Fn(rows: &[StringRecord]) -> Vec<Option<DataType>>;
   convert: Fn(data_type: &DataType, rows: &[StringRecord], col_idx: usize) -> Result<ArrayRef>;
   # or something like this
   ```
   
   This signature indicates that:
   
   1. The function transverses rows
   2. the function is falible
   3. the resulting array is dynamic
   
   This allows the user to e.g. make unparsable rows as nulls, adopt specific notations for CSV files that are (for them) interoperable with Arrow, etc.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on pull request #8710: ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #8710:
URL: https://github.com/apache/arrow/pull/8710#issuecomment-730330984


   Related:https://github.com/apache/arrow/pull/8714


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me commented on pull request #8710: ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency

Posted by GitBox <gi...@apache.org>.

nevi-me commented on pull request #8710:
URL: https://github.com/apache/arrow/pull/8710#issuecomment-731545653


   @alamb @jorgecarleitao should we complete #8611 first, so that we don't remove `lazy_static` and `regex` if we can't find an alternative for dates?
   We could also in future take a similar approach to C++ and Pandas, where the user is expected to provide a list of columns that should be parsed as temporal types, and the formats that should be used. I think this is more flexible, and it would allow us to support `Time` by parsing `HH:mm:{ss}`.
   
   Your thoughts @Jibbow @Dandandan ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on pull request #8710: ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #8710:
URL: https://github.com/apache/arrow/pull/8710#issuecomment-730034088


   Thank you @Dandandan 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Jibbow commented on pull request #8710: ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency

Posted by GitBox <gi...@apache.org>.

Jibbow commented on pull request #8710:
URL: https://github.com/apache/arrow/pull/8710#issuecomment-730353755


   `regex` + `lazy_static` is somewhat a nice combination, but I agree that we could also recognize dates without those two libraries. But instead of using `all_digit()` and manually dissecting the string, we could also apply `chrono`'s `parse()` function which returns a `ParseResult` and check whether parsing was successful or not. Opinions on that?
   Also, I like #8714


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on pull request #8710: ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #8710:
URL: https://github.com/apache/arrow/pull/8710#issuecomment-731551129


   Sure!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on a change in pull request #8710: ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency

Posted by GitBox <gi...@apache.org>.

Dandandan commented on a change in pull request #8710:
URL: https://github.com/apache/arrow/pull/8710#discussion_r526508924



##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -77,15 +70,20 @@ fn infer_field_schema(string: &str) -> DataType {
         return DataType::Utf8;
     }
     // match regex in a particular order
-    if BOOLEAN_RE.is_match(string) {
-        DataType::Boolean
-    } else if DECIMAL_RE.is_match(string) {
-        DataType::Float64
-    } else if INTEGER_RE.is_match(string) {
-        DataType::Int64
-    } else {
-        DataType::Utf8
+    let lower = string.to_ascii_lowercase();
+    if lower == "true" || lower == "false" {
+        return DataType::Boolean;
+    }
+    let skip_minus = if string.starts_with('-') { 1 } else { 0 };

Review comment:
       Yeah... check could be added if it's too inclusive, I think easily using https://doc.rust-lang.org/std/primitive.str.html#method.splitn




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org