You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/11/19 00:03:07 UTC

[GitHub] [arrow] alamb commented on a change in pull request #8710: ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency

alamb commented on a change in pull request #8710:
URL: https://github.com/apache/arrow/pull/8710#discussion_r526502537



##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -77,15 +70,20 @@ fn infer_field_schema(string: &str) -> DataType {
         return DataType::Utf8;
     }
     // match regex in a particular order
-    if BOOLEAN_RE.is_match(string) {
-        DataType::Boolean
-    } else if DECIMAL_RE.is_match(string) {
-        DataType::Float64
-    } else if INTEGER_RE.is_match(string) {
-        DataType::Int64
-    } else {
-        DataType::Utf8
+    let lower = string.to_ascii_lowercase();

Review comment:
       I think `to_ascii_lowercase` introduces a copy. I wonder if we are worried about the costs of doing so (I don't know if there are good benchmarks for the CSV parser anywhere)

##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -77,15 +70,20 @@ fn infer_field_schema(string: &str) -> DataType {
         return DataType::Utf8;
     }
     // match regex in a particular order
-    if BOOLEAN_RE.is_match(string) {
-        DataType::Boolean
-    } else if DECIMAL_RE.is_match(string) {
-        DataType::Float64
-    } else if INTEGER_RE.is_match(string) {
-        DataType::Int64
-    } else {
-        DataType::Utf8
+    let lower = string.to_ascii_lowercase();
+    if lower == "true" || lower == "false" {
+        return DataType::Boolean;
+    }
+    let skip_minus = if string.starts_with('-') { 1 } else { 0 };

Review comment:
       One thing I I wonder is does this code handle invalid data like `12.12.12` (aka if `split` returns more than 1 decimal)?
   
   But given that this is just code for inferring schema, it is probably ok if that gets identified incorrect as a float...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org