You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "sacundim (via GitHub)" <gi...@apache.org> on 2023/02/22 06:57:02 UTC

[GitHub] [arrow-rs] sacundim opened a new issue, #3744: CSV reader infers Date64 type for fields like "2020-03-19 00:00:00" that it can't parse to Date64

sacundim opened a new issue, #3744:
URL: https://github.com/apache/arrow-rs/issues/3744

   # Describe the bug
   
   Starting with version 0.6.1 of [csv2parquet](https://github.com/domoritz/csv2parquet) (which upgraded its arrow-rs dependency from 24.0 to 30.0.1), input CSV files that were converting just fine are now failing with errors like this one:
   
   ```
   Error: External(ParseError("Error while parsing value 2020-03-19 00:00:00 for column 4 at line 2"))
   ```
   
   The csv2parquet tool is a very thin wrapper around arrow-rs that basically
   
   1. Calls `arrow::csv::reader::infer_file_schema` to infer a schema for the input file;
   2. Builds an `arrow::csv::Reader` from that schema and uses it to read the file;
   3. Outputs a Parquet file.
   
   The tool has a command line option that prints out the inferred schemas, and input values like that are being inferred into Date64:
   
   ```
       {
         "name": "Sample Date",
         "data_type": "Utf8",
         "nullable": false,
         "dict_id": 0,
         "dict_is_ordered": false
       }
   ```
   
   And I've written a unit test case that demonstrates that 
   
   1. Arrow-rs is inferring Date64 type for input values like "2020-03-19 00:00:00"
   2. ...but attempting to parse them as Date64 fails
   
   # To Reproduce
   
   I wrote a couple of very simple test cases illustrating the problem, which I will shortly edit this report to link.
   
   # Expected behavior
   
   I see that the various Timestamp types are able to parse strings like that correctly, so I don't understand whether the more correct behavior for this library would be to infer a Timestamp type for these strings instead of the Date64 type. It does seem clear to me that however that it should be possible to parse these strings as Date64. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #3744: CSV reader infers Date64 type for fields like "2020-03-19 00:00:00" that it can't parse to Date64

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #3744:
URL: https://github.com/apache/arrow-rs/issues/3744#issuecomment-1443405375

   `label_issue.py` automatically added labels {'arrow'} from #3746


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alexhallam commented on issue #3744: CSV reader infers Date64 type for fields like "2020-03-19 00:00:00" that it can't parse to Date64

Posted by "alexhallam (via GitHub)" <gi...@apache.org>.
alexhallam commented on issue #3744:
URL: https://github.com/apache/arrow-rs/issues/3744#issuecomment-1439982994

   @tustvold  I think that is sensible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #3744: CSV reader infers Date64 type for fields like "2020-03-19 00:00:00" that it can't parse to Date64

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #3744:
URL: https://github.com/apache/arrow-rs/issues/3744#issuecomment-1439938265

   I think the CSV reader should be inferring Timestamp for such columns, it is unclear to me that it should ever infer Date64, as the semantics of such a type are somewhat unclear, at least to me.
   
   I have posted an email to the arrow mailing list to try to clarify what the Date64 type is for - https://lists.apache.org/thread/q036r1q3cw5ysn3zkpvljx3s9ho18419


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold closed issue #3744: CSV reader infers Date64 type for fields like "2020-03-19 00:00:00" that it can't parse to Date64

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold closed issue #3744: CSV reader infers Date64 type for fields like "2020-03-19 00:00:00" that it can't parse to Date64
URL: https://github.com/apache/arrow-rs/issues/3744


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alexhallam commented on issue #3744: CSV reader infers Date64 type for fields like "2020-03-19 00:00:00" that it can't parse to Date64

Posted by "alexhallam (via GitHub)" <gi...@apache.org>.
alexhallam commented on issue #3744:
URL: https://github.com/apache/arrow-rs/issues/3744#issuecomment-1439903773

   I came to make the same comment.
   
   here is a sample csv
   
   
   ```
   "","origin","year","month","day","hour","temp","dewp","humid","wind_dir","wind_speed","wind_gust","precip","pressure","visib","time_hour"
   "1","EWR",2013,1,1,1,39.02,26.06,59.37,270,10.35702,NA,0,1012,10,2013-01-01 01:00:00
   "2","EWR",2013,1,1,2,39.02,26.96,61.63,250,8.05546,NA,0,1012.3,10,2013-01-01 02:00:00
   "3","EWR",2013,1,1,3,39.02,28.04,64.43,240,11.5078,NA,0,1012.5,10,2013-01-01 03:00:00
   ```
   
   The last column `time_hour` will not parse.
   
   # Example
   
   Here is some code.
   
   ```rust
   use arrow::array::{Array, StringArray};
   use arrow::compute::cast;
   use arrow::datatypes::DataType::{
       Boolean, Date32, Date64, Float64, Int64, Interval, List, Time32, Time64, Timestamp, Utf8,
   };
   use arrow::ipc::Time;
   use arrow::record_batch::RecordBatch;
   use arrow_csv::reader;
   use std::fs::File;
   
   fn main{
       let path = "data/weather.csv".to_owned();
   
       // infer the schema using arrow_csv::reader
       let schema = reader::infer_schema_from_files(&[path.clone()], 44, Some(1000), true);
       let schema_data_types = reader::infer_schema_from_files(&[path.clone()], 44, Some(1000), true);
   
       //for each feild in the schema, match on the data type and push a string to a vectotr `Vec<String>`.
       let data_types: Vec<String> = schema_data_types
           .expect("Schema should be infered")
           .fields()
           .iter()
           .map(|field| {
               let data_type = field.data_type();
               match data_type {
                   Boolean => "<bool>".to_string(),
                   Int64 => "<int>".to_string(),
                   Float64 => "<dbl>".to_string(),
                   Utf8 => "<chr>".to_string(),
                   List(_) => "<list>".to_string(),
                   Date32 => "<date>".to_string(),
                   Date64 => "<date64>".to_string(),
                   Timestamp(_, _) => "<ts>".to_string(),
                   Time32(_) => "<time>".to_string(),
                   Time64(_) => "<time64>".to_string(),
                   _ => "<_>".to_string(),
               }
           })
           .collect();
   
       // print the data types
       println!("data types {:?}", data_types);
   
       let file = File::open(path).unwrap();
       let mut reader = reader::Reader::new(
           file,
           Arc::new(schema.expect("Schema should be infered")),
           true,
           Some(44),
           1024,
           None,
           None,
           None,
       );
   }
       // convert reader to record batch
       let record_batch: RecordBatch = reader.next().unwrap().unwrap().clone();
   
       // print record batch
       println!("{:?}", record_batch);
   ```
   
   # The Error
   
   ```txt
   data types ["<int>", "<chr>", "<int>", "<int>", "<int>", "<int>", "<dbl>", "<dbl>", "<dbl>", "<chr>", "<dbl>", "<chr>", "<dbl>", "<chr>", "<dbl>", "<date64>"]
   thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ParseError("Error while parsing value 2013-01-01 02:00:00 for column 15 at line 2")'
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alexhallam commented on issue #3744: CSV reader infers Date64 type for fields like "2020-03-19 00:00:00" that it can't parse to Date64

Posted by "alexhallam (via GitHub)" <gi...@apache.org>.
alexhallam commented on issue #3744:
URL: https://github.com/apache/arrow-rs/issues/3744#issuecomment-1439911727

   # Expected Behavior
   
   Does not matter much to me. I just need (what I would call a date time) "2013-01-01 02:00:00" to not error.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org