You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alexhallam (via GitHub)" <gi...@apache.org> on 2023/02/22 12:00:17 UTC
[GitHub] [arrow-rs] alexhallam commented on issue #3744: CSV reader infers Date64 type for fields like "2020-03-19 00:00:00" that it can't parse to Date64

alexhallam commented on issue #3744:
URL: https://github.com/apache/arrow-rs/issues/3744#issuecomment-1439903773

   I came to make the same comment.
   
   here is a sample csv
   
   
   ```
   "","origin","year","month","day","hour","temp","dewp","humid","wind_dir","wind_speed","wind_gust","precip","pressure","visib","time_hour"
   "1","EWR",2013,1,1,1,39.02,26.06,59.37,270,10.35702,NA,0,1012,10,2013-01-01 01:00:00
   "2","EWR",2013,1,1,2,39.02,26.96,61.63,250,8.05546,NA,0,1012.3,10,2013-01-01 02:00:00
   "3","EWR",2013,1,1,3,39.02,28.04,64.43,240,11.5078,NA,0,1012.5,10,2013-01-01 03:00:00
   ```
   
   The last column `time_hour` will not parse.
   
   # Example
   
   Here is some code.
   
   ```rust
   use arrow::array::{Array, StringArray};
   use arrow::compute::cast;
   use arrow::datatypes::DataType::{
       Boolean, Date32, Date64, Float64, Int64, Interval, List, Time32, Time64, Timestamp, Utf8,
   };
   use arrow::ipc::Time;
   use arrow::record_batch::RecordBatch;
   use arrow_csv::reader;
   use std::fs::File;
   
   fn main{
       let path = "data/weather.csv".to_owned();
   
       // infer the schema using arrow_csv::reader
       let schema = reader::infer_schema_from_files(&[path.clone()], 44, Some(1000), true);
       let schema_data_types = reader::infer_schema_from_files(&[path.clone()], 44, Some(1000), true);
   
       //for each feild in the schema, match on the data type and push a string to a vectotr `Vec<String>`.
       let data_types: Vec<String> = schema_data_types
           .expect("Schema should be infered")
           .fields()
           .iter()
           .map(|field| {
               let data_type = field.data_type();
               match data_type {
                   Boolean => "<bool>".to_string(),
                   Int64 => "<int>".to_string(),
                   Float64 => "<dbl>".to_string(),
                   Utf8 => "<chr>".to_string(),
                   List(_) => "<list>".to_string(),
                   Date32 => "<date>".to_string(),
                   Date64 => "<date64>".to_string(),
                   Timestamp(_, _) => "<ts>".to_string(),
                   Time32(_) => "<time>".to_string(),
                   Time64(_) => "<time64>".to_string(),
                   _ => "<_>".to_string(),
               }
           })
           .collect();
   
       // print the data types
       println!("data types {:?}", data_types);
   
       let file = File::open(path).unwrap();
       let mut reader = reader::Reader::new(
           file,
           Arc::new(schema.expect("Schema should be infered")),
           true,
           Some(44),
           1024,
           None,
           None,
           None,
       );
   }
       // convert reader to record batch
       let record_batch: RecordBatch = reader.next().unwrap().unwrap().clone();
   
       // print record batch
       println!("{:?}", record_batch);
   ```
   
   # The Error
   
   ```txt
   data types ["<int>", "<chr>", "<int>", "<int>", "<int>", "<int>", "<dbl>", "<dbl>", "<dbl>", "<chr>", "<dbl>", "<chr>", "<dbl>", "<chr>", "<dbl>", "<date64>"]
   thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ParseError("Error while parsing value 2013-01-01 02:00:00 for column 15 at line 2")'
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org