You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/05/19 21:10:00 UTC

[GitHub] [arrow] nevi-me commented on a change in pull request #7210: ARROW-8839: [Rust] [DataFusion] support CSV schema inference in logical plan

nevi-me commented on a change in pull request #7210:
URL: https://github.com/apache/arrow/pull/7210#discussion_r427600229



##########
File path: rust/datafusion/src/execution/physical_plan/csv.rs
##########
@@ -71,15 +75,35 @@ impl CsvExec {
     /// Create a new execution plan for reading a set of CSV files
     pub fn try_new(
         path: &str,
-        schema: Arc<Schema>,
+        schema: Option<Arc<Schema>>,
         has_header: bool,
+        delimiter: Option<u8>,
         projection: Option<Vec<usize>>,
         batch_size: usize,
     ) -> Result<Self> {
+        let schema = match schema {
+            Some(s) => s,
+            None => {
+                let mut filenames: Vec<String> = vec![];
+                common::build_file_list(path, &mut filenames, ".csv")?;
+                if filenames.is_empty() {
+                    return Err(ExecutionError::General("No files found".to_string()));
+                }
+
+                let f = File::open(&filenames[0])?;

Review comment:
       What would happen if the first file is empty, but other files aren't? A good follow-up could be iterating through the files until an `Option<max_inference: usize>` is reached to ensure that the desired sample size is reached.
   
   I've personally encountered some CSV files where the first k records conformed to a schema, but things broke down thereafter, especially with files that aren't generated by a uniformity-preserving tool like Spark or a database.

##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -762,10 +778,20 @@ mod tests {
         let mut csv = builder.build(file).unwrap();
         match csv.next() {
             Err(e) => assert_eq!(
-                "ParseError(\"Error while parsing value 4.x4 at line 4\")",
+                "ParseError(\"Error while parsing value 4.x4 for column 1 at line 4\")",
                 format!("{:?}", e)
             ),
             Ok(_) => panic!("should have failed"),
         }
     }
+
+    #[test]
+    fn test_infer_field_schema() {

Review comment:
       Thanks for adding this :)

##########
File path: rust/datafusion/src/execution/physical_plan/csv.rs
##########
@@ -71,15 +75,35 @@ impl CsvExec {
     /// Create a new execution plan for reading a set of CSV files
     pub fn try_new(
         path: &str,
-        schema: Arc<Schema>,
+        schema: Option<Arc<Schema>>,
         has_header: bool,
+        delimiter: Option<u8>,
         projection: Option<Vec<usize>>,
         batch_size: usize,
     ) -> Result<Self> {
+        let schema = match schema {
+            Some(s) => s,
+            None => {
+                let mut filenames: Vec<String> = vec![];
+                common::build_file_list(path, &mut filenames, ".csv")?;
+                if filenames.is_empty() {
+                    return Err(ExecutionError::General("No files found".to_string()));
+                }
+
+                let f = File::open(&filenames[0])?;
+                Arc::new(csv::infer_file_schema(
+                    &mut BufReader::new(f),
+                    delimiter.unwrap_or(b','),
+                    Some(30),

Review comment:
       Using a number that's too small for inference could cause issues for some users (esp as it's not yet configurable). How about 1000?

##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -87,19 +87,19 @@ fn infer_field_schema(string: &str) -> DataType {
 /// with `max_read_records` controlling the maximum number of records to read.
 ///
 /// If `max_read_records` is not set, the whole file is read to infer its schema.
-fn infer_file_schema<R: Read + Seek>(
+pub fn infer_file_schema<R: Read + Seek>(
     reader: &mut BufReader<R>,
     delimiter: u8,
     max_read_records: Option<usize>,
-    has_headers: bool,

Review comment:
       Should we change `has_headers` to `has_header` consistently on the rest of the `csv` submodule?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org