You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/05/23 05:01:30 UTC

[GitHub] [arrow] houqp opened a new pull request #7252: ARROW-8906: [Rust] [DataFusion] support schema inference from multiple CSV files

houqp opened a new pull request #7252:
URL: https://github.com/apache/arrow/pull/7252


   This change adds `infer_schema_from_files` function to arrow csv reader module. Datafusion's `CsvExec` struct is now using this function to do schema inference from multiple CSV files if needed.
   
   Second follow up PR for #7210.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #7252: ARROW-8906: [Rust] [DataFusion] support schema inference from multiple CSV files

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #7252:
URL: https://github.com/apache/arrow/pull/7252#discussion_r430411298



##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -172,7 +177,42 @@ pub fn infer_file_schema<R: Read + Seek>(
     // return the reader seek back to the start
     csv_reader.into_inner().seek(SeekFrom::Start(0))?;
 
-    Ok(Schema::new(fields))
+    Ok((Schema::new(fields), records_count))
+}
+
+/// Infer schema from a list of CSV files by reading through first n records
+/// with `max_read_records` controlling the maximum number of records to read.
+///
+/// Files will be read in the given order untill n records have been reached.
+///
+/// If `max_read_records` is not set, all files will be read fully to infer the schema.
+pub fn infer_schema_from_files(
+    files: &Vec<String>,
+    delimiter: u8,
+    max_read_records: Option<usize>,
+    has_header: bool,
+) -> Result<Schema> {
+    let mut schemas = vec![];
+    let mut records_to_read = max_read_records.unwrap_or(std::usize::MAX);
+
+    for fname in files.iter() {
+        let (schema, records_read) = infer_file_schema(
+            &mut BufReader::new(File::open(fname)?),
+            delimiter,
+            Some(records_to_read),
+            has_header,
+        )?;
+        if records_read == 0 {
+            continue;
+        }
+        schemas.push(schema.clone());
+        records_to_read -= records_read;

Review comment:
       Just checking, we can never get `records_to_read` < 0, right?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on a change in pull request #7252: ARROW-8906: [Rust] [DataFusion] support schema inference from multiple CSV files

Posted by GitBox <gi...@apache.org>.
andygrove commented on a change in pull request #7252:
URL: https://github.com/apache/arrow/pull/7252#discussion_r430050988



##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -175,6 +176,60 @@ pub fn infer_file_schema<R: Read + Seek>(
     Ok(Schema::new(fields))
 }
 
+/// Infer schema from a list of CSV files by reading through first n records
+/// with `max_read_records` controlling the maximum number of records to read.
+///
+/// Files will be read in the given order untill n records have been reached.
+///
+/// If `max_read_records` is not set, all files will be read fully to infer the schema.
+pub fn infer_schema_from_files(
+    files: &Vec<String>,
+    delimiter: u8,
+    max_read_records: Option<usize>,
+    has_header: bool,
+) -> Result<Schema> {
+    let mut buff = Cursor::new(Vec::new());

Review comment:
       I'm a little nervous about the idea of potentially loading all of the csv files into memory. I'm also not sure how this will work when the schema varies between files. In the unit test, all files appear to have the same three columns. What if one of the files is missing "c2" and another file has an additional "c4" and "c5" ?
   
   What I think we'll ultimately need in DataFusion is a schema merging feature, so each csv (or parquet) file can have differences but the final output schema will contain the superset.

##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -175,6 +176,60 @@ pub fn infer_file_schema<R: Read + Seek>(
     Ok(Schema::new(fields))
 }
 
+/// Infer schema from a list of CSV files by reading through first n records
+/// with `max_read_records` controlling the maximum number of records to read.
+///
+/// Files will be read in the given order untill n records have been reached.
+///
+/// If `max_read_records` is not set, all files will be read fully to infer the schema.
+pub fn infer_schema_from_files(
+    files: &Vec<String>,
+    delimiter: u8,
+    max_read_records: Option<usize>,
+    has_header: bool,
+) -> Result<Schema> {
+    let mut buff = Cursor::new(Vec::new());

Review comment:
       Another option here is that this code returns an error if the files have differing schemas.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] houqp commented on a change in pull request #7252: ARROW-8906: [Rust] [DataFusion] support schema inference from multiple CSV files

Posted by GitBox <gi...@apache.org>.
houqp commented on a change in pull request #7252:
URL: https://github.com/apache/arrow/pull/7252#discussion_r432877858



##########
File path: rust/arrow/src/datatypes.rs
##########
@@ -1289,6 +1417,46 @@ impl Schema {
         Self { fields, metadata }
     }
 
+    pub fn try_merge(schemas: &Vec<Self>) -> Result<Self> {

Review comment:
       ha, good catch, i added a doc for this, but it was mistakenly added to Field's try_merge method. I have fixed and also added doc for Field's try_merge method as well.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7252: ARROW-8906: [Rust] [DataFusion] support schema inference from multiple CSV files

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7252:
URL: https://github.com/apache/arrow/pull/7252#issuecomment-632986177


   https://issues.apache.org/jira/browse/ARROW-8906


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] houqp commented on pull request #7252: ARROW-8906: [Rust] [DataFusion] support schema inference from multiple CSV files

Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #7252:
URL: https://github.com/apache/arrow/pull/7252#issuecomment-633803661


   @andygrove I have added schema merge logic to `Schema` and `Field` structs. As a result, we are now able to do schema inference per file thus avoiding the need to load records into memory.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #7252: ARROW-8906: [Rust] [DataFusion] support schema inference from multiple CSV files

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #7252:
URL: https://github.com/apache/arrow/pull/7252#discussion_r432878508



##########
File path: rust/arrow/src/datatypes.rs
##########
@@ -1289,6 +1417,46 @@ impl Schema {
         Self { fields, metadata }
     }
 
+    pub fn try_merge(schemas: &Vec<Self>) -> Result<Self> {

Review comment:
       Thanks, I'll await a review from another commiter, then we can merge




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #7252: ARROW-8906: [Rust] [DataFusion] support schema inference from multiple CSV files

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #7252:
URL: https://github.com/apache/arrow/pull/7252#discussion_r432873310



##########
File path: rust/arrow/src/datatypes.rs
##########
@@ -1289,6 +1417,46 @@ impl Schema {
         Self { fields, metadata }
     }
 
+    pub fn try_merge(schemas: &Vec<Self>) -> Result<Self> {

Review comment:
       May you please add a doc for this




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me closed pull request #7252: ARROW-8906: [Rust] [DataFusion] support schema inference from multiple CSV files

Posted by GitBox <gi...@apache.org>.
nevi-me closed pull request #7252:
URL: https://github.com/apache/arrow/pull/7252


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #7252: ARROW-8906: [Rust] [DataFusion] support schema inference from multiple CSV files

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #7252:
URL: https://github.com/apache/arrow/pull/7252#issuecomment-637099283


   I'll merge this tomorrow evening GMT if there haven't been any further reviews. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] houqp commented on a change in pull request #7252: ARROW-8906: [Rust] [DataFusion] support schema inference from multiple CSV files

Posted by GitBox <gi...@apache.org>.
houqp commented on a change in pull request #7252:
URL: https://github.com/apache/arrow/pull/7252#discussion_r430062159



##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -175,6 +176,60 @@ pub fn infer_file_schema<R: Read + Seek>(
     Ok(Schema::new(fields))
 }
 
+/// Infer schema from a list of CSV files by reading through first n records
+/// with `max_read_records` controlling the maximum number of records to read.
+///
+/// Files will be read in the given order untill n records have been reached.
+///
+/// If `max_read_records` is not set, all files will be read fully to infer the schema.
+pub fn infer_schema_from_files(
+    files: &Vec<String>,
+    delimiter: u8,
+    max_read_records: Option<usize>,
+    has_header: bool,
+) -> Result<Schema> {
+    let mut buff = Cursor::new(Vec::new());

Review comment:
       sounds good, i will change it to infer schema per file, then merge at the end. if we don't have to seek across multiple files, then we don't need to load all lines into memory.

##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -172,7 +177,42 @@ pub fn infer_file_schema<R: Read + Seek>(
     // return the reader seek back to the start
     csv_reader.into_inner().seek(SeekFrom::Start(0))?;
 
-    Ok(Schema::new(fields))
+    Ok((Schema::new(fields), records_count))
+}
+
+/// Infer schema from a list of CSV files by reading through first n records
+/// with `max_read_records` controlling the maximum number of records to read.
+///
+/// Files will be read in the given order untill n records have been reached.
+///
+/// If `max_read_records` is not set, all files will be read fully to infer the schema.
+pub fn infer_schema_from_files(
+    files: &Vec<String>,
+    delimiter: u8,
+    max_read_records: Option<usize>,
+    has_header: bool,
+) -> Result<Schema> {
+    let mut schemas = vec![];
+    let mut records_to_read = max_read_records.unwrap_or(std::usize::MAX);
+
+    for fname in files.iter() {
+        let (schema, records_read) = infer_file_schema(
+            &mut BufReader::new(File::open(fname)?),
+            delimiter,
+            Some(records_to_read),
+            has_header,
+        )?;
+        if records_read == 0 {
+            continue;
+        }
+        schemas.push(schema.clone());
+        records_to_read -= records_read;

Review comment:
       That's correct, should always be positive as long as infer_file_schema doesn't read more records than requested. I used `<=` below to exit early if any unexpected error occurs. We can also throw an error when it's less than zero.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org