You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/05/27 04:11:53 UTC

[GitHub] [arrow] nevi-me commented on a change in pull request #7252: ARROW-8906: [Rust] [DataFusion] support schema inference from multiple CSV files

nevi-me commented on a change in pull request #7252:
URL: https://github.com/apache/arrow/pull/7252#discussion_r430411298



##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -172,7 +177,42 @@ pub fn infer_file_schema<R: Read + Seek>(
     // return the reader seek back to the start
     csv_reader.into_inner().seek(SeekFrom::Start(0))?;
 
-    Ok(Schema::new(fields))
+    Ok((Schema::new(fields), records_count))
+}
+
+/// Infer schema from a list of CSV files by reading through first n records
+/// with `max_read_records` controlling the maximum number of records to read.
+///
+/// Files will be read in the given order untill n records have been reached.
+///
+/// If `max_read_records` is not set, all files will be read fully to infer the schema.
+pub fn infer_schema_from_files(
+    files: &Vec<String>,
+    delimiter: u8,
+    max_read_records: Option<usize>,
+    has_header: bool,
+) -> Result<Schema> {
+    let mut schemas = vec![];
+    let mut records_to_read = max_read_records.unwrap_or(std::usize::MAX);
+
+    for fname in files.iter() {
+        let (schema, records_read) = infer_file_schema(
+            &mut BufReader::new(File::open(fname)?),
+            delimiter,
+            Some(records_to_read),
+            has_header,
+        )?;
+        if records_read == 0 {
+            continue;
+        }
+        schemas.push(schema.clone());
+        records_to_read -= records_read;

Review comment:
       Just checking, we can never get `records_to_read` < 0, right?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org