You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/05/31 21:15:46 UTC

[GitHub] [arrow] zeapo opened a new pull request #7309: [ARROW-8993] support reading gzipped json files

zeapo opened a new pull request #7309:
URL: https://github.com/apache/arrow/pull/7309


   A proposal of an implementation to read `json.gz` files. The idea is to along `.json` files, support gziped ones. 
   
   The implementation of `json::ReadBuilder` heavily rely on `File`. My first approach would have been to make another version of the builder that would rely on a `BufReader<Box<dyn Read>>` and have the original one delegate to this the final build.
   
   I went with another approach, in this case I've added an option to specify if the file is gziped or not. In this case, the original implementation works as defined, however requires a switch of type as per https://github.com/apache/arrow/compare/master...zeapo:feature/json-gzip?expand=1#diff-3fa6fc3c0ee201f5a3a1a5d25d0062ffR966-R967 
   
   I've got no religion on this. Suggestions are welcome, to be honest, I would love that the `infer_json_schema` be public , which would make all of this code unnecessary as it would delegate the builder process to the users who care about these features.
   
   3 options :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zeapo commented on a change in pull request #7309: ARROW-8993: [Rust] support reading non-seekable sources

Posted by GitBox <gi...@apache.org>.
zeapo commented on a change in pull request #7309:
URL: https://github.com/apache/arrow/pull/7309#discussion_r434820112



##########
File path: rust/arrow/src/json/reader.rs
##########
@@ -364,6 +370,24 @@ impl<R: Read> Reader<R> {
         }
     }
 
+    /// Create a new JSON Reader from a `BufReader<R: Read>`
+    ///
+    /// If reading a `File`, you can customise the Reader, such as to enable schema

Review comment:
       changed :+1: 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #7309: ARROW-8993: [Rust] support reading non-seekable sources

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #7309:
URL: https://github.com/apache/arrow/pull/7309#issuecomment-644233988


   Hi @zeapo, can you please rebase to fix the merge conflict? I'll merge this PR after that


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] houqp commented on a change in pull request #7309: ARROW-8993: [Rust] support reading non-seekable sources

Posted by GitBox <gi...@apache.org>.
houqp commented on a change in pull request #7309:
URL: https://github.com/apache/arrow/pull/7309#discussion_r434772617



##########
File path: rust/arrow/src/json/reader.rs
##########
@@ -1037,62 +1065,73 @@ mod tests {
             .unwrap();
         let batch = reader.next().unwrap().unwrap();
 
-        assert_eq!(4, batch.num_columns());
-        assert_eq!(4, batch.num_rows());
-
-        let schema = batch.schema();
-
-        let a = schema.column_with_name("a").unwrap();
-        assert_eq!(&DataType::Int64, a.1.data_type());
-        let b = schema.column_with_name("b").unwrap();
-        assert_eq!(
-            &DataType::List(Box::new(DataType::Float64)),
-            b.1.data_type()
-        );
-        let c = schema.column_with_name("c").unwrap();
-        assert_eq!(
-            &DataType::List(Box::new(DataType::Boolean)),
-            c.1.data_type()
-        );
-        let d = schema.column_with_name("d").unwrap();
-        assert_eq!(&DataType::List(Box::new(DataType::Utf8)), d.1.data_type());
-
-        let bb = batch
-            .column(b.0)
-            .as_any()
-            .downcast_ref::<ListArray>()
-            .unwrap();
-        let bb = bb.values();
-        let bb = bb.as_any().downcast_ref::<Float64Array>().unwrap();
-        assert_eq!(10, bb.len());
-        assert_eq!(4.0, bb.value(9));
-
-        let cc = batch
-            .column(c.0)
-            .as_any()
-            .downcast_ref::<ListArray>()
-            .unwrap();
-        let cc = cc.values();
-        let cc = cc.as_any().downcast_ref::<BooleanArray>().unwrap();
-        assert_eq!(6, cc.len());
-        assert_eq!(false, cc.value(0));
-        assert_eq!(false, cc.value(3));
-        assert_eq!(false, cc.is_valid(2));
-        assert_eq!(false, cc.is_valid(4));
-
-        let dd = batch
-            .column(d.0)
-            .as_any()
-            .downcast_ref::<ListArray>()
-            .unwrap();
-        let dd = dd.values();
-        let dd = dd.as_any().downcast_ref::<StringArray>().unwrap();
-        assert_eq!(7, dd.len());
-        assert_eq!(false, dd.is_valid(1));
-        assert_eq!("text", dd.value(2));
-        assert_eq!("1", dd.value(3));
-        assert_eq!("false", dd.value(4));
-        assert_eq!("array", dd.value(5));
-        assert_eq!("2.4", dd.value(6));
+        let mut file = File::open("test/data/mixed_arrays.json.gz").unwrap();

Review comment:
       yeah, i feel like compression support should be handled by higher level interface that takes filenames/paths as inputs.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zeapo commented on pull request #7309: [ARROW-8993] [Rust] support reading gzipped json files

Posted by GitBox <gi...@apache.org>.
zeapo commented on pull request #7309:
URL: https://github.com/apache/arrow/pull/7309#issuecomment-636706744


   After sleeping on it, here's the 3rd option (exposing infer schema with BufReader): https://github.com/apache/arrow/compare/master...zeapo:feature/buffer-exposed
   
   The advantage is that it does not need the dependency on gzip and has little impact on the api. It also opens the way to have other formats support.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #7309: [ARROW-8993] [Rust] support reading gzipped json files

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #7309:
URL: https://github.com/apache/arrow/pull/7309#issuecomment-636794722


   > It also opens the way to have other formats support
   
   Do you mean other compression formats?
   
   > However, this needs for the user to do manual seek (if the BufRead is on a File) and does not rely on the Builder. Unless a new builder with buffer is added.
   
   I suppose the broader question to ask would be whether we'd like to support reading (and writing) compressed data. There have also been some changes on the `arrow::csv` side, such as allowing inference of multiple files, which might also be convenient to have in `arrow::json`.
   
   What's your opinion on the 2 above? We need not implement them as part of this PR, but I suppose some user feedback is helpful :)
   
   I'm still pro returning the reader back to the start, or is there a performance impact in doing so? I wouldn't want to place the burden of seeking on the user, because I'd expect the common inference case to be getting the schema then reading the file.
   
   ```rust
       // return the reader seek back to the start	
       reader.into_inner().seek(SeekFrom::Start(0))?;
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #7309: ARROW-8993: [Rust] support reading gzipped json files

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #7309:
URL: https://github.com/apache/arrow/pull/7309#issuecomment-636864911


   > I agree that placing the burden on the user is a bad idea. However, there are situations where we just can't seek back to start (s3 is one example). Maybe a specific implementation for `Seek + Read`, that would do the seek back to start, and one for `Read` only, that would not. However... this would need the use of specialization, so more nightly dependencies.
   
   Okay, in that case I could support not seeking back to the start of the input. One downside though is that in the case where no schema is supplied, and the readers (csv and json) infer the schema, we do need to reset the input to its starting position. I've briefly looked at the csv code, so if it's doable, we could find a solution.
   
   Regarding specialization, we are already dependent on it, with low likelihood of this changing soon; so I'd say it's an option.
   
   With regards to `dyn Read`, please have a look at the `arrow::csv::reader` code. We already support a `BufReader<R: Read>`, so I think we can implement the same without boxing the reader in the way that you've done so far. The only/primary reason why we still use `File` in `arrow::json` is that nobody's needed to use something else, or at least raised the issue.
   ___
   
   One dramatic alternative would be to always require a schema, and leave inference to the user. We could then consume the buffer reader (`reader: mut BufReader<R>` instead of `reader: &mut BufReader<R>`), so that we don't leave the user with a file handle that's already partially/fully consumed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zeapo commented on pull request #7309: ARROW-8993: [Rust] support reading gzipped json files

Posted by GitBox <gi...@apache.org>.
zeapo commented on pull request #7309:
URL: https://github.com/apache/arrow/pull/7309#issuecomment-636850680


   Thanks for your feedback.
   
   > Do you mean other compression formats?
   
   Yes, but not only. This would allow also to have different filesystems (like S3) where using `File` is not possible, and would make much more sense to have `dyn Read` (in rusoto they use ByteStream that implements Read and AsyncRead)
   
   > There have also been some changes on the arrow::csv side, such as allowing inference of multiple files, which might also be convenient to have in arrow::json
   
   That would be great. This would help when there are multiple small file that are written without a fixed batch size (small & big files) and data would be scattered across multiple files. If you have a JIRA issue for this I can take a look at it :)
   
   > I'm still pro returning the reader back to the start, or is there a performance impact in doing so? I wouldn't want to place the burden of seeking on the user, because I'd expect the common inference case to be getting the schema then reading the file.
   
   I agree that placing the burden on the user is a bad idea. However, there are situations where we just can't seek back to start (s3 is one example). Maybe a specific implementation for `Seek + Read`, that would do the seek back to start, and one for `Read` only, that would not. However... this would need the use of specialization, so more nightly dependencies.
   
   Not really sure :/


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7309: [ARROW-8993] support reading gzipped json files

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7309:
URL: https://github.com/apache/arrow/pull/7309#issuecomment-636531370


   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
       ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #7309: [ARROW-8993] [Rust] support reading gzipped json files

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #7309:
URL: https://github.com/apache/arrow/pull/7309#issuecomment-636789883


   > ...I would love that the `infer_json_schema` be public , which would make all of this code unnecessary as it would delegate the builder process to the users who care about these features.
   
   Hi @zeapo, firstly, thanks for working on this. If exposing the function as public is helpful, please go ahead.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zeapo commented on pull request #7309: ARROW-8993: [Rust] support reading non-seekable sources

Posted by GitBox <gi...@apache.org>.
zeapo commented on pull request #7309:
URL: https://github.com/apache/arrow/pull/7309#issuecomment-644591933


   Hey @nevi-me, it should be done ^_^


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] houqp commented on pull request #7309: ARROW-8993: [Rust] support reading gzipped json files

Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #7309:
URL: https://github.com/apache/arrow/pull/7309#issuecomment-637040555


   > One dramatic alternative would be to always require a schema, and leave inference to the user. We could then consume the buffer reader (reader: mut BufReader<R> instead of reader: &mut BufReader<R>), so that we don't leave the user with a file handle that's already partially/fully consumed.
   
   this is the approach csv reader took as well, so another benefit from doing this would be consistent user experience.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zeapo commented on pull request #7309: ARROW-8993: [Rust] support reading gzipped json files

Posted by GitBox <gi...@apache.org>.
zeapo commented on pull request #7309:
URL: https://github.com/apache/arrow/pull/7309#issuecomment-636863181


   I've updated the little poc: https://github.com/apache/arrow/compare/master...zeapo:feature/buffer-exposed
   
   The main changes are:
   ```rust
   pub fn infer_json_schema_from_seekable<R: Read + Seek>(
      source: &mut R,
      max_read_records: Option<usize>
   ) -> Result<Arc<Schema>>
   ```
   and
   ```rust
   pub fn infer_json_schema_from_reader<R: Read>(
      reader: &mut BufReader<R>,
      max_read_records: Option<usize>
   ) -> Result<Arc<Schema>>
   ```
   And the fact the `build` requires a `Seek` too. This makes the builder `File` agnostic, and relies only on Seek.
   
   Now, if the user (i.e. me? ^_^) wants to use a non-seekable, I'd use the infer on the buffer, handle the memory myself and pass the schema I inferred. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zeapo commented on pull request #7309: ARROW-8993: [Rust] support reading gzipped json files

Posted by GitBox <gi...@apache.org>.
zeapo commented on pull request #7309:
URL: https://github.com/apache/arrow/pull/7309#issuecomment-637601236


   As per @houqp suggestion, here are the changes:
   
   -  match csv's infer schema:
       - CSV: https://github.com/apache/arrow/blob/master/rust/arrow/src/csv/reader.rs#L90-L95 
       - JSON: https://github.com/apache/arrow/pull/7309/files#diff-3fa6fc3c0ee201f5a3a1a5d25d0062ffR174-R177
       - Mismatch: notice that the json's one return an `Arc<Schema>`. I don't know if we should make the csv reader match this or the other way arround. Especially that `Reader::new` takes `Arc<Schema>`
       - new function for seekable: https://github.com/apache/arrow/pull/7309/files?diff=split&w=1#diff-3fa6fc3c0ee201f5a3a1a5d25d0062ffR159-R168
   - match csv's `Reader::from_buf_reader`:
       - csv: https://github.com/apache/arrow/blob/master/rust/arrow/src/csv/reader.rs#L235-L242
       - json: https://github.com/apache/arrow/pull/7309/files#diff-3fa6fc3c0ee201f5a3a1a5d25d0062ffR377-R382
   - match csv's `Reader::new`:
       - csv: https://github.com/apache/arrow/blob/master/rust/arrow/src/csv/reader.rs#L198-L214
       - json: https://github.com/apache/arrow/pull/7309/files#diff-3fa6fc3c0ee201f5a3a1a5d25d0062ffR341-R347
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zeapo commented on a change in pull request #7309: ARROW-8993: [Rust] support reading non-seekable sources

Posted by GitBox <gi...@apache.org>.
zeapo commented on a change in pull request #7309:
URL: https://github.com/apache/arrow/pull/7309#discussion_r434824558



##########
File path: rust/arrow/src/json/reader.rs
##########
@@ -157,9 +156,26 @@ fn generate_schema(spec: HashMap<String, HashSet<DataType>>) -> Result<Arc<Schem
 /// `max_read_records` controlling the maximum number of records to read.
 ///
 /// If `max_read_records` is not set, the whole file is read to infer its field types.
-fn infer_json_schema(file: File, max_read_records: Option<usize>) -> Result<Arc<Schema>> {
+pub fn infer_json_schema_from_seekable<R: Read + Seek>(

Review comment:
       Updated with additional examples and other tests at the end. could you please check




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zeapo commented on pull request #7309: ARROW-8993: [Rust] support reading non-seekable sources

Posted by GitBox <gi...@apache.org>.
zeapo commented on pull request #7309:
URL: https://github.com/apache/arrow/pull/7309#issuecomment-644261323


   I'll fix the build tonight.
   
   Le lun. 15 juin 2020 à 18:21, Wakahisa <no...@github.com> a écrit :
   
   > Hi @zeapo <https://github.com/zeapo>, can you please rebase to fix the
   > merge conflict? I'll merge this PR after that
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/arrow/pull/7309#issuecomment-644233988>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AACCABVQEVE6YUM6CTGEEKTRWZC63ANCNFSM4NPKLWTA>
   > .
   >
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me closed pull request #7309: ARROW-8993: [Rust] support reading non-seekable sources

Posted by GitBox <gi...@apache.org>.
nevi-me closed pull request #7309:
URL: https://github.com/apache/arrow/pull/7309


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] houqp commented on pull request #7309: ARROW-8993: [Rust] support reading gzipped json files

Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #7309:
URL: https://github.com/apache/arrow/pull/7309#issuecomment-637059093


   Looks like we have to remove Seek constraints in order to support compressed format across the board. If this is the plan here as well, then I recommend just implement Read for json in this PR (with extra Seek helper function if you prefer). Then send a follow up PR to update csv reader.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zeapo commented on pull request #7309: ARROW-8993: [Rust] support reading gzipped json files

Posted by GitBox <gi...@apache.org>.
zeapo commented on pull request #7309:
URL: https://github.com/apache/arrow/pull/7309#issuecomment-637046642


    Would it make sense to implement for json reader the same interfaces that are on csv reader, i.e. `Read + Seek` for `infer_file_schema`, and a `json::Reader::from_buf_reader()`.
   
   Then in another PR add an interface on `Read` only? In both json and csv. That would mean creating a new `infer_reader_schema` that takes a `Read` (rather than `Read + Seek`).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zeapo commented on pull request #7309: ARROW-8993: [Rust] support reading gzipped json files

Posted by GitBox <gi...@apache.org>.
zeapo commented on pull request #7309:
URL: https://github.com/apache/arrow/pull/7309#issuecomment-636865514


   > One dramatic alternative would be to always require a schema, and leave inference to the user. We could then consume the buffer reader (reader: mut BufReader<R> instead of reader: &mut BufReader<R>), so that we don't leave the user with a file handle that's already partially/fully consumed.
   
   I like this :+1: 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zeapo edited a comment on pull request #7309: [ARROW-8993] [Rust] support reading gzipped json files

Posted by GitBox <gi...@apache.org>.
zeapo edited a comment on pull request #7309:
URL: https://github.com/apache/arrow/pull/7309#issuecomment-636706744


   After sleeping on it, here's the 3rd option (exposing infer schema with BufReader): https://github.com/apache/arrow/compare/master...zeapo:feature/buffer-exposed
   
   The advantage is that it does not need the dependency on gzip and has little impact on the api. It also opens the way to have other formats support.
   
   However, this needs for the user to do manual seek (if the BufRead is on a File) and does not rely on the Builder. Unless a new builder with buffer is added.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #7309: ARROW-8993: [Rust] support reading non-seekable sources

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #7309:
URL: https://github.com/apache/arrow/pull/7309#discussion_r434689868



##########
File path: rust/arrow/src/json/reader.rs
##########
@@ -157,9 +156,26 @@ fn generate_schema(spec: HashMap<String, HashSet<DataType>>) -> Result<Arc<Schem
 /// `max_read_records` controlling the maximum number of records to read.
 ///
 /// If `max_read_records` is not set, the whole file is read to infer its field types.
-fn infer_json_schema(file: File, max_read_records: Option<usize>) -> Result<Arc<Schema>> {
+pub fn infer_json_schema_from_seekable<R: Read + Seek>(

Review comment:
       May you please add a note on why a user would want/need to use this vs the non-seekable one

##########
File path: rust/arrow/src/json/reader.rs
##########
@@ -1037,62 +1065,73 @@ mod tests {
             .unwrap();
         let batch = reader.next().unwrap().unwrap();
 
-        assert_eq!(4, batch.num_columns());
-        assert_eq!(4, batch.num_rows());
-
-        let schema = batch.schema();
-
-        let a = schema.column_with_name("a").unwrap();
-        assert_eq!(&DataType::Int64, a.1.data_type());
-        let b = schema.column_with_name("b").unwrap();
-        assert_eq!(
-            &DataType::List(Box::new(DataType::Float64)),
-            b.1.data_type()
-        );
-        let c = schema.column_with_name("c").unwrap();
-        assert_eq!(
-            &DataType::List(Box::new(DataType::Boolean)),
-            c.1.data_type()
-        );
-        let d = schema.column_with_name("d").unwrap();
-        assert_eq!(&DataType::List(Box::new(DataType::Utf8)), d.1.data_type());
-
-        let bb = batch
-            .column(b.0)
-            .as_any()
-            .downcast_ref::<ListArray>()
-            .unwrap();
-        let bb = bb.values();
-        let bb = bb.as_any().downcast_ref::<Float64Array>().unwrap();
-        assert_eq!(10, bb.len());
-        assert_eq!(4.0, bb.value(9));
-
-        let cc = batch
-            .column(c.0)
-            .as_any()
-            .downcast_ref::<ListArray>()
-            .unwrap();
-        let cc = cc.values();
-        let cc = cc.as_any().downcast_ref::<BooleanArray>().unwrap();
-        assert_eq!(6, cc.len());
-        assert_eq!(false, cc.value(0));
-        assert_eq!(false, cc.value(3));
-        assert_eq!(false, cc.is_valid(2));
-        assert_eq!(false, cc.is_valid(4));
-
-        let dd = batch
-            .column(d.0)
-            .as_any()
-            .downcast_ref::<ListArray>()
-            .unwrap();
-        let dd = dd.values();
-        let dd = dd.as_any().downcast_ref::<StringArray>().unwrap();
-        assert_eq!(7, dd.len());
-        assert_eq!(false, dd.is_valid(1));
-        assert_eq!("text", dd.value(2));
-        assert_eq!("1", dd.value(3));
-        assert_eq!("false", dd.value(4));
-        assert_eq!("array", dd.value(5));
-        assert_eq!("2.4", dd.value(6));
+        let mut file = File::open("test/data/mixed_arrays.json.gz").unwrap();

Review comment:
       Will this mean that we're then not supporting zipped files for now? I'm fine with that.

##########
File path: rust/arrow/src/json/reader.rs
##########
@@ -364,6 +370,24 @@ impl<R: Read> Reader<R> {
         }
     }
 
+    /// Create a new JSON Reader from a `BufReader<R: Read>`
+    ///
+    /// If reading a `File`, you can customise the Reader, such as to enable schema

Review comment:
       This doesn't flow well. Perhaps:
   
   ```
   To customi{s|z}e the schema, such as to enable schema inference, use `ReaderBuilder`
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] zeapo commented on a change in pull request #7309: ARROW-8993: [Rust] support reading non-seekable sources

Posted by GitBox <gi...@apache.org>.
zeapo commented on a change in pull request #7309:
URL: https://github.com/apache/arrow/pull/7309#discussion_r434797655



##########
File path: rust/arrow/src/json/reader.rs
##########
@@ -1037,62 +1065,73 @@ mod tests {
             .unwrap();
         let batch = reader.next().unwrap().unwrap();
 
-        assert_eq!(4, batch.num_columns());
-        assert_eq!(4, batch.num_rows());
-
-        let schema = batch.schema();
-
-        let a = schema.column_with_name("a").unwrap();
-        assert_eq!(&DataType::Int64, a.1.data_type());
-        let b = schema.column_with_name("b").unwrap();
-        assert_eq!(
-            &DataType::List(Box::new(DataType::Float64)),
-            b.1.data_type()
-        );
-        let c = schema.column_with_name("c").unwrap();
-        assert_eq!(
-            &DataType::List(Box::new(DataType::Boolean)),
-            c.1.data_type()
-        );
-        let d = schema.column_with_name("d").unwrap();
-        assert_eq!(&DataType::List(Box::new(DataType::Utf8)), d.1.data_type());
-
-        let bb = batch
-            .column(b.0)
-            .as_any()
-            .downcast_ref::<ListArray>()
-            .unwrap();
-        let bb = bb.values();
-        let bb = bb.as_any().downcast_ref::<Float64Array>().unwrap();
-        assert_eq!(10, bb.len());
-        assert_eq!(4.0, bb.value(9));
-
-        let cc = batch
-            .column(c.0)
-            .as_any()
-            .downcast_ref::<ListArray>()
-            .unwrap();
-        let cc = cc.values();
-        let cc = cc.as_any().downcast_ref::<BooleanArray>().unwrap();
-        assert_eq!(6, cc.len());
-        assert_eq!(false, cc.value(0));
-        assert_eq!(false, cc.value(3));
-        assert_eq!(false, cc.is_valid(2));
-        assert_eq!(false, cc.is_valid(4));
-
-        let dd = batch
-            .column(d.0)
-            .as_any()
-            .downcast_ref::<ListArray>()
-            .unwrap();
-        let dd = dd.values();
-        let dd = dd.as_any().downcast_ref::<StringArray>().unwrap();
-        assert_eq!(7, dd.len());
-        assert_eq!(false, dd.is_valid(1));
-        assert_eq!("text", dd.value(2));
-        assert_eq!("1", dd.value(3));
-        assert_eq!("false", dd.value(4));
-        assert_eq!("array", dd.value(5));
-        assert_eq!("2.4", dd.value(6));
+        let mut file = File::open("test/data/mixed_arrays.json.gz").unwrap();

Review comment:
       That would be great. I've seen that the C++ and Python libs have a filesystem interface. Maybe we could add that first, then work from there into adding mime-type and compression support?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #7309: ARROW-8993: [Rust] support reading gzipped json files

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #7309:
URL: https://github.com/apache/arrow/pull/7309#issuecomment-638169616


   > As per @houqp suggestion, here are the changes:
   
   Thanks, I'll start reviewing them


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7309: ARROW-8993: [Rust] support reading gzipped json files

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7309:
URL: https://github.com/apache/arrow/pull/7309#issuecomment-636802967


   https://issues.apache.org/jira/browse/ARROW-8993


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org