You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/26 11:24:39 UTC

[GitHub] [arrow-rs] alamb opened a new issue #55: Reading parquet file is slow

alamb opened a new issue #55:
URL: https://github.com/apache/arrow-rs/issues/55


   *Note*: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-6774
   
   Using the example at [https://github.com/apache/arrow/tree/master/rust/parquet] is slow.
   
   The snippet 
   {code:none}
   let reader = SerializedFileReader::new(file).unwrap();
   let mut iter = reader.get_row_iter(None).unwrap();
   let start = Instant::now();
   while let Some(record) = iter.next() {}
   let duration = start.elapsed();
   println!("{:?}", duration);
   {code}
   Runs for 17sec for a ~160MB parquet file.
   
   If there is a more effective way to load a parquet file, it would be nice to add it to the readme.
   
   P.S.: My goal is to construct an ndarray from it, I'd be happy for any tips.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on issue #55: Reading parquet file is slow

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #55:
URL: https://github.com/apache/arrow-rs/issues/55#issuecomment-826757586


   Comment from Wes McKinney(wesm) @ 2019-10-03T16:24:47.724+0000:
   <pre>Row-by-row iteration is going to be slow compared with vectorized / column-by-column reads. This unfinished PR was related to this (I think?) but there are Arrow-based readers available that don't require it
   
   https://github.com/apache/arrow/pull/3461</pre>
   
   Comment from Adam Lippai(alippai) @ 2019-10-03T17:02:23.855+0000:
   <pre>I've seen some nice work in [https://github.com/apache/arrow/blob/master/rust/parquet/src/column/reader.rs] and [https://github.com/apache/arrow/blob/master/rust/parquet/src/arrow/array_reader.rs] but I couldn't figure it out how to use it. [~liurenjie1024] can you help me perhaps? </pre>
   
   Comment from Renjie Liu(liurenjie1024) @ 2019-10-05T10:38:16.039+0000:
   <pre>This is part of a reader for reading parquet files into arrow arrays. It's almost complete, and we have still one PR ([https://github.com/apache/arrow/pull/5523]) waiting for review, which contains documentations and examples.</pre>
   
   Comment from Adam Lippai(alippai) @ 2019-10-07T10:37:11.985+0000:
   <pre>While it doesn't support reading Utf8 type, dropping that column then reading the same file takes less than 3 seconds! Thank you for the contribution. (size 10k row * 3k Float64 columns)</pre>
   
   Comment from Neville Dipale(nevi_me) @ 2020-02-01T07:47:36.619+0000:
   <pre>Hi [~alippai], UTF8 types are now supported. Is the performance still a concern, or can we close this?</pre>
   
   Comment from Sietse Brouwer(sietsebb) @ 2020-09-30T22:15:00.262+0000:
   <pre>I'm not sure what test data [~alippai]  used, so I used a test data set with 500k rows and two columns:
    * a column x containing random floating point numbers,
    * and a column y where each cell contains a Unicode string of 100 space-separated mostly-cyrillic words.
   
   See the attached [^data.py]. When I saved that 500k-row table as parquet with gzip compression, the resulting file was 174 MB.
   
   I tried running Adam's test snippet (the code I used is attached as [^main.rs]) while compiling with different versions of parquet:
    * parquet=0.15.1
    * parquet=1.0.1
    * parquet=2.0.0-SNAPSHOT (specifically git:3fae71b10c42 of 2020-09-30).
   
   *In all three cases running the snippet took almost exactly 150 seconds,* give or take one second.
   
   Does that help you decide whether to close the question, [~nevi_me]? Or perhaps your comment, Adam, from 2019-10-07 used some other version to get that speed improvement? Should I change the test to use the ParquetFileArrowReader example in [https://github.com/apache/arrow/blob/3fae71b10c42/rust/parquet/src/arrow/mod.rs#L25-L50,] and then this issue can close if that one is faster?</pre>
   
   Comment from Adam Lippai(alippai) @ 2020-09-30T22:23:42.336+0000:
   <pre>[~sietsebb] I used the Arrow reader method, that time it didn't support all the types I needed, but later it was added. It's definitely faster, however I don't remember the benchmark.</pre>
   
   Comment from Sietse Brouwer(sietsebb) @ 2020-10-03T23:46:00.174+0000:
   <pre>[~alippai], I can't get parquet::arrow::ParquetFileArrowReader to be faster than parquet::file::reader::SerializedFileReader under commit `3fae71b10c42`. Timings below, code below that, conclusions at the bottom. Interesting times in *bold.*
   
    
   ||n_rows||include utf8-column||reader||iteration unit
   _(loop does not iterate over rows within batches)_||time taken||
   |50_000|yes|ParquetFileArrowReader|1 batch of 50k rows|14.9s|
   |50_000|yes|ParquetFileArrowReader|10 batches of 5k rows|14.8s|
   |50_000|yes|ParquetFileArrowReader|50k batches of 1 row|24.0s|
   |50_000|yes|SerializedFileReader|get_row_iter|*14.5s*|
   | | | | | |
   |50_000|no|ParquetFileArrowReader|1 batch of 50k rows|*143ms*|
   |50_000|no|ParquetFileArrowReader|10 batches of 5k rows|154ms|
   |50_000|no|ParquetFileArrowReader|50k batches of 1 row|6.5s|
   |50_000|no|SerializedFileReader| get_row_iter|*211ms*|
   
    
   
   Here is the code I used to load the dataset with ParquetFileArrowReader (see also this version of [^main.rs]):
   
    
   {code:java}
   fn read_with_arrow(file: File) -> () {
       let file_reader = SerializedFileReader::new(file).unwrap();
       let mut arrow_reader = ParquetFileArrowReader::new(Rc::new(file_reader));
       println!("Arrow schema is: {}", arrow_reader.get_schema().unwrap());
       let mut record_batch_reader = arrow_reader
           .get_record_reader(/* batch size */ 50000)
           .unwrap();
   
       let start = Instant::now();
       while let Some(_record) = record_batch_reader.next_batch().unwrap() {
           // no-op
       };
       let duration = start.elapsed();
   
       println!("{:?}", duration);
   }
   
   {code}
    
   
   Main observations:
    * we can't tell whether the slow loading when we include the UTF8 column is because UTF8 is slow to process, or because the column is very big (100 random Russian words per cell).
    * When the big UTF-8 column is included, iterating over every row with SerializedFileReader is as fast as iterating over a few batches with ParquetFileArrowReader. Even when you skip the rows within the batches!
    * Should I try this again with  (size 10k row * 3k Float64 columns) plus one small UTF-8 column?
    * I'm not even sure what result I'm trying to reproduce-or-falsify here... whether adding a small UTF-8 column causes disproportional slowdown? Or whether switching between SerializedFileReader and ParquetFileArrowReader causes slowdown? Right now, I feel like everything and nothing is in scope of the issue. I wouldn't mind if somebody made it narrower and clearer.</pre>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org