You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/26 11:24:04 UTC
[GitHub] [arrow-rs] alamb commented on issue #47: [Parquet] Too many open files (os error 24)

alamb commented on issue #47:
URL: https://github.com/apache/arrow-rs/issues/47#issuecomment-826757214


   Comment from Chao Sun(csun) @ 2019-08-07T06:02:08.709+0000:
   <pre>Thanks for reporting. Do you have rough idea how deep the nested data type is? is there any error message? would be great if we can reproduce this.</pre>
   
   Comment from Yesh(madras) @ 2019-08-07T11:35:10.840+0000:
   <pre>Thanks for ack. Below is the error message.  Additional data point is that it is able to dump schema via parquet-schema . 
   {code:java}
   thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: General("underlying IO error: Too many open files (os error 24)")', src/libcore/result.rs:1084:5{code}</pre>
   
   Comment from Ahmed Riza(dr.riza@gmail.com) @ 2021-02-12T22:52:01.045+0000:
   <pre>I've come across the same error. In my case it appears to be due to the `try_clone` calls in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  I have a Parquet file with 3000 columns (see attached example), and the `try_clone` calls here eventually fail as it ends up creating too many open file descriptors{color:#000000}.{color}
   
   Here's a stack trace from `gdb` which leads to the call in `io.rs`.   This can be reproduced by using the attached Parquet file.
   
   One could increase the `ulimit -n` on Linux to get around this, but not really a solution, since the code path ends up just creating potentially a very large number of open file descriptors (one for each column in the Parquet file).
   
   This is the initial stack trace when the footer is first read.  `FileSource<std::fs::File>::new` (in io.rs) gets called for every column subsequently as well when reading the columns (see {color:#cc844f}fn {color}{color:#8ec1ff}reader_tree {color}in `parquet/record/reader.rs`)
   
    
   {code:java}
   #0  parquet::util::io::FileSource<std::fs::File>::new<std::fs::File> (fd=0x7ffff7c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82
   
   #1  0x00005555558294ce in parquet::file::serialized_reader::{{impl}}::get_read (self=0x7ffff7c3fafc, start=807191, length=65536)
   
       at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59
   
   #2  0x000055555590a3fc in parquet::file::footer::parse_metadata<std::fs::File> (chunk_reader=0x7ffff7c3fafc) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57
   
   #3  0x0000555555845db1 in parquet::file::serialized_reader::SerializedFileReader<std::fs::File>::new<std::fs::File> (chunk_reader=...)
   
       at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134
   
   #4  0x0000555555845bb6 in parquet::file::serialized_reader::{{impl}}::try_from (file=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81
   
   #5  0x0000555555845c4a in parquet::file::serialized_reader::{{impl}}::try_from (path=0x7ffff0000d20) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90
   
   #6  0x0000555555845d34 in parquet::file::serialized_reader::{{impl}}::try_from (path="resources/parquet/part-00001-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet")
   
       at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98
   
   #7  0x000055555577c7f5 in data_rust::parquet::parquet_demo::test::test_read_multiple_files () at /work/rust/data-rust/src/parquet/parquet_demo.rs:103
   
   
    {code}</pre>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org