You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Jörn Horstmann (Jira)" <ji...@apache.org> on 2020/05/15 13:49:00 UTC

[jira] [Resolved] (ARROW-7574) [Rust] FileSource read implementation is seeking for each single byte

     [ https://issues.apache.org/jira/browse/ARROW-7574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jörn Horstmann resolved ARROW-7574.
-----------------------------------
    Resolution: Fixed

> [Rust] FileSource read implementation is seeking for each single byte
> ---------------------------------------------------------------------
>
>                 Key: ARROW-7574
>                 URL: https://issues.apache.org/jira/browse/ARROW-7574
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Rust
>    Affects Versions: 0.16.0
>            Reporter: Jörn Horstmann
>            Priority: Major
>
> on current master branch
> {code:java}
> $ RUST_BACKTRACE=1 strace target/debug/parquet-read tripdata.parquet
> ...
> lseek(3, -8, SEEK_END)                  = 2937
> read(3, ",\10\0\0PAR1", 8192)           = 8
> lseek(3, 845, SEEK_SET)                 = 845
> read(3, "\25\2\31\334H schema"..., 8192) = 2100
> ...
> lseek(5, 4, SEEK_SET)                   = 4
> read(5, "\25\0\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\02000000000000"..., 8192) = 2941
> lseek(5, 5, SEEK_SET)                   = 5
> read(5, "\0\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\020000000000000"..., 8192) = 2940
> lseek(5, 6, SEEK_SET)                   = 6
> read(5, "\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\0200000000000000"..., 8192) = 2939
> lseek(5, 7, SEEK_SET)                   = 7
> read(5, "\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\02000000000000000"..., 8192) = 2938
> lseek(5, 8, SEEK_SET)                   = 8
> read(5, "\1\25P,\25\n\25\0\25\10\25\10\0346\0(\020000000000000000"..., 8192) = 2937
> lseek(5, 9, SEEK_SET)                   = 9
> read(5, "\25P,\25\n\25\0\25\10\25\10\0346\0(\0200000000000000004"..., 8192) = 2936
> lseek(5, 10, SEEK_SET)                  = 10
> read(5, "P,\25\n\25\0\25\10\25\10\0346\0(\0200000000000000004\30"..., 8192) = 2935
> {code}
>  Notice the seek position being incremented by one, despite reading up to 8192 bytes at a time. Interestingly this does not seem to have a big performance impact on a local file system with linux, but becomes a problem when working with a custom implementation of ParquetReader, for example for reading from s3.
> The problem seems to be in
> {code}
> impl<R: ParquetReader> Read for FileSource<R>
> {code}
> which is unconditionally calling
> {code}
> reader.seek(SeekFrom::Start(self.start as u64))?
> {code}
> Instead it should probably keep track of the current position and only seek on the first read.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)