You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Frank Luan (Jira)" <ji...@apache.org> on 2022/04/27 00:33:00 UTC

[jira] [Created] (ARROW-16351) [C++][Python] Implement seek() for BufferedInputStream

Frank Luan created ARROW-16351:
----------------------------------

             Summary: [C++][Python] Implement seek() for BufferedInputStream
                 Key: ARROW-16351
                 URL: https://issues.apache.org/jira/browse/ARROW-16351
             Project: Apache Arrow
          Issue Type: New Feature
          Components: C++, Python
    Affects Versions: 7.0.0
            Reporter: Frank Luan


I would like to use seek() in a buffered input stream for the following usage scenario:
 * Open a S3 file (e.g. 1GB)
 * Jump to an offset (e.g. skip 500MB)
 * Do a bunch of small (8 bytes) reads

So that I get the performance of buffered input by avoiding lots of small reads (which are expensive and slow if using S3) and also seek to a position.

Currently I need to hack it using a mix of RandomAccessFile and BufferedInputStream, like

{{with _fs.open_input_file(url) as f:}}
{{    f.seek(offset)}}
{{    f = fs._wrap_input_stream(f, url, None, self._buffer_size)}}
{{    x = }}{{{}f.read(8){}}}{{{}{}}}

I'm wondering if there is any fundamental reason why seek is not implemented for the buffered input stream? Looks like .NET implements it: [https://docs.microsoft.com/en-us/dotnet/api/system.io.bufferedstream.seek?view=net-6.0]

Or, what I actually need is to open a S3 file with an offset. Would this be easier to do, or is it already supported in current API?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)