You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Wes McKinney (Jira)" <ji...@apache.org> on 2020/06/29 04:17:00 UTC

[jira] [Resolved] (ARROW-8950) [C++] Make head optional in s3fs

     [ https://issues.apache.org/jira/browse/ARROW-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wes McKinney resolved ARROW-8950.
---------------------------------
    Resolution: Fixed

Issue resolved by pull request 7547
[https://github.com/apache/arrow/pull/7547]

> [C++] Make head optional in s3fs
> --------------------------------
>
>                 Key: ARROW-8950
>                 URL: https://issues.apache.org/jira/browse/ARROW-8950
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Remi Dettai
>            Assignee: Antoine Pitrou
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.0.0
>
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> When you open an input file with the f3fs, it issues a head request to S3 to check if the file is present/authorized and get the size (https://github.com/apache/arrow/blob/f16f76ab7693ae085e82f4269a0a0bc23770bef9/cpp/src/arrow/filesystem/s3fs.cc#L407).
> This call comes with a non-neglictable cost:
>  * adds latency
>  * priced the same as a GET request by AWS
> I fail to see usecases where this call is really crucial:
>  * if the file is not present/authorized, failing at first read seems to have mostly the same effect as failing on opening. I agree that it is kind of "usual" for an _open_ call to fail eagerly, so to avoid surprises we could add a flag indicating if we don't need to fail when running _OpenInputFile_ on an inaccessible file.
>  * getting the size can be done on the first read, and could be mostly avoided on caller side if the filesystem api provided read-from-end capabilities (compatible with fs reads using _ios::end_ and on http filesystems with _bytes=-xxx_). Worst case scenario the call to _head_ could be done lazily when calling _getSize()._
> I agree that it makes things a bit more complex, and I understand that you would not want to complexify the generic fs api because of blob storage behavior. But obviously there are workloads where this has a significant impact.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)