You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "gitfy (via GitHub)" <gi...@apache.org> on 2024/04/19 17:47:39 UTC

[I] Reading file from s3 [arrow]

gitfy opened a new issue, #41310:
URL: https://github.com/apache/arrow/issues/41310

### Describe the bug, including details regarding any error messages, version, and platform.

We use arrow s3 filesystem object to read parquet files from object store.

We ran into exceptions like
```
File "/opt/app-root/lib64/python3.9/site-packages/pyarrow/parquet/core.py", line 3008, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
File "/opt/app-root/lib64/python3.9/site-packages/pyarrow/parquet/core.py", line 2636, in read
table = self._dataset.to_table(
File "pyarrow/_dataset.pyx", line 556, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 3713, in pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: Unexpected end of stream: Page was smaller (44336) than expected (44387)
```

and
```
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: Couldn't deserialize thrift: don't know what type:
Deserializing page header failed.
```
While investigating this issue, this happens in our system when the file being read is also being overwritten by other processes. There are lots of others who had reported exceptions like above in the past, but most of them got diverted due to the fact they all were using diff filesystem object to read the file. Here we are using the s3fs.cc filesystem from arrow and still had the issues.

Upon reviewing code, while reading a file the following happens
1. Reads the metadata using HEAD request (it gets the content-length)
2. Uses the content length and using GET request with range request to fetch the bytes from object store.

Here it can happen, after the HEAD call, the file might have been updated and while doing the GET the content length can be totally off.

This leads to the above exceptions. I am submitted a minor fix which does the following
1. From the head call, store the version-id of the file along with the content-length
2. Using the version id in the GET call to read the version of the file.
This will avoid running into the race condition, as a new update will create a new file version.

For buckets, which is not version, the only option would be to retry it.

### Component(s)

C++

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [C++] Reading file from S3 [arrow]

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #41310:
URL: https://github.com/apache/arrow/issues/41310#issuecomment-2067585727

   Would this better being a config with api? Since once `size` is passed explicitly, this is ignored.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org