You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Alessandro Molina (Jira)" <ji...@apache.org> on 2022/10/24 12:00:00 UTC

[jira] [Commented] (ARROW-18076) [Python] PyArrow cannot read from R2 (Cloudflare's S3)

    [ https://issues.apache.org/jira/browse/ARROW-18076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623138#comment-17623138 ] 

Alessandro Molina commented on ARROW-18076:
-------------------------------------------

Have you tried reaching to Cloudflare to verify if it might be a problem with the file itself? That error is usually caused by a mismatch between {{Content-Length}} header and the actually transfered amount of bytes. In the majority of cases the problem is caused by the server setting a wrong {{Content-Length}} or truncating the connection. So I would check with Cloudflare support, especially if you say that when using S3 the same file works correctly.

I'm going to close the ticket, if you get an answer from Cloudflare confirming that everything is fine on their side, feel free to reopen it.

> [Python] PyArrow cannot read from R2 (Cloudflare's S3)
> ------------------------------------------------------
>
>                 Key: ARROW-18076
>                 URL: https://issues.apache.org/jira/browse/ARROW-18076
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>         Environment: Ubuntu 20
>            Reporter: Vedant Roy
>            Priority: Major
>
> When using pyarrow to read parquet data (as part of the Ray project), I get the following stracktrace:
> {noformat}
> (_sample_piece pid=49818) Traceback (most recent call last):
> (_sample_piece pid=49818)   File "python/ray/_raylet.pyx", line 859, in ray._raylet.execute_task
> (_sample_piece pid=49818)   File "python/ray/_raylet.pyx", line 863, in ray._raylet.execute_task
> (_sample_piece pid=49818)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/parquet_datasource.py", line 446, in _sample_piece
> (_sample_piece pid=49818)     batch = next(batches)
> (_sample_piece pid=49818)   File "pyarrow/_dataset.pyx", line 3202, in _iterator
> (_sample_piece pid=49818)   File "pyarrow/_dataset.pyx", line 2891, in pyarrow._dataset.TaggedRecordBatchIterator.__next__
> (_sample_piece pid=49818)   File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
> (_sample_piece pid=49818)   File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
> (_sample_piece pid=49818) OSError: AWS Error [code 99]: curlCode: 18, Transferred a partial file
> {noformat}
> I do not get this error when using Amazon S3 for the exact same data.
> The error is coming from this line:
> https://github.com/ray-project/ray/blob/6fb605379a726d889bd25cf0ee4ed335c74408ff/python/ray/data/datasource/parquet_datasource.py#L446



--
This message was sent by Atlassian Jira
(v8.20.10#820010)