You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "James Bourbeau (Jira)" <ji...@apache.org> on 2022/12/14 21:44:00 UTC

[jira] [Created] (ARROW-18436) `pyarrow.fs.FileSystem.from_uri` crashes when URI has a space

James Bourbeau created ARROW-18436:
--------------------------------------

             Summary: `pyarrow.fs.FileSystem.from_uri` crashes when URI has a space
                 Key: ARROW-18436
                 URL: https://issues.apache.org/jira/browse/ARROW-18436
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 10.0.1
         Environment: - OS: macOS
- `python=3.9.15:h709bd14_0_cpython` (installed from conda-forge)
- `pyarrow=10.0.1:py39h2db5b05_1_cpu` (installed from conda-forge)
            Reporter: James Bourbeau


When attempting to create a new filesystem object from a public dataset in S3, where there is a space in the bucket name, an error is raised. 

 

Here's a minimal reproducer:

```python

from pyarrow.fs import FileSystem

result = FileSystem.from_uri("s3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet")

```

 

which fails with the following traceback:

 

```

Traceback (most recent call last):
  File "/Users/james/projects/dask/dask/test.py", line 3, in <module>
    result = FileSystem.from_uri("s3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet")
  File "pyarrow/_fs.pyx", line 470, in pyarrow._fs.FileSystem.from_uri
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Cannot parse URI: 's3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet'

```

 

Note that things work if I use a different dataset that doesn't have a space in the URI, or if I replace the portion of the URI that has a space with a `*` wildcard

 

```python

from pyarrow.fs import FileSystem


result = FileSystem.from_uri("s3://nyc-tlc/*/fhvhv_tripdata_2022-06.parquet") # works
```

 

The wildcard isn't necessarily equivalent to the original failing URI, but I think highlights that the space is somehow problematic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)