You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Li Jin <ic...@gmail.com> on 2022/08/01 16:00:20 UTC

Help with writing/reading from s3

Hello!

We recently updated Arrow to 7.0.0 and hit some error with our old code
(Details below). I wonder if there is a new way to do this with the current
version?

import pyarrow

import pyarrow.parquet as pq



df = pd.DataFrame({"aa": [1, 2, 3], "bb": [1, 2, 3]})

uri = "gs://amp_bucket_liao/try"

s3fs = # ...



pq.write_to_dataset(

    table=pyarrow.Table.from_pandas(df=df, preserve_index=True),

    root_path=uri, filesystem=s3fs, partition_cols=["aa"]

)

# so far it works fine.



# The following gives an error, error message in the thread

test_df = pq.read_table(

    source=uri, filesystem=s3fs

)



Error:

/home/tsdist/vats_deployments/modeling.env.interactive-bc9b04a0-708b-45b8-90bc-14b9ca6ee9ba/ext/public/python/pyarrow/7/0/x/dist/lib/python3.9/pyarrow/error.pxi
in pyarrow.lib.check_status()

     97

     98         if status.IsInvalid():

---> 99             raise ArrowInvalid(message)

    100         elif status.IsIOError():

    101             # Note: OSError constructor is



ArrowInvalid: GetFileInfo() yielded path
'amp_bucket_liao/try/aa=3/235add6629d44a2f8fa4ec772340b73d.parquet',
which is outside base dir 'gs://amp_bucket_liao/try'

Re: Help with writing/reading from s3

Posted by Li Jin <ic...@gmail.com>.

Thanks! Removing the "gs://" prefix indeed fixes it.

On Tue, Aug 2, 2022 at 4:01 PM Will Jones <wi...@gmail.com> wrote:

> Hi Li Jin,
>
> I'm not sure yet what changed, but I believe you can fix that error simply
> by omitting the scheme prefix from the URI and just use the page when
> loading the dataset. Here's my repro:
>
> import pyarrow as pa
> import pyarrow.dataset as ds
> from pyarrow.fs import S3FileSystem
>
> s3fs = S3FileSystem(
>     endpoint_override="https://storage.googleapis.com",
>     anonymous=True
> )
>
> uri = "gs://voltrondata-labs-datasets/nyc-taxi"
>
> # This works
> ds.dataset(uri[5:], filesystem=s3fs)
>
> # With prefix causes error
> ds.dataset(uri, filesystem=s3fs)
> # ArrowInvalid: Expected an S3 object path of the form 'bucket/key...', got
> a URI: 'gs://voltrondata-labs-datasets/nyc-taxi'
>
> Best,
>
> Will Jones
>
> On Mon, Aug 1, 2022 at 9:00 AM Li Jin <ic...@gmail.com> wrote:
>
> > Hello!
> >
> > We recently updated Arrow to 7.0.0 and hit some error with our old code
> > (Details below). I wonder if there is a new way to do this with the
> current
> > version?
> >
> > import pyarrow
> >
> > import pyarrow.parquet as pq
> >
> >
> >
> > df = pd.DataFrame({"aa": [1, 2, 3], "bb": [1, 2, 3]})
> >
> > uri = "gs://amp_bucket_liao/try"
> >
> > s3fs = # ...
> >
> >
> >
> > pq.write_to_dataset(
> >
> >     table=pyarrow.Table.from_pandas(df=df, preserve_index=True),
> >
> >     root_path=uri, filesystem=s3fs, partition_cols=["aa"]
> >
> > )
> >
> > # so far it works fine.
> >
> >
> >
> > # The following gives an error, error message in the thread
> >
> > test_df = pq.read_table(
> >
> >     source=uri, filesystem=s3fs
> >
> > )
> >
> >
> >
> > Error:
> >
> >
> >
> /home/tsdist/vats_deployments/modeling.env.interactive-bc9b04a0-708b-45b8-90bc-14b9ca6ee9ba/ext/public/python/pyarrow/7/0/x/dist/lib/python3.9/pyarrow/error.pxi
> > in pyarrow.lib.check_status()
> >
> >      97
> >
> >      98         if status.IsInvalid():
> >
> > ---> 99             raise ArrowInvalid(message)
> >
> >     100         elif status.IsIOError():
> >
> >     101             # Note: OSError constructor is
> >
> >
> >
> > ArrowInvalid: GetFileInfo() yielded path
> > 'amp_bucket_liao/try/aa=3/235add6629d44a2f8fa4ec772340b73d.parquet',
> > which is outside base dir 'gs://amp_bucket_liao/try'
> >
>

Re: Help with writing/reading from s3

Posted by Will Jones <wi...@gmail.com>.

Hi Li Jin,

I'm not sure yet what changed, but I believe you can fix that error simply
by omitting the scheme prefix from the URI and just use the page when
loading the dataset. Here's my repro:

import pyarrow as pa
import pyarrow.dataset as ds
from pyarrow.fs import S3FileSystem

s3fs = S3FileSystem(
    endpoint_override="https://storage.googleapis.com",
    anonymous=True
)

uri = "gs://voltrondata-labs-datasets/nyc-taxi"

# This works
ds.dataset(uri[5:], filesystem=s3fs)

# With prefix causes error
ds.dataset(uri, filesystem=s3fs)
# ArrowInvalid: Expected an S3 object path of the form 'bucket/key...', got
a URI: 'gs://voltrondata-labs-datasets/nyc-taxi'

Best,

Will Jones

On Mon, Aug 1, 2022 at 9:00 AM Li Jin <ic...@gmail.com> wrote:

> Hello!
>
> We recently updated Arrow to 7.0.0 and hit some error with our old code
> (Details below). I wonder if there is a new way to do this with the current
> version?
>
> import pyarrow
>
> import pyarrow.parquet as pq
>
>
>
> df = pd.DataFrame({"aa": [1, 2, 3], "bb": [1, 2, 3]})
>
> uri = "gs://amp_bucket_liao/try"
>
> s3fs = # ...
>
>
>
> pq.write_to_dataset(
>
>     table=pyarrow.Table.from_pandas(df=df, preserve_index=True),
>
>     root_path=uri, filesystem=s3fs, partition_cols=["aa"]
>
> )
>
> # so far it works fine.
>
>
>
> # The following gives an error, error message in the thread
>
> test_df = pq.read_table(
>
>     source=uri, filesystem=s3fs
>
> )
>
>
>
> Error:
>
>
> /home/tsdist/vats_deployments/modeling.env.interactive-bc9b04a0-708b-45b8-90bc-14b9ca6ee9ba/ext/public/python/pyarrow/7/0/x/dist/lib/python3.9/pyarrow/error.pxi
> in pyarrow.lib.check_status()
>
>      97
>
>      98         if status.IsInvalid():
>
> ---> 99             raise ArrowInvalid(message)
>
>     100         elif status.IsIOError():
>
>     101             # Note: OSError constructor is
>
>
>
> ArrowInvalid: GetFileInfo() yielded path
> 'amp_bucket_liao/try/aa=3/235add6629d44a2f8fa4ec772340b73d.parquet',
> which is outside base dir 'gs://amp_bucket_liao/try'
>