You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Kelton Halbert <kt...@wxbyte.com> on 2022/02/20 23:04:07 UTC

FIleNotFound Error on root directory with fsspec partitioned dataset

Hello,

I’ve been learning and working with PyArrow recently for a project to store some atmospheric science data as part of a partitioned dataset, and recently the dataset class with the  fsspec/gcsfs filesystem has started producing a new error. Unfortunately I cannot seem to track down what changed or if it’s an error on my end or not. I’m using PyArrow 7.0.0 and python 3.8.

If I specify a specific parquet file, everything is fine - but if I give it any of the directory partitions, the same issue occurs. Any guidance here would be appreciated!

The code: 
fs = gcsfs.GCSFileSystem(token="anon")

partitioning = ds.HivePartitioning(
        pyarrow.schema([
            pyarrow.field('year', pyarrow.int32()),
            pyarrow.field('month', pyarrow.int32()),
            pyarrow.field('day', pyarrow.int32()),
            pyarrow.field('hour', pyarrow.int32()),
            pyarrow.field('WMO', pyarrow.string())
        ])
)

schema = pyarrow.schema([
    pyarrow.field('lon', pyarrow.float32()),
    pyarrow.field('lat', pyarrow.float32()),
    pyarrow.field('pres', pyarrow.float32()),
    pyarrow.field('hght', pyarrow.float32()),
    pyarrow.field('gpht', pyarrow.float32()),
    pyarrow.field('tmpc', pyarrow.float32()),
    pyarrow.field('dwpc', pyarrow.float32()),
    pyarrow.field('relh', pyarrow.float32()),
    pyarrow.field('uwin', pyarrow.float32()),
    pyarrow.field('vwin', pyarrow.float32()),
    pyarrow.field('wspd', pyarrow.float32()),
    pyarrow.field('wdir', pyarrow.float32()),
    pyarrow.field('year', pyarrow.int32()),
    pyarrow.field('month', pyarrow.int32()),
    pyarrow.field('day', pyarrow.int32()),
    pyarrow.field('hour', pyarrow.int32()),
    pyarrow.field('WMO', pyarrow.string())
])

data = ds.dataset("global-radiosondes/hires-sonde", filesystem=fs, format="parquet", \
                        partitioning=partitioning, schema=schema)

subset = (ds.field("year") == 2016) & (ds.field("WMO") == "72451")

batches = data.to_batches(columns=["pres", "gpht", "tmpc", "wspd", "wdir", "year", "month", "day", "hour"], \
                use_threads=True)

batches = list(batches)

The error:
    391 from pyarrow import PythonFile
    393 if not self.fs.isfile(path):
--> 394     raise FileNotFoundError(path)
    396 return PythonFile(self.fs.open(path, mode="rb"), mode="r")

FileNotFoundError: global-radiosondes/hires-sonde/

Re: FIleNotFound Error on root directory with fsspec partitioned dataset

Posted by Kelton Halbert <kt...@wxbyte.com>.

Sorry for the slow response everyone. Thanks for the discussion and the help - I think I may have tracked down the issue and can confirm that this likely has nothing to do with PyArrow. 

Recently I had to implement a workaround due to a bug introduced in fsspec: https://github.com/fsspec/gcsfs/issues/404

This involved using a custom class to fix a directory creation issue, and this correlates with when I started having problems. This post on the GitHub issue appears to be relevant: 

“ Note that GCS does not have any directories below buckets. The online console and gsutils emulate buckets by using zero-length files, but they are not really directories. On the other hand, you can create any key without first making directories, and the intervening implied directories will be implicitly inferred.”

So, given this info, I believe that the workaround I used is creating a zero-length file in the root “directory” called “/“. That said, calling “isfile” on “/“ still returns False, but none the less, I have a hunch that it’s related to this workaround. 

I’m not exactly sure how to remedy this. I’d prefer to not have to re-upload and process the dataset, so I’m going to look into manually fixing the bucket. I’m also happy to hear any thoughts or suggestions on how to fix this as well.

Kelton.

> On Feb 23, 2022, at 4:20 PM, Micah Kornfield <em...@gmail.com> wrote:
> 
> 
> > You might also try the GCS filesystem (released with 7.0.0) instead of
> going through fsspec.
> 
> I don't think the native GCS filesystem support is complete in 7.0.0, I think if you are willing to compile from the latest commit in the repo it might be useable.
> 
>> On Wed, Feb 23, 2022 at 11:41 AM Weston Pace <we...@gmail.com> wrote:
>> I'm pretty sure GCS is similar to S3 in that there is no such thing as
>> a "directory".  Instead a directory is often emulated by an empty
>> file.  Note that the single file being detected is hires-sonde/ (with
>> a trailing slash).  I'm pretty sure this is the convention for
>> creating mock directories.  I'm guessing, if there were multiple
>> files, we would work ok because we just skip the empty files.
>> 
>> So perhaps this is a problem unique to gcsfs/fsspec and trying to read
>> an "empty directory".
>> 
>> You might also try the GCS filesystem (released with 7.0.0) instead of
>> going through fsspec.
>> 
>> On Wed, Feb 23, 2022 at 2:23 AM Joris Van den Bossche
>> <jo...@gmail.com> wrote:
>> >
>> >
>> > On Mon, 21 Feb 2022 at 00:04, Kelton Halbert <kt...@wxbyte.com> wrote:
>> >>
>> >> Hello,
>> >>
>> >> I’ve been learning and working with PyArrow recently for a project to store some atmospheric science data as part of a partitioned dataset, and recently the dataset class with the  fsspec/gcsfs filesystem has started producing a new error.
>> >
>> >
>> > Hi Kelton,
>> >
>> > One more question: you say that this started producing a new error, so I suppose this worked a while ago? Do you know if you updated some packages (eg gcsfs or fsspec) since then? Or something else that might have changed?
>> >
>> > Joris
>> >

Re: FIleNotFound Error on root directory with fsspec partitioned dataset

Posted by Micah Kornfield <em...@gmail.com>.

> You might also try the GCS filesystem (released with 7.0.0) instead of
going through fsspec.

I don't think the native GCS filesystem support is complete in 7.0.0, I
think if you are willing to compile from the latest commit in the repo it
might be useable.

On Wed, Feb 23, 2022 at 11:41 AM Weston Pace <we...@gmail.com> wrote:

> I'm pretty sure GCS is similar to S3 in that there is no such thing as
> a "directory".  Instead a directory is often emulated by an empty
> file.  Note that the single file being detected is hires-sonde/ (with
> a trailing slash).  I'm pretty sure this is the convention for
> creating mock directories.  I'm guessing, if there were multiple
> files, we would work ok because we just skip the empty files.
>
> So perhaps this is a problem unique to gcsfs/fsspec and trying to read
> an "empty directory".
>
> You might also try the GCS filesystem (released with 7.0.0) instead of
> going through fsspec.
>
> On Wed, Feb 23, 2022 at 2:23 AM Joris Van den Bossche
> <jo...@gmail.com> wrote:
> >
> >
> > On Mon, 21 Feb 2022 at 00:04, Kelton Halbert <kt...@wxbyte.com>
> wrote:
> >>
> >> Hello,
> >>
> >> I’ve been learning and working with PyArrow recently for a project to
> store some atmospheric science data as part of a partitioned dataset, and
> recently the dataset class with the  fsspec/gcsfs filesystem has started
> producing a new error.
> >
> >
> > Hi Kelton,
> >
> > One more question: you say that this started producing a new error, so I
> suppose this worked a while ago? Do you know if you updated some packages
> (eg gcsfs or fsspec) since then? Or something else that might have changed?
> >
> > Joris
> >
>

Re: FIleNotFound Error on root directory with fsspec partitioned dataset

Posted by Weston Pace <we...@gmail.com>.

I'm pretty sure GCS is similar to S3 in that there is no such thing as
a "directory".  Instead a directory is often emulated by an empty
file.  Note that the single file being detected is hires-sonde/ (with
a trailing slash).  I'm pretty sure this is the convention for
creating mock directories.  I'm guessing, if there were multiple
files, we would work ok because we just skip the empty files.

So perhaps this is a problem unique to gcsfs/fsspec and trying to read
an "empty directory".

You might also try the GCS filesystem (released with 7.0.0) instead of
going through fsspec.

On Wed, Feb 23, 2022 at 2:23 AM Joris Van den Bossche
<jo...@gmail.com> wrote:
>
>
> On Mon, 21 Feb 2022 at 00:04, Kelton Halbert <kt...@wxbyte.com> wrote:
>>
>> Hello,
>>
>> I’ve been learning and working with PyArrow recently for a project to store some atmospheric science data as part of a partitioned dataset, and recently the dataset class with the  fsspec/gcsfs filesystem has started producing a new error.
>
>
> Hi Kelton,
>
> One more question: you say that this started producing a new error, so I suppose this worked a while ago? Do you know if you updated some packages (eg gcsfs or fsspec) since then? Or something else that might have changed?
>
> Joris
>

Re: FIleNotFound Error on root directory with fsspec partitioned dataset

Posted by Joris Van den Bossche <jo...@gmail.com>.

On Mon, 21 Feb 2022 at 00:04, Kelton Halbert <kt...@wxbyte.com> wrote:

> Hello,
>
> I’ve been learning and working with PyArrow recently for a project to
> store some atmospheric science data as part of a partitioned dataset, and
> recently the dataset class with the  fsspec/gcsfs filesystem has started
> producing a new error.
>

Hi Kelton,

One more question: you say that this started producing a new error, so I
suppose this worked a while ago? Do you know if you updated some packages
(eg gcsfs or fsspec) since then? Or something else that might have changed?

Joris

Re: FIleNotFound Error on root directory with fsspec partitioned dataset

Posted by Joris Van den Bossche <jo...@gmail.com>.

Hi Kelton,

I was looking into it a bit, and this seems to be some kind of bug in the
gcsfs package (or fsspec).

When looking at the dataset object that gets created with your initial
example, we can see:

>>> data.files
['global-radiosondes/hires-sonde']

So this indicates that for some reason, it is seeing this top-level
directory as the single file that makes up this partitioned dataset (the
original error message you showed also indicates it is trying to open that
path as if it was a file). Normally, this property should give a list of
all discovered files in the partitioned dataset.

The dataset discovery gets this information from the filesystem, and it
seems that this is behaving a bit strangely. Under the hood, it is calling
the `info()` method of an fsspec-like filesystem. If I do this manually, I
get a different result for the first vs subsequent call:

In [1]: import gcsfs

In [2]: fs = gcsfs.GCSFileSystem(token="anon")

In [3]: fs.info("global-radiosondes/hires-sonde")
Out[3]:
{'kind': 'storage#object',
 'id': 'global-radiosondes/hires-sonde//1644725282206197',
 'selfLink': '
https://www.googleapis.com/storage/v1/b/global-radiosondes/o/hires-sonde%2F
',
 'mediaLink': '
https://storage.googleapis.com/download/storage/v1/b/global-radiosondes/o/hires-sonde%2F?generation=1644725282206197&alt=media
',
 'name': 'global-radiosondes/hires-sonde/',
 'bucket': 'global-radiosondes',
 'generation': '1644725282206197',
 'metageneration': '1',
 'contentType': 'text/plain',
 'storageClass': 'STANDARD',
 'size': 0,
 'md5Hash': '1B2M2Y8AsgTpgAmY7PhCfg==',
 'crc32c': 'AAAAAA==',
 'etag': 'CPXjy5Hn+/UCEAE=',
 'temporaryHold': False,
 'eventBasedHold': False,
 'timeCreated': '2022-02-13T04:08:02.226Z',
 'updated': '2022-02-13T04:08:02.226Z',
 'timeStorageClassUpdated': '2022-02-13T04:08:02.226Z',
 'type': 'file'}

In [4]: fs.info("global-radiosondes/hires-sonde")
Out[4]:
{'bucket': 'global-radiosondes',
 'name': 'global-radiosondes/hires-sonde',
 'size': 0,
 'storageClass': 'DIRECTORY',
 'type': 'directory'}

(to see this, you need to do this in a fresh python session, not after
already trying the ds.dataset(..); because that will already have called
that a first time)

So for this reason, the `ds.dataset(..)` discovery thinks it is dealing
with a single file, and thus subsequently reading data fails.

As a quick workaround, can you test doing this `fs.info
("global-radiosondes/hires-sonde")` call first before calling
`ds.dataset(..)`? Does it then work?

I would report this to https://github.com/fsspec/gcsfs, as that seems like
a bug in gcsfs or fsspec.

Best,
Joris

On Mon, 21 Feb 2022 at 19:32, Kelton Halbert <kt...@wxbyte.com> wrote:

> Hi Alenka,
>
> Here is the code snippet that loads a single Parquet file. I can also
> confirm that it appears to be with the function call “fs.isfile” on the
> root directory… calling this function myself returns False, as I would
> expect it should: fs.isfile("global-radiosondes/hires-sonde”)
>
> fs = gcsfs.GCSFileSystem(token="anon")
>
> partitioning = ds.HivePartitioning(
>         pyarrow.schema([
>             pyarrow.field('year', pyarrow.int32()),
>             pyarrow.field('month', pyarrow.int32()),
>             pyarrow.field('day', pyarrow.int32()),
>             pyarrow.field('hour', pyarrow.int32()),
>             pyarrow.field('WMO', pyarrow.string())
>         ])
> )
>
> schema = pyarrow.schema([
>     pyarrow.field('lon', pyarrow.float32()),
>     pyarrow.field('lat', pyarrow.float32()),
>     pyarrow.field('pres', pyarrow.float32()),
>     pyarrow.field('hght', pyarrow.float32()),
>     pyarrow.field('gpht', pyarrow.float32()),
>     pyarrow.field('tmpc', pyarrow.float32()),
>     pyarrow.field('dwpc', pyarrow.float32()),
>     pyarrow.field('relh', pyarrow.float32()),
>     pyarrow.field('uwin', pyarrow.float32()),
>     pyarrow.field('vwin', pyarrow.float32()),
>     pyarrow.field('wspd', pyarrow.float32()),
>     pyarrow.field('wdir', pyarrow.float32()),
>     pyarrow.field('year', pyarrow.int32()),
>     pyarrow.field('month', pyarrow.int32()),
>     pyarrow.field('day', pyarrow.int32()),
>     pyarrow.field('hour', pyarrow.int32()),
>     pyarrow.field('WMO', pyarrow.string())
> ])
>
> data =
> ds.dataset("global-radiosondes/hires-sonde/year=2016/month=5/day=24/hour=19/WMO=72451",
> filesystem=fs, format="parquet", \
>                         schema=schema, partitioning=partitioning)
>
> batches = data.to_batches(columns=["pres", "gpht", "hght", "tmpc", "wspd",
> "wdir"], \
>                 use_threads=True)
>
> batches = list(batches)
> print(batches[0].to_pandas().head())
>
> Kelton.
>
>
> On Feb 21, 2022, at 3:07 AM, Alenka Frim <al...@voltrondata.com> wrote:
>
> Hi Kelton,
>
> I can reproduce the same error if I try to load all the data with data =
> ds.dataset("global-radiosondes/hires-sonde", filesystem=fs) or data =
> pq.ParquetDataset("global-radiosondes/hires-sonde", filesystem=fs,
> use_legacy_dataset=False).
>
> Could you share your code, where you read a specific parquet file?
>
> Best,
> Alenka
>
> On Mon, Feb 21, 2022 at 12:04 AM Kelton Halbert <kt...@wxbyte.com>
> wrote:
>
>> Hello,
>>
>> I’ve been learning and working with PyArrow recently for a project to
>> store some atmospheric science data as part of a partitioned dataset, and
>> recently the dataset class with the  fsspec/gcsfs filesystem has started
>> producing a new error. Unfortunately I cannot seem to track down what
>> changed or if it’s an error on my end or not. I’m using PyArrow 7.0.0 and
>> python 3.8.
>>
>> If I specify a specific parquet file, everything is fine - but if I give
>> it any of the directory partitions, the same issue occurs. Any guidance
>> here would be appreciated!
>>
>> The code:
>> fs = gcsfs.GCSFileSystem(token="anon")
>>
>> partitioning = ds.HivePartitioning(
>>         pyarrow.schema([
>>             pyarrow.field('year', pyarrow.int32()),
>>             pyarrow.field('month', pyarrow.int32()),
>>             pyarrow.field('day', pyarrow.int32()),
>>             pyarrow.field('hour', pyarrow.int32()),
>>             pyarrow.field('WMO', pyarrow.string())
>>         ])
>> )
>>
>> schema = pyarrow.schema([
>>     pyarrow.field('lon', pyarrow.float32()),
>>     pyarrow.field('lat', pyarrow.float32()),
>>     pyarrow.field('pres', pyarrow.float32()),
>>     pyarrow.field('hght', pyarrow.float32()),
>>     pyarrow.field('gpht', pyarrow.float32()),
>>     pyarrow.field('tmpc', pyarrow.float32()),
>>     pyarrow.field('dwpc', pyarrow.float32()),
>>     pyarrow.field('relh', pyarrow.float32()),
>>     pyarrow.field('uwin', pyarrow.float32()),
>>     pyarrow.field('vwin', pyarrow.float32()),
>>     pyarrow.field('wspd', pyarrow.float32()),
>>     pyarrow.field('wdir', pyarrow.float32()),
>>     pyarrow.field('year', pyarrow.int32()),
>>     pyarrow.field('month', pyarrow.int32()),
>>     pyarrow.field('day', pyarrow.int32()),
>>     pyarrow.field('hour', pyarrow.int32()),
>>     pyarrow.field('WMO', pyarrow.string())
>> ])
>>
>> data = ds.dataset("global-radiosondes/hires-sonde", filesystem=fs,
>> format="parquet", \
>>                         partitioning=partitioning, schema=schema)
>>
>> subset = (ds.field("year") == 2016) & (ds.field("WMO") == "72451")
>>
>> batches = data.to_batches(columns=["pres", "gpht", "tmpc", "wspd",
>> "wdir", "year", "month", "day", "hour"], \
>>                 use_threads=True)
>>
>> batches = list(batches)
>>
>> The error:
>>
>>     391 from pyarrow import PythonFile    393 if not self.fs.isfile(path):--> 394     raise FileNotFoundError(path)    396 return PythonFile(self.fs.open(path, mode="rb"), mode="r")
>> FileNotFoundError: global-radiosondes/hires-sonde/
>>
>>
>>
>

Re: FIleNotFound Error on root directory with fsspec partitioned dataset

Posted by Kelton Halbert <kt...@wxbyte.com>.

Hi Alenka,

Here is the code snippet that loads a single Parquet file. I can also confirm that it appears to be with the function call “fs.isfile” on the root directory… calling this function myself returns False, as I would expect it should: fs.isfile("global-radiosondes/hires-sonde”)

fs = gcsfs.GCSFileSystem(token="anon")

partitioning = ds.HivePartitioning(
        pyarrow.schema([
            pyarrow.field('year', pyarrow.int32()),
            pyarrow.field('month', pyarrow.int32()),
            pyarrow.field('day', pyarrow.int32()),
            pyarrow.field('hour', pyarrow.int32()),
            pyarrow.field('WMO', pyarrow.string())
        ])
)

schema = pyarrow.schema([
    pyarrow.field('lon', pyarrow.float32()),
    pyarrow.field('lat', pyarrow.float32()),
    pyarrow.field('pres', pyarrow.float32()),
    pyarrow.field('hght', pyarrow.float32()),
    pyarrow.field('gpht', pyarrow.float32()),
    pyarrow.field('tmpc', pyarrow.float32()),
    pyarrow.field('dwpc', pyarrow.float32()),
    pyarrow.field('relh', pyarrow.float32()),
    pyarrow.field('uwin', pyarrow.float32()),
    pyarrow.field('vwin', pyarrow.float32()),
    pyarrow.field('wspd', pyarrow.float32()),
    pyarrow.field('wdir', pyarrow.float32()),
    pyarrow.field('year', pyarrow.int32()),
    pyarrow.field('month', pyarrow.int32()),
    pyarrow.field('day', pyarrow.int32()),
    pyarrow.field('hour', pyarrow.int32()),
    pyarrow.field('WMO', pyarrow.string())
])

data = ds.dataset("global-radiosondes/hires-sonde/year=2016/month=5/day=24/hour=19/WMO=72451", filesystem=fs, format="parquet", \
                        schema=schema, partitioning=partitioning)

batches = data.to_batches(columns=["pres", "gpht", "hght", "tmpc", "wspd", "wdir"], \
                use_threads=True)

batches = list(batches)
print(batches[0].to_pandas().head())

Kelton.


> On Feb 21, 2022, at 3:07 AM, Alenka Frim <al...@voltrondata.com> wrote:
> 
> Hi Kelton,
> 
> I can reproduce the same error if I try to load all the data with data = ds.dataset("global-radiosondes/hires-sonde", filesystem=fs) or data = pq.ParquetDataset("global-radiosondes/hires-sonde", filesystem=fs, use_legacy_dataset=False).
> 
> Could you share your code, where you read a specific parquet file?
> 
> Best,
> Alenka 
> 
> On Mon, Feb 21, 2022 at 12:04 AM Kelton Halbert <kthalbert@wxbyte.com <ma...@wxbyte.com>> wrote:
> Hello,
> 
> I’ve been learning and working with PyArrow recently for a project to store some atmospheric science data as part of a partitioned dataset, and recently the dataset class with the  fsspec/gcsfs filesystem has started producing a new error. Unfortunately I cannot seem to track down what changed or if it’s an error on my end or not. I’m using PyArrow 7.0.0 and python 3.8.
> 
> If I specify a specific parquet file, everything is fine - but if I give it any of the directory partitions, the same issue occurs. Any guidance here would be appreciated!
> 
> The code: 
> fs = gcsfs.GCSFileSystem(token="anon")
> 
> partitioning = ds.HivePartitioning(
>         pyarrow.schema([
>             pyarrow.field('year', pyarrow.int32()),
>             pyarrow.field('month', pyarrow.int32()),
>             pyarrow.field('day', pyarrow.int32()),
>             pyarrow.field('hour', pyarrow.int32()),
>             pyarrow.field('WMO', pyarrow.string())
>         ])
> )
> 
> schema = pyarrow.schema([
>     pyarrow.field('lon', pyarrow.float32()),
>     pyarrow.field('lat', pyarrow.float32()),
>     pyarrow.field('pres', pyarrow.float32()),
>     pyarrow.field('hght', pyarrow.float32()),
>     pyarrow.field('gpht', pyarrow.float32()),
>     pyarrow.field('tmpc', pyarrow.float32()),
>     pyarrow.field('dwpc', pyarrow.float32()),
>     pyarrow.field('relh', pyarrow.float32()),
>     pyarrow.field('uwin', pyarrow.float32()),
>     pyarrow.field('vwin', pyarrow.float32()),
>     pyarrow.field('wspd', pyarrow.float32()),
>     pyarrow.field('wdir', pyarrow.float32()),
>     pyarrow.field('year', pyarrow.int32()),
>     pyarrow.field('month', pyarrow.int32()),
>     pyarrow.field('day', pyarrow.int32()),
>     pyarrow.field('hour', pyarrow.int32()),
>     pyarrow.field('WMO', pyarrow.string())
> ])
> 
> data = ds.dataset("global-radiosondes/hires-sonde", filesystem=fs, format="parquet", \
>                         partitioning=partitioning, schema=schema)
> 
> subset = (ds.field("year") == 2016) & (ds.field("WMO") == "72451")
> 
> batches = data.to_batches(columns=["pres", "gpht", "tmpc", "wspd", "wdir", "year", "month", "day", "hour"], \
>                 use_threads=True)
> 
> batches = list(batches)
> 
> The error:
>     391 from pyarrow import PythonFile
>     393 if not self.fs.isfile(path):
> --> 394     raise FileNotFoundError(path)
>     396 return PythonFile(self.fs.open(path, mode="rb"), mode="r")
> 
> FileNotFoundError: global-radiosondes/hires-sonde/
>

Re: FIleNotFound Error on root directory with fsspec partitioned dataset

Posted by Alenka Frim <al...@voltrondata.com>.

Hi Kelton,

I can reproduce the same error if I try to load all the data with data =
ds.dataset("global-radiosondes/hires-sonde", filesystem=fs) or data =
pq.ParquetDataset("global-radiosondes/hires-sonde", filesystem=fs,
use_legacy_dataset=False).

Could you share your code, where you read a specific parquet file?

Best,
Alenka

On Mon, Feb 21, 2022 at 12:04 AM Kelton Halbert <kt...@wxbyte.com>
wrote:

> Hello,
>
> I’ve been learning and working with PyArrow recently for a project to
> store some atmospheric science data as part of a partitioned dataset, and
> recently the dataset class with the  fsspec/gcsfs filesystem has started
> producing a new error. Unfortunately I cannot seem to track down what
> changed or if it’s an error on my end or not. I’m using PyArrow 7.0.0 and
> python 3.8.
>
> If I specify a specific parquet file, everything is fine - but if I give
> it any of the directory partitions, the same issue occurs. Any guidance
> here would be appreciated!
>
> The code:
> fs = gcsfs.GCSFileSystem(token="anon")
>
> partitioning = ds.HivePartitioning(
>         pyarrow.schema([
>             pyarrow.field('year', pyarrow.int32()),
>             pyarrow.field('month', pyarrow.int32()),
>             pyarrow.field('day', pyarrow.int32()),
>             pyarrow.field('hour', pyarrow.int32()),
>             pyarrow.field('WMO', pyarrow.string())
>         ])
> )
>
> schema = pyarrow.schema([
>     pyarrow.field('lon', pyarrow.float32()),
>     pyarrow.field('lat', pyarrow.float32()),
>     pyarrow.field('pres', pyarrow.float32()),
>     pyarrow.field('hght', pyarrow.float32()),
>     pyarrow.field('gpht', pyarrow.float32()),
>     pyarrow.field('tmpc', pyarrow.float32()),
>     pyarrow.field('dwpc', pyarrow.float32()),
>     pyarrow.field('relh', pyarrow.float32()),
>     pyarrow.field('uwin', pyarrow.float32()),
>     pyarrow.field('vwin', pyarrow.float32()),
>     pyarrow.field('wspd', pyarrow.float32()),
>     pyarrow.field('wdir', pyarrow.float32()),
>     pyarrow.field('year', pyarrow.int32()),
>     pyarrow.field('month', pyarrow.int32()),
>     pyarrow.field('day', pyarrow.int32()),
>     pyarrow.field('hour', pyarrow.int32()),
>     pyarrow.field('WMO', pyarrow.string())
> ])
>
> data = ds.dataset("global-radiosondes/hires-sonde", filesystem=fs,
> format="parquet", \
>                         partitioning=partitioning, schema=schema)
>
> subset = (ds.field("year") == 2016) & (ds.field("WMO") == "72451")
>
> batches = data.to_batches(columns=["pres", "gpht", "tmpc", "wspd", "wdir",
> "year", "month", "day", "hour"], \
>                 use_threads=True)
>
> batches = list(batches)
>
> The error:
>
>     391 from pyarrow import PythonFile    393 if not self.fs.isfile(path):--> 394     raise FileNotFoundError(path)    396 return PythonFile(self.fs.open(path, mode="rb"), mode="r")
> FileNotFoundError: global-radiosondes/hires-sonde/
>
>
>