You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Lance Dacey (Jira)" <ji...@apache.org> on 2020/11/08 17:14:00 UTC
[jira] [Created] (ARROW-10517) [Python] Unable to read/write
Parquet datasets with fsspec on Azure Blob
Lance Dacey created ARROW-10517:
-----------------------------------
Summary: [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
Key: ARROW-10517
URL: https://issues.apache.org/jira/browse/ARROW-10517
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 2.0.0
Environment: Ubuntu 18.04
Reporter: Lance Dacey
If I downgrade adlfs to 0.2.5 and azure-blob-storage to 2.1, and then upgrade fsspec (0.6.2 has errors with a detail kwarg, so I need to upgrade it):
{code:java}
pa.dataset.write_dataset(data=table,
base_dir="test/test7",
basename_template=None,
format="parquet",
partitioning=DirectoryPartitioning(pa.schema([("year", pa.int64()), ("month", pa.int16()), ("day", pa.int16())])),
schema=table.schema,
filesystem=blob_fs){code}
{code:java}
226 def create_dir(self, path, recursive):
227 # mkdir also raises FileNotFoundError when base directory is not found --> 228 self.fs.mkdir(path, create_parents=recursive){code}
It does not look like there is a mkdir option. However, the output of fs.find() returns a dictionary as expected:
{code:java}
selected_files = blob_fs.find(
"test/test6", maxdepth=None, withdirs=True, detail=True
){code}
Now if I install the latest version of adlfs it upgrades my blob SDK to 12.5 (unfortunately, I cannot use this in production since Airflow requires 2.1, so this is only for testing purposes):
{code:java}
Successfully installed adlfs-0.5.5 azure-storage-blob-12.5.0{code}
Now fs.find() returns a list, but I am able to use fs.mkdir().
{code:java}
['test/test6/year=2020',
'test/test6/year=2020/month=11',
'test/test6/year=2020/month=11/day=1',
'test/test6/year=2020/month=11/day=1/8ee6c66320ca47908c37f112f0cffd6c.parquet',
'test/test6/year=2020/month=11/day=1/ef753f016efc44b7b0f0800c35d084fc.parquet',]{code}
This causes issues later when I try to read a dataset (the code is expecting a dictionary still):
{code:java}
dataset = ds.dataset("test/test5", filesystem=blob_fs, format="parquet"){code}
{code:java}
-->
221 for path, info in selected_files.items():
222 infos.append(self._create_file_info(path, info))
223 AttributeError: 'list' object has no attribute 'items'{code}
I am still able to read individual files:
{code:java}
dataset = ds.dataset("test/test4/year=2020/month=11/2020-11.parquet", filesystem=blob_fs, format="parquet"){code}
And I can read the dataset if I pass in a list of blob names "manually":
{code:java}
blobs = wasb.list_blobs(container_name="test", prefix="test4")
dataset = ds.dataset(source=["test/" + blob.name for blob in blobs],
format="parquet",
partitioning="hive",
filesystem=blob_fs)
{code}
For all of my examples, blob_fs is defined by:
{code:java}
blob_fs = fsspec.filesystem(
protocol="abfs", account_name=base.login, account_key=base.password
){code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)