You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Lance Dacey (Jira)" <ji...@apache.org> on 2020/11/08 17:24:00 UTC
[jira] [Commented] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob

    [ https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17228257#comment-17228257 ] 

Lance Dacey commented on ARROW-10517:
-------------------------------------

+ [~mdurant] and [~jorisvandenbossche]

You guys helped me with a similar issue before. There seems to be some incompatibility with fsspec and the new pyarrow.dataset feature. If I upgrade adlfs and the azure blob SDK, then it it looks like fs.find() is returning a list instead of a dictionary like pyarrow expects. If I downgrade adlfs to use SDK v2.1, then I get the correct dictionary that pyarrow expects, but there does not seem to be a method for mkdir (which is required). Is there a way for me to get this to work? I tried tweaking the installed versions of fsspec, adlfs, and azure-storage-blob but I could not find a combination that worked.

> [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
> ------------------------------------------------------------------------
>
>                 Key: ARROW-10517
>                 URL: https://issues.apache.org/jira/browse/ARROW-10517
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0
>         Environment: Ubuntu 18.04
>            Reporter: Lance Dacey
>            Priority: Major
>              Labels: azureblob, dataset, dataset-parquet-read, dataset-parquet-write, fsspec
>
>  
>  
> If I downgrade adlfs to 0.2.5 and azure-blob-storage to 2.1, and then upgrade fsspec (0.6.2 has errors with a detail kwarg, so I need to upgrade it):
>  
> {code:java}
> pa.dataset.write_dataset(data=table, 
>  base_dir="test/test7", 
>  basename_template=None, 
>  format="parquet",
>  partitioning=DirectoryPartitioning(pa.schema([("year", pa.int64()), ("month", pa.int16()), ("day", pa.int16())])), 
>  schema=table.schema,
>  filesystem=blob_fs){code}
>  
> {code:java}
> 226 def create_dir(self, path, recursive):  
> 227 # mkdir also raises FileNotFoundError when base directory is not found --> 228 self.fs.mkdir(path, create_parents=recursive){code}
>  
> It does not look like there is a mkdir option. However, the output of fs.find() returns a dictionary as expected:
> {code:java}
> selected_files = blob_fs.find(
>  "test/test6", maxdepth=None, withdirs=True, detail=True
> ){code}
>  
> Now if I install the latest version of adlfs it upgrades my blob SDK to 12.5 (unfortunately, I cannot use this in production since Airflow requires 2.1, so this is only for testing purposes):
> {code:java}
> Successfully installed adlfs-0.5.5 azure-storage-blob-12.5.0{code}
>  
> Now fs.find() returns a list, but I am able to use fs.mkdir().
> {code:java}
> ['test/test6/year=2020',
>  'test/test6/year=2020/month=11',
>  'test/test6/year=2020/month=11/day=1',
>  'test/test6/year=2020/month=11/day=1/8ee6c66320ca47908c37f112f0cffd6c.parquet',
>  'test/test6/year=2020/month=11/day=1/ef753f016efc44b7b0f0800c35d084fc.parquet',]{code}
>  
> This causes issues later when I try to read a dataset (the code is expecting a dictionary still):
> {code:java}
> dataset = ds.dataset("test/test5", filesystem=blob_fs, format="parquet"){code}
> {code:java}
> --> 
> 221 for path, info in selected_files.items():  
> 222 infos.append(self._create_file_info(path, info))  
> 223 AttributeError: 'list' object has no attribute 'items'{code}
>  
> I am still able to read individual files:
> {code:java}
> dataset = ds.dataset("test/test4/year=2020/month=11/2020-11.parquet", filesystem=blob_fs, format="parquet"){code}
>  And I can read the dataset if I pass in a list of blob names "manually":
>  
> {code:java}
> blobs = wasb.list_blobs(container_name="test", prefix="test4")
> dataset = ds.dataset(source=["test/" + blob.name for blob in blobs], 
>  format="parquet", 
>  partitioning="hive",
>  filesystem=blob_fs)
> {code}
>  
> For all of my examples, blob_fs is defined by:
> {code:java}
> blob_fs = fsspec.filesystem(
>  protocol="abfs", account_name=base.login, account_key=base.password
>  ){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)