You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Lance Dacey (Jira)" <ji...@apache.org> on 2020/11/08 17:14:00 UTC
[jira] [Created] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob

Lance Dacey created ARROW-10517:
-----------------------------------

             Summary: [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
                 Key: ARROW-10517
                 URL: https://issues.apache.org/jira/browse/ARROW-10517
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 2.0.0
         Environment: Ubuntu 18.04
            Reporter: Lance Dacey


 

 

If I downgrade adlfs to 0.2.5 and azure-blob-storage to 2.1, and then upgrade fsspec (0.6.2 has errors with a detail kwarg, so I need to upgrade it):

 
{code:java}
pa.dataset.write_dataset(data=table, 
 base_dir="test/test7", 
 basename_template=None, 
 format="parquet",
 partitioning=DirectoryPartitioning(pa.schema([("year", pa.int64()), ("month", pa.int16()), ("day", pa.int16())])), 
 schema=table.schema,
 filesystem=blob_fs){code}
 
{code:java}
226 def create_dir(self, path, recursive):  
227 # mkdir also raises FileNotFoundError when base directory is not found --> 228 self.fs.mkdir(path, create_parents=recursive){code}
 

It does not look like there is a mkdir option. However, the output of fs.find() returns a dictionary as expected:
{code:java}
selected_files = blob_fs.find(
 "test/test6", maxdepth=None, withdirs=True, detail=True
){code}
 

Now if I install the latest version of adlfs it upgrades my blob SDK to 12.5 (unfortunately, I cannot use this in production since Airflow requires 2.1, so this is only for testing purposes):
{code:java}
Successfully installed adlfs-0.5.5 azure-storage-blob-12.5.0{code}
 

Now fs.find() returns a list, but I am able to use fs.mkdir().
{code:java}
['test/test6/year=2020',
 'test/test6/year=2020/month=11',
 'test/test6/year=2020/month=11/day=1',
 'test/test6/year=2020/month=11/day=1/8ee6c66320ca47908c37f112f0cffd6c.parquet',
 'test/test6/year=2020/month=11/day=1/ef753f016efc44b7b0f0800c35d084fc.parquet',]{code}
 

This causes issues later when I try to read a dataset (the code is expecting a dictionary still):
{code:java}
dataset = ds.dataset("test/test5", filesystem=blob_fs, format="parquet"){code}
{code:java}
--> 
221 for path, info in selected_files.items():  
222 infos.append(self._create_file_info(path, info))  
223 AttributeError: 'list' object has no attribute 'items'{code}
 

I am still able to read individual files:
{code:java}
dataset = ds.dataset("test/test4/year=2020/month=11/2020-11.parquet", filesystem=blob_fs, format="parquet"){code}
 And I can read the dataset if I pass in a list of blob names "manually":

 
{code:java}
blobs = wasb.list_blobs(container_name="test", prefix="test4")
dataset = ds.dataset(source=["test/" + blob.name for blob in blobs], 
 format="parquet", 
 partitioning="hive",
 filesystem=blob_fs)
{code}
 

For all of my examples, blob_fs is defined by:
{code:java}
blob_fs = fsspec.filesystem(
 protocol="abfs", account_name=base.login, account_key=base.password
 ){code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)