You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/11/10 06:41:04 UTC

[GitHub] [arrow] kcyea commented on issue #1510: Question: Can arrow be used to load a parquet file from an Azure blob store?

kcyea commented on issue #1510:
URL: https://github.com/apache/arrow/issues/1510#issuecomment-724494811


   Can we read a list of parquet files under subdirectory of a container in Azure blob storage? For example, under container "test_container", there are a list of files under subdirectory of "test.parquet/department=HR/date=2020-01-01/"?
   
   Besides that, I had tested the performance on reading a parquet file of 298MB on Azure blob storage from my local machine, I am selecting 3 columns to return. It took about 2 minutes to return the result. Can this be improve to less than one minute?
   
   from storefact import get_store
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   params = {
       'account_name': 'test',
       'account_key': 'XXXsome_azure_account_keyXXX',
       'container': 'my-azure-container',
      'create_if_missing': False
   }
   store = get_store('azure', **params)
   
   reader = store.open(key) 
   table = pq.read_table(reader, columns=columns).to_pandas()
   print(table.head(10))
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org