You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/09 17:25:28 UTC

[GitHub] [arrow] westonpace commented on issue #10492: Doc update ? For Reading and Writing the Apache Parquet Format

westonpace commented on issue #10492:
URL: https://github.com/apache/arrow/issues/10492#issuecomment-857888278


   > Cannot submit a bug since it's not especially a direct issue but it's more something not complete or up to date in the documentation
   Please do create a JIRA issue.  Arrow uses JIRA to track all changes (bugs, doc change, CI improvements, new features) and so you don't have to worry about that.  These sound like valid concerns and a JIRA issue would be acceptable.
   
   > There is a chapter for "Reading from Partitioned Datasets", that's great ... but works only with a local storage and adding a Data Lake URL to a recursive folder don't work, missing the ability to read partitioned parquet files from Cloud
   
   That chapter is talking about the legacy datasets API (ParquetDataset).  You may be better served reading up on the new datasets API: https://arrow.apache.org/docs/python/dataset.html#dataset .  The new API will accept a URL as a path although it currently only has first-class support for S3 and HDFS.  To use Azure data lake directly you would need to create a filesystem for it as the datasets API needs to be able to list files, search for files, create files, etc.
   
   That being said, you might be able to make something work by using the fsspec filesystem (https://arrow.apache.org/docs/python/filesystems.html#using-fsspec-compatible-filesystems) and https://github.com/dask/adlfs .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org